Optimize scalar product (reduction) on CPU

Benchmarks show that our implementation of scalar product on CPU is very slow.

scalar product              400000            CPU         5.4616     0.00109134            N/A
scalar product              400000         CPU ET        4.96865     0.00119962       0.909742
scalar product              400000       CPU BLAS        17.7799    0.000335237        3.25543

Since ET (expression templates) and non-ET version behaves almost the same, it seems that even the original implementation before switching to ET was not optimal. We should check the implementation of scalar product in BLAS and improve it.