Optimize scalar product (reduction) on CPU
Benchmarks show that our implementation of scalar product on CPU is very slow.
scalar product 400000 CPU 5.4616 0.00109134 N/A
scalar product 400000 CPU ET 4.96865 0.00119962 0.909742
scalar product 400000 CPU BLAS 17.7799 0.000335237 3.25543
Since ET (expression templates) and non-ET version behaves almost the same, it seems that even the original implementation before switching to ET was not optimal. We should check the implementation of scalar product in BLAS and improve it.