Further BLAS improvements.

1. Switch to Eigen's implementation when all dimensions are fixed.
2. Use lazyProduct for eigen matrix-vector product. This brings
   eigen's performance on iterative_schur closer to what it used
   to be before the last commit. There is however still an
   improvement to be had by using the naive implementation when
   the matrix and vector have dynamic dimensions.

BENCHMARK
                                      HEAD                                       CHANGE

problem-16-22106-pre.txt
gcc-eigen       sparse_schur         0.859    gcc-eigen       sparse_schur        0.853
clang-eigen     sparse_schur         0.848    clang-eigen     sparse_schur        0.850
gcc-blas        sparse_schur         0.956    gcc-blas        sparse_schur        0.865
clang-blas      sparse_schur         0.954    clang-blas      sparse_schur        0.858
gcc-eigen       iterative_schur      4.656    gcc-eigen       iterative_schur     3.271
clang-eigen     iterative_schur      4.664    clang-eigen     iterative_schur     3.307
gcc-blas        iterative_schur      2.598    gcc-blas        iterative_schur     2.620
clang-blas      iterative_schur      2.554    clang-blas      iterative_schur     2.567

problem-49-7776-pre.txt
gcc-eigen       sparse_schur         0.477    gcc-eigen       sparse_schur        0.472
clang-eigen     sparse_schur         0.475    clang-eigen     sparse_schur        0.479
gcc-blas        sparse_schur         0.521    gcc-blas        sparse_schur        0.469
clang-blas      sparse_schur         0.508    clang-blas      sparse_schur        0.471
gcc-eigen       iterative_schur      3.172    gcc-eigen       iterative_schur     2.088
clang-eigen     iterative_schur      3.161    clang-eigen     iterative_schur     2.079
gcc-blas        iterative_schur      1.701    gcc-blas        iterative_schur     1.720
clang-blas      iterative_schur      1.708    clang-blas      iterative_schur     1.694

problem-245-198739-pre.txt
gcc-eigen       sparse_schur        28.092    gcc-eigen       sparse_schur       28.233
clang-eigen     sparse_schur        28.148    clang-eigen     sparse_schur       28.400
gcc-blas        sparse_schur        30.919    gcc-blas        sparse_schur       28.110
clang-blas      sparse_schur        31.001    clang-blas      sparse_schur       28.407
gcc-eigen       iterative_schur     63.095    gcc-eigen       iterative_schur    43.694
clang-eigen     iterative_schur     63.412    clang-eigen     iterative_schur    43.473
gcc-blas        iterative_schur     33.353    gcc-blas        iterative_schur    33.321
clang-blas      iterative_schur     33.276    clang-blas      iterative_schur    33.278

problem-257-65132-pre.txt
gcc-eigen       sparse_schur         3.687    gcc-eigen       sparse_schur        3.629
clang-eigen     sparse_schur         3.669    clang-eigen     sparse_schur        3.652
gcc-blas        sparse_schur         3.947    gcc-blas        sparse_schur        3.673
clang-blas      sparse_schur         3.952    clang-blas      sparse_schur        3.678
gcc-eigen       iterative_schur    121.512    gcc-eigen       iterative_schur    76.833
clang-eigen     iterative_schur    123.547    clang-eigen     iterative_schur    78.763
gcc-blas        iterative_schur     68.334    gcc-blas        iterative_schur    68.612
clang-blas      iterative_schur     67.793    clang-blas      iterative_schur    68.266

Notes:

1. Naive BLAS was a bit worse than eigen on fixed sized matrices. We did not see this
   before because of the different inlining thresholds. Fixing this boosted eigen's
   performance. Also the disparity between gcc and clang has gone away.

2. SPARSE_SCHUR performance remains the same, since it is only testing static sized
   matrices.

3. ITERATIVE_SCHUR performance goes up substantially due to the lazyProduct change,
   but even there, since most of the products are dynamic sized, the naive implementation
   wins handily.

Change-Id: Idc17f35b9c68aaebb1b2e131adf3af8374a85a4c
1 file changed