Parallel right products for partitioned view

Parallel implementations for right-multiply by dense vector for:
 - Partitioned matrix view
 - Block-sparse matrix
 - CRS matrix (non-symmetric only)

When coupled with non-interleaving indexes in parallel for, this
simple aproach provides a reasonable speedup.
For example, in CRS case difference with GPGPU approach reduces
closer to memory throughput ratio for high enough core count.

./bin/spmv_benchmark
-------------------------------------------------------------------
Benchmark                                                      Time
-------------------------------------------------------------------
BM_BlockSparseRightMultiplyAndAccumulateBA/1              28.5   ms
BM_BlockSparseRightMultiplyAndAccumulateBA/2              15.7   ms
BM_BlockSparseRightMultiplyAndAccumulateBA/4               9.01  ms
BM_BlockSparseRightMultiplyAndAccumulateBA/8               5.60  ms
BM_BlockSparseRightMultiplyAndAccumulateBA/16              3.86  ms
BM_BlockSparseRightMultiplyAndAccumulateBA/28              3.84  ms
BM_BlockSparseRightMultiplyAndAccumulateUnstructured/1    23.8   ms
BM_BlockSparseRightMultiplyAndAccumulateUnstructured/2    15.0   ms
BM_BlockSparseRightMultiplyAndAccumulateUnstructured/4     8.01  ms
BM_BlockSparseRightMultiplyAndAccumulateUnstructured/8     4.02  ms
BM_BlockSparseRightMultiplyAndAccumulateUnstructured/16    2.39  ms
BM_BlockSparseRightMultiplyAndAccumulateUnstructured/28    1.68  ms
BM_BlockSparseLeftMultiplyAndAccumulateBA                 30.7   ms
BM_BlockSparseLeftMultiplyAndAccumulateUnstructured       41.5   ms
BM_CRSRightMultiplyAndAccumulateBA/1                      24.1   ms
BM_CRSRightMultiplyAndAccumulateBA/2                      13.6   ms
BM_CRSRightMultiplyAndAccumulateBA/4                       8.70  ms
BM_CRSRightMultiplyAndAccumulateBA/8                       5.34  ms
BM_CRSRightMultiplyAndAccumulateBA/16                      3.99  ms
BM_CRSRightMultiplyAndAccumulateBA/28                      4.00  ms
BM_CRSRightMultiplyAndAccumulateUnstructured/1            21.1   ms
BM_CRSRightMultiplyAndAccumulateUnstructured/2            10.83  ms
BM_CRSRightMultiplyAndAccumulateUnstructured/4             5.88  ms
BM_CRSRightMultiplyAndAccumulateUnstructured/8             3.68  ms
BM_CRSRightMultiplyAndAccumulateUnstructured/16            2.21  ms
BM_CRSRightMultiplyAndAccumulateUnstructured/28            1.71  ms
BM_CRSLeftMultiplyAndAccumulateBA                         23.6   ms
BM_CRSLeftMultiplyAndAccumulateUnstructured               22.5   ms
BM_CudaRightMultiplyAndAccumulateBA                        0.679 ms
BM_CudaRightMultiplyAndAccumulateUnstructured              0.480 ms
BM_CudaLeftMultiplyAndAccumulateBA                         0.774 ms
BM_CudaLeftMultiplyAndAccumulateUnstructured               0.361 ms

./bin/partitioned_matrix_view_benchmark
-----------------------------------------------------------------
Benchmark                                                    Time
-----------------------------------------------------------------
BM_PatitionedViewRightMultiplyAndAccumulateE_Static/1    18.5  ms
BM_PatitionedViewRightMultiplyAndAccumulateE_Static/2    10.7  ms
BM_PatitionedViewRightMultiplyAndAccumulateE_Static/4     6.34 ms
BM_PatitionedViewRightMultiplyAndAccumulateE_Static/8     4.26 ms
BM_PatitionedViewRightMultiplyAndAccumulateE_Static/16    3.86 ms
BM_PatitionedViewRightMultiplyAndAccumulateE_Static/28    3.75 ms
BM_PatitionedViewRightMultiplyAndAccumulateF_Static/1    18.8  ms
BM_PatitionedViewRightMultiplyAndAccumulateF_Static/2    11.9  ms
BM_PatitionedViewRightMultiplyAndAccumulateF_Static/4     6.94 ms
BM_PatitionedViewRightMultiplyAndAccumulateF_Static/8     4.41 ms
BM_PatitionedViewRightMultiplyAndAccumulateF_Static/16    3.63 ms
BM_PatitionedViewRightMultiplyAndAccumulateF_Static/28    3.86 ms

Timings correspond to intel 8176 cpu and 2080ti nvidia gpu,
with OpenMP threading backend.

Change-Id: Idc07d0563103d057ca3c8412de81a7823fe232af
20 files changed
tree: ef572ab94ee4f7215e75aa88245d150baf8927a4
  1. .github/
  2. bazel/
  3. cmake/
  4. config/
  5. data/
  6. docs/
  7. examples/
  8. include/
  9. internal/
  10. scripts/
  11. .clang-format
  12. .gitignore
  13. BUILD
  14. CITATION.cff
  15. CMakeLists.txt
  16. CONTRIBUTING.md
  17. LICENSE
  18. package.xml
  19. README.md
  20. WORKSPACE
README.md

Android Linux macOS Windows

Ceres Solver

Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems.

  1. Non-linear Least Squares problems with bounds constraints.
  2. General unconstrained optimization problems.

Please see ceres-solver.org for more information.