Block-sparse to CRS conversion using block-structure

Instead of pre-computing pemutation from block-sparse to CRS order,
index of value in CRS matrix is computed in the process of updating
values using block-sparse structure.

When it is possible to update values via a simple host-to-device copy,
block-sparse structure on GPU is discarded after computing CRS
structure.

Computing index is significantly slower than using pre-computed
permutation, but is still hidden by host-to-device transfer.

On problems from BAL dataset this results into reduction of extra
gpu memory consumption from 33% (permutation stored as 32-bit indices)
to ~10% for storing block-sparse structure.

Benchmark results:

======================= CUDA Device Properties ======================
Cuda version         : 11.8
Device ID            : 0
Device name          : NVIDIA GeForce RTX 2080 Ti
Total GPU memory     :  11012 MiB
GPU memory available :  10852 MiB
Compute capability   : 7.5
Warp size            : 32
Max threads per block: 1024
Max threads per dim  : 1024 1024 64
Max grid size        : 2147483647 65535 65535
Multiprocessor count : 68
====================================================================
Running ./bin/evaluation_benchmark
Run on (112 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x56)
  L1 Instruction 32 KiB (x56)
  L2 Unified 1024 KiB (x56)
  L3 Unified 39424 KiB (x2)
Load Average: 24.58, 11.75, 8.52

-----------------------------------------------------------------------
Benchmark                                                          Time
-----------------------------------------------------------------------
Using on-the-fly computation of CRS index corresponding to block-sparse
index:

JacobianToCRS<g/final/problem-4585-1324582-pre.txt>             1607 ms
JacobianToCRSView<g/final/problem-4585-1324582-pre.txt>          564 ms
JacobianToCRSMatrix<g/final/problem-4585-1324582-pre.txt>       2226 ms
JacobianToCRSViewUpdate<g/final/problem-4585-1324582-pre.txt>    228 ms
JacobianToCRSMatrixUpdate<g/final/problem-4585-1324582-pre.txt>  400 ms

Using precomputed permutation:
JacobianToCRS</final/problem-4585-1324582-pre.txt>              1656 ms
JacobianToCRSView</final/problem-4585-1324582-pre.txt>           553 ms
JacobianToCRSMatrix</final/problem-4585-1324582-pre.txt>        2255 ms
JacobianToCRSViewUpdate</final/problem-4585-1324582-pre.txt>     228 ms
JacobianToCRSMatrixUpdate</final/problem-4585-1324582-pre.txt>   406 ms

Performance of JacobianToCRSViewUpdate is still limited by
host-to-device transfer, and JacobianToCRSView is faster than computing
CRS structure on CPU.

Change-Id: Ifb6910fb01ae6071400d36c277846fadc5857964
23 files changed
tree: 932a5daf02335c23d9487571bf5d5bc250f89abe
  1. .github/
  2. bazel/
  3. cmake/
  4. config/
  5. data/
  6. docs/
  7. examples/
  8. include/
  9. internal/
  10. scripts/
  11. .clang-format
  12. .gitignore
  13. BUILD
  14. CITATION.cff
  15. CMakeLists.txt
  16. CONTRIBUTING.md
  17. LICENSE
  18. package.xml
  19. README.md
  20. WORKSPACE
README.md

Android Linux macOS Windows

Ceres Solver

Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems.

  1. Non-linear Least Squares problems with bounds constraints.
  2. General unconstrained optimization problems.

Please see ceres-solver.org for more information.