Block-sparse to CRS conversion using block-structure

Instead of pre-computing pemutation from block-sparse to CRS order,
index of value in CRS matrix is computed in the process of updating
values using block-sparse structure.

When it is possible to update values via a simple host-to-device copy,
block-sparse structure on GPU is discarded after computing CRS

Computing index is significantly slower than using pre-computed
permutation, but is still hidden by host-to-device transfer.

On problems from BAL dataset this results into reduction of extra
gpu memory consumption from 33% (permutation stored as 32-bit indices)
to ~10% for storing block-sparse structure.

Benchmark results:

======================= CUDA Device Properties ======================
Cuda version         : 11.8
Device ID            : 0
Device name          : NVIDIA GeForce RTX 2080 Ti
Total GPU memory     :  11012 MiB
GPU memory available :  10852 MiB
Compute capability   : 7.5
Warp size            : 32
Max threads per block: 1024
Max threads per dim  : 1024 1024 64
Max grid size        : 2147483647 65535 65535
Multiprocessor count : 68
Running ./bin/evaluation_benchmark
Run on (112 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x56)
  L1 Instruction 32 KiB (x56)
  L2 Unified 1024 KiB (x56)
  L3 Unified 39424 KiB (x2)
Load Average: 24.58, 11.75, 8.52

Benchmark                                                          Time
Using on-the-fly computation of CRS index corresponding to block-sparse

JacobianToCRS<g/final/problem-4585-1324582-pre.txt>             1607 ms
JacobianToCRSView<g/final/problem-4585-1324582-pre.txt>          564 ms
JacobianToCRSMatrix<g/final/problem-4585-1324582-pre.txt>       2226 ms
JacobianToCRSViewUpdate<g/final/problem-4585-1324582-pre.txt>    228 ms
JacobianToCRSMatrixUpdate<g/final/problem-4585-1324582-pre.txt>  400 ms

Using precomputed permutation:
JacobianToCRS</final/problem-4585-1324582-pre.txt>              1656 ms
JacobianToCRSView</final/problem-4585-1324582-pre.txt>           553 ms
JacobianToCRSMatrix</final/problem-4585-1324582-pre.txt>        2255 ms
JacobianToCRSViewUpdate</final/problem-4585-1324582-pre.txt>     228 ms
JacobianToCRSMatrixUpdate</final/problem-4585-1324582-pre.txt>   406 ms

Performance of JacobianToCRSViewUpdate is still limited by
host-to-device transfer, and JacobianToCRSView is faster than computing
CRS structure on CPU.

Change-Id: Ifb6910fb01ae6071400d36c277846fadc5857964
23 files changed
tree: 932a5daf02335c23d9487571bf5d5bc250f89abe
  1. .github/
  2. bazel/
  3. cmake/
  4. config/
  5. data/
  6. docs/
  7. examples/
  8. include/
  9. internal/
  10. scripts/
  11. .clang-format
  12. .gitignore
  13. BUILD
  14. CITATION.cff
  15. CMakeLists.txt
  18. package.xml

Android Linux macOS Windows

Ceres Solver

Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems.

  1. Non-linear Least Squares problems with bounds constraints.
  2. General unconstrained optimization problems.

Please see for more information.