commit	bdee4d6172b940c9c2b46dda863baec0e6fc3308	[log] [tgz]
author	Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com>	Wed May 17 14:43:33 2023 +0300
committer	Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com>	Fri May 26 01:12:47 2023 +0300
tree	932a5daf02335c23d9487571bf5d5bc250f89abe
parent	0f9de3daf44312bfad6d3a44c1a8793c64a28c3d [diff]

Block-sparse to CRS conversion using block-structure

Instead of pre-computing pemutation from block-sparse to CRS order,
index of value in CRS matrix is computed in the process of updating
values using block-sparse structure.

When it is possible to update values via a simple host-to-device copy,
block-sparse structure on GPU is discarded after computing CRS
structure.

Computing index is significantly slower than using pre-computed
permutation, but is still hidden by host-to-device transfer.

On problems from BAL dataset this results into reduction of extra
gpu memory consumption from 33% (permutation stored as 32-bit indices)
to ~10% for storing block-sparse structure.

Benchmark results:

======================= CUDA Device Properties ======================
Cuda version         : 11.8
Device ID            : 0
Device name          : NVIDIA GeForce RTX 2080 Ti
Total GPU memory     :  11012 MiB
GPU memory available :  10852 MiB
Compute capability   : 7.5
Warp size            : 32
Max threads per block: 1024
Max threads per dim  : 1024 1024 64
Max grid size        : 2147483647 65535 65535
Multiprocessor count : 68
====================================================================
Running ./bin/evaluation_benchmark
Run on (112 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x56)
  L1 Instruction 32 KiB (x56)
  L2 Unified 1024 KiB (x56)
  L3 Unified 39424 KiB (x2)
Load Average: 24.58, 11.75, 8.52

-----------------------------------------------------------------------
Benchmark                                                          Time
-----------------------------------------------------------------------
Using on-the-fly computation of CRS index corresponding to block-sparse
index:

JacobianToCRS<g/final/problem-4585-1324582-pre.txt>             1607 ms
JacobianToCRSView<g/final/problem-4585-1324582-pre.txt>          564 ms
JacobianToCRSMatrix<g/final/problem-4585-1324582-pre.txt>       2226 ms
JacobianToCRSViewUpdate<g/final/problem-4585-1324582-pre.txt>    228 ms
JacobianToCRSMatrixUpdate<g/final/problem-4585-1324582-pre.txt>  400 ms

Using precomputed permutation:
JacobianToCRS</final/problem-4585-1324582-pre.txt>              1656 ms
JacobianToCRSView</final/problem-4585-1324582-pre.txt>           553 ms
JacobianToCRSMatrix</final/problem-4585-1324582-pre.txt>        2255 ms
JacobianToCRSViewUpdate</final/problem-4585-1324582-pre.txt>     228 ms
JacobianToCRSMatrixUpdate</final/problem-4585-1324582-pre.txt>   406 ms

Performance of JacobianToCRSViewUpdate is still limited by
host-to-device transfer, and JacobianToCRSView is faster than computing
CRS structure on CPU.

Change-Id: Ifb6910fb01ae6071400d36c277846fadc5857964

.github/workflows/linux.yml[diff]
internal/ceres/CMakeLists.txt[diff]
internal/ceres/block_sparse_matrix.cc[diff]
internal/ceres/block_sparse_matrix.h[diff]
internal/ceres/cuda_block_sparse_crs_view.cc[diff]
internal/ceres/cuda_block_sparse_crs_view.h[diff]
internal/ceres/cuda_block_sparse_crs_view_test.cc[diff]
internal/ceres/cuda_block_structure.cc[diff]
internal/ceres/cuda_block_structure.h[diff]
internal/ceres/cuda_block_structure_test.cc[diff]
internal/ceres/cuda_kernels.cu.cc[Deleted - diff]
internal/ceres/cuda_kernels_bsm_to_crs.cu.cc[Added - diff]
internal/ceres/cuda_kernels_bsm_to_crs.h[Added - diff]
internal/ceres/cuda_kernels_utils.h[Added - diff]
internal/ceres/cuda_kernels_vector_ops.cu.cc[Added - diff]
internal/ceres/cuda_kernels_vector_ops.h[Renamed from internal/ceres/cuda_kernels.h - diff]
internal/ceres/cuda_kernels_vector_ops_test.cc[Renamed from internal/ceres/cuda_kernels_test.cc - diff]
internal/ceres/cuda_sparse_matrix.cc[diff]
internal/ceres/cuda_sparse_matrix.h[diff]
internal/ceres/cuda_vector.cc[diff]
internal/ceres/cuda_vector.h[diff]
internal/ceres/dense_cholesky.cc[diff]
internal/ceres/evaluation_benchmark.cc[diff]

23 files changed

tree: 932a5daf02335c23d9487571bf5d5bc250f89abe

README.md

Ceres Solver

Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems.

Non-linear Least Squares problems with bounds constraints.
General unconstrained optimization problems.

Please see ceres-solver.org for more information.