commit | bdee4d6172b940c9c2b46dda863baec0e6fc3308 | [log] [tgz] |
---|---|---|
author | Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com> | Wed May 17 14:43:33 2023 +0300 |
committer | Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com> | Fri May 26 01:12:47 2023 +0300 |
tree | 932a5daf02335c23d9487571bf5d5bc250f89abe | |
parent | 0f9de3daf44312bfad6d3a44c1a8793c64a28c3d [diff] |
Block-sparse to CRS conversion using block-structure Instead of pre-computing pemutation from block-sparse to CRS order, index of value in CRS matrix is computed in the process of updating values using block-sparse structure. When it is possible to update values via a simple host-to-device copy, block-sparse structure on GPU is discarded after computing CRS structure. Computing index is significantly slower than using pre-computed permutation, but is still hidden by host-to-device transfer. On problems from BAL dataset this results into reduction of extra gpu memory consumption from 33% (permutation stored as 32-bit indices) to ~10% for storing block-sparse structure. Benchmark results: ======================= CUDA Device Properties ====================== Cuda version : 11.8 Device ID : 0 Device name : NVIDIA GeForce RTX 2080 Ti Total GPU memory : 11012 MiB GPU memory available : 10852 MiB Compute capability : 7.5 Warp size : 32 Max threads per block: 1024 Max threads per dim : 1024 1024 64 Max grid size : 2147483647 65535 65535 Multiprocessor count : 68 ==================================================================== Running ./bin/evaluation_benchmark Run on (112 X 3200 MHz CPU s) CPU Caches: L1 Data 32 KiB (x56) L1 Instruction 32 KiB (x56) L2 Unified 1024 KiB (x56) L3 Unified 39424 KiB (x2) Load Average: 24.58, 11.75, 8.52 ----------------------------------------------------------------------- Benchmark Time ----------------------------------------------------------------------- Using on-the-fly computation of CRS index corresponding to block-sparse index: JacobianToCRS<g/final/problem-4585-1324582-pre.txt> 1607 ms JacobianToCRSView<g/final/problem-4585-1324582-pre.txt> 564 ms JacobianToCRSMatrix<g/final/problem-4585-1324582-pre.txt> 2226 ms JacobianToCRSViewUpdate<g/final/problem-4585-1324582-pre.txt> 228 ms JacobianToCRSMatrixUpdate<g/final/problem-4585-1324582-pre.txt> 400 ms Using precomputed permutation: JacobianToCRS</final/problem-4585-1324582-pre.txt> 1656 ms JacobianToCRSView</final/problem-4585-1324582-pre.txt> 553 ms JacobianToCRSMatrix</final/problem-4585-1324582-pre.txt> 2255 ms JacobianToCRSViewUpdate</final/problem-4585-1324582-pre.txt> 228 ms JacobianToCRSMatrixUpdate</final/problem-4585-1324582-pre.txt> 406 ms Performance of JacobianToCRSViewUpdate is still limited by host-to-device transfer, and JacobianToCRSView is faster than computing CRS structure on CPU. Change-Id: Ifb6910fb01ae6071400d36c277846fadc5857964
Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems.
Please see ceres-solver.org for more information.