| commit | c0c4f93940f86e8d108e84a60104bfda4aad66b3 | [log] [tgz] |
|---|---|---|
| author | Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com> | Thu Aug 18 22:10:17 2022 +0300 |
| committer | Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com> | Tue Sep 20 11:06:22 2022 +0300 |
| tree | 2cbbba12cd9d231f1a10926e055d122ba457f416 | |
| parent | fc826c578032c19054a27e5bcac626a3ac6883ac [diff] |
Change implementation of parallel for
Implemented templated invocation routines for ParallelFor backends
in order to improve loop body inlining.
Several modifications of ParallelFor implementation using CXX threads:
- Index order changed from interleaved to sequential
- Static task scheduling replaced with dynamic (controlled by
kWorkBlocksPerThread)
- Changed index retrieval to atomic
Modifications of OpenMP backend:
- Changed loop scheduling to guided
Changing index order from interleaved to sequential in parallel seem
to significantly improve run-times of parallel loops, for example in
evaluation of jacobian and residuals.
Other modifications provide minor improvements for unbalanced
sub-problem lengths and parallel for loops with small number of
computation per operation.
Single-threaded performance was improved by avoiding costs of
wrapping parallel loop bodies in std::function.
On BAL dataset the following improvements in time consumed for
evaluation of residuals or jacobian and residuals were observed:
OLD NEW OLD/NEW
dataset threads r J r J r J
problem-257-65132-pre.txt 1 0.025 0.079 0.025 0.074 1.016 1.056
problem-257-65132-pre.txt 2 0.030 0.062 0.022 0.050 1.333 1.246
problem-257-65132-pre.txt 4 0.023 0.052 0.014 0.034 1.592 1.515
problem-257-65132-pre.txt 8 0.015 0.035 0.010 0.025 1.477 1.401
problem-257-65132-pre.txt 16 0.011 0.027 0.008 0.019 1.365 1.377
problem-356-226730-pre.txt 1 0.150 0.442 0.147 0.412 1.017 1.070
problem-356-226730-pre.txt 2 0.155 0.322 0.100 0.281 1.542 1.145
problem-356-226730-pre.txt 4 0.129 0.291 0.089 0.196 1.439 1.485
problem-356-226730-pre.txt 8 0.091 0.184 0.066 0.139 1.381 1.319
problem-356-226730-pre.txt 16 0.070 0.148 0.055 0.110 1.272 1.340
problem-1723-156502-pre.txt 1 0.084 0.243 0.082 0.229 1.023 1.063
problem-1723-156502-pre.txt 2 0.088 0.188 0.055 0.154 1.589 1.222
problem-1723-156502-pre.txt 4 0.072 0.159 0.049 0.108 1.475 1.475
problem-1723-156502-pre.txt 8 0.050 0.105 0.037 0.077 1.348 1.368
problem-1723-156502-pre.txt 16 0.038 0.083 0.030 0.062 1.269 1.344
problem-1778-993923-pre.txt 1 0.621 1.777 0.609 1.667 1.018 1.065
problem-1778-993923-pre.txt 2 0.621 1.273 0.415 1.199 1.494 1.061
problem-1778-993923-pre.txt 4 0.514 1.140 0.361 0.786 1.421 1.449
problem-1778-993923-pre.txt 8 0.365 0.808 0.277 0.559 1.319 1.443
problem-1778-993923-pre.txt 16 0.279 0.608 0.223 0.441 1.252 1.379
problem-13682-4456117-pre.txt 1 3.877 10.726 3.738 10.082 1.037 1.063
problem-13682-4456117-pre.txt 2 3.310 7.170 2.423 6.448 1.366 1.111
problem-13682-4456117-pre.txt 4 3.070 6.344 2.064 4.474 1.486 1.417
problem-13682-4456117-pre.txt 8 2.051 4.612 1.527 3.133 1.343 1.472
problem-13682-4456117-pre.txt 16 1.549 3.453 1.218 2.488 1.271 1.387
Run time in seconds for a single evaluation, using evaluation_benchmark
numactl -N 0 -m 0 ./bin/evaluation_benchmark --bal_root ${path_to_BAL}
Evaluation was performed on 28-core CPU.
Note: performance when running across numa-nodes degrades in both old
and proposed implementations, thus the test was executed limiting memory
and compute resources allocation to a single numa-node.
Change-Id: Ia195580bdab9d05c95ac983bfe37b045eecfaf49
Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems.
Please see ceres-solver.org for more information.