commit | c0c4f93940f86e8d108e84a60104bfda4aad66b3 | [log] [tgz] |
---|---|---|
author | Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com> | Thu Aug 18 22:10:17 2022 +0300 |
committer | Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com> | Tue Sep 20 11:06:22 2022 +0300 |
tree | 2cbbba12cd9d231f1a10926e055d122ba457f416 | |
parent | fc826c578032c19054a27e5bcac626a3ac6883ac [diff] |
Change implementation of parallel for Implemented templated invocation routines for ParallelFor backends in order to improve loop body inlining. Several modifications of ParallelFor implementation using CXX threads: - Index order changed from interleaved to sequential - Static task scheduling replaced with dynamic (controlled by kWorkBlocksPerThread) - Changed index retrieval to atomic Modifications of OpenMP backend: - Changed loop scheduling to guided Changing index order from interleaved to sequential in parallel seem to significantly improve run-times of parallel loops, for example in evaluation of jacobian and residuals. Other modifications provide minor improvements for unbalanced sub-problem lengths and parallel for loops with small number of computation per operation. Single-threaded performance was improved by avoiding costs of wrapping parallel loop bodies in std::function. On BAL dataset the following improvements in time consumed for evaluation of residuals or jacobian and residuals were observed: OLD NEW OLD/NEW dataset threads r J r J r J problem-257-65132-pre.txt 1 0.025 0.079 0.025 0.074 1.016 1.056 problem-257-65132-pre.txt 2 0.030 0.062 0.022 0.050 1.333 1.246 problem-257-65132-pre.txt 4 0.023 0.052 0.014 0.034 1.592 1.515 problem-257-65132-pre.txt 8 0.015 0.035 0.010 0.025 1.477 1.401 problem-257-65132-pre.txt 16 0.011 0.027 0.008 0.019 1.365 1.377 problem-356-226730-pre.txt 1 0.150 0.442 0.147 0.412 1.017 1.070 problem-356-226730-pre.txt 2 0.155 0.322 0.100 0.281 1.542 1.145 problem-356-226730-pre.txt 4 0.129 0.291 0.089 0.196 1.439 1.485 problem-356-226730-pre.txt 8 0.091 0.184 0.066 0.139 1.381 1.319 problem-356-226730-pre.txt 16 0.070 0.148 0.055 0.110 1.272 1.340 problem-1723-156502-pre.txt 1 0.084 0.243 0.082 0.229 1.023 1.063 problem-1723-156502-pre.txt 2 0.088 0.188 0.055 0.154 1.589 1.222 problem-1723-156502-pre.txt 4 0.072 0.159 0.049 0.108 1.475 1.475 problem-1723-156502-pre.txt 8 0.050 0.105 0.037 0.077 1.348 1.368 problem-1723-156502-pre.txt 16 0.038 0.083 0.030 0.062 1.269 1.344 problem-1778-993923-pre.txt 1 0.621 1.777 0.609 1.667 1.018 1.065 problem-1778-993923-pre.txt 2 0.621 1.273 0.415 1.199 1.494 1.061 problem-1778-993923-pre.txt 4 0.514 1.140 0.361 0.786 1.421 1.449 problem-1778-993923-pre.txt 8 0.365 0.808 0.277 0.559 1.319 1.443 problem-1778-993923-pre.txt 16 0.279 0.608 0.223 0.441 1.252 1.379 problem-13682-4456117-pre.txt 1 3.877 10.726 3.738 10.082 1.037 1.063 problem-13682-4456117-pre.txt 2 3.310 7.170 2.423 6.448 1.366 1.111 problem-13682-4456117-pre.txt 4 3.070 6.344 2.064 4.474 1.486 1.417 problem-13682-4456117-pre.txt 8 2.051 4.612 1.527 3.133 1.343 1.472 problem-13682-4456117-pre.txt 16 1.549 3.453 1.218 2.488 1.271 1.387 Run time in seconds for a single evaluation, using evaluation_benchmark numactl -N 0 -m 0 ./bin/evaluation_benchmark --bal_root ${path_to_BAL} Evaluation was performed on 28-core CPU. Note: performance when running across numa-nodes degrades in both old and proposed implementations, thus the test was executed limiting memory and compute resources allocation to a single numa-node. Change-Id: Ia195580bdab9d05c95ac983bfe37b045eecfaf49
Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems.
Please see ceres-solver.org for more information.