)]}' { "commit": "c0c4f93940f86e8d108e84a60104bfda4aad66b3", "tree": "2cbbba12cd9d231f1a10926e055d122ba457f416", "parents": [ "fc826c578032c19054a27e5bcac626a3ac6883ac" ], "author": { "name": "Dmitriy Korchemkin", "email": "dmitriy.korchemkin@gmail.com", "time": "Thu Aug 18 22:10:17 2022 +0300" }, "committer": { "name": "Dmitriy Korchemkin", "email": "dmitriy.korchemkin@gmail.com", "time": "Tue Sep 20 11:06:22 2022 +0300" }, "message": "Change implementation of parallel for\n\nImplemented templated invocation routines for ParallelFor backends\nin order to improve loop body inlining.\n\nSeveral modifications of ParallelFor implementation using CXX threads:\n - Index order changed from interleaved to sequential\n - Static task scheduling replaced with dynamic (controlled by\n kWorkBlocksPerThread)\n - Changed index retrieval to atomic\n\nModifications of OpenMP backend:\n - Changed loop scheduling to guided\n\nChanging index order from interleaved to sequential in parallel seem\nto significantly improve run-times of parallel loops, for example in\nevaluation of jacobian and residuals.\n\nOther modifications provide minor improvements for unbalanced\nsub-problem lengths and parallel for loops with small number of\ncomputation per operation.\n\nSingle-threaded performance was improved by avoiding costs of\nwrapping parallel loop bodies in std::function.\n\nOn BAL dataset the following improvements in time consumed for\nevaluation of residuals or jacobian and residuals were observed:\n\n OLD NEW OLD/NEW\n dataset threads r J r J r J\nproblem-257-65132-pre.txt 1 0.025 0.079 0.025 0.074 1.016 1.056\nproblem-257-65132-pre.txt 2 0.030 0.062 0.022 0.050 1.333 1.246\nproblem-257-65132-pre.txt 4 0.023 0.052 0.014 0.034 1.592 1.515\nproblem-257-65132-pre.txt 8 0.015 0.035 0.010 0.025 1.477 1.401\nproblem-257-65132-pre.txt 16 0.011 0.027 0.008 0.019 1.365 1.377\nproblem-356-226730-pre.txt 1 0.150 0.442 0.147 0.412 1.017 1.070\nproblem-356-226730-pre.txt 2 0.155 0.322 0.100 0.281 1.542 1.145\nproblem-356-226730-pre.txt 4 0.129 0.291 0.089 0.196 1.439 1.485\nproblem-356-226730-pre.txt 8 0.091 0.184 0.066 0.139 1.381 1.319\nproblem-356-226730-pre.txt 16 0.070 0.148 0.055 0.110 1.272 1.340\nproblem-1723-156502-pre.txt 1 0.084 0.243 0.082 0.229 1.023 1.063\nproblem-1723-156502-pre.txt 2 0.088 0.188 0.055 0.154 1.589 1.222\nproblem-1723-156502-pre.txt 4 0.072 0.159 0.049 0.108 1.475 1.475\nproblem-1723-156502-pre.txt 8 0.050 0.105 0.037 0.077 1.348 1.368\nproblem-1723-156502-pre.txt 16 0.038 0.083 0.030 0.062 1.269 1.344\nproblem-1778-993923-pre.txt 1 0.621 1.777 0.609 1.667 1.018 1.065\nproblem-1778-993923-pre.txt 2 0.621 1.273 0.415 1.199 1.494 1.061\nproblem-1778-993923-pre.txt 4 0.514 1.140 0.361 0.786 1.421 1.449\nproblem-1778-993923-pre.txt 8 0.365 0.808 0.277 0.559 1.319 1.443\nproblem-1778-993923-pre.txt 16 0.279 0.608 0.223 0.441 1.252 1.379\nproblem-13682-4456117-pre.txt 1 3.877 10.726 3.738 10.082 1.037 1.063\nproblem-13682-4456117-pre.txt 2 3.310 7.170 2.423 6.448 1.366 1.111\nproblem-13682-4456117-pre.txt 4 3.070 6.344 2.064 4.474 1.486 1.417\nproblem-13682-4456117-pre.txt 8 2.051 4.612 1.527 3.133 1.343 1.472\nproblem-13682-4456117-pre.txt 16 1.549 3.453 1.218 2.488 1.271 1.387\n\nRun time in seconds for a single evaluation, using evaluation_benchmark\nnumactl -N 0 -m 0 ./bin/evaluation_benchmark --bal_root ${path_to_BAL}\nEvaluation was performed on 28-core CPU.\n\nNote: performance when running across numa-nodes degrades in both old\nand proposed implementations, thus the test was executed limiting memory\nand compute resources allocation to a single numa-node.\n\nChange-Id: Ia195580bdab9d05c95ac983bfe37b045eecfaf49\n", "tree_diff": [ { "type": "modify", "old_id": "e5599cc70449d7dedb4763c39170d7c2a14e8477", "old_mode": 33188, "old_path": "internal/ceres/parallel_for.h", "new_id": "3c3d8874a10d64087f074c9de5c385ed14417d86", "new_mode": 33188, "new_path": "internal/ceres/parallel_for.h" }, { "type": "modify", "old_id": "df2f619eadaa3f8ddb363668e9f46bb73a0a546c", "old_mode": 33188, "old_path": "internal/ceres/parallel_for_cxx.cc", "new_id": "13cabf90bd1619447bffbaa9e32b2fb4b27e1de0", "new_mode": 33188, "new_path": "internal/ceres/parallel_for_cxx.cc" }, { "type": "add", "old_id": "0000000000000000000000000000000000000000", "old_mode": 0, "old_path": "/dev/null", "new_id": "90edc0774a77ca55bd4e18bd9ccd301a7e930d64", "new_mode": 33188, "new_path": "internal/ceres/parallel_for_cxx.h" }, { "type": "modify", "old_id": "1c1871662c8b1ae9afd6473a9ea5e86a25b24786", "old_mode": 33188, "old_path": "internal/ceres/parallel_for_nothreads.cc", "new_id": "8d3611dbc36444b2ef876e73819736a31b67dd2c", "new_mode": 33188, "new_path": "internal/ceres/parallel_for_nothreads.cc" }, { "type": "modify", "old_id": "1d44bf9977ab93b34f3e2fea4e37dee1f93ff596", "old_mode": 33188, "old_path": "internal/ceres/parallel_for_openmp.cc", "new_id": "02690f31011f18bd25087fc2df68c1c72db9446c", "new_mode": 33188, "new_path": "internal/ceres/parallel_for_openmp.cc" }, { "type": "add", "old_id": "0000000000000000000000000000000000000000", "old_mode": 0, "old_path": "/dev/null", "new_id": "94254c45564efae99d70a4d4384bd26a8c3ef7a9", "new_mode": 33188, "new_path": "internal/ceres/parallel_for_openmp.h" } ] }