commit	354002f989e7abdab9a4c17291662ab6e3102283	[log] [tgz]
author	Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com>	Fri Oct 06 18:19:12 2023 +0000
committer	Dmitriy Korchemkin <dmitriy.korchemkin@gmail.com>	Fri Oct 06 20:25:16 2023 +0000
tree	36d2fc04af2b52d1a87a22f0ecad4c66a8bf18a7
parent	a9b3fcff426eb33644413f766fdc785af937d9ed [diff]

Schedule task in ParallerFor from the previous task

As pointed out by several users, introduction of parallel operations on
vectors severely impacts solver performance on small problems, with time
consumption increasing with the number of threads.

The problem is two-fold:
 - Single-threaded execution is faster than multi-threaded
 - Overhead of multi-threaded execution increases dramaticaly when
   number of threads is increased

Supposedly, the second problem is due to ParallelInvoke scheduling
a task for each thread via ThreadPool.
When the time required to perform computations is smaller than costs of
scheduling task, runtime becomes linear in num_threads.
Moreover, main thread competes with working threads for mutex in
ConcurrentQueue.

In order to limit scheduling overhead and minimize lock contention,
each new task is scheduled from the previous one, if:
 - Number of scheduled tasks is less than num_threads
 - At the moment of creating the task not all work has been done

Correctness is granted by atomicity of thread_id counter.

SchedulerBenchmark mini-benchmark was added to illustrate the issue.
Each iteration of parallel loop performs change of a single value.

With the previous scheduling strategy, increasing number of threads
leads to significant increase of runtime:
-----------------------------------------------------
Benchmark                           Time   Iterations
-----------------------------------------------------
SchedulerBenchmark/128/1         14.1 ns     49496153
SchedulerBenchmark/128/2         3965 ns       240173
SchedulerBenchmark/128/4        13162 ns        71478
SchedulerBenchmark/128/8        30643 ns        29614
SchedulerBenchmark/128/16       63694 ns        10000
SchedulerBenchmark/256/1         24.1 ns     28943598
SchedulerBenchmark/256/2         3878 ns       227498
SchedulerBenchmark/256/4        13293 ns        69817
SchedulerBenchmark/256/8        31117 ns        32640
SchedulerBenchmark/256/16       59503 ns        14910
SchedulerBenchmark/1024/1        56.7 ns     12048398
SchedulerBenchmark/1024/2        4346 ns       203140
SchedulerBenchmark/1024/4       13487 ns        66736
SchedulerBenchmark/1024/8       30982 ns        33090
SchedulerBenchmark/1024/16      63199 ns        14762
SchedulerBenchmark/4096/1         189 ns      3633540
SchedulerBenchmark/4096/2        5932 ns       131884
SchedulerBenchmark/4096/4       14784 ns        61236
SchedulerBenchmark/4096/8       35857 ns        29276
SchedulerBenchmark/4096/16      63934 ns        10000

With new scheduling strategy, increasing requested number of threads
does not result in that high increase of runtime
-----------------------------------------------------
Benchmark                           Time   Iterations
-----------------------------------------------------
SchedulerBenchmark/128/1         14.1 ns     49323498
SchedulerBenchmark/128/2         2411 ns       362916
SchedulerBenchmark/128/4         3556 ns       243026
SchedulerBenchmark/128/8         4346 ns       200626
SchedulerBenchmark/128/16        5066 ns       169698
SchedulerBenchmark/256/1         24.2 ns     28960018
SchedulerBenchmark/256/2         2330 ns       388470
SchedulerBenchmark/256/4         3864 ns       219233
SchedulerBenchmark/256/8         4399 ns       195225
SchedulerBenchmark/256/16        5111 ns       161858
SchedulerBenchmark/1024/1        55.9 ns     12204777
SchedulerBenchmark/1024/2        2541 ns       329807
SchedulerBenchmark/1024/4        3977 ns       222628
SchedulerBenchmark/1024/8        4607 ns       193548
SchedulerBenchmark/1024/16       5031 ns       160285
SchedulerBenchmark/4096/1         188 ns      3714433
SchedulerBenchmark/4096/2        4203 ns       188284
SchedulerBenchmark/4096/4        4832 ns       171811
SchedulerBenchmark/4096/8        5605 ns       159093
SchedulerBenchmark/4096/16       6425 ns       126861

(both runs were executed on 28-core 56-thread cpu)

Change-Id: I91eca783280598997bfe6abd28019847731692e4

3 files changed

tree: 36d2fc04af2b52d1a87a22f0ecad4c66a8bf18a7

README.md

Ceres Solver

Ceres Solver is an open source C++ library for modeling and solving large, complicated optimization problems. It is a feature rich, mature and performant library which has been used in production at Google since 2010. Ceres Solver can solve two kinds of problems.

Non-linear Least Squares problems with bounds constraints.
General unconstrained optimization problems.

Please see ceres-solver.org for more information.