#include "thread_pool.hpp" // just drop the header somewhere in your include path
std::future<void> fut = SP::ThreadPool::submit_task(0, N, [](std::size_t i) {
heavy_work(i);
});
SP::ThreadPool::wait_for_all(); // blocks until every submitted task finishes
// fut.get() to wait for a specific task the future is returned from
- No manual boot‑up required – first call auto‑initialises the pool.
submit_task(...)
has overloads for range loops, custom chunking, and per‑task thread caps.set_thread_count(n)
changes the pool size globally, any time.shutdown()
tears everything down (use once at program exit).
SP::ThreadPool
is a single‑header, C++17 work‑stealing thread pool designed for embarrassingly parallel workloads.
The pool only uses the C++ Standard Library
Idle threads pull tasks from heavier queues for near-perfect load balancing.
You can divide a loop into either by chunk_count
, chunk_size
, or chunk_multiplier
.
chunk_count
: Exact number of chunks to divide the task into
chunk_size
: How big each task should be. The pool, then, computes appropriate chunk_count
.
chunk_multiplier
: The number of chunks each threads should be assigned to. The pool, then, computes appropriate chunk_count
.
For simple tasks, either set chunk_count < number_of_available_threads
or threads_cap
to reduce the overhead of using multi-threading.
-
Add the header
#include "thread_pool.hpp"
-
Add compiler flag
-pthread
That's it! The thread pool is self-contained inside that header.
-
Submit work
std::size_t N = 1'000'000; SP::ThreadPool::submit_task(0, N, [](std::size_t i) { compute(i); }); SP::ThreadPool::wait_for_all();
-
Tune if required
SP::ThreadPool::set_thread_count(8); // fixed pool size SP::ThreadPool::set_work_stealing(false); // FIFO scheduling only
Mechanism | Summary |
---|---|
Per‑thread deques | Each worker owns a double‑ended queue. It pushes its own tasks at the back and pops from the front (FIFO, cache‑friendly for recursion). |
Work stealing | When a worker runs dry, it scans the other queues and steals from the back of the fullest one, minimizing contention. Stealing can be disabled anytime using set_work_stealing(bool) |
Thread cap | set_threads_cap(k) (permanent) or submit_task_with_threads_cap() (temporarily) limit active workers to k (useful when sharing CPUs). The pool will automatically use the first k threads. |
Processor affinity | set_processor_affinity() pins each worker to a core (SetThreadAffinityMask /pthread_setaffinity_np ). The pool uses sequential pinning. If there are more threads than cores, we will wrap around and assign multiple threads to a single core. Optional; call once after boot. |
Auto‑teardown | When the last task finishes and you call shutdown() , all threads join cleanly. The destructor also triggers this on program exit. |
All methods are static – call them via the class.
Method | Purpose | Notes / Warnings |
---|---|---|
submit_task(F&& f) |
Enqueue a single functor. | Round‑robin assignment; returns std::future<void> immediately. |
submit_task(start, end, F&& f) |
Split [start,end) into threads × 4 chunks (default multiplier) and process in parallel. |
Non‑blocking; wait via the returned future or wait_for_all() . |
submit_task(start,end,chunk_cnt,F&& f) |
Explicit number of chunks. | Chunk count 0 is a no‑op (future resolves instantly). |
submit_task_with_chunk_multiplier(start,end,mul,F&& f) |
chunks = mul × active_threads . |
Use for coarse‑/fine‑grained tuning. |
submit_task_with_chunk_size(start,end,chunk_sz,F&& f) |
Fixed item count per chunk. | |
submit_task_with_threads_cap(start,end,cap,F&& f) |
Temporarily restrict workers for this task. | Important: tasks submitted with a cap must finish (wait on the future) before submitting new capped tasks, to avoid starvation. |
submit_task_with_threads_cap(start,end,cap,mul,F&& f) |
Cap + custom chunking. | Same caution as above. |
wait_for_all() |
Block until every queued task completes. | Safe to call multiple times. |
set_thread_count(n) |
Resize pool (spawns or joins threads). | Active tasks continue; new size applies to future submissions. |
get_thread_count() |
Current configured pool size. | Does not reflect temporary caps. |
set_threads_cap(k) |
Manually cap active workers. | Applies to all subsequent tasks until changed again. |
get_threads_cap() |
Current global cap. | |
set_work_stealing(bool) |
Enable/disable stealing. | Disabling may reduce jitter for real‑time work at the cost of load balance. |
set_processor_affinity() |
Pin threads to cores. | Call once after the pool is running. No‑op on some platforms. |
shutdown() |
Join all workers and free memory. | Call once, usually from main() shutdown path. |
soft_boot() |
Pre‑launch threads without submitting a task. | Rarely needed; used in low‑latency systems. |
std::vector<int> data(1'000'000);
auto fut = SP::ThreadPool::submit_task(0, data.size(), [&](std::size_t i) {
data[i] = heavy_compute(i);
});
fut.get(); // or SP::ThreadPool::wait_for_all();
auto fut = SP::ThreadPool::submit_task_with_threads_cap(
0, files.size(), /*cap=*/4, [&](std::size_t i) {
parse_file(files[i]);
});
fut.get(); // MUST finish before you raise the cap or submit further capped work
constexpr std::size_t CHUNK = 16;
SP::ThreadPool::submit_task_with_chunk_size(0, N, CHUNK, do_work);
SP::ThreadPool::submit_task([] { prewarm(); });
- Always wait (
future::get()
orwait_for_all()
) when using per‑call thread caps to avoid deadlocks. - Call
set_processor_affinity()
after the pool is up and before heavy computation phases. - Avoid very small chunk sizes (< 2–4 µs of work) to minimize scheduler overhead.
- For library authors: wrap pool calls so that you can fall back to the caller’s executor in the future.
No build system magic – include the header and compile with C++17 or newer.
On Windows you may need to link against Synchronize.lib
implicitly provided by MSVC (nothing extra to do on recent toolchains).
submit_bulk()
returning afuture<void>
per element group.- Integration with C++26
std::execution
. - Optional task priorities.
(to be filled by repo owner)
Benchmark performed on a 3000×3000 image with max iterations of 300, using 8 threads.
Tasks | Mean Time (ms) | Std Dev (ms) | Speed (pixels/ms) | Speedup |
---|---|---|---|---|
1 | 3209.35 | 12.24 | 2804.3 | 1.00x |
2 | 1609.89 | 4.61 | 5590.5 | 1.99x |
4 | 1501.88 | 5.27 | 5992.5 | 2.14x |
8 | 1048.94 | 2.45 | 8580.1 | 3.06x |
16 | 719.48 | 23.85 | 12509.1 | 4.46x |
32 | 583.73 | 14.72 | 15418.1 | 5.50x |
64 | 553.96 | 24.56 | 16246.7 | 5.79x |
128 | 585.07 | 38.29 | 15382.7 | 5.49x |
Maximum speedup: 5.793x using 64 tasks
Tasks | Mean Time (ms) | Std Dev (ms) | Speed (pixels/ms) | Speedup |
---|---|---|---|---|
1 | 3150.02 | 28.70 | 2857.1 | 1.00x |
2 | 1582.81 | 3.22 | 5686.1 | 1.99x |
4 | 1483.74 | 4.78 | 6065.8 | 2.12x |
8 | 1038.28 | 13.72 | 8668.2 | 3.03x |
16 | 678.65 | 17.39 | 13261.7 | 4.64x |
32 | 627.67 | 5.54 | 14338.8 | 5.02x |
64 | 545.37 | 19.13 | 16502.6 | 5.78x |
128 | 513.02 | 9.62 | 17543.3 | 6.14x |
Maximum speedup: 6.140x using 128 tasks
- SP::ThreadPool shows slightly better performance than BS::Thread_Pool
- Both implementations achieve significant speedups (5.79x vs 6.14x)
- Optimal task counts differ between implementations (64 vs 128 tasks)
- Fork & branch.
- Follow the style in the existing header (clang‑format file forthcoming).
- Open a PR; GitHub CI runs sanitizer + unit tests.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.