I’m trying to optimize the running time of my C++ code by utilizing multiple threads. I have tried a few different solutions.
I have this code using boost:
#include <iostream>
#include <vector>
#include <chrono>
#include <atomic>
#include <thread>
#include <boost/asio.hpp>
#include <boost/bind/bind.hpp>
void Test::boost_worker_task() {
char new_state[3][3];
MyEngine::random_start_state(new_state);
MyEngine::solve_game(new_state);
++games_solved_counter;
}
void Test::run(const unsigned int games_to_solve, const bool use_mul_thread) {
MyEngine::init_rand();
const auto start = std::chrono::high_resolution_clock::now();
const unsigned int num_threads = use_mul_thread ? std::thread::hardware_concurrency() : 1;
std::cout << "Using " << num_threads << " threads to solve " << games_to_solve << " games" << std::endl;
boost::asio::io_service io_service;
boost::asio::thread_pool pool(num_threads);
for (unsigned int i = 0; i < games_to_solve; ++i) {
io_service.post([] { return Test::boost_worker_task(); });
}
// Run and wait for all tasks to complete
io_service.run();
pool.join();
const auto end = std::chrono::high_resolution_clock::now();
const std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Solved " << games_solved_counter << " games!" << std::endl;
std::cout << "Elapsed time: " << elapsed.count() / 1000.0 << " seconds" << std::endl;
std::cout << "Elapsed time: " << elapsed.count() << " millisecondsn" << std::endl;
}
As well as this code:
#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
#include <chrono>
#include <atomic>
#include <future>
std::atomic<int> games_solved(0);
void TestMulThread::worker_task(const unsigned int num_iterations, std::mutex& games_solved_mutex) {
for (unsigned int i = 0; i < num_iterations; ++i) {
char new_state[3][3];
MyEngine::random_start_state(new_state);
MyEngine::solve_game(new_state);
++games_solved;
}
}
void TestMulThread::run(const unsigned int total_games_to_solve) {
MyEngine::init_rand();
const auto start_time = std::chrono::high_resolution_clock::now();
const unsigned int num_threads = std::thread::hardware_concurrency();
const unsigned int games_per_thread = total_games_to_solve / num_threads;
const unsigned int remaining_games = total_games_to_solve % games_per_thread;
std::cout << "Using " << num_threads << " threads to solve " << total_games_to_solve << " games" << std::endl;
// Distribute the remaining games
std::vector<unsigned int> games_for_each_thread(num_threads, games_per_thread);
for (unsigned int i = 0; i < remaining_games; ++i) {
games_for_each_thread[i]++;
}
std::vector<std::future<void>> futures;
std::mutex games_solved_mutex;
for (unsigned int i = 0; i < num_threads; ++i) {
futures.push_back(std::async(std::launch::async, worker_task, games_for_each_thread[i], std::ref(games_solved_mutex)));
}
for (auto& future : futures) {
future.get();
}
const auto end_time = std::chrono::high_resolution_clock::now();
const auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();
std::cout << "Solved " << games_solved << " games!" << std::endl;
std::cout << "Elapsed time: " << elapsed_time / 1000.0 << " seconds" << std::endl;
std::cout << "Elapsed time: " << elapsed_time << " millisecondsn" << std::endl;
}
My issues are twofold:
-
The boost version is running much slower than the second code. It’s even running slower than using a simple for loop and not trying to utilize several threads. I understand that trying to run tasks in parallel can lead to overhead of different sorts but the second code is running super fast and I would like to understand why.
-
The second code snippet is running really fast when using Visual studio (compiling with Release and -O2 flags as C++ optimization), but when I compile and run the same code on my Linux machine using g++ it’s again running slower than using a for loop to run the same amount of games. I’ve tried compiling with a few different settings, like:
g++ -O2 -o test Test.cpp -std=c++20 -lpthread
g++ -O2 -o test Test.cpp -std=c++20 -pthread
Any ideas as to why this is the case?
Thanks!
2
Answers
@sehe
That's interesting. However, if you don't mind having a look at this modified code:
I've changed the solve_game implementation to somewhat resemble the task I do in the original code. I also added a chance to print to verify that the tasks are running. These are the surprising results I get:
When using USE_SINGLE_THREAD 1:
When using USE_SINGLE_THREAD 0:
How I compile it on linux: g++ -Ofast -o test test.cpp -lboost_thread -lboost_system && ./test
The first program never uses the threadpool. Except to join.
If you fix at least the first, the performance matches between the two:
Live On Coliru
File
test.h
File
test.cpp
Prints, online:
And locally for me: