I am using OpenMP task construct to parallelize the execution of a binary tree application. However, the runtime performance shows that single-threaded implementation outperforms multi-threaded implementations. How do I interpret the results?
The implementation is the following:
size_t N = 1 << 16;
size_t *D = new size_t [N];
size_t threads = 1; // 1, 2, 4, 8, 16
std::vector<std::vector<int>> mat1(128); // a 2D matrix
std::vector<std::vector<int>> mat2(128); // a 2D matrix
for(size_t i = 0; i < mat1.size(); ++i){
mat1[i].resize(128);
mat2[I].resize(128);
}
#pragma omp parallel num_threads(threads)
{
#pragma omp single
{
for(size_t i = 1; i < N; ++i) {
size_t p = i / 2;
size_t l = i * 2;
size_t r = l + 1;
if(l < N && r < N) {
#pragma omp task firstprivate(i) depend(out:D[l], D[r]) depend(in:D[p])
{
// perform matrix multiplication
for(size_t x = 0; x < 128; ++x) {
for(size_t y = 0; y < 128; ++y) {
size_t sum = 0;
for(size_t z = 0; z < 128; ++z) {
//sum += mat1[x][z] * mat2[z][y];
sum += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
}
}
}
}
}
else {
#pragma omp task firstprivate(i) depend(in:D[p])
{
// perform matrix multiplication
for(size_t x = 0; x < 128; ++x) {
for(size_t y = 0; y < 128; ++y) {
size_t sum = 0;
for(size_t z = 0; z < 128; ++z) {
//sum += mat1[x][z] * mat2[z][y];
sum += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
}
}
}
}
}
}
}
}
delete [] D;
}
However, if I replace the matrix multiplication with sleep. Then the runtime performance does scale with the number of threads.
The following image is the runtime performance of matrix multiplication.
The following image is the runtime performance of sleep.
If I replace the matrix multiplication with a lightweight operation, such as the following:
size_t sum = i * 1000 // i is the private value for each task
I observe the same runtime performance as the matrix multiplication that OpenMP does not scale. The figure of runtime performance is shown below, runtime performance graph of a lightweight operation
The implementation is compiled with gcc-12 and O3 enabled and ran on a Ubuntu 19.10 (Eoan Ermine) machine with 80 Intel Xeon Gold 6138 CPU at 2.00GHz and 256 GB RAM.
I expect the runtime performance and the number of threads has a convex curve in which the lowest points are either 4 or 8 threads.
2
Answers
According to the previous discussions and comments, I found OpenMP task can scale under two conditions:
To avoid this optimization, we should use a global data storage as PierU suggests,
In the following, I post the entire implementation that I used to measure the runtime performance of OpenMP task construct on a binary tree structure. The implementation is modified on the previous comments. The implementation is compiled with gcc-9 and O3 enabled and ran on a Ubuntu 19.10 (Eoan Ermine) machine with 80 Intel Xeon Gold 6138 CPU at 2.00GHz and 256 GB RAM.
I have tested your code and raised two issues:
g++ -O3 -fopenmp
) simply skips the loops (i.e. doesn’t execute them at all), so the timings means nothing. I have introduced some instructions to avoid this issueHere is the modified code:
As I said it is much slower, and it scales (the machine has 10 cores):