Ubuntu - OpenMP task construct does not scale with the number of threads

chchiu
September 21, 2023
103 views
1 vote
2 Answers

I am using OpenMP task construct to parallelize the execution of a binary tree application. However, the runtime performance shows that single-threaded implementation outperforms multi-threaded implementations. How do I interpret the results?

The implementation is the following:

  size_t N = 1 << 16;
  size_t *D = new size_t [N];
  size_t threads = 1;  // 1, 2, 4, 8, 16

  std::vector<std::vector<int>> mat1(128); // a 2D matrix
  std::vector<std::vector<int>> mat2(128); // a 2D matrix

  for(size_t i = 0; i < mat1.size(); ++i){
    mat1[i].resize(128);
    mat2[I].resize(128);
  }

  #pragma omp parallel num_threads(threads)
  {
    #pragma omp single
    {
      for(size_t i = 1; i < N; ++i) {

        size_t p = i / 2;
        size_t l = i * 2;
        size_t r = l + 1;

        if(l < N && r < N) {
          #pragma omp task firstprivate(i) depend(out:D[l], D[r]) depend(in:D[p])
          {
            // perform matrix multiplication
            for(size_t x = 0; x < 128; ++x) {
              for(size_t y = 0; y < 128; ++y) {
                size_t sum = 0;
                for(size_t z = 0; z < 128; ++z) {
                  //sum += mat1[x][z] * mat2[z][y];
                  sum += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
                }
              }
            }     
          }
        }
        else {
          #pragma omp task firstprivate(i) depend(in:D[p])
          {           
            // perform matrix multiplication
            for(size_t x = 0; x < 128; ++x) {
              for(size_t y = 0; y < 128; ++y) {
                size_t sum = 0;
                for(size_t z = 0; z < 128; ++z) {
                  //sum += mat1[x][z] * mat2[z][y];
                  sum += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
                }
              }
            }     
          }
        }
      }
    }
  }

  delete [] D;
}

However, if I replace the matrix multiplication with sleep. Then the runtime performance does scale with the number of threads.
The following image is the runtime performance of matrix multiplication.

The following image is the runtime performance of sleep.

If I replace the matrix multiplication with a lightweight operation, such as the following:

size_t sum = i * 1000 // i is the private value for each task

I observe the same runtime performance as the matrix multiplication that OpenMP does not scale. The figure of runtime performance is shown below, runtime performance graph of a lightweight operation

The implementation is compiled with gcc-12 and O3 enabled and ran on a Ubuntu 19.10 (Eoan Ermine) machine with 80 Intel Xeon Gold 6138 CPU at 2.00GHz and 256 GB RAM.

I expect the runtime performance and the number of threads has a convex curve in which the lowest points are either 4 or 8 threads.

Tags: openmp task

Answers

Chosen as BEST ANSWER

According to the previous discussions and comments, I found OpenMP task can scale under two conditions:

The computation loading of each task must be big (>100us). As the overhead of using task construct can not be ignored, the elapsed time of small tasks would be dominated by the creation and scheduling of tasks.
OpenMP runtime would optimize the computations when tasks only perform local computations, such as

for(size_t x = 0; x < 128; ++x) {
  for(size_t y = 0; y < 128; ++y) {
    size_t sum = 0;
    for(size_t z = 0; z < 128; ++z) {
      sum += mat1[x][z] * mat2[y][z]; // OpenMP optimizes this out
    }
  }
}

To avoid this optimization, we should use a global data storage as PierU suggests,

size_t *sum = new size_t[N];

for(size_t x = 0; x < 128; ++x) {
  for(size_t y = 0; y < 128; ++y) {
    for(size_t z = 0; z < 128; ++z) {
      sum[i] += mat1[x][z] * mat2[y][z]; 
    }
  }
}

In the following, I post the entire implementation that I used to measure the runtime performance of OpenMP task construct on a binary tree structure. The implementation is modified on the previous comments. The implementation is compiled with gcc-9 and O3 enabled and ran on a Ubuntu 19.10 (Eoan Ermine) machine with 80 Intel Xeon Gold 6138 CPU at 2.00GHz and 256 GB RAM.

#include <omp.h>
#include <iostream>
#include <chrono>
#include <vector>
#include <string>
#include <stdlib.h>

int main(int argc, char** argv) {
  
  size_t num_layers = 18;
  unsigned num_threads = strtol(argv[1], NULL, 10);
  size_t msize = strtol(argv[2], NULL, 10);
  
  size_t N = 1 << num_layers;

  std::vector<std::vector<int>> mat1(msize);
  std::vector<std::vector<int>> mat2(msize);

  for(int i = 0; i < msize; ++i) {
    mat1[i].resize(msize);
    mat2[i].resize(msize);
  }
  
  for(int i = 0; i < msize; ++i) {
    for(int j = 0; j < msize; ++j) {
      mat1[i][j] = i+j;
      mat2[i][j] = i-j;
    }
  }

  auto beg = std::chrono::high_resolution_clock::now();
  
  size_t *D = new size_t [N];

  #pragma omp parallel num_threads(num_threads)
  {
    #pragma omp single
    {
      for(size_t i = 1; i<N; ++i) {

        size_t p = i / 2;
        size_t l = i * 2;
        size_t r = l + 1;

        if(l < N && r < N) {
          #pragma omp task firstprivate(i) depend(out:D[l], D[r]) depend(in:D[p])
          {
            D[i] = 0; 
            for (int x = 0; x < mat1.size(); x++) {
              for (int y = 0; y < mat1.size(); y++) {
                for (int z = 0; z < mat1.size(); z++) {
                  D[i] += mat1[x][z] * mat2[y][z];
                }
              }
            }
          }
        }
        else {
          #pragma omp task firstprivate(i) depend(in:D[p])
          {
             
            D[i] = 0; 
            for (int x = 0; x < mat1.size(); x++) {
              for (int y = 0; y < mat1.size(); y++) {
                for (int z = 0; z < mat1.size(); z++) {
                  D[i] += mat1[x][z] * mat2[y][z];
                }
              }
            }
          }
        }
      }
    }
  }

  delete [] D;

  auto end = std::chrono::high_resolution_clock::now();

  std::cout << "Elapsed: " << std::chrono::duration_cast<std::chrono::microseconds>(end - beg).count() << " usn";

  return 0;
}

(Edit)

I have tested your code and raised two issues:

since you don’t use at all the results of the calculations, the compiler (I used g++ -O3 -fopenmp) simply skips the loops (i.e. doesn’t execute them at all), so the timings means nothing. I have introduced some instructions to avoid this issue
but then I got segmentation faults: I’m no C++ specialist, but are you sure that your matrices are correctly declared? The outer vector is dimensioned to 128, but the inner vector sizes are unspecified (edit: I see that you now corrected your code). I replaced then by classical C 2D arrays and now the code runs fine… and takes much more time.

Here is the modified code:

#include <cstdio>
#include <cstddef>
#include <vector>
#include <omp.h>

int main() {

  size_t N = 1 << 10; //16;
  size_t *D = new size_t [N];
  size_t threads = 2;  // 1, 2, 4, 8, 16
  size_t *sum = new size_t [N]; // global sum array to avoid skipping the calculations

  //std::vector<std::vector<int>> mat1(128); // a 2D matrix
  //std::vector<std::vector<int>> mat2(128); // a 2D matrix

  // classical C arrays
  size_t mat1[128][128], mat2[128][128];
  for (int i = 0; i < 128; i++) {
    for (int j = 0; j < 128; j++) {
      mat1[i][j] = 0;
      mat2[i][j] = 0;
    }
  }

  double tic = omp_get_wtime();

  // number of threads set by the environment variable OMP_NUM_THREADS
  #pragma omp parallel 
  {
    #pragma omp single
    {
      for(size_t i = 1; i < N; ++i) {

        size_t p = i / 2;
        size_t l = i * 2;
        size_t r = l + 1;

        if(l < N && r < N) {
          #pragma omp task firstprivate(i) depend(out:D[l], D[r]) depend(in:D[p])
          {
            // perform matrix multiplication
            //size_t sum = 0;
            for(size_t x = 0; x < 128; ++x) {
              for(size_t y = 0; y < 128; ++y) {
                for(size_t z = 0; z < 128; ++z) {
                  //sum += mat1[x][z] * mat2[z][y];
                  sum[i] += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
                }
              }
            }  
          }
        }
        else {
          #pragma omp task firstprivate(i) depend(in:D[p])
          {           
            // perform matrix multiplication
            //size_t sum = 0;
            for(size_t x = 0; x < 128; ++x) {
              for(size_t y = 0; y < 128; ++y) {
                for(size_t z = 0; z < 128; ++z) {
                  //sum += mat1[x][z] * mat2[z][y];
                  sum[i] += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
                }
              }
            }
          }
        }
      }
    }
  }

  double toc = omp_get_wtime();
  printf("%lfn",toc-tic);

  delete [] D;
  delete [] sum;
}

As I said it is much slower, and it scales (the machine has 10 cores):

% g++ -O3 -fopenmp mmtask.cpp
% setenv OMP_NUM_THREADS 1 && ./a.out
1.803429
% setenv OMP_NUM_THREADS 2 && ./a.out
1.011750
% setenv OMP_NUM_THREADS 4 && ./a.out
0.475696
% setenv OMP_NUM_THREADS 8 && ./a.out
0.301431
% setenv OMP_NUM_THREADS 16 && ./a.out
0.357968

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – OpenMP task construct does not scale with the number of threads

Answers