skip to Main Content

I am using OpenMP task construct to parallelize the execution of a binary tree application. However, the runtime performance shows that single-threaded implementation outperforms multi-threaded implementations. How do I interpret the results?

The implementation is the following:

  size_t N = 1 << 16;
  size_t *D = new size_t [N];
  size_t threads = 1;  // 1, 2, 4, 8, 16

  std::vector<std::vector<int>> mat1(128); // a 2D matrix
  std::vector<std::vector<int>> mat2(128); // a 2D matrix

  for(size_t i = 0; i < mat1.size(); ++i){
    mat1[i].resize(128);
    mat2[I].resize(128);
  }

  #pragma omp parallel num_threads(threads)
  {
    #pragma omp single
    {
      for(size_t i = 1; i < N; ++i) {

        size_t p = i / 2;
        size_t l = i * 2;
        size_t r = l + 1;

        if(l < N && r < N) {
          #pragma omp task firstprivate(i) depend(out:D[l], D[r]) depend(in:D[p])
          {
            // perform matrix multiplication
            for(size_t x = 0; x < 128; ++x) {
              for(size_t y = 0; y < 128; ++y) {
                size_t sum = 0;
                for(size_t z = 0; z < 128; ++z) {
                  //sum += mat1[x][z] * mat2[z][y];
                  sum += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
                }
              }
            }     
          }
        }
        else {
          #pragma omp task firstprivate(i) depend(in:D[p])
          {           
            // perform matrix multiplication
            for(size_t x = 0; x < 128; ++x) {
              for(size_t y = 0; y < 128; ++y) {
                size_t sum = 0;
                for(size_t z = 0; z < 128; ++z) {
                  //sum += mat1[x][z] * mat2[z][y];
                  sum += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
                }
              }
            }     
          }
        }
      }
    }
  }

  delete [] D;
}

However, if I replace the matrix multiplication with sleep. Then the runtime performance does scale with the number of threads.
The following image is the runtime performance of matrix multiplication.
runtime performance graph of matrix multiplication

The following image is the runtime performance of sleep.
runtime performance graph of sleep

If I replace the matrix multiplication with a lightweight operation, such as the following:

size_t sum = i * 1000 // i is the private value for each task

I observe the same runtime performance as the matrix multiplication that OpenMP does not scale. The figure of runtime performance is shown below, runtime performance graph of a lightweight operation

The implementation is compiled with gcc-12 and O3 enabled and ran on a Ubuntu 19.10 (Eoan Ermine) machine with 80 Intel Xeon Gold 6138 CPU at 2.00GHz and 256 GB RAM.

I expect the runtime performance and the number of threads has a convex curve in which the lowest points are either 4 or 8 threads.

2

Answers


  1. Chosen as BEST ANSWER

    According to the previous discussions and comments, I found OpenMP task can scale under two conditions:

    1. The computation loading of each task must be big (>100us). As the overhead of using task construct can not be ignored, the elapsed time of small tasks would be dominated by the creation and scheduling of tasks.
    2. OpenMP runtime would optimize the computations when tasks only perform local computations, such as
    for(size_t x = 0; x < 128; ++x) {
      for(size_t y = 0; y < 128; ++y) {
        size_t sum = 0;
        for(size_t z = 0; z < 128; ++z) {
          sum += mat1[x][z] * mat2[y][z]; // OpenMP optimizes this out
        }
      }
    }    
    

    To avoid this optimization, we should use a global data storage as PierU suggests,

    size_t *sum = new size_t[N];
    
    for(size_t x = 0; x < 128; ++x) {
      for(size_t y = 0; y < 128; ++y) {
        for(size_t z = 0; z < 128; ++z) {
          sum[i] += mat1[x][z] * mat2[y][z]; 
        }
      }
    }    
    

    In the following, I post the entire implementation that I used to measure the runtime performance of OpenMP task construct on a binary tree structure. The implementation is modified on the previous comments. The implementation is compiled with gcc-9 and O3 enabled and ran on a Ubuntu 19.10 (Eoan Ermine) machine with 80 Intel Xeon Gold 6138 CPU at 2.00GHz and 256 GB RAM.

    #include <omp.h>
    #include <iostream>
    #include <chrono>
    #include <vector>
    #include <string>
    #include <stdlib.h>
    
    int main(int argc, char** argv) {
      
      size_t num_layers = 18;
      unsigned num_threads = strtol(argv[1], NULL, 10);
      size_t msize = strtol(argv[2], NULL, 10);
      
      size_t N = 1 << num_layers;
    
      std::vector<std::vector<int>> mat1(msize);
      std::vector<std::vector<int>> mat2(msize);
    
      for(int i = 0; i < msize; ++i) {
        mat1[i].resize(msize);
        mat2[i].resize(msize);
      }
      
      for(int i = 0; i < msize; ++i) {
        for(int j = 0; j < msize; ++j) {
          mat1[i][j] = i+j;
          mat2[i][j] = i-j;
        }
      }
    
      auto beg = std::chrono::high_resolution_clock::now();
      
      size_t *D = new size_t [N];
    
      #pragma omp parallel num_threads(num_threads)
      {
        #pragma omp single
        {
          for(size_t i = 1; i<N; ++i) {
    
            size_t p = i / 2;
            size_t l = i * 2;
            size_t r = l + 1;
    
            if(l < N && r < N) {
              #pragma omp task firstprivate(i) depend(out:D[l], D[r]) depend(in:D[p])
              {
                D[i] = 0; 
                for (int x = 0; x < mat1.size(); x++) {
                  for (int y = 0; y < mat1.size(); y++) {
                    for (int z = 0; z < mat1.size(); z++) {
                      D[i] += mat1[x][z] * mat2[y][z];
                    }
                  }
                }
              }
            }
            else {
              #pragma omp task firstprivate(i) depend(in:D[p])
              {
                 
                D[i] = 0; 
                for (int x = 0; x < mat1.size(); x++) {
                  for (int y = 0; y < mat1.size(); y++) {
                    for (int z = 0; z < mat1.size(); z++) {
                      D[i] += mat1[x][z] * mat2[y][z];
                    }
                  }
                }
              }
            }
          }
        }
      }
    
      delete [] D;
    
      auto end = std::chrono::high_resolution_clock::now();
    
      std::cout << "Elapsed: " << std::chrono::duration_cast<std::chrono::microseconds>(end - beg).count() << " usn";
    
      return 0;
    }
    
    

  2. I have tested your code and raised two issues:

    1. since you don’t use at all the results of the calculations, the compiler (I used g++ -O3 -fopenmp) simply skips the loops (i.e. doesn’t execute them at all), so the timings means nothing. I have introduced some instructions to avoid this issue
    2. but then I got segmentation faults: I’m no C++ specialist, but are you sure that your matrices are correctly declared? The outer vector is dimensioned to 128, but the inner vector sizes are unspecified (edit: I see that you now corrected your code). I replaced then by classical C 2D arrays and now the code runs fine… and takes much more time.

    Here is the modified code:

    #include <cstdio>
    #include <cstddef>
    #include <vector>
    #include <omp.h>
    
    int main() {
    
      size_t N = 1 << 10; //16;
      size_t *D = new size_t [N];
      size_t threads = 2;  // 1, 2, 4, 8, 16
      size_t *sum = new size_t [N]; // global sum array to avoid skipping the calculations
    
      //std::vector<std::vector<int>> mat1(128); // a 2D matrix
      //std::vector<std::vector<int>> mat2(128); // a 2D matrix
    
      // classical C arrays
      size_t mat1[128][128], mat2[128][128];
      for (int i = 0; i < 128; i++) {
        for (int j = 0; j < 128; j++) {
          mat1[i][j] = 0;
          mat2[i][j] = 0;
        }
      }
    
      double tic = omp_get_wtime();
    
      // number of threads set by the environment variable OMP_NUM_THREADS
      #pragma omp parallel 
      {
        #pragma omp single
        {
          for(size_t i = 1; i < N; ++i) {
    
            size_t p = i / 2;
            size_t l = i * 2;
            size_t r = l + 1;
    
            if(l < N && r < N) {
              #pragma omp task firstprivate(i) depend(out:D[l], D[r]) depend(in:D[p])
              {
                // perform matrix multiplication
                //size_t sum = 0;
                for(size_t x = 0; x < 128; ++x) {
                  for(size_t y = 0; y < 128; ++y) {
                    for(size_t z = 0; z < 128; ++z) {
                      //sum += mat1[x][z] * mat2[z][y];
                      sum[i] += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
                    }
                  }
                }  
              }
            }
            else {
              #pragma omp task firstprivate(i) depend(in:D[p])
              {           
                // perform matrix multiplication
                //size_t sum = 0;
                for(size_t x = 0; x < 128; ++x) {
                  for(size_t y = 0; y < 128; ++y) {
                    for(size_t z = 0; z < 128; ++z) {
                      //sum += mat1[x][z] * mat2[z][y];
                      sum[i] += mat1[x][z] * mat2[y][z]; // mat2 is transposed for better memory access pattern
                    }
                  }
                }
              }
            }
          }
        }
      }
    
      double toc = omp_get_wtime();
      printf("%lfn",toc-tic);
    
      delete [] D;
      delete [] sum;
    }
    

    As I said it is much slower, and it scales (the machine has 10 cores):

    % g++ -O3 -fopenmp mmtask.cpp
    % setenv OMP_NUM_THREADS 1 && ./a.out
    1.803429
    % setenv OMP_NUM_THREADS 2 && ./a.out
    1.011750
    % setenv OMP_NUM_THREADS 4 && ./a.out
    0.475696
    % setenv OMP_NUM_THREADS 8 && ./a.out
    0.301431
    % setenv OMP_NUM_THREADS 16 && ./a.out
    0.357968
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search