Visual Studio Code - CUDA & OpenCL GPU Kernel function queuing are far too slow. Any faster method or technology? Future perhaps? Is this a known issue?

Mike
August 25, 2024
130 views
0 votes
2 Answers

Background

I have spent the past week doing experiments towards using GPU for audio processing/synthesis, for example in areas like finite difference modeling (eg. wave equation) where the equations are highly parallelized at least at each serial time step.

However, there has appeared a massive bottleneck which I am not sure is solvable.

Queuing of the kernel GPU functions is far too slow. I have tried an OpenCL method and just now a CUDA method and they both are extremely slow. To a crushing extent.

Requirements

For real time audio processing or synthesis, audio is typically at 44100 samples per second (Hz) and the audio buffer is typically at most 512 or 1024 samples.

This means we have 23 ms (at 1024 buffer) to process each buffer. In real world scenarios, you only have a tiny fraction of this, as there are often multiple audio processes that must finish in this time on the same CPU thread.

Problem

In order to do any useful audio work, one must finish processing one sample before moving on to the next. Furthermore, you may need several steps to go through per sample. Even if each step is highly parallel in itself, it would be hard to do anything useful without 3-4 steps per sample.

Both in CUDA and OpenCL I have found, in pseudo-code, running:

for (int i=0; i < 1024; i++){
    enqueueStep1();
    enqueueStep2();
    enqueueStep3();
    enqueueStep4();
}
runQueueOrWaitTilDone();

Takes around 10-12 ms. This has been using empty or near empty functions, so this is just the extra burden and cost of queueing the functions, not from heavy processing.

Method

OpenCL – I used this [OpenCL wrapper][1] which allows you to easily create a Kernel kernel and then kernel.enqueue_run(); as many times as you wish then, kernel.finish_queue(); to run it.
CUDA – I used the default CUDA project created by Visual Studio 2022 that has a simple add function built into it. I created a stream so could be sure it was all running on one process and nothing else was interfering. Then I wrapped the basic add function they give you in a for loop and ran cudaStreamSynchronize(stream) after this loop.

Running four operations in a loop of 1024: OpenCL with completely empty functions took around 8-12 ms (less if no arguments were passed in, more if they were) and CUDA around 12 ms with the basic add function (basically also empty function).

Result

This makes real time audio processing or wave equation work essentially useless on GPU’s at the present.

Although in theory a GPU would be great for solving something like each step of a finite difference wave equation, the cost of simply enqueuing the many serial steps required is too great. You are eating up ~half your total audio buffer time just with enqueuing the empty functions.

Question

If you have a hypoethetical function like this where you were trying to process all 1024 samples in one buffer (OpenCL Wrapper code):

kernel void processAudioBlock(global float* A, global float* B, global float* C) { 

    const uint n = get_global_id(0);
    const uint N = get_global_size(0);

    //iterate through 1024 times for 1024 samples in audio buffer
    for (int i=0; i< 1024;i++){

        //Step 1: perform function that does not depend on neighbors
        C[n] = A[n]+B[n];

        //Step 2: perform function that DOES depend on neighbors (and thus needs Step 1 complete for all other samples)
        if (n > 0 && n < N - 1) {
            C[n] = (C[n] + C[n] + C[n - 1] + C[n + 1]) * 0.25f;
        }
    }
}

As you can see, step 1 must be completed for all samples before it moves on to step 2, as step 2 requires looking at the calculations done by the neighboring GPU cores. And hypothetically (in a real case) before you move on to the next sample, the prior sample must be fully done (step 2 must be done for all cores before looping to the next step 1).

I believe I have misunderstood how GPU’s work. I thought I needed multiple kernel calls to keep the cores synchronized in such a case.

However, is it the case that when you run step 1 or 2 for example here, ALL cores do so at exactly the same time? And ALL cores are done step 1 before step 2 is run?

Thanks for any clarification.
[1]: https://github.com/ProjectPhysX/OpenCL-Wrapper

Tags: audio c#cuda gpu opencl

Answers

Chosen as BEST ANSWER
- Mike
- August 25, 2024 at 11:16 pm
- 0 votes
0
The solution was to use barrier(CLK_GLOBAL_MEM_FENCE) in OpenCL liberally inside my kernel to separate each step. Then when running, it one must make sure OpenCL workgroups are coordinated so that each workgroup is expected to stay in sync only with itself.

Ie. barrier(CLK_GLOBAL_MEM_FENCE) can only work across one workgroup at a time. But each workgroup can be up to 1024 in size or more based on the complexity of the Kernel and GPU capacity. All of which can be calculated using the OpenCL API.

And when you run the OpenCL Kernel, you can request number of workgroups and total thread count so it is all predictable.

(Edit)

- huseyintugrulbuyukisik
- August 20, 2024 at 2:21 pm
- 0 votes
0
The part
```
if (n > 0 && n < N - 1) {
    C[n] = (C[n] + C[n] + C[n - 1] + C[n + 1]) * 0.25f;
}
```
computes convoluted values from 1022 elements at most. For faster convolution, you can prepare a 1022-wide kernel of coefficients, then compute its FFT in compile time, then only in run-time compute the FFT of the array of 1024 elements and multiply the FFT of kernel and FFT of array, then compute the inverse-FFT of the result. This should give the convolution faster than brute force (i.e. scanning all elements to compute last element).

Normally this works as is for symmetric kernels (like gravitational force field kernel in 2D/3D to compute all-pair forces between planets) but this version may require some tricks before or after computations.

FFT is O(NlogN) so using 1024 (or less) cuda threads to do it should do less number of total work than brute force and in parallel so each thread (N threads assumed) would do about logN operations instead of N.

To do this inside a kernel, you can implement a Complex type, then a bit-reversing function then a 1024-element FFT function using 2-wide twiddling(called butterfly) operations in parallel. Or use Cuda’s FFT API.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Visual Studio Code – CUDA & OpenCL GPU Kernel function queuing are far too slow. Any faster method or technology? Future perhaps? Is this a known issue?