Title says it all. I have a big array A
of type unsigned int
that I want to copy to an array B
of type char
. I am not sure, if it might be beneficial for performance, if B
was of type unsigned char
because then I could simply copy every forth byte and ignore type conversions (loss of data is not an issue here, every number in A
fits in (unsigned) char
). If so, unsigned char
would be perfectly OK.
Doing a simple for-loop like below works of course, but is slow:
for (int n = 0; n < size_of_A; n++)
{
B[n] = A[n];
}
What would be the fastest way of doing this?
Edit:
I am using an Intel i5 8259u with integrated graphics. For software: Visual Studio 2019 with MSVC compiler, of course in release mode. As for compiler options – I never changed anything after standard install.
Edit 2:
Thank you for your answers. I now found, that the problem was completely elsewhere: As shown by Solei, timing the for-loop doesn’t show much CPU-time. In my case, I wanted to copy directly from GPU memory (after using a compute shader). I am using a second OpenGL context and that seems to screw things up sometimes. My first code was like this:
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
void* mapped = glMapNamedBuffer(buffer, GL_READ_ONLY);
unsigned int* Temp = (unsigned int*)mapped;
for (int n = 0; n < size_of_SSBO; n++)
{
B[n] = Temp[n];
}
glUnmapNamedBuffer(buffer);
For some reason (probably copying from GPU in one go), this works much faster:
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
void* mapped = glMapNamedBuffer(buffer, GL_READ_ONLY);
std::vector<unsigned int> Temp(size_of_SSBO);
memcpy(&Temp[0], mapped, size_of_SSBO);
for (int n = 0; n < size_of_SSBO; n++)
{
B[n] = Temp[n];
}
glUnmapNamedBuffer(buffer);
2
Answers
In general there are several ways to make it faster. Using parallel registers can help, using iGPU can help and eventually offloading on a discrete GPU can help.
But before trying anything fancy, I try on my CPU.
With a preliminary test (cold run on 9900k, VS2022, release (-o2)):
The assembly of the loop does not use anything parallel:
Comment: there is no vectorization, nothing parallel.
The story is different in debug:
The test:
Comment about
return B[16];
: 16 is arbitrary, if I do not useB
, the loop is optimized away (ie., will not exist in the release assembly). This is not needed for a debug assembly.Helpers, put before main:
You can pre Allocate the memory your arrays need and you could use concurrency to speed up looping over a large array and reading its elements and copying them to another array.
Be aware that due to simpilicty of your task I did not use barriers (mutex and locks) since it is assumed that array is divided such that each section has no overlaping with another divided section.
And there is no other way of converting everything in an array rather than using a loop, you could cast it to your desired type every time you want to read from it.
Further reading: https://en.cppreference.com/w/cpp/thread/async