skip to Main Content

Title says it all. I have a big array A of type unsigned int that I want to copy to an array Bof type char. I am not sure, if it might be beneficial for performance, if B was of type unsigned char because then I could simply copy every forth byte and ignore type conversions (loss of data is not an issue here, every number in A fits in (unsigned) char). If so, unsigned char would be perfectly OK.

Doing a simple for-loop like below works of course, but is slow:

for (int n = 0; n < size_of_A; n++)
{
    B[n] = A[n];
}

What would be the fastest way of doing this?

Edit:

I am using an Intel i5 8259u with integrated graphics. For software: Visual Studio 2019 with MSVC compiler, of course in release mode. As for compiler options – I never changed anything after standard install.

Edit 2:

Thank you for your answers. I now found, that the problem was completely elsewhere: As shown by Solei, timing the for-loop doesn’t show much CPU-time. In my case, I wanted to copy directly from GPU memory (after using a compute shader). I am using a second OpenGL context and that seems to screw things up sometimes. My first code was like this:

    glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
    void* mapped = glMapNamedBuffer(buffer, GL_READ_ONLY);
    unsigned int* Temp = (unsigned int*)mapped;
    for (int n = 0; n < size_of_SSBO; n++)
    {
        B[n] = Temp[n];
    }
    glUnmapNamedBuffer(buffer);

For some reason (probably copying from GPU in one go), this works much faster:

    glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
    void* mapped = glMapNamedBuffer(buffer, GL_READ_ONLY);
    std::vector<unsigned int> Temp(size_of_SSBO);
    memcpy(&Temp[0], mapped, size_of_SSBO);
    for (int n = 0; n < size_of_SSBO; n++)
    {
        B[n] = Temp[n];
    }
    glUnmapNamedBuffer(buffer);

2

Answers


  1. In general there are several ways to make it faster. Using parallel registers can help, using iGPU can help and eventually offloading on a discrete GPU can help.

    But before trying anything fancy, I try on my CPU.

    With a preliminary test (cold run on 9900k, VS2022, release (-o2)):

    malloc A: 0.5376ms
    malloc B: 0.01ms
    loop with cast: 55.4634ms
    

    The assembly of the loop does not use anything parallel:

    loc_140001030:
    movzx   edx, byte ptr [rbx]
    lea     rbx, [rbx+8]
    mov     [rcx], dl
    lea     rcx, [rcx+2]
    movzx   edx, byte ptr [rbx-4]
    mov     [rcx-1], dl
    sub     r8, 1
    jnz     short loc_140001030
    

    Comment: there is no vectorization, nothing parallel.

    The story is different in debug:

    malloc A: 32.6284ms
    malloc B: 7.8048ms
    loop with cast: 130.097ms
    

    The test:

    int main()
    {
        size_t size{ 8192 * 8192 };
    
        Chrono a{};
        auto A = (uint*)malloc(size * sizeof(uint));
        a.StopAndDisplay("malloc A");
    
        a.Start();
        auto B = (uchar*)malloc(size * sizeof(uchar));
        a.StopAndDisplay("malloc B");
    
        a.Start();
        for (size_t i = 0; i < size; i++)
            B[i] = (uchar)A[i];
        a.StopAndDisplay("loop with cast");
        
        return B[16];
    }
    

    Comment about return B[16];: 16 is arbitrary, if I do not use B, the loop is optimized away (ie., will not exist in the release assembly). This is not needed for a debug assembly.

    Helpers, put before main:

    #include <chrono>
    #include <iostream>
    
    using uint = unsigned int;
    using uchar = unsigned char;
    
    class Chrono
    {
        std::chrono::time_point<std::chrono::steady_clock> t0{};
        std::chrono::time_point<std::chrono::steady_clock> t1{};
    
    public:
        Chrono()
        {
            Start();
        }
    
        void Start() { t0 = std::chrono::steady_clock::now(); }
    
        std::chrono::duration<double> Stop()
        {
            t1 = std::chrono::steady_clock::now();
            std::chrono::duration<double> elapsed_seconds = t1 - t0;
            return elapsed_seconds;
        }
    
        std::chrono::duration<double> Restart()
        {
            auto elapsed = Stop();
            t0 = std::chrono::steady_clock::now();
            return elapsed;
        }
    
        void StopAndDisplay(const std::string& message)
        {
            auto elapsed = Stop();
            std::cout << message << ": " << elapsed.count() * 1e3 << "ms" << std::endl;
        }
    
        void RestartAndDisplay(const std::string& message)
        {
            auto elapsed = Restart();
            std::cout << message << ": " << elapsed.count() * 1e3 << "ms" << std::endl;
        }
    };
    
    Login or Signup to reply.
  2. You can pre Allocate the memory your arrays need and you could use concurrency to speed up looping over a large array and reading its elements and copying them to another array.

    #include <vector>
    #include <future>
    
    void copyToNewArray(std::vector<unsigned int>& read, std::vector<unsigned char>& write, 
    unsigned int fromIndex, unsigned int toIndex)
    {
        for(unsigned int i{fromIndex}; i <= toIndex; ++i)
        {
            write[i] = static_cast<unsigned char>(read[i]);
        }
    }
    int main()
    {
        std::vector<unsigned int> readVec;
        std::vector<unsigned char> writeVec;
        
        size_t estematedMaxSize{1000};
        readVec.reserve(estematedMaxSize);
        writeVec.reserve(estematedMaxSize);
        //fill your read(first) vector
        //assuming your vector size is dividable by 4, I launched 4 async tasks
        
        auto result1 = std::async(std::launch::async, copyToNewArray, std::ref(readVec), std::ref(writeVec), 0, arraySize/4 - 1);
        auto result2 = std::async(std::launch::async, copyToNewArray, std::ref(readVec), std::ref(writeVec), arraySize/4, arraySize/2 - 1);
        auto result3 = std::async(std::launch::async, copyToNewArray, std::ref(readVec), std::ref(writeVec), arraySize/2, 3*arraySize/4 - 1);
        auto result4 = std::async(std::launch::async, copyToNewArray, std::ref(readVec), std::ref(writeVec), 3*arraySize/4, arraySize - 1);
        result1.wait();
        result2.wait();
        result3.wait();
        result4.wait();
        //your answer is ready.
    }
    

    Be aware that due to simpilicty of your task I did not use barriers (mutex and locks) since it is assumed that array is divided such that each section has no overlaping with another divided section.
    And there is no other way of converting everything in an array rather than using a loop, you could cast it to your desired type every time you want to read from it.

    Further reading: https://en.cppreference.com/w/cpp/thread/async

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search