skip to Main Content

I want to xor large shifted arrays, following is portable version of that function for easy of explanation. How I can improve this computation? I have tried using AVX2 but didnt see much improvement. Currently for the DB showing in the example below it takes 50ms to process everything, which is 12 GB/s, I will appreciate any tips to improve the computation.

#include <iostream>

uint64_t partition_size = 4096;
uint64_t entry_size = 256; // bits
uint64_t DB_size = 16777216;
uint64_t *DB = new uint64_t[DB_size * entry_size/64];


//partition_index will be a random multiple of partition_size, e.g. 0, 8192, 4096 etc
//random_offset will be a random number in [0, partition_size]
void xor_shifted_arrays(uint32_t partition_index, uint32_t random_offset, uint64_t *result)
{
    auto uint64_per_entry = entry_size / sizeof(uint64_t);

    int shift_offset;
    uint32_t shift;
    
    for (int i = 0; i < partition_size  ; i = i + 1)
    {
        shift = (i + random_offset) & (partition_size - 1);
        shift_offset = shift * uint64_per_entry;
        
        for (int j = 0; j < uint64_per_entry; j=j+1){
            result[shift_offset + j] = result[shift_offset + j] ^ DB[partition_index + j];  
        }
        partition_index = partition_index + uint64_per_entry;
    }
}

Update:
Here is godbolt :https://godbolt.org/z/j14av3fGq
Also ran this code on two instances.

Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz running MacOS 13.6, 16 GB DDR4 RAM, compiler Apple clang version 15.0.0 (clang-1500.0.40.1)

AWS r7i.2xlarge Intel(R) Xeon(R) Platinum 8488C with Ubuntu, 64 GB DDR5 RAM, compiler g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Surprisingly it is 2x slower on Xeon !!

I am using compiler flag O3

Update2: I feel this may be useful, the above code gets called from outer functions something like this (not running code)

void outer_function(){
  uint64_t *result1 = new uint64_t[partition_size];
  uint64_t *result2 = new uint64_t[partition_size];
  uint64_t number_partitions = 4096;
  for (int i=0; i< number_partitions; i++){
      xor_shifted_arrays(i*partition_size,           some_rnd_gen(), result1)
  }
  for (int i=0; i< number_partitions; i++){
     xor_shifted_arrays(i*partition_size,                some_rnd_gen(), result2)
  }
}

2

Answers


  1. These are mostly generic tips, which will help your compiler to optimize code (many of them were already in the comments):

    • For array indexes/sizes always use size_t (or ptrdiff_t if you want to allow negative indexes). This will save the compiler from dealing with potential wrap-around or overflow behavior.
    • Declare variables at the point you initialize them/in the scope you use them
    • Declare all variables which never change after initialization as const. If the compiler knows their value, it can optimize multiplications or division with them, especially if they are powers of 2.
    • Pass the pointer of arrays you modify as function parameter with a __restrict keyword (only do this, if you are sure that this array does not alias/overlap with other arrays). This often helps the compiler with unrolling and SIMD optimization.
    • Compile your code with -march=native (or -march=target_architecture) to optimize for your local or some specific CPU architecture (additionally to -O3, of course).
    • Just for readability: Instead of A = A + B, write A += B, (same with ^/^=, etc). This usually does not matter for the compiler.

    Assuming the j=j+4 in your godbolt-link was a typo, and you meant j+=1 (or ++j) this results in very clean AVX2 code:

    // assuming these are known at compile-time:
    const size_t partition_size = 4096; 
    const size_t entry_size = 256; // bits
    const size_t DB_size = 16777216;
    
    uint64_t *DB = new uint64_t[DB_size * entry_size/64];
    const size_t uint64_per_entry = entry_size / sizeof(uint64_t);
    //uint64_t *result = new uint64_t[partition_size]; // pass this as parameter
    
    //partition_index will be a random multiple of partition_size, e.g. 0, 8192, 4096 etc
    //random_offset will be a random number in [0, partition_size]
    void xor_shifted_arrays(size_t partition_index, size_t random_offset, uint64_t *__restrict result)
    {
    
        for (size_t i = 0; i < partition_size  ; i++)
        {
            const size_t shift = (i + random_offset) & (partition_size - 1);
            const size_t shift_offset = shift * uint64_per_entry;
            
            for (size_t j = 0; j < uint64_per_entry; j++){
                result[shift_offset + j] ^= DB[partition_index + j];    
            }
            partition_index += uint64_per_entry;
        }
    }
    

    Godbolt-Link: https://godbolt.org/z/Khs5aTPaK

    I don’t see much further possible improvements for this code. You need to read and write every result entry once together with the corresponding DB entry.
    If you have control over memory management, you could make sure that pointers are aligned to page sizes (or at least to your SIMD-width)

    Also if (for a different problem) some entries would be read or modified multiple times, you can try to re-arrange your loops, so that variables are less often read or written from/to memory.

    Login or Signup to reply.
  2. OpenCL can vectorize simple codes easily and distributes the work to all cores. For example, following code computes XOR of 16 million 64-bit integers in 3 milliseconds with Ryzen 7900 (12 cores), making ~44GB/s bandwidth on dual-channel 4800MHz DDR5 memory:

    int main()
    {
        constexpr int N = 4096 * 4096;
        constexpr int numWorkers = 4096*32;
        GPGPU::Computer computer(GPGPU::Computer::DEVICE_CPUS);
        std::string nStr = std::string("constant int N = ") + std::to_string(N) + R"(;
        )";
        std::string nwStr = std::string("constant int stride = ") + std::to_string(numWorkers) + R"(;
        )";
        computer.compile(nStr+nwStr+R"(
        
            kernel void Xor(global unsigned long * dataIn, global unsigned long * dataOut)
            { 
                int id = get_global_id(0);
                unsigned long xorData = dataIn[id];
                for(int i=1;i<N/stride;i++)
                {
                    xorData = xorData ^ dataIn[id + i*stride];
                }
    
                // reduced result
                dataOut[id]=xorData;
            })", "Xor");
    
        GPGPU::HostParameter dataIn = computer.createArrayInput<uint64_t>("dataIn",N);
        GPGPU::HostParameter dataOut = computer.createArrayOutput<uint64_t>("dataOut", numWorkers);
        auto parameters = dataIn.next(dataOut);
    
        uint64_t result = 0;
        size_t t;
        for (int rep = 0; rep < 20; rep++)
        {
            for (int init = 0; init < N; init++)
                dataIn.access<uint64_t>(init) = init;
            {
                GPGPU::Bench bench(&t);
                computer.compute(parameters, "Xor", 0, numWorkers, 256);
    
                // final pass on host-side with just numWorkers elements
                uint64_txorData = dataOut.access<uint64_t>(0);
                for (int i = 1; i < numWorkers; i++)
                {
                    xorData ^= dataOut.access<uint64_t>(i);
                }
                result = xorData;
            }
            std::cout << "computation took " << t / 1000000000.0 << "s" << std::endl;
            std::cout << "xor: " << result << std::endl;
            std::cout << "------------" << std::endl;
        }
        return 0;
    }
    

    output:

    computation took 0.0079244s
    xor: 0
    ------------
    computation took 0.0030533s
    xor: 0
    ------------
    computation took 0.0030468s
    xor: 0
    ------------
    computation took 0.0031714s
    xor: 0
    ------------
    computation took 0.0030102s
    xor: 0
    ------------
    computation took 0.0030884s
    xor: 0
    ------------
    computation took 0.0029352s
    xor: 0
    ------------
    computation took 0.0029854s
    xor: 0
    ------------
    computation took 0.0029936s
    xor: 0
    ------------
    computation took 0.0029326s
    xor: 0
    ------------
    computation took 0.0030838s
    xor: 0
    ------------
    computation took 0.0031311s
    xor: 0
    ------------
    computation took 0.0030022s
    xor: 0
    ------------
    computation took 0.0031073s
    xor: 0
    ------------
    computation took 0.0029577s
    xor: 0
    ------------
    computation took 0.0030004s
    xor: 0
    ------------
    computation took 0.0031038s
    xor: 0
    ------------
    computation took 0.0029255s
    xor: 0
    ------------
    computation took 0.002938s
    xor: 0
    ------------
    computation took 0.0029927s
    xor: 0
    ------------
    

    I had to install Intel’s Runtime to be able to use the CPU. Also tried GPU but pcie v4.0 has some bottleneck (being only around 32GB/s and outside of the CPU cache).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search