I want to xor large shifted arrays, following is portable version of that function for easy of explanation. How I can improve this computation? I have tried using AVX2 but didnt see much improvement. Currently for the DB showing in the example below it takes 50ms to process everything, which is 12 GB/s, I will appreciate any tips to improve the computation.
#include <iostream>
uint64_t partition_size = 4096;
uint64_t entry_size = 256; // bits
uint64_t DB_size = 16777216;
uint64_t *DB = new uint64_t[DB_size * entry_size/64];
//partition_index will be a random multiple of partition_size, e.g. 0, 8192, 4096 etc
//random_offset will be a random number in [0, partition_size]
void xor_shifted_arrays(uint32_t partition_index, uint32_t random_offset, uint64_t *result)
{
auto uint64_per_entry = entry_size / sizeof(uint64_t);
int shift_offset;
uint32_t shift;
for (int i = 0; i < partition_size ; i = i + 1)
{
shift = (i + random_offset) & (partition_size - 1);
shift_offset = shift * uint64_per_entry;
for (int j = 0; j < uint64_per_entry; j=j+1){
result[shift_offset + j] = result[shift_offset + j] ^ DB[partition_index + j];
}
partition_index = partition_index + uint64_per_entry;
}
}
Update:
Here is godbolt :https://godbolt.org/z/j14av3fGq
Also ran this code on two instances.
Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz running MacOS 13.6, 16 GB DDR4 RAM, compiler Apple clang version 15.0.0 (clang-1500.0.40.1)
AWS r7i.2xlarge Intel(R) Xeon(R) Platinum 8488C with Ubuntu, 64 GB DDR5 RAM, compiler g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Surprisingly it is 2x slower on Xeon !!
I am using compiler flag O3
Update2: I feel this may be useful, the above code gets called from outer functions something like this (not running code)
void outer_function(){
uint64_t *result1 = new uint64_t[partition_size];
uint64_t *result2 = new uint64_t[partition_size];
uint64_t number_partitions = 4096;
for (int i=0; i< number_partitions; i++){
xor_shifted_arrays(i*partition_size, some_rnd_gen(), result1)
}
for (int i=0; i< number_partitions; i++){
xor_shifted_arrays(i*partition_size, some_rnd_gen(), result2)
}
}
2
Answers
These are mostly generic tips, which will help your compiler to optimize code (many of them were already in the comments):
size_t
(orptrdiff_t
if you want to allow negative indexes). This will save the compiler from dealing with potential wrap-around or overflow behavior.const
. If the compiler knows their value, it can optimize multiplications or division with them, especially if they are powers of 2.__restrict
keyword (only do this, if you are sure that this array does not alias/overlap with other arrays). This often helps the compiler with unrolling and SIMD optimization.-march=native
(or-march=target_architecture
) to optimize for your local or some specific CPU architecture (additionally to-O3
, of course).A = A + B
, writeA += B
, (same with^
/^=
, etc). This usually does not matter for the compiler.Assuming the
j=j+4
in your godbolt-link was a typo, and you meantj+=1
(or++j
) this results in very clean AVX2 code:Godbolt-Link: https://godbolt.org/z/Khs5aTPaK
I don’t see much further possible improvements for this code. You need to read and write every
result
entry once together with the correspondingDB
entry.If you have control over memory management, you could make sure that pointers are aligned to page sizes (or at least to your SIMD-width)
Also if (for a different problem) some entries would be read or modified multiple times, you can try to re-arrange your loops, so that variables are less often read or written from/to memory.
OpenCL can vectorize simple codes easily and distributes the work to all cores. For example, following code computes XOR of 16 million 64-bit integers in 3 milliseconds with Ryzen 7900 (12 cores), making ~44GB/s bandwidth on dual-channel 4800MHz DDR5 memory:
output:
I had to install Intel’s Runtime to be able to use the CPU. Also tried GPU but pcie v4.0 has some bottleneck (being only around 32GB/s and outside of the CPU cache).