skip to Main Content

I want to have X amount of std::vectors of equal size, which I can be processed together in a for loop which goes from start to finish in a linear fashion. For example:

for (int i = 0; i < vector_length; i++)
    vector1[i] = vector2[i] + vector3[i] * vector4[i];

I want all this to take full advantage of SIMD instructions. For this to happen, the compiler should be able to assume that each of the vectors are aligned optimally for __m256 use. If the compiler can’t assume this, all sorts of non-optimal loops can be generated and used in the code.

How do I ensure this optimal alignment of std::vectors and optimal code generation for such aligned data?

It can be assumed that each vector has identical data structures inside, which can be added/multiplied together using standard SIMD instructions.

I’m using C++17.

MORE INFORMATION AS REQUESTED BY THE PEOPLE HERE:

32 bytes of alignment is good for my use.

I want to get this running on Intel Macs and PCs. (Xcode + Visual Studio) and later on ARM CPU Macs when I get one of those computers (Xcode again).

3

Answers


  1. Chosen as BEST ANSWER

    As couple of people pointed out, there's a related question which can be used to first ensure properly aligned memory owned by the std::vector:

    Modern approach to making std::vector allocate aligned memory

    That combined with __attribute__((aligned(ALIGNMENT_IN_BYTES))) added to the method parameters (pointers) seems to do the trick. Example:

    void Process(__attribute__((aligned(ALIGNMENT_IN_BYTES))) const uint8_t* p_source1,
                 __attribute__((aligned(ALIGNMENT_IN_BYTES))) const uint8_t* p_source2,
                 __attribute__((aligned(ALIGNMENT_IN_BYTES))) uint8_t*       p_destination,
                 const int      count)
    {
        for (int i = 0; i < count; i++)
            p_destination[i] = p_source1[i] + p_source2[i];
    }
    

    That seems to compile nicely (checked in Godbolt) so the compiler clearly assumes it can simply use large registers to process the data with SIMD instructions.

    Thank you everyone!


  2. The only way to control the allocation of std::vector is by replacing the allocator. Boost has an implementation that ensures alignment: https://www.boost.org/doc/libs/1_84_0/doc/html/align/reference.html#align.reference.classes

    Login or Signup to reply.
  3. Is the size of the data known beforehand or are you using any buffers? Cause then you could just us a normal array with alignas.
    And for using SIMD instruction – you could use valarray. That and vector both internally use malloc wich in turn is guaranteed to respect the types alignment.

    So std::vector<__m256i> mySIMDVector; is aligned.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search