uint8_t data[] = "mykeyxyz:1234nky:123n...";
.
My lines of string has format key:value
, where each line has len(key) <= 16
guaranteed. I want to load mykeyxyz
into a __m128i
, but fill out the higher position with 0.
The easiest way is to have an array of 255 or 0 masks, but that requires another memory load.
Is there anyway to do this faster?
Edit: preferably without AVX512
Edit 2: I need the variable len
so I can start parsing the value part.
Edit 3: the function will be used in a loop (for example to parse 1 million lines of text). But strcmp_mask
will basically always be inside L1 cache
Edit 4: I benchmark the functions by parsing 1 billion lines of (key,value)
and process them. You can download the code/data and replicate the results in my repo: https://github.com/lehuyduc/1brc-simd . Also the discussion post will contain more info
The accepted answer gives ~2% faster total program time. To compare, test 1brc_valid13.cpp
against 1brc_valid14.cpp
(which uses the accepted answer). Hardware: AMD 2950X, Ubuntu 18.04, g++ 11.4, compile command: g++ -o main 1brc_final_valid.cpp -O3 -std=c++17 -march=native -m64 -lpthread
#include <iostream>
#include <immintrin.h>
#include <string>
#include <cstring>
using namespace std;
alignas(4096) const uint8_t strcmp_mask[32] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
int main()
{
uint8_t data[] = "mykeyxyz:1234naaaaaaaaaaa";
__m128i chars = _mm_loadu_si128((__m128i*)data);
__m128i separators = _mm_set1_epi8(':');
__m128i compared = _mm_cmpeq_epi8(chars, separators);
uint32_t separator_mask = _mm_movemask_epi8(compared);
uint32_t len = __builtin_ctz(separator_mask);
cout << "len = " << len << "n";
__m128i mask = _mm_loadu_si128((__m128i*)(strcmp_mask + 16 - len));
__m128i key_chars = _mm_and_si128(chars, mask);
uint8_t res[16];
memcpy(res, (char*)&key_chars, 16);
for (int i = 0; i < 16; i++) cout << int(res[i]) << " ";
cout << "n";
}
// len = 8
// 109 121 107 101 121 120 121 122 0 0 0 0 0 0 0 0
3
Answers
I’m not aware of an efficient, load-free way of doing it without AVX-512.
See is there an inverse instruction to the movemask instruction in intel avx2? for various approaches.
Using AVX-512, you can generate a mask using
_mm_mask_broadcastb_epi8
for later use with_mm_and_si128
.Even more simply, you can mask the input characters with
_mm_maskz_mov_epi8
(1):Using this, you can mask your string like this:
This code prints
mykeyxyz
.See live code at Compiler Explorer.
(1) thanks to @PeterCordes for the suggestion
The following code (requiring only SSE 4.1), masks out every byte following the first occurrence of
char c
instring
:Godbolt demo: https://godbolt.org/z/Tnsj1sf46
N.B.: If you need the length of the key anyways, OP’s original code is probably fine (loading data from cache is usually cheap).
I often find it interesting to see how others approach a problem, so here’s my version. It only requires SSE2, but benefits from BMI1 for the trailing zeros calculation.