I wanted to try making my own absolute value function. I figured that the fastest way to calculate absolute value would be to simply mask out the sign bit (the last bit in IEEE 754). I wanted to compare it’s speed to the standard abs
function. Here is my implementation:
// Union used for type punning
union float_uint_u
{
float f_val;
unsigned int ui_val;
};
// 'MASK' has all bits == 1 except the last one
constexpr unsigned int MASK = ~(1 << (sizeof(int) * 8 - 1));
float abs_bitwise(float value)
{
float_uint_u ret;
ret.f_val = value;
ret.ui_val &= MASK;
return ret.f_val;
}
For the record, I know that this sort of type punning is not standard C++. However, this is just for educational purposes, and according to the docs, this is supported in GCC.
I figured this should be the fastest way to calculate absolute value, so it should at the very least be as fast as the standard implementation. However, timing 100000000 iterations of random values, I got the following results:
Bitwise time: 5.47385 | STL time: 5.15662
Ratio: 1.06152
My abs
function is about 6% slower.
Assembly output
I compiled with -O2
optimization and the -S
option (assembly output) to help determine what was going on. I have extracted the relevant portions:
; 16(%rsp) is a value obtained from standard input
movss 16(%rsp), %xmm0
andps .LC5(%rip), %xmm0 ; .LC5 == 2147483647
movq %rbp, %rdi
cvtss2sd %xmm0, %xmm0
movl 16(%rsp), %eax
movq %rbp, %rdi
andl $2147483647, %eax
movd %eax, %xmm0
cvtss2sd %xmm0, %xmm0
Observations
I’m not great at assembly, but the main thing I noticed is that the standard function operates directly on the xmm0
register. But with mine, it first moves the value to eax
(for some reason), performs the and
, and then moves it into xmm0
. I’m assuming the extra mov
is where the slow down happens. I also noticed that, for the standard, it stores the bit mask elsewhere in the program vs an immediate. I’m guessing that’s not significant, however. The two versions also use different instructions (e.g. movl
vs movss
).
System info
This was compiled with g++ on Debian Linux (unstable branch). g++ --version
output:
g++ (Debian 10.2.1-6) 10.2.1 20210110
If these two versions of the code both calculate absolute value the same way (via an and
), why doesn’t the optimizer generate the same code? Specifically, why does it feel the need to include an extra mov
when it optimizes my implementation?
2
Answers
I got a bit different assembly. According to the x86_64 Linux ABI, a
float
argument is passed viaxmm0
. With standardfabs
, the bitwiseAND
operation is performed directly on this register (Intel syntax):However, in your case, the bitwise
AND
is performed on objects of typeunsigned int
. Therefore, GCC does the same which requires to movexmm0
toeax
first:Live demo: https://godbolt.org/z/xj8MMo
I haven’t found any way how to force the GCC optimizer to perform
AND
directly onxmm0
with only pure C/C++ source code. It seems that efficient implementations need to be built upon assembler code or Intel intrinsic.Relevant question: How to perform a bitwise operation on floating point numbers. All the proposed solutions basically result in the same outcome.
I also tried to use the
copysign
function, but the result was even worse. The generated machine code then conatiend x87 instructions.Anyway, it is quite interesting that the Clang optimizer was clever enough to make the assembly in all 3 cases equivalent: https://godbolt.org/z/b6Khv5.
Because with most optimizing compilers (in particular GCC or Clang), it would use a specialized machine instruction known by the compiler
The GCC compiler has even a builtin for
abs
Be sure to compile with
gcc -O3
and perhaps-ffast-math
.You could study the assembler code: compile your
example.c
asgcc -Wall -O3 -ffast-math -fverbose-asm example.c
and look inside the emittedexample.s
assembler file.On Linux systems (e.g. Debian), you could study the source code of GNU libc and look inside the
math.h
standard header (and useg++ -O3 -C -E
to get the preprocessed form)