I stumbled upon a failing unit-test, that involves converting doubles to integers.
The actual number that is converted is 1.234 * 1000.
, and the code basically boils down to:
#include <iostream>
#include <cstdint>
int64_t deltatime(double numSeconds) {
return (int64_t) (numSeconds * 1000.0);
}
int main() {
double s = 1.234;
int64_t ms = deltatime(s);
std::cout << s << "sec -> " << ms << "msn";
return 0;
}
Now compiling this for x86-64 with optimization disabled, gives me:
$ g++ test.cpp && ./test
1.234000s -> 1234ms
Whereas compiling this for x86-32 (either using the -m32
flag, or natively in a x86-32 chroot/VM), gives me:
$ g++ -m32 test.cpp && ./test
1.234000s -> 1233ms
This is with g++ --version
g++ (Debian 14.2.0-7) 14.2.0
I didn’t use any -O
optimization options, and none of the expressions have more than one constant, so GCC won’t do compile-time evaluation (constant folding).
Now, I understand that the number 1.234
cannot be represented exactly in IEEE-754,
e.g. in single-precision float it really is 1.2339999675750732421875
, and similarly in double-precision it is 1.2339999999999999857891452848
.
Now, multiplying the actual values with 1000.0
(which can be represented exactly), should always give me 1233.9999...
, and casting this to int64_t
would actually be 1233
(rather than the naively expected 1234
).
But why-of-why do I get 1234
on x86-64 (without specifying any rounding options for the compiler)?
Is the proper solution around this just adding 0.5
to the sum (before casting to int64_t
)?
int64_t deltatime(double numSeconds) {
return (int64_t) (numSeconds * 1000.0 + 0.5);
}
2
Answers
It may be that on one compiler 80 but extended is used. 1.234 * 1000 is a number very close to 1234. It can be equal, a tiny bit larger or a tiny bit smaller, so the correct answer do your question is “your code is broken”.
GCC targeting i386 defaults to
-mfpmath=387
, withFLT_EVAL_METHOD == 2
(with the excess precision for temporaries being the 80-bit x87 type, 64-bit mantissas).GCC targeting x86-64 uses
-mfpmath=sse
withFLT_EVAL_METHOD == 0
(doubles are evaluated as double, rounding each temporary result to double-precision).-std=gnu++20
, the default is-fexcess-precision=fast
.With
-std=c++20
, that would set-fexcess-precision=standard
which is somewhat like-ffloat-store
for runtime variables.Not rounding up requires excess temporary precision in this case
For single-precision
float
, https://www.h-schmidt.net/FloatConverter/IEEE754.html shows1233.9999675750732421875
rounding up to1234.0
if you enter that value. (That’s the exact product which the FP multiplier has to round to fit in the output, producing some representablefloat
. I got that input by manually moving the decimal place in your float value). Presumably the same thing happens withdouble
.(
1000.0
isn’t a power of 2 so multiplying by it changes the mantissa and thus requires rounding.)But that doesn’t happen with
double(1.234)
promoted to 80-bit – there are trailing zeros at the bottom of the mantissa of that 80-bit input, so multiplying by an exact1000.0
doesn’t round up all the way to1234.0
, so it still truncates to1233
. In GDB, I can see the product isst0 = 1233.99999999999998579 (raw 0x40099a3fffffffffff80)
If you’d started with
auto s = (long double)1234.0 / 1000
and the function took along double
, that might also round up.Since the cast to
int64_t
is in the same expression as the multiply, even strict ISO C++ rules allowFLT_EVAL_METHOD == 2
to keep excess precision there. (And evengcc -O0
still has the value in an x87 register, not storing/reloading like it would if you assigned it to a separate variable.)As we can see from the asm output,
g++ -m32
doesfmulp
then (after changing the x87 rounding mode)fistp
which rounds to integer using the current rounding mode. It doesn’t store/reload as aqword
(64-bit =double
) between the multiply and the truncation.GCC by default keeps excess precision even across statements (when optimization is enabled of course), not just within statements.
What do you consider "proper"? What do you want your real code to do in your real use-case? If you want it to truncate toward 0, adding
0.5
doesn’t do that. That’s closer to round-to-nearest (but different for the smallest representable float below 0.5 which will round up to 1.0 because the distance between consecutive doubles grows with larger exponents. Also for negative).If you want rounding, there are functions like
long lrint(double)
orlong long lrint(double)
which can (hopefully) compile to an SSE2cvtsd2si
, although you might need-fno-math-errno
for GCC to inline it. Basically nobody checkserrno
after calling math functions likesqrt
ornearbyint
.Whatever you choose, you need to accept that fact that tiny differences in rounding can push a value across a threshold to round to a different integer. Different choices change where the cutoffs are. If you don’t like binary floating-point math, don’t use it. e.g. use a decimal floating or fixed-point type or something.