Debian - Truncating floats to int on x86 vs x86_64 has different rounding error

uml228ute
October 27, 2024
216 views
0 votes
2 Answers

I stumbled upon a failing unit-test, that involves converting doubles to integers.

The actual number that is converted is 1.234 * 1000., and the code basically boils down to:

#include <iostream>
#include <cstdint>

int64_t deltatime(double numSeconds) {
        return (int64_t) (numSeconds * 1000.0);
}

int main() {
        double s = 1.234;
        int64_t ms = deltatime(s);
        std::cout << s << "sec -> " << ms << "msn";
        return 0;
}

Now compiling this for x86-64 with optimization disabled, gives me:

$ g++ test.cpp && ./test
1.234000s -> 1234ms

Whereas compiling this for x86-32 (either using the -m32 flag, or natively in a x86-32 chroot/VM), gives me:

$ g++ -m32 test.cpp && ./test
1.234000s -> 1233ms

This is with g++ --version g++ (Debian 14.2.0-7) 14.2.0

I didn’t use any -O optimization options, and none of the expressions have more than one constant, so GCC won’t do compile-time evaluation (constant folding).

Now, I understand that the number 1.234 cannot be represented exactly in IEEE-754,
e.g. in single-precision float it really is 1.2339999675750732421875, and similarly in double-precision it is 1.2339999999999999857891452848.

Now, multiplying the actual values with 1000.0 (which can be represented exactly), should always give me 1233.9999..., and casting this to int64_t would actually be 1233 (rather than the naively expected 1234).

But why-of-why do I get 1234 on x86-64 (without specifying any rounding options for the compiler)?

Is the proper solution around this just adding 0.5 to the sum (before casting to int64_t)?

int64_t deltatime(double numSeconds) {
        return (int64_t) (numSeconds * 1000.0 + 0.5);
}

Tags: c#gcc rounding-error x86 x86-64

Answers

- gnasher729
- October 27, 2024 at 10:28 pm
- 0 votes
0
It may be that on one compiler 80 but extended is used. 1.234 * 1000 is a number very close to 1234. It can be equal, a tiny bit larger or a tiny bit smaller, so the correct answer do your question is “your code is broken”.

Login or Signup to reply.

- PeterCordes
- October 27, 2024 at 10:37 pm
- 0 votes
0
GCC targeting i386 defaults to -mfpmath=387, with FLT_EVAL_METHOD == 2 (with the excess precision for temporaries being the 80-bit x87 type, 64-bit mantissas).

GCC targeting x86-64 uses -mfpmath=sse with FLT_EVAL_METHOD == 0 (doubles are evaluated as double, rounding each temporary result to double-precision).
- https://en.cppreference.com/w/cpp/types/climits/FLT_EVAL_METHOD
- C++ handling of excess precision – with the default -std=gnu++20, the default is -fexcess-precision=fast.
  With -std=c++20, that would set -fexcess-precision=standard which is somewhat like -ffloat-store for runtime variables.
- https://gcc.gnu.org/wiki/FloatingPointMath
- https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/
- Comparing double with literal value in C gives different results on 32 bit machines – things can get really weird even with compile-time constants.
Not rounding up requires excess temporary precision in this case

For single-precision float, https://www.h-schmidt.net/FloatConverter/IEEE754.html shows 1233.9999675750732421875 rounding up to 1234.0 if you enter that value. (That’s the exact product which the FP multiplier has to round to fit in the output, producing some representable float. I got that input by manually moving the decimal place in your float value). Presumably the same thing happens with double.

(1000.0 isn’t a power of 2 so multiplying by it changes the mantissa and thus requires rounding.)

But that doesn’t happen with double(1.234) promoted to 80-bit – there are trailing zeros at the bottom of the mantissa of that 80-bit input, so multiplying by an exact 1000.0 doesn’t round up all the way to 1234.0, so it still truncates to 1233. In GDB, I can see the product is st0 = 1233.99999999999998579 (raw 0x40099a3fffffffffff80)

If you’d started with auto s = (long double)1234.0 / 1000 and the function took a long double, that might also round up.

Since the cast to int64_t is in the same expression as the multiply, even strict ISO C++ rules allow FLT_EVAL_METHOD == 2 to keep excess precision there. (And even gcc -O0 still has the value in an x87 register, not storing/reloading like it would if you assigned it to a separate variable.)

As we can see from the asm output, g++ -m32 does fmulp then (after changing the x87 rounding mode) fistp which rounds to integer using the current rounding mode. It doesn’t store/reload as a qword (64-bit = double) between the multiply and the truncation.

GCC by default keeps excess precision even across statements (when optimization is enabled of course), not just within statements.

Is the proper solution around this just adding 0.5 to the sum (before casting to int64_t)?

What do you consider "proper"? What do you want your real code to do in your real use-case? If you want it to truncate toward 0, adding 0.5 doesn’t do that. That’s closer to round-to-nearest (but different for the smallest representable float below 0.5 which will round up to 1.0 because the distance between consecutive doubles grows with larger exponents. Also for negative).

If you want rounding, there are functions like long lrint(double) or long long lrint(double) which can (hopefully) compile to an SSE2 cvtsd2si, although you might need -fno-math-errno for GCC to inline it. Basically nobody checks errno after calling math functions like sqrt or nearbyint.

Whatever you choose, you need to accept that fact that tiny differences in rounding can push a value across a threshold to round to a different integer. Different choices change where the cutoffs are. If you don’t like binary floating-point math, don’t use it. e.g. use a decimal floating or fixed-point type or something.
Login or Signup to reply.