https://dotat.at/@/2025-06-08-floats.html
A couple of years ago I wrote about random floating point
numbers. In that article I was mainly concerned about how
neat the code is, and I didn't pay attention to its performance.
Recently, a comment from Oliver Hunt and a blog post from
Alisa Sireneva prompted me to wonder if I made an
unwarranted assumption. So I wrote a little benchmark, which you can
find in pcg-dxsm.git
.
(Note 2025-06-09: I've edited this post substantially after
discovering some problems with the results.)
recap
Briefly, there are two basic ways to convert a random integer to a
floating point number between 0.0 and 1.0:
Use bit fiddling to construct an integer whose format matches a
float between 1.0 and 2.0; this is the same span as the result but
with a simpler exponent. Bitcast the integer to a float and
subtract 1.0 to get the result.
Shift the integer down to the same range as the mantissa, convert
to float, then multiply by a scaling factor that reduces it to the
desired range. This produces one more bit of randomness than the
bithacking conversion.
(There are other less basic ways.)
code
The double precision code for the two kinds of conversion is below.
(Single precision is very similar so I'll leave it out.)
It's mostly as I expect, but there are a couple of ARM instructions
that surprised me.
bithack
The bithack function looks like:
double bithack52(uint64_t u) {
u = ((uint64_t)(1023) << 52) | (u >> 12);
return(bitcast(double, u) - 1.0);
}
It translates fairly directly to amd64 like this:
bithack52:
shr rdi, 12
movabs rax, 0x3ff0000000000000
or rax, rdi
movq xmm0, rax
addsd xmm0, qword ptr [rip + .number]
ret
.number:
.quad 0xbff0000000000000
On arm64 the shift-and-or becomes one bfxil
instruction (which is a
kind of bitfield move), and the constant -1.0
is encoded more
briefly. Very neat!
bithack52:
mov x8, #0x3ff0000000000000
fmov d0, #-1.00000000
bfxil x8, x0, #12, #52
fmov d1, x8
fadd d0, d1, d0
ret
multiply
The shift-convert-multiply function looks like this:
double multiply53(uint64_t u) {
return ((double)(u >> 11) * 0x1.0p-53);
}
It translates directly to amd64 like this:
multiply53:
shr rdi, 11
cvtsi2sd xmm0, rdi
mulsd xmm0, qword ptr [rip + .number]
ret
.number:
.quad 0x3ca0000000000000
GCC and earlier versions of Clang produce the following arm64 code,
which is similar though it requires more faff to get the constant into
the right register.
multiply53:
lsr x8, x0, #11
mov x9, #0x3ca0000000000000
ucvtf d0, x8
fmov d1, x9
fmul d0, d0, d1
ret
Recent versions of Clang produce this astonishingly brief two
instruction translation: apparently you can convert fixed-point to
floating point in one instruction, which gives us the power of
two scale factor for free!
multiply53:
lsr x8, x0, #11
ucvtf d0, x8, #53
ret
benchmark
My benchmark has 2 x 2 x 2 tests:
I ran the benchmark on my Apple M1 Pro and my AMD Ryzen 7950X.
These functions are very small and work entirely in registers so it
has been tricky to measure them properly.
To prevent the compiler from inlining and optimizing the benchmark
loop to nothing, the functions are compiled in a separate translation
unit from the test harness. This is not enough to get plausible
measurements because the CPU overlaps successive iterations of the
loop, so we also use fence instructions.
On arm64, a single ISB (instruction stream barrier) in the loop is
enough to get reasonable measurements.
I have not found an equivalent of ISB on amd64, so I'm using MFENCE.
It isn't effective unless I pass the argument and return values via
pointers (because it's a memory fence) and place MFENCE instructions
just before reading the argument and just after writing the result.
results
In the table below, the leftmost column is the number of random bits;
"old" is arm64 with older clang, "arm" is newer clang, "amd" is gcc.
The first line is a baseline do-nothing function, showing the
overheads of the benchmark loop, function call, load argument, store
return, and fences.
The upper half measures sequential numbers, the bottom half is random
numbers. The times are nanoseconds per operation.
old arm amd
00 21.44 21.41 21.42
23 24.28 24.31 22.19
24 25.24 24.31 22.94
52 24.31 24.28 21.98
53 25.32 24.35 22.25
23 25.59 25.56 22.86
24 26.55 25.55 23.03
52 27.83 27.81 23.93
53 28.57 27.84 25.01
The times vary a little from run to run but the difference in speed of
the various loops is reasonably consistent.
The numbers on arm64 are reasonably plausible. The most notable thing
is that the "old" multiply conversion is about 3 or 4 clock cycles
slower, but with a newer compiler that can eliminate the multiply,
it's the same speed as the bithacking conversion.
On amd64 the multiply conversion is about 1 or 2 clock cycles slower
than the bithacking conversion.
conclusion
The folklore says that bithacking floats is faster than normal integer
to float conversion, and my results generally agree with that, apart
from on arm64 with a good compiler. It would be interesting to compare
other CPUs to get a better idea of when the folklore is right or
wrong -- or if any CPUs perform the other way round!