# Revisiting The Quick Inverse Sq. Root – Is It Nonetheless Helpful?

*by*Phil Tadros

*April 20, 2023*

In 2005, id Software program launched the supply code for his or her 1999 sport *Quake III
Area* below the GPL-2 license. Within the file code/game/q_math.c,

there’s a operate for calculating the reciprocal sq. root of a quantity

which at first look appears to make use of a really peculiar algorithm:

```
float Q_rsqrt( float quantity )
{
lengthy i;
float x2, y;
const float threehalfs = 1.5F;
x2 = quantity * 0.5F;
y = quantity;
i = * ( lengthy * ) &y; // evil floating level bit degree hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this may be eliminated
return y;
}
```

Many articles have been written about this specific algorithm and it has its

personal nicely written Wikipedia page the place it’s known as the

*quick inverse sq. root*. The algorithm truly appeared on varied boards

earlier than the Q3 supply code was launched. Ryszard of Beyond3D did some

investigating in 2004-2005 and eventually tracked down

the unique writer of the algorithm to be Greg Walsh at Ardent Pc who

created it greater than a decade earlier.

## How does it work?

So how does the tactic work, anyway? It’s carried out in two steps:

- acquire a tough approximation
`y`

for the reciprocal sq. root of our

`quantity`

:`y = quantity; i = * ( lengthy * ) &y; i = 0x5f3759df - ( i >> 1 ); y = * ( float * ) &i;`

- enhance the approximation utilizing a single step of the Newton-Raphson (NR) methodology:
`const float threehalfs = 1.5F; x2 = quantity * 0.5F; y = y * ( threehalfs - ( x2 * y * y ) );`

### First approximation

Essentially the most attention-grabbing half is the primary one. It makes use of a seemingly magic quantity

`0x5f3759df`

and a few bit shifting and in some way finally ends up with the reciprocal

sq. root. The primary line shops the 32-bit floating-point quantity `y`

as a

32-bit integer `i`

by taking a pointer to `y`

, changing it to a `lengthy`

pointer and dereferencing it. So `y`

and `i`

maintain two similar 32-bit vectors,

however one is interpreted as a floating-point quantity and the opposite is interpreted

as an integer quantity. Then, the integer quantity is shifted one step to the

proper, negated, and the fixed `0x5f3759df`

is added. Lastly, the ensuing

worth is interpreted as a floating quantity once more by dereferencing a `float`

pointer that factors to the integer `i`

worth.

Right here, shifting, negation and addition is carried out within the integer area, how

do these operations have an effect on the quantity within the floating-point area? So as to

perceive how this may yield an approximation of the reciprocal sq. root we

should be accustomed to how floating level numbers are represented in reminiscence. A

floating-point quantity consists of an indication $$s in {0,1}$$, exponent $$ein mathbb{Z}$$ and a fractional half $$0leq{f}$. The worth of the

floating-point quantity is then

$$y = (-1)^s cdot (1 + f) cdot 2^e. $$

In our case, we will assume that our `float`

is within the IEEE 754 binary32

format, the bits are then ordered as proven beneath.

Essentially the most vital bit is the signal bit $$S$$, adopted by 8 bits ($$E$$)

representing the exponent $$e$$ and the remaining 23 bits ($$F$$) representing

the fractional half $$f$$. The quantity is unfavourable when $$S=1$$. The 8-bit

quantity $$E$$ isn’t instantly used because the exponent, it has an offset or bias of

$$2^8-1 = 127$$. So $$E=0$$ signifies that the exponent is $$e=-127$$. $$F$$ is

merely a fractional binary quantity with the decimal level earlier than the primary digit

such that $$f=Fcdot2^{-23}$$.

We are able to write a easy C program `interp.c`

to print each the integer and

floating-point interpretations of a given quantity and likewise extract the completely different

components:

```
#embody <stdlib.h>
#embody <stdio.h>
#embody <string.h>
#embody <stdint.h>
int primary(int argc, char *args[]) {
/* parse quantity from args */
uint32_t i;
int ret;
if (argc == 2) {
ret = sscanf(args[1], "%u", &i);
} else if (argc == 3 && strcmp(args[1], "-h") == 0) {
ret = sscanf(args[2], "%x", &i);
} else if (argc == 3 && strcmp(args[1], "-f") == 0) {
float y;
ret = sscanf(args[2], "%f", &y);
i = *(uint32_t*)&y;
} else {
return EXIT_FAILURE;
}
if (ret != 1) return EXIT_FAILURE;
/* print representations */
printf("hexadecimal: %xn", i);
printf("unsigned int: %un", i);
printf("signed int: %dn", i);
printf("floating-point: %fn", *(float*)&i);
/* print elements */
int S = i >> 31;
int E = (i >> 23) & ((1 << 8)-1);
int e = E - 127;
int F = i & ((1 << 23)-1);
float f = (float)F / (1 << 23);
printf("S: %dn", S);
printf("E: %d (0x%x) <=> e: %dn", E, E, e);
printf("F: %d (0x%x) <=> f: %fn", F, F, f);
return EXIT_SUCCESS;
}
```

We are able to for instance take a look at the quantity `0x40b00000`

:

```
$ ./interp -h 40b00000
hexadecimal: 40b00000
unsigned int: 1085276160
signed int: 1085276160
floating-point: 5.500000
S: 0
E: 129 (0x81) <=> e: 2
F: 3145728 (0x300000) <=> f: 0.375000
```

We are able to additionally extract the components of a floating-point quantity:

```
$ ./interp -f -32.1
hexadecimal: c2006666
unsigned int: 3254806118
signed int: -1040161178
floating-point: -32.099998
S: 1
E: 132 (0x84) <=> e: 5
F: 26214 (0x6666) <=> f: 0.003125
```

Even now once we understand how the floating-point numbers are represented in reminiscence,

it’s not fully apparent how performing operations within the integer area

would have an effect on the floating-point area. At first we will attempt to merely iterate

over a variety of floating-point quantity and see what integer values we get:

```
#embody <stdio.h>
int primary() {
float x;
for (x = 0.1; x <= 8.0; x += 0.1) {
printf("%ftpercentdn", x, *(int*)&x);
}
}
```

We are able to then plot the floating-point values on the x-axis and the integer values

on the y-axis with e.g. gnuplot to get a plot like this:

Effectively, this curve appears to be like fairly acquainted. We are able to look additional at a few of the knowledge

factors utilizing our earlier program:

```
$ ./interp -f 1.0
hexadecimal: 3f800000
unsigned int: 1065353216
signed int: 1065353216
floating-point: 1.000000
S: 0
E: 127 (0x7f) <=> e: 0
F: 0 (0x0) <=> f: 0.000000
$ ./interp -f 2.0
hexadecimal: 40000000
unsigned int: 1073741824
signed int: 1073741824
floating-point: 2.000000
S: 0
E: 128 (0x80) <=> e: 1
F: 0 (0x0) <=> f: 0.000000
$ ./interp -f 3.0
hexadecimal: 40400000
unsigned int: 1077936128
signed int: 1077936128
floating-point: 3.000000
S: 0
E: 128 (0x80) <=> e: 1
F: 4194304 (0x400000) <=> f: 0.500000
```

For 1.0 and a pair of.0 we get $$S=0$$, $$F=0$$ and a non-zero biased exponent $$E$$.

If we take away the bias from this quantity (subtract by `127 << 23`

) after which shift

it to the far proper we find yourself with the exponent $$e$$, in different phrases the bottom

2 logarithm of the floating-point quantity. Nevertheless, this solely works when $$S=0$$

and $$F=0$$, i.e. constructive integers. If $$S=1$$ we’ve got a unfavourable quantity for

which the logarithm is undefined. But when $$Fne{}0$$ and we shift the exponent

to the far proper we’ll merely lose all of that knowledge. We are able to as a substitute convert

it to a floating-point worth and divide by $$2^{23}$$, such that the fractional

half scales our ensuing worth linearly:

```
(float) (*(int*)&x - (127 << 23)) / (1 << 23)
```

Then we don’t precisely get the logarithm however we do get a linear approximation

for all non energy of two values. We are able to plot the approximation along with

the precise logarithmic operate:

Which means once we take a floating-point quantity and interpret it as an

integer quantity, we acquire an approximation of the logarithm of that quantity,

with some offset and scaling. And once we interpret an integer quantity as a

floating-point quantity, we get reverse, i.e. the exponential or antilogarithm

of our integer worth. This mainly signifies that once we carry out operations in

the integer area, it’s as if we carry out operations within the logarithmic

area. For instance, if we keep in mind our logarithmic identities, we all know that

if we take the logarithm of two numbers and add them collectively, we get the

logarithm of their product:

$$log{a} + log{b} = log{(a cdot b)}. $$

In different phrases, if we carry out addition within the integer area we get

multiplication within the floating-point area — roughly anyway. We are able to

do this with one other easy C program. One factor we have to take into account is how

our operation impacts the exponent bias. Once we add two numbers with biased

exponents we get double bias:

$$start{align} E_1 + E_2 &=& (e_1 + B) + (e_2 + B) &=& e_1 + e_2 + 2B. finish{align} $$

We wish our bias to stay as $$B$$ quite than $$2B$$ so to be able to counter

this we merely subtract the consequence by $$B$$. Our C program that performs

floating-point multiplication utilizing integer addition might then seem like this:

```
#embody <stdio.h>
#embody <stdlib.h>
#embody <stdint.h>
const uint32_t B = (127 << 23);
int primary(int argc, char *args[]) {
/* parse components from args */
float a, b;
if (argc == 3) {
int ret = sscanf(args[1], "%f", &a);
ret += sscanf(args[2], "%f", &b);
if (ret != 2) return EXIT_FAILURE;
} else {
return EXIT_FAILURE;
}
/* carry out multiplication (integer addition) */
uint32_t sum = *(uint32_t*)&a + *(uint32_t*)&b - B;
float y = *(float*)∑
/* evaluate with precise */
float y_actual = a*b;
float rel_err = (y - y_actual) / y_actual;
printf("%f =? %f (%.2f%%)n", y, y_actual, 100*rel_err);
}
```

Let’s strive it out:

```
$ ./mul 3.14159 8.0
25.132721 =? 25.132721 (0.00%)
$ ./mul 3.14159 0.2389047
0.741016 =? 0.750541 (-1.27%)
$ ./mul -15.0 3.0
-44.000000 =? -45.000000 (-2.22%)
$ ./mul 6.0 3.0
16.000000 =? 18.000000 (-11.11%)
$ ./mul 0.0 10.0
0.000000 =? 0.000000 (inf%)
```

More often than not it’s not completely correct, it’s appropriate provided that one of many

components is an influence of two, and least correct when each components are proper

between two powers of two.

How in regards to the reciprocal sq. root? The reciprocal sq. root

$$frac{1}{sqrt{x}}$$ is equal to $$x^{-1/2}$$ so we’ll want one other

logarithmic id:

$$plog{x} = log{x^p} $$

Which means if we carry out multiplication within the integer area, we get

exponentiation within the floating-point area. Relying on our exponent $$p$$ we

can acquire a number of completely different features, e.g:

$$p$$ | $$f(x)$$ |
---|---|

2 | $$x^2$$ |

1/2 | $$sqrt{x}$$ |

-1 | $$frac{1}{x}$$ |

-1/2 | $$frac{1}{sqrt{x}}$$ |

So as to get a primary approximation of the reciprocal sq. root, we merely

have to multiply by -1/2 within the integer area and alter for the bias. The

bias will then be $$-B/2$$ and we wish the bias to be $$B$$ so we merely want

so as to add $$3B/2 = texttt{0x5f400000}$$. So, we’ll multiply by -1/2 by shifting

proper one step and negating, after which add the bias:

```
- (i << 1) + 0x5f400000;
```

That is now similar to the Q3 supply code besides that the fixed worth

differs barely. They used `0x5f3759df`

whereas we at the moment have `0x5f400000`

.

We are able to see whether it is potential to make enhancements by taking a look at our error. We

merely subtract our approximate worth for the reciprocal sq. root by the

precise worth and plot the consequence for a sure vary of numbers:

The graph repeats horizontally in each instructions (solely in numerous scale) so

we solely want to take a look at this half to grasp the error for all (regular)

floating-point numbers. We are able to see that the approximate worth is all the time

overestimating, by merely subtracting a continuing that’s round half the

most error we will make it symmetric and thus lower the common absolute

error. Trying on the graph, subtracting one thing like 0x7a120 would possibly work. Our

fixed would then be 0x5f385ee0 which is nearer to the fixed utilized in Q3.

Within the integer area, our error will merely heart the error across the x-axis

within the above diagram. Within the floating-point area, the error is affected

equally besides when our subtraction borrows from the exponent:

We might probably attempt to discover an precise optimum for some affordable

objective function however we’ll cease right here. Within the case of the unique Q3

fixed, it’s not actually clear the way it was chosen, maybe utilizing trial and

error.

### Enhancing the approximation

The second half is much less unconventional. When a primary approximation has been

obtained, one can enhance it through the use of a way often called Newton-Raphson (NR). If

you’re unfamiliar with it, Wikipedia has a great article on it. The

NR methodology is used to enhance an approximation for the foundation of an equation.

Since we wish the reciprocal sq. root we want an equation $$f(y)$$ that’s

zero when $$y$$ is precisely the reciprocal sq. root of $$x$$:

$$start{align} y = frac{1}{sqrt{x}} : Leftrightarrow : frac{1}{y^2} = x Rightarrow f(y) = frac{1}{y^2} – x = 0 finish{align} $$

If we’ve got an approximate worth $$y_n$$ we will get a greater approximation

$$y_{n+1}$$ by calculating the place the tangent of the operate’s graph at

$$y=y_n$$ (i.e. the spinoff) intersects $$f(y)=0$$. That worth may be

expressed as

$$start{align} y_{n+1} &=& y_n – frac{f(y_n)}{f'(y_n)} &=& y_n left( frac{3}{2} – frac{x}{2} cdot {y_n}^2 proper) finish{align} $$

which is the very same expression that’s used within the second a part of the Q3

operate.

## How briskly is it?

Again in 2003 Chris Lomont wrote an article about his

investigations of the algorithm. His testing yielded that the algorithm was

4 instances quicker than utilizing the extra simple manner of merely utilizing

`sqrt(x)`

from the usual library and taking its reciprocal.

In 2009, Elan Ruskin made a submit, Timing Square Root, the place he

primarily appeared on the sq. root operate but additionally in contrast the quick inverse

sq. root algorithm to different strategies. On his Intel Core 2, the quick inverse

sq. root was 4 instances slower than utilizing `rsqrtss`

, or 30% slower than

`rsqrtss`

with a single NR step.

Since then, there has come a number of new extensions to the x86 instruction set. I

have tried to sum up all sq. root directions at the moment accessible:

Set | $$sqrt{x}$$ | $$frac{1}{sqrt{x}}$$ | Width |
---|---|---|---|

x87 (1980) | `fsqrt` |
32 | |

3DNow! (1998) | `pfrsqrt` |
128 | |

SSE (1999) | `sqrtps` , `sqrtss` |
`rsqrtps` , `rsqrtss` |
128 |

SSE2 (2000) | `sqrtpd` , `sqrtsd` |
128 | |

AVX (2011) | `vsqrtps` , `vsqrtpd` , `vsqrtps_nr` , |
`vrsqrtps` , `vrsqrtps_nr` |
256 |

AVX-512 (2014) | `vrsqrt14pd` , `vrsqrt14ps` , `vrsqrt14sd` , `vrsqrt14ss` |
512 |

The `fsqrt`

is kind of out of date by now. The 3DNow! extension has additionally been

deprecated and is not supported. All x86-64 processors help at the least

SSE and SSE2. Most processors help AVX and a few help AVX-512, however e.g.

GCC at the moment chooses to not emit any AVX directions by default.

The `p`

and `s`

is brief for “packed” and “scalar”. The packed directions are

vector SIMD directions whereas the scalar ones solely function on a single worth

at a time. With a register width of e.g. 256 bits, the packed instruction can

carry out $$256/32=8$$ calculations in parallel. The `s`

or `d`

is brief for

“single” or “double” precision floating-point. Since we’re contemplating

approximations we will likely be utilizing single precision floating-point numbers. We might

then use both the `ps`

or `ss`

variants.

The quick inverse sq. root methodology had a fairly laborious time in opposition to the

`rsqrtss`

instruction again in 2009 already. And since then, a number of extensions

with specialised SIMD directions has been applied in trendy x86

processors. Certainly, the quick inverse sq. root has no probability at the moment and its

time has handed?

Why don’t we give it a strive ourselves proper now, we will begin by operating some

assessments on my present machine which has a comparatively trendy processor: an AMD Zen

3 5950X from late 2020.

### Preliminary testing

We are going to write a C program that tries to calculate the reciprocal sq. root

utilizing three completely different strategies:

`precise`

: merely`1.0 / sqrtf(x)`

, utilizing the`sqrtf`

operate from the C

customary library,`appr`

: first approximation from the Q3 supply as defined above,`appr_nr`

: the total Q3 methodology with one iteration of Newton-Raphson.

For every methodology we carry out the calculation for every worth in a randomized enter

array and time how lengthy it takes in complete. We are able to use the `clock_gettime`

operate from libc (for POSIX methods) to get the time earlier than and after we

carry out the calculations and calculate the distinction. We are going to then repeat this

many instances to lower the random variations. The C program appears to be like like this:

```
#embody <stdlib.h>
#embody <stdio.h>
#embody <stdint.h>
#embody <time.h>
#embody <math.h>
#outline N 4096
#outline T 1000
#outline E9 1000000000
#ifndef CLOCK_REALTIME
#outline CLOCK_REALTIME 0
#endif
enum strategies { EXACT, APPR, APPR_NR, M };
const char *METHODS[] = { "precise", "appr", "appr_nr" };
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
static inline float rsqrt_appr(float x) {
uint32_t i = *(uint32_t*)&x;
i = -(i >> 1) + 0x5f3759df;
return *(float*)&i;
}
static inline float rsqrt_nr(float x, float y) { return y * (1.5f - x*0.5f*y*y); }
static inline float rsqrt_appr_nr(float x) {
float y = rsqrt_appr(x);
return rsqrt_nr(x, y);
}
int primary() {
srand(time(NULL));
float y_sum[M] = {0};
double t[M] = {0};
for (int trial = 0; trial < T; trial++) {
struct timespec begin, cease;
float x[N], y[N];
for (int i = 0; i < N; i++) { x[i] = rand(); }
clock_gettime(CLOCK_REALTIME, &begin);
for (int i = 0; i < N; i++) { y[i] = rsqrt_exact(x[i]); }
clock_gettime(CLOCK_REALTIME, &cease);
for (int i = 0; i < N; i++) { y_sum[EXACT] += y[i]; }
t[EXACT] += ((cease.tv_sec-start.tv_sec)*E9 + cease.tv_nsec-start.tv_nsec);
clock_gettime(CLOCK_REALTIME, &begin);
for (int i = 0; i < N; i++) { y[i] = rsqrt_appr(x[i]); }
clock_gettime(CLOCK_REALTIME, &cease);
for (int i = 0; i < N; i++) { y_sum[APPR] += y[i]; }
t[APPR] += ((cease.tv_sec-start.tv_sec)*E9 + cease.tv_nsec-start.tv_nsec);
clock_gettime(CLOCK_REALTIME, &begin);
for (int i = 0; i < N; i++) { y[i] = rsqrt_appr_nr(x[i]); }
clock_gettime(CLOCK_REALTIME, &cease);
for (int i = 0; i < N; i++) { y_sum[APPR_NR] += y[i]; }
t[APPR_NR] += ((cease.tv_sec-start.tv_sec)*E9 + cease.tv_nsec-start.tv_nsec);
}
printf("rsqrttfs/optratioterrn");
for (int m = 0; m < M; m++) {
printf("%st%.0ft%.2ft%.4fn",
METHODS[m],
t[m] * 1000.0f / N / T,
(double) t[EXACT] / t[m],
(y_sum[m] - y_sum[EXACT]) / y_sum[EXACT]);
}
return 0;
}
```

On the finish of this system we print three issues for every methodology:

- the common time to calculate a single operation in femtoseconds – the decrease

the higher, - the ratio of the calculation time in comparison with the precise methodology – the upper

the quicker, - the common error between the tactic and the precise methodology – simply to make

positive the calculations are carried out appropriately.

So, what will we count on? There are devoted features for calculating the

reciprocal sq. root within the x86 instruction set that the compiler ought to be

in a position to emit. The throughput might then be higther than within the approximate methodology

the place we carry out a number of operations.

Let’s go forward and take a look at it, we’ll compile it utilizing GCC with none optimizations

at first, explicitly with `-O0`

. Since we’re utilizing `math.h`

for the precise

methodology we will even have to hyperlink the maths library utilizing `-lm`

:

```
$ gcc -lm -O0 rsqrt.c
$ ./a.out
rsqrt fs/op ratio err
precise 3330 1.00 0.0000
appr 2020 1.65 0.0193
appr_nr 6115 0.54 -0.0010
```

This appears affordable. The error is noticeable for the primary approximation however

diminished after one iteration of NR. The primary approximation is definitely quicker

than the precise methodology however when performed along with a step of NR it’s twice as

gradual. The NR methodology requires extra operations so this appears affordable.

Alright, however that is solely a debug construct, let’s strive including optimizations utilizing

the `-O3`

flag. It will allow all optimizations that don’t disregard any

requirements compliance.

```
$ gcc -lm -O3 rsqrt.c
$ ./a.out
rsqrt fs/op ratio err
precise 1879 1.00 0.0000
appr 72 26.01 0.0193
appr_nr 178 10.54 -0.0010
```

Hmm, now the approximations are literally lots quicker than earlier than however the time

of the precise methodology has solely halved, making the approximation with NR greater than

ten instances quicker than the precise methodology. Maybe the compiler didn’t emit the

reciprocal sq. root features? Perhaps it should enhance if we use the `-Ofast`

flag as a substitute which is described by the gcc(1) man web page:

Disregard strict requirements compliance. -Ofast allows all -O3 optimizations.

It additionally allows optimizations that aren’t legitimate for all standard- compliant

packages. It activates -ffast-math, -fallow-store-data-races and the

Fortran-specific -fstack-arrays, except -fmax-stack-var-size is specified,

and -fno-protect-parens. It turns off -fsemantic-interposition.

Our precise methodology might not be as correct as earlier than, however it might be quicker.

```
$ gcc -lm -Ofast rsqrt.c
$ ./a.out
rsqrt fs/op ratio err
precise 153 1.00 0.0000
appr 118 1.30 0.0137
appr_nr 179 0.85 -0.0009
```

And it’s certainly quicker. The primary approximation remains to be quicker, however with a

step of NR it’s slower than the precise methodology. The error has decreased barely

for the approximations as a result of we’re nonetheless evaluating in opposition to the “precise”

methodology which now yields completely different outcomes. Oddly sufficient, the primary

approximation has develop into half as quick. This appears to be a quirk of GCC, as

Clang doesn’t have this difficulty, in any other case it produces related outcomes:

```
$ clang -lm -O0 rsqrt.c
$ ./a.out
rsqrt fs/op ratio err
precise 3715 1.00 0.0000
appr 1933 1.92 0.0193
appr_nr 6001 0.62 -0.0010
$ clang -lm -O3 rsqrt.c
$ ./a.out
rsqrt fs/op ratio err
precise 1900 1.00 0.0000
appr 61 31.26 0.0193
appr_nr 143 13.24 -0.0010
$ clang -lm -Ofast rsqrt.c
$ ./a.out
rsqrt fs/op ratio err
precise 148 1.00 0.0000
appr 62 2.40 0.0144
appr_nr 145 1.02 -0.0009
```

### Disassembly

For each compilers, there’s fairly a big distinction between `-O3`

and

`-Ofast`

. We are able to take a look at the disassembly to see what’s going on. We are going to want

to supply the `-g`

flag to the compiler to be able to get debug symbols within the

binary that inform us which object code corresponds to which supply code.

Thereafter we will run `objdump -d`

with the `-S`

flag to see the disassembled

directions subsequent to the supply code:

```
$ gcc -lm -O3 -g rsqrt.c
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
118e: 66 0f ef db pxor %xmm3,%xmm3
1192: 0f 2e d8 ucomiss %xmm0,%xmm3
1195: 0f 87 e1 02 00 00 ja 147c <primary+0x3bc>
119b: f3 0f 51 c0 sqrtss %xmm0,%xmm0
119f: f3 0f 10 0d 99 0e 00 movss 0xe99(%rip),%xmm1 # 2040
11a6: 00
...
11ab: f3 0f 5e c8 divss %xmm0,%xmm1
...
2040: 00 00 80 3f 1.0f
```

In case you’re unfamiliar, that is the AT&T syntax for x86-64 meeting. Notice

that the supply operand is all the time earlier than the vacation spot operand. The

parentheses point out an handle, for instance `movss 0xecd(%rip),%xmm1`

copies

the worth positioned 0xecd bytes forward of the handle within the `rip`

register

(instruction pointer, a.okay.a. PC) to the `xmm1`

register. The `xmmN`

registers

are 128 bits huge, or 4 phrases. Nevertheless, the `ss`

directions are for scalar

single-precision values, so it should solely apply the operation on a single

floating-point worth within the least vital 32 bits.

Within the `-O3`

case we use the scalar `sqrtss`

adopted by `divss`

. There may be additionally

a evaluate `ucomiss`

and a leap `ja`

that can set `errno`

to `EDOM`

in case

the enter is lower than -0. We’re not utilizing `errno`

in any respect so we will take away the

setting of `errno`

by offering the `-fno-math-errno`

flag:

```
$ gcc -lm -O3 -g -fno-math-errno rsqrt.c
$ ./a.out
rsqrt fs/op ratio err
precise 479 1.00 0.0000
appr 116 4.13 0.0193
appr_nr 175 2.74 -0.0010
```

```
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
1170: 0f 51 0c 28 sqrtps (%rax,%rbp,1),%xmm1
1174: f3 0f 10 1d c4 0e 00 movss 0xec4(%rip),%xmm3 # 2040
117b: 00
117c: 48 83 c0 10 add $0x10,%rax
1180: 0f c6 db 00 shufps $0x0,%xmm3,%xmm3
1184: 0f 28 c3 movaps %xmm3,%xmm0
1187: 0f 5e c1 divps %xmm1,%xmm0
...
2040: 00 00 80 3f 1.0f
```

This prevents us from having to examine each enter worth individually and thus

permits us to make use of the packed variants of the directions, performing 4

operations at a time. This improved the efficiency lots. Nevertheless, we nonetheless

use `sqrtps`

adopted by `divps`

. We should additionally

allow `-funsafe-math-optimizations`

and `-ffinite-math-only`

in

order to make GCC emit `rsqrtps`

as a substitute. We then get similar code

to once we used `-Ofast`

:

```
$ gcc -lm -O3 -g -fno-math-errno -funsafe-math-optimizations -ffinite-math-only rsqrt.c
$ ./a.out
rsqrt fs/op ratio err
precise 155 1.00 0.0000
appr 120 1.29 0.0137
appr_nr 182 0.85 -0.0009
```

```
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
1170: 0f 52 0c 28 rsqrtps (%rax,%rbp,1),%xmm1
1174: 0f 28 04 28 movaps (%rax,%rbp,1),%xmm0
1178: 48 83 c0 10 add $0x10,%rax
117c: 0f 59 c1 mulps %xmm1,%xmm0
117f: 0f 59 c1 mulps %xmm1,%xmm0
1182: 0f 59 0d c7 0e 00 00 mulps 0xec7(%rip),%xmm1 # 2050
1189: 0f 58 05 b0 0e 00 00 addps 0xeb0(%rip),%xmm0 # 2040
1190: 0f 59 c1 mulps %xmm1,%xmm0
...
2040: 00 00 40 c0 -3.0f
...
2050: 00 00 00 bf -0.5f
```

Now it makes use of `rsqrtps`

, nevertheless it additionally has a number of multiplication directions as

nicely as an addition. Why are these wanted, isn’t the reciprocal sq. root all

we want? We are able to get a touch from trying on the disassembly of the `appr_nr`

operate:

```
static inline float rsqrt_nr(float x, float y) { return y * (1.5f - x*0.5f*y*y); }
12f8: f3 0f 10 1d 80 0d 00 movss 0xd80(%rip),%xmm3 # 2080
12ff: 00
...
1304: 0f 59 05 65 0d 00 00 mulps 0xd65(%rip),%xmm0 # 2070
...
1310: 0f c6 db 00 shufps $0x0,%xmm3,%xmm3
...
1318: 0f 28 d1 movaps %xmm1,%xmm2
131b: 0f 59 d1 mulps %xmm1,%xmm2
131e: 0f 59 d0 mulps %xmm0,%xmm2
1321: 0f 28 c3 movaps %xmm3,%xmm0
1324: 0f 5c c2 subps %xmm2,%xmm0
1327: 0f 59 c1 mulps %xmm1,%xmm0
...
2070: 00 00 00 3f 0.5f
...
2080: 00 00 c0 3f 1.5f
```

The final half appears to be like fairly related, as a result of it’s truly doing the identical factor:

an iteration of Newton-Raphson. That is hinted within the man web page of

gcc(1):

This feature allows use of the reciprocal estimate and reciprocal sq. root

estimate directions with extra Newton-Raphson steps to extend

precision as a substitute of doing a divide or sq. root and divide for

floating-point arguments.

The `rsqrtps`

instruction solely ensures a relative error smaller than

$$1.5cdot2^{-12}$$, the NR iteration reduces it additional similar to within the Q3

code.

If we don’t want this additional precision, can we get a speedup by skipping the NR

step? We are able to use built-in compiler intrinsics to be able to make the compiler

solely emit the `rsqrtps`

instruction. The GCC guide has a

list of built-in features for the x86 instruction set.

There’s a `__builtin_ia32_rsqrtps`

operate that can emit the `rsqrtps`

instruction:

```
v4sf __builtin_ia32_rsqrtps (v4sf);
```

The guide additionally has a chapter about use these vector

directions with built-in features. We have to add a `typedef`

for the `v4sf`

sort which incorporates 4 floating level numbers. We are going to then use an array of

$$N/4$$ of those vectors and easily present one vector at a time to the

built-in operate. N is a a number of of 4 so there aren’t any half full vectors.

We are able to merely forged our earlier `float`

enter array to a `vfs4`

pointer. We

will add these components to our earlier program:

```
typedef float v4sf __attribute__ ((vector_size(16)));
v4sf rsqrt_intr(v4sf x) { return __builtin_ia32_rsqrtps(x); };
v4sf *xv = (v4sf*)x, *yv = (v4sf*)y;
clock_gettime(CLOCK_REALTIME, &begin);
for (int i = 0; i < N/4; i++) { yv[i] = rsqrt_intr(xv[i]); }
clock_gettime(CLOCK_REALTIME, &cease);
for (int i = 0; i < N; i++) { y_sum[INTR] += y[i]; }
t[INTR] += ((cease.tv_sec-start.tv_sec)*E9 + cease.tv_nsec-start.tv_nsec);
```

We are able to compile it to be able to run and disassemble it:

```
$ gcc -lm -O3 -g rsqrt_vec.c
$ ./a.out
rsqrt fs/op ratio err
precise 1895 1.00 0.0000
appr 72 26.39 0.0193
appr_nr 175 10.81 -0.0010
rsqrtps 61 31.00 0.0000
```

```
$ objdump -d -S a.out
...
v4sf rsqrt_intr(v4sf x) { return __builtin_ia32_rsqrtps(x); };
1238: 41 0f 52 04 04 rsqrtps (%r12,%rax,1),%xmm0
...
```

Now we’re all the way down to a single instruction and it’s barely quicker than earlier than.

There are additionally extensions that not all processors help that we will attempt to

use. We are able to inform the compiler to make use of any extensions which are accessible on our

processor utilizing `-march=native`

. This may occasionally make the binary incompatible with

different processors, although.

```
$ gcc -lm -Ofast -g -march=native rsqrt_vec.c
$ ./a.out
rsqrt fs/op ratio err
precise 78 1.00 0.0000
appr 40 1.96 0.0137
appr_nr 85 0.91 -0.0009
rsqrtps 62 1.25 0.0000
```

Now we’re all the way down to nearly nearly as good as the primary approximation. The intrinsic one

is just about simply as quick. The “precise” methodology obtained changed by a 256-bit

`vrsqrtps`

and a step of NR:

```
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
11d0: c5 fc 52 0c 18 vrsqrtps (%rax,%rbx,1),%ymm1
11d5: c5 f4 59 04 18 vmulps (%rax,%rbx,1),%ymm1,%ymm0
11da: 48 83 c0 20 add $0x20,%rax
11de: c4 e2 75 a8 05 79 0e vfmadd213ps 0xe79(%rip),%ymm1,%ymm0
11e5: 00 00
11e7: c5 f4 59 0d 91 0e 00 vmulps 0xe91(%rip),%ymm1,%ymm1
11ee: 00
11ef: c5 fc 59 c1 vmulps %ymm1,%ymm0,%ymm0
```

The `__builtin_ia32_rsqrtps`

is now utilizing a single `vrsqrtps`

and no NR step,

nonetheless, it nonetheless makes use of solely 128-bit registers.

### Broad sweep

So, we did some testing on my machine and obtained some perception into what sort of

directions we will use to calculate the reciprocal sq. root and the way they

would possibly carry out. We are going to now attempt to run these benchmarks on a number of machines to

give us an concept how nicely our findings apply normally. These machines embody

all those that I occur to have handy SSH entry to. All ensuing knowledge

may be downloaded from here, it additionally contains outcomes for the

inverse and sq. root features, individually.

Beneath is an inventory of the x86 machines that had been examined together with their CPUs and

their launch date. All of the earlier assessments had been run on on the pc labeled

as “igelkott”.

Hostname | CPU Household | CPU Mannequin | Yr | Type issue |
---|---|---|---|---|

jackalope | Core | Intel Celeron 550 | 2007 | i686 laptop computer |

narwhal | Piledriver | AMD FX-6300 | 2012 | x86_64 desktop |

silverback | Ivy Bridge | Intel Xeon E5-1410 | 2014 | x86_64 server |

bovinae | Kaby Lake | Intel Core i5-8250U | 2017 | x86_64 laptop computer |

igelkott | Zen 3 | AMD Ryzen 5950X | 2020 | x86_64 desktop |

deck | Zen 2 | AMD APU 0405 | 2022 | x86_64 cell |

Beneath is a plot of the efficiency ratio in comparison with the `precise`

methodology, i.e.

the time of every methodology divided by the point of the `precise`

methodology. A better

ratio means increased efficiency, something beneath 1 is slower than `precise`

and

something above is quicker. We use the `-Ofast`

flag right here, as it’s the quickest

choice that can be utilized with out sacrificing portability.

The outcomes are fairly related throughout all the machines, the time of the

strategies are roughly ranked within the order `rsqrtps`

<= `appr`

< `precise`

<=

`appr_nr`

. Utilizing the `appr_nr`

methodology is both slower or the identical because the

`precise`

methodology, so it has no actual profit on this case.

The “jackalope” machine was not included within the above plot as a result of it had an

extraordinarily gradual `precise`

methodology. Particularly when not utilizing `-march=native`

because the

compiler then resorted to utilizing the vintage `fsqrt`

instruction.

Beneath is a desk of the particular timings when utilizing `-Ofast`

, numbers in

parenthesis makes use of `-march=native`

. Every quantity is how lengthy a single operation

takes in femtoseconds.

Machine/Compiler | precise | appr | appr_nr | rsqrtps |
---|---|---|---|---|

jackalope-clang | 53634 (5363) | 1500 (2733) | 4971 (3996) | N/A |

narwhal-gcc | 419 (363) | 443 (418) | 601 (343) | 396 (231) |

narwhal-clang | 389 (796) | 340 (321) | 445 (859) | 349 (388) |

silverback-gcc | 422 (294) | 179 (199) | 543 (543) | 178 (189) |

bovinae-gcc | 260 (127) | 155 (81) | 321 (119) | 108 (105) |

bovinae-clang | 255 (132) | 108 (78) | 272 (112) | 95 (96) |

igelkott-gcc | 141 (79) | 111 (63) | 168 (87) | 58 (64) |

igelkott-clang | 152 (76) | 63 (40) | 149 (70) | 61 (62) |

deck-gcc | 342 (160) | 234 (114) | 444 (172) | 226 (120) |

deck-clang | 297 (166) | 189 (123) | 332 (140) | 101 (126) |

The sq. root operate yields barely completely different outcomes:

Oddly sufficient, the `sqrtps`

built-in operate is slower than the `precise`

methodology,

and the `appr`

with out NR is now quicker as a substitute. The `appr_nr`

methodology nonetheless

gives no benefit, it’s as a substitute constantly worse than `precise`

.

Listed here are the unique timings for the sq. root operate as nicely, with

`-Ofast`

. Once more, numbers in parentheses use `-march=native`

:

Machine/Compiler | precise | appr | appr_nr | sqrtps |
---|---|---|---|---|

jackalope-clang | 35197 (5743) | 1494 (2738) | 19191 (4308) | N/A |

narwhal-gcc | 505 (399) | 399 (427) | 659 (559) | 796 (785) |

narwhal-clang | 448 (823) | 327 (319) | 638 (847) | 803 (780) |

silverback-gcc | 625 (297) | 271 (190) | 958 (728) | 1163 (1135) |

bovinae-gcc | 301 (148) | 155 (81) | 408 (200) | 225 (226) |

bovinae-clang | 315 (244) | 92 (60) | 399 (159) | 317 (227) |

igelkott-gcc | 173 (95) | 119 (38) | 233 (124) | 288 (296) |

igelkott-clang | 168 (143) | 63 (48) | 234 (104) | 170 (283) |

deck-gcc | 419 (205) | 215 (108) | 519 (252) | 575 (574) |

deck-clang | 325 (244) | 153 (88) | 372 (180) | 315 (458) |

### Strive it your self

You may attempt to run the benchmarks in your machine and see should you get related

outcomes. There’s a shell script `bench/run.sh`

that can generate and run

benchmarks utilizing the `bench/bench.c.m4`

file. These information may be present in this

blog’s repo. Merely run the script with no arguments and it’ll generate a

`.tsv`

file with all outcomes:

```
$ cd bench
$ sh run.sh
$ grep rsqrt bench.tsv | kind -nk3 | head
rsqrt appr 40 1.91 0.0139 clang-Ofast-march=native
rsqrt rsqrtps 56 32.08 0.0000 clang-O3
rsqrt appr 58 31.08 0.0193 clang-O3
rsqrt rsqrtps 58 2.48 0.0000 clang-O3-fno-math-errno-funsafe-math-optimizations-ffinite-math-only
rsqrt rsqrtps 59 2.45 0.0000 gcc-Ofast
rsqrt rsqrtps 59 2.48 0.0000 clang-Ofast
rsqrt rsqrtps 59 31.07 0.0000 gcc-O3
rsqrt rsqrtps 59 7.83 0.0000 gcc-O3-fno-math-errno
rsqrt appr 60 2.41 0.0144 clang-O3-fno-math-errno-funsafe-math-optimizations-ffinite-math-only
rsqrt rsqrtps 60 8.09 0.0000 clang-O3-fno-math-errno
```

## Closing ideas

To summarize, utilizing merely `1/sqrtf(x)`

on trendy x86 processors may be each

quicker and extra correct than the *quick inverse sq. root* methodology from Quake

III’s `Q_rsqrt`

operate.

Nevertheless, a key takeaway is that it’s important to order the compiler to make it

quicker. When merely compiling utilizing `-O3`

, the quick inverse sq. root methodology

is definitely *significantly* quicker than the naive implementation. Now we have to

enable the compiler to violate some strict specification necessities so as

to make it emit a quicker implementation (primarily to permit vectorization).

Equally may be stated for the atypical sq. root operate as nicely, simply utilizing

`sqrtf(x)`

and altering compiler flags enable for a really quick implementation.

If very low accuracy may be tolerated, it’s potential to get a barely quicker

implementation by skipping the Newton-Raphson step from the quick inverse sq.

root methodology. Curiously, the compiler additionally performs an NR step after utilizing

approximate implementations of the inverse sq. root. This may also be made

barely quicker by skipping the NR step — by solely emitting the approximate

instruction with the assistance of compiler intrinsics.

On this submit, we targeted on x86, however how about different directions

units? The quick inverse sq. root methodology might maybe nonetheless be

helpful for processors with out devoted sq. root directions.

How are the {hardware} implementations of approximate sq. roots sometimes

applied? May an approximate {hardware} implementation probably use

one thing just like the primary approximation of the quick inverse sq. root

methodology?