# Bitwise Binary Search: Elegant and Quick

*by*Phil Tadros

I not too long ago learn the article *Beautiful Branchless Binary Search*

by Malte Skarupke. In it they focus on the deserves of the next snippet of

C++ code implementing a binary search:

```
template<typename It, typename T, typename Cmp>
It lower_bound_skarupke(It start, It finish, const T& worth, Cmp comp) {
size_t size = finish - start;
if (size == 0) return finish;
size_t step = bit_floor(size);
if (step != size && comp(start[step], worth)) {
size -= step + 1;
if (size == 0) return finish;
step = bit_ceil(size);
start = finish - step;
}
for (step /= 2; step != 0; step /= 2) {
if (comp(start[step], worth)) start += step;
}
return start + comp(*start, worth);
}
```

Frankly, whereas the concepts behind the algorithm are stunning, I discover the

implementation advanced and arduous to grasp or show right. This isn’t

meant as a jab at Malte Skarupke, I discover nearly all binary search

implementations arduous to grasp or show right.

On this article I’ll present an alternate implementation primarily based on comparable

concepts however with a really completely different *interpretation* that’s (in my view)

extremely elegant and clear to grasp, a minimum of so far as binary searches

go. The ensuing implementation additionally saves a comparability in nearly each case

and finally ends up fairly a bit smaller.

### A brief history lesson

Be at liberty to skip this part if you’re not enthusiastic about historical past, however I had

to search out out whose shoulders we’re standing on. This isn’t solely to provide credit score

the place credit score is due, but in addition to see if any helpful particulars had been misplaced in

translation.

Malte Skarupke says they discovered in regards to the above algorithm from

Alex Muscar in their blog post.

Alex says they discovered the algorithm whereas studying Jon L. Bentley’s guide

*Writing Environment friendly Packages* (ISBN 0-13-970244-X). Jon Bentley writes:

If we’d like extra velocity then we should always seek the advice of Knuth’s [1973] definitive treatise on

looking. Part 6.2.1 discusses binary search, and Train 6.2.1-11 describes

a particularly environment friendly binary search program; […]

I personal the referenced guide hardcopy, Donald Knuth’s The Art of Computer Programming

(also called TAOCP), quantity 3 Sorting and Looking out. Train 6.2.1-11 will not be

the right train in my version, however 12 and 13 are, that are workouts

referring to “Shar’s methodology”.

We’ve to scan chapter 6.2.1 to search out the talked about methodology. Lastly, we discover it

on web page 416. First as context, Knuth makes use of the next notation for binary search:

Algorithm U(Uniform binary search). Given a desk of data

$R_1, R_2, dots, R_N$ whose keys are in rising order $K_1 < K_2 < cdots < K_N$,

this algorithm searches for a given argument $Okay$. If $N$ is even, the

algorithm will typically confer with a dummy key $K_0$ that ought to be set to $-infty$.

We assume that $N geq 1$.

Now we will lastly see Shar’s methodology:

One other modification of binary search, urged in 1971 by L. E. Shar,

will probably be nonetheless sooner on some computer systems, as a result of it’s uniform after the primary

step, and it requires no desk.

Step one is to check $Okay$ with $K_i$, the place $i = 2^okay$, $okay = lfloor lg Nrfloor$.

If $Okay$ < $K_i$, we use a uniform search with the $delta$‘s equal to $2^{k-1},

2^{k-2}, dots, 1, 0$. Then again, if $Okay > K_i$ we reset $i$ to $i’ = N + 1 – 2^l$

the place $l = lceillg(N – 2^okay + 1)rceil$, and faux that the primary

comparability was truly $Okay > K_{i’},$ utilizing a uniform search with the

$delta$’s equal to $2^{l-1}, 2^{l-2}, dots, 1, 0$.

The $delta$’s the primary paragraph refers to could be understood because the ‘step’

variable within the above C++ code. General Skarupke’s C++ code appears a reasonably

devoted implementation of Shar’s methodology as described by Knuth, besides that

Knuth makes use of one-based indexing which Skarupke’s methodology doesn’t take into consideration.

Knuth goes on to explain that Shar’s methodology by no means makes greater than $lfloor lg

N rfloor + 1$ comparisons, which is yet one more than the minimal attainable quantity

of comparisons.

To complete the historical past lesson, I did look on Google Scholar, however I couldn’t discover a 1971 paper by L. E. Shar.

I assume the modification was described in personal communication with Knuth.

Allow us to assume that we have now a zero-indexed array $A$ of size $n$ that’s in

ascending order: $A[0] leq A[1] leq cdots leq A[n-1]$. We wish to discover the

*decrease sure* of some ingredient $x$ on this array. That is the leftmost place

we might insert $x$ into the array whereas protecting it sorted. Alternatively

phrased, that is the variety of components strictly lower than $x$ within the array.

A standard binary search algorithm finds this quantity by protecting a spread of

attainable options, repeatedly chopping that vary in two items and

deciding on the one piece which comprises the answer. This tends to be tough

to get proper, as you should keep away from overflows whereas computing the midpoint, and

are coping with a number of boundary situations, each in your code in addition to

in your correctness invariant.

Earlier than we start with our answer that avoids this, we have now to take a second and

perceive an essential side of binary search. With every comparability we will

distinguish between two units of outcomes. Thus with $okay$ comparisons, we will

distinguish between $2^okay$ whole outcomes. Nevertheless, for $n$ components, there are

$n+1$ outcomes! For instance, for an array of seven components there are 8 positions in

which $x$ might be sorted:

Thus the pure array measurement for binary search is $2^okay – 1$, and

not $2^okay$.

Let’s check out the sixteen attainable 4-bit integers in binary:

```
0 = 0000 8 = 1000
1 = 0001 9 = 1001
2 = 0010 10 = 1010
3 = 0011 11 = 1011
4 = 0100 12 = 1100
5 = 0101 13 = 1101
6 = 0110 14 = 1110
7 = 0111 15 = 1111
```

Discover how if the highest little bit of the integer is ready, it stays set for all bigger

integers. And inside every group with the identical high bit, when the second most

important bit is ready, it stays set for bigger integers inside that group.

And so forth for the third bit inside every group with the identical high two bits, advert

infinitum. **In binary, inside every group with similar high $t$ bits set, the
worth of the $t+1$th bit is monotonically rising.**

Since our desired answer is the variety of components strictly lower than $x,$

we will rephrase it as discovering the biggest quantity $b$ such that ${A[b-1] < x},$

or $b = 0$ if no components are lower than $x$.

We will discover the distinctive $b$ very effectively by developing it **immediately**, bit-by-bit,

utilizing the above remark.

Let’s assume that $A$ has size $n = 2^okay – 1$. Then any attainable reply $b$

suits precisely in $okay$ bits. Since $A$ is sorted, if we discover that $A[i-1] < x$, we

know that ${b geq i}$. Conversely, if that comparability fails, we all know that ${b <

i}.$ Thus, utilizing the above remark we will work out if the highest little bit of $b$

is ready just by testing $A[i-1] < x$ with $i = 2^{k-1}$.

Now we all know what the highest bit ought to be and set it accordingly, by no means altering it

once more. Utilizing the above remark, this time throughout the group of

integers with a given high bit, we all know that if we set the second bit and discover

that $A[i-1] < x$ nonetheless holds, that the second bit should be set, and if not it

should be zero. We repeat this course of bit-by-bit till we have now discovered all

bits of $b$, giving our reply!

Maybe you might be like me, and you’re a visible thinker. Allow us to flip our

earlier tree on its facet and visually affiliate affiliate a binary $b$

worth with every hole between the weather of our array:

The small crimson arrows point out which ingredient $A[b-1] < x$ would take a look at

for a given guess of $b$. Observe that no ingredient is related to $b = 0$,

as we will solely find yourself with this worth if all different assessments failed, and thus

we by no means have to check this worth. A seek for $5$ would find yourself testing

${b = {coloration{crimson}1}00_2}$ (success, set bit), ${b = 1{coloration{crimson}1}0_2}$ (fail, don’t set bit) and ${b = 10{coloration{crimson}1}_2}$ (success, set bit).

In C++ we might get the next:

```
// Solely works for n = 2^okay - 1.
template<typename It, typename T, typename Cmp>
It lower_bound_2k1(It start, It finish, const T& worth, Cmp comp) {
size_t two_k = (finish - start) + 1;
size_t b = 0;
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) = bit;
return start + b;
}
```

Observe that we at all times do precisely $okay$ comparisons, which is perfect.

Nevertheless, there’s a evident difficulty: our authentic array may not have size

$2^okay – 1$. The best strategy to remedy that is so as to add components with worth

$infty$ to the tip, to pad the array out to $2^okay – 1$ components.

As an alternative of bodily including $infty$ components the array, we will merely

test if the index lies within the authentic array, and if not skip our take a look at

fully, as it could fail (we’d be testing if $infty < x$).

To pad our array out we wish to discover the smallest integer $okay$ such that $2^okay – 1 geq n$,

which suggests $okay geq log_2(n + 1)$, which after rounding up offers

$$okay = lceil log_2(n + 1) rceil = lfloor log_2(n) rfloor + 1.$$

Alternatively and positively extra enlightening is that this may be understood as

initializing `bit`

in our loop to the very best set bit in $n$:

```
template<typename It, typename T, typename Cmp>
It lower_bound_pad(It start, It finish, const T& worth, Cmp comp) {
size_t n = finish - start;
size_t b = 0;
for (size_t bit = std::bit_floor(n); bit != 0; bit >>= 1) bit) - 1;
if (i < n && comp(start[i], worth)) b
return start + b;
}
```

For my part that is essentially the most elegant binary search implementation there may be.

The above works nicely, however introduces an index test earlier than every array entry.

This implies the compiler cannot eradicate the

branch right here, lest we

entry out-of-bounds reminiscence.

To unravel this downside we use an identical trick as L. E. Shar: we do an preliminary

comparability with the center ingredient, then both take a look at $2^okay – 1$ components at

the beginning, or $2^okay – 1$ components on the finish of the array. If the array measurement

itself isn’t of the shape $2^okay – 1$, these two subslices overlap within the center.

To fully cowl our array (along with the ingredient we initially evaluate

with) we should have $$(2^okay – 1) + (2^okay – 1) + 1 = 2^{okay+1} – 1 geq n$$ and thus

we select $okay = lceil log_2(n + 1) – 1 rceil = lfloor log_2 (n) rfloor$ :

```
template<typename It, typename T, typename Cmp>
It lower_bound_overlap(It start, It finish, const T& worth, Cmp comp) {
size_t n = finish - start;
if (n == 0) return start;
size_t two_k = std::bit_floor(n);
if (comp(start[n / 2], worth)) start = finish - (two_k - 1);
size_t b = 0;
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) = bit;
return start + b;
}
```

### Improving the efficiency

If our array doesn’t have measurement $2^{okay+1} – 1$, the subarrays overlap within the

center. Which means a part of the subarrays already eradicated by the preliminary

comparability $A[n / 2] < x$ are being unnecessarily searched. Can we enhance on

this?

We will, if we select two completely different sizes $2^l – 1$ and $2^r – 1$ for when

we’re respectively looking at the beginning (left) or finish (proper) of the array.

Once more, together with the preliminary ingredient we evaluate with (which is now

$A[2^l – 1] < x$) we discover that we should have

$$(2^l – 1) + (2^r – 1) + 1 = 2^l + 2^r – 1 geq n$$

to have the ability to deal with an arbitrary measurement $n$. And naturally, we should have

$2^l – 1 leq n$ and $2^r – 1 leq n$ for our subarrays to suit. Let’s discover the

optimum selection for $l, r$—which isn’t trivial.

#### Cost analysis

What’s the price of a sure selection of $l, r$, assuming a uniform distribution

over the $n + 1$ attainable outcomes of the binary search? We all know that after

the preliminary comparability, for $2^l$ of these outcomes we use $l$ comparisons,

and thus the remainder should use $r$ comparisons for an anticipated price of

$$C = 1 + frac{2^l}{n + 1}cdot l + frac{n + 1 – 2^l}{n + 1}cdot r$$

We solely actually care about minimizing this price, so we will throw out the extra

fixed $+1$ and the issue $1/(n+1)$ because it doesn’t change the relative order:

$$C’ = 2^lcdot l + (n + 1 – 2^l)cdot r$$

#### Optimizing $r$

Examine price $C’$ to the expression $2^l cdot l + 2^{r}cdot r$. On this case

the expression is fully symmetrical, so we might freely swap $l$ and $r$. However

we all know from our earlier array measurement inequality that $n + 1 – 2^l leq 2^{r}$.

Thus we will conclude that $l$ has a larger weight in the fee than $r$ and

due to this fact we will safely assume that $l leq r$ is minimal.

From this plus the truth that $2^l + 2^{r} – 1 geq n$

we will instantly deduce that $2^{r} geq (n + 1) / 2$ by weakening the

inequality with $2^l to 2^r$, and thus rounding as much as the closest integer offers

start{align*}

r &= lceillog_2(n+1) – 1rceil = lfloorlog_2(n)rfloor

finish{align*}

Observe that we will’t select $r$ any bigger, nor smaller, and thus we’ve

decided the optimum worth for $r$.

#### Optimizing $l$

Let’s reorder our relative price $C’$ a bit:

$$C’ = 2^lcdot (l – r) + (n + 1)cdot r$$

We will ignore the second time period as a continuing, as we’re now attempting to optimize $l$

given the optimum $r$. The perform

$$f(l) = 2^x(l – r)$$

has by-product w.r.t. $l$

$$f’(l) = 2^l(ln(2)(l – r) + 1)$$

with a single zero akin to the worldwide minimal at $r – frac{1}{ln(2)} approx r – 1.4427$.

Let’s plug within the two integers closest to this minimal in $f$:

$$f(r – 1) = 2^{r – 1}(r – 1 – r) = – 2^{r-1}$$

$$f(r – 2) = 2^{r – 2}(r – 2 – r) = – 2^{r-1}$$

Thus we discover that each $l = r – 1$ or $l = r – 2$ have optimum price. For

simplicity we will simply restrict ourselves to $l = r – 1$ as it’s equal however simpler

to fulfill $2^l + 2^r – 1 geq n$ with. Talking of that inequality,

we will’t at all times select $l = r – 1$ as we’re typically pressured to decide on $l = r$

by it.

### Putting it together

We discovered that $r = lfloor log2(n) rfloor$, and that

$$l = start{instances}

r-1&textual content{if }2^r + 2^{r-1} – 1 geq n

r&textual content{in any other case}

finish{instances}.$$

The situation $2^r + 2^{r-1} – 1 geq n$ could be seen to be equal to “the

$r-1$th little bit of $n$ will not be set”. And as $2^r – 2^{r-1} = 2^{r-1}$ we will isolate

that bit, negate it, and subtract it from `two_r`

to get our `two_l`

:

```
template<typename It, typename T, typename Cmp>
It lower_bound_opt(It start, It finish, const T& worth, Cmp comp) {
size_t n = finish - start;
if (n == 0) return start;
size_t two_r = std::bit_floor(n);
size_t two_l = two_r - ((two_r >> 1) & ~n);
bool use_r = comp(start[two_l - 1], worth);
size_t two_k = use_r ? two_r : two_l;
start = use_r ? finish - (two_r - 1) : start;
size_t b = 0;
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) = bit;
return start + b;
}
```

The considerably odd use of ternary statements and `use_r`

is to persuade the

compiler to generate branchless code. We actually misplaced among the class we

had earlier than, however a minimum of now we do the minimal variety of comparisons we will do

with our bitwise binary search whereas being branchless. And it’s the truth is higher

than than L. E. Shar’s authentic methodology, whose preliminary comparability $A[i – 1] < x$

makes use of $i = leftlfloor log_2 (n) rightrfloor$, which is suboptimal as we’ve

seen.

For some purpose the usual implementation of

`std::bit_floor`

… sucks

a bit. E.g. on x86-64 Clang 16.0 with

`-O2`

we compile this:

```
size_t bit_floor(size_t n) {
if (n == 0) return 0;
return std::bit_floor(n);
}
size_t bit_floor_manual(size_t n) {
if (n == 0) return 0;
return size_t(1) << (std::bit_width(n) - 1);
}
```

to this:

```
bit_floor(unsigned lengthy):
take a look at rdi, rdi
je .LBB0_1
shr rdi
je .LBB0_3
bsr rcx, rdi
xor rcx, 63
jmp .LBB0_5
.LBB0_1:
xor eax, eax
ret
.LBB0_3:
mov ecx, 64
.LBB0_5:
neg cl
mov eax, 1
shl rax, cl
ret
bit_floor_manual(unsigned lengthy):
bsr rcx, rdi
mov eax, 1
shl rax, cl
take a look at rdi, rdi
cmove rax, rdi
ret
```

Yikes. Guide computation it’s!

### Optimizing the tight loop

The astute observer may need observed that within the following loop, we solely

ever set every bit in `b`

at most as soon as:

```
size_t b = 0;
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) = bit;
return start + b;
```

This implies we might change the binary OR to easy addition, which could

optimize higher in pointer calculations.

For the bitwise model we see the next tight loop for the above,

in `x86-64`

with GCC:

```
.L7:
mov rsi, rdx
or rsi, rcx
cmp DWORD PTR [rax-4+rsi*4], r8d
cmovb rcx, rsi
shr rdx
jne .L7
```

With addition we see the next:

```
.L7:
lea rsi, [rdx+rcx]
cmp DWORD PTR [rax-4+rsi*4], r8d
cmovb rcx, rsi
shr rdx
jne .L7
```

In reality, when utilizing addition we might eradicate variable $b$ fully,

and immediately add to `start`

(just like Skarupke’s authentic model that sparked

this text):

```
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) {
if (comp(start[bit - 1], worth)) start += bit;
}
return start;
```

Nevertheless, I’ve discovered that some compilers, e.g. GCC on x86-64 will refuse to make

this variant branchless. I hate how fickle compilers could be typically, and I

want compilers uncovered not simply the

`likely`

/`unlikely`

attributes, but in addition an attribute that lets you mark one thing as `unpredictable`

to nudge the compiler in direction of utilizing branchless methods like CMOV’s.

As an alternative of eliminating `b`

, we will optimize the loop to solely do a single addition

explicitly, by shifting the `-1`

into the worth of `b`

itself:

```
size_t b = -1;
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) {
if (comp(start[b + bit], worth)) b += bit;
}
return start + (b + 1);
```

Yay for 2’s complement and integer overflow! This generated one of the best code on

all platforms I’ve checked out, so I utilized this sample to all my

implementations within the benchmark.

Let’s evaluate all of the variants we’ve made, each in comparisons and precise

runtime. The latter I’ll take a look at on my Apple M1 2021 Macbook Professional which is an ARM

machine. Your mileage **will** range on completely different machines, particularly x86-64

machines, however I would like this text to focus extra on the algorithmic facet of

issues fairly than grow to be an in depth research on the precise traits of

department mispredictions, cache misses, and how one can get the compiler to generate

branchless code for quite a lot of platforms.

The code for the beneath benchmark is accessible on my Github.

### Comparisons

To check the common variety of comparisons for measurement $n$ we will merely question for

every of the $n + 1$ outcomes what number of comparisons it takes to get that consequence.

We then common over all these for a given $n$. The result’s the next

graph:

We see that `lower_bound_opt`

does the truth is do the fewest comparisons of

all of the branchless strategies, following the optimum `lower_bound_std`

extra intently than `lower_bound_pad`

or `lower_bound_skarupke`

.

Throughout all sizes lower than 256 we see the next common comparability counts,

minus the optimum comparability rely:

Algorithm | Comparisons above optimum |
---|---|

`lower_bound_skarupke` |
1.17835 |

`lower_bound_overlap` |
0.37250 |

`lower_bound_pad` |
0.17668 |

`lower_bound_opt` |
0.17238 |

`lower_bound_std` |
0.00000 |

All our arduous work discovering the optimum break up into subarrays solely saved us ~0.2

comparisons on common on `lower_bound_opt`

versus the a lot less complicated

`lower_bound_overlap`

.

### Runtime (32-bit integers)

To benchmark runtime for a sure measurement $n$ I pre-generated a million random

integers within the vary $[0, n + 1]$. Then I file the time it takes to look them

all up utilizing our decrease sure routine of curiosity, and calculate the common. As

I’ve discovered that eliminating `b`

in our tight loop as talked about above can both

acquire or harm efficiency relying on the compiler / structure, so I’ve

benchmarked each variants.

Utilizing clang 13.0.0 with `g++ -O2 -std=c++20`

we get the next:

I feel this graph offers an interesting perception into the department predictor on the

Apple M1. Most hanging is the comparatively poor efficiency of `lower_bound_opt`

.

Inside every bracket of sizes $[2^k, 2^{k+1})$ it performs much worse

than `lower_bound_overlap`

, with a size-dependent slope, before suddenly dropping to a

consistently good performance.

This puzzled me for a while, and I triple-checked to see that `lower_bound_opt`

was really being compiled with branchless instructions. Only then did I realize

there was a hidden branch all along: the loop exit condition.

`lower_bound_overlap`

always performs the same number of loop iterations,

allowing the CPU to always correctly predict the loop exit, whereas

`lower_bound_opt`

tries to reduce the number of iterations it does to save

comparisons. It turns out that for integers the cost of an extra iteration is

much lower than risking a mispredict on the loop condition on the Apple M1.

If we also look at larger inputs we see that the above pattern keeps up

for quite a while until we start hitting sizes where cache effects become

a factor:

We also note that it truly is important for a binary search benchmark to look at

a variety of sizes, as you might reach rather different conclusions in

performance at $n = 2^{12}$ versus $n = 2^{12} cdot 1.5$.

A commonly heard advice is to not use binary search for small arrays, but to

use a linear search instead. I find that not to be true on the Apple M1 for integers,

at least compared to my branchless binary search, when searching a runtime-sized

but otherwise fixed size array:

Note that a linear search must always incur at least one branch misprediction:

on the loop exit condition. For a fixed size array `lower_bound_overlap`

has

zero branch mispredictions, including the loop exit.

### Runtime (strings)

To benchmark the performance on strings I copied the above benchmark, except

that I convert all integers to strings, zero-padded to a length of four.

I also reduced the number of samples to 300,000 per size, as the string

benchmark was significantly slower.

Using clang 13.0.0 with `g++ -O2 -std=c++20`

we get the following:

Strings are a lot less interesting than integers in this case, as most of the

branchless optimizations are moot. We find that initially the branchless

versions are only slightly slower than `std::lower_bound`

due to the extra

comparisons needed. However once we get to the larger-than-cache sizes

`std::lower_bound`

becomes significantly better as it can do speculative loads

to reduce cache misses.

It seems that for strings the advice to use linear searches for small input

arrays doesn’t help that much, but doesn’t hurt either for $n leq 9$,

on the Apple M1.

In my opinion the bitwise binary search is an elegant alternative to the

traditional binary search method, at the cost of ~0.17 to ~0.37 extra comparisons

on average. It can be implemented in a branchless manner, which can be

significantly faster when searching elements with a branchless comparison

operator.

In this article we found the following implementation to perform the best

on Apple M1 after all micro-optimizations are applied:

```
template<typename It, typename T, typename Cmp>
It lower_bound(It begin, It end, const T& value, Cmp comp) {
size_t n = end - begin;
if (n == 0) return begin;
size_t two_k = size_t(1) << (std::bit_width(n) - 1);
size_t b = comp(begin[n / 2], worth) ? n - two_k : -1;
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) {
if (comp(start[b + bit], worth)) b += bit;
}
return start + (b + 1);
}
```

Nevertheless, in relation to readability and class I nonetheless discover the next

methodology essentially the most stunning:

```
template<typename It, typename T, typename Cmp>
It lower_bound(It start, It finish, const T& worth, Cmp comp) {
size_t n = finish - start;
size_t b = 0;
for (size_t bit = std::bit_floor(n); bit != 0; bit >>= 1) bit) - 1;
if (i < n && comp(start[i], worth)) b
return start + b;
}
```