Bitwise Binary Search: Elegant and Quick

2023-05-16 18:51:50

2023-05-16

I not too long ago learn the article Beautiful Branchless Binary Search
by Malte Skarupke. In it they focus on the deserves of the next snippet of
C++ code implementing a binary search:

template<typename It, typename T, typename Cmp>
It lower_bound_skarupke(It start, It finish, const T& worth, Cmp comp) {
    size_t size = finish - start;
    if (size == 0) return finish;

    size_t step = bit_floor(size);
    if (step != size && comp(start[step], worth)) {
        size -= step + 1;
        if (size == 0) return finish;
        step = bit_ceil(size);
        start = finish - step;
    }

    for (step /= 2; step != 0; step /= 2) {
        if (comp(start[step], worth)) start += step;
    }

    return start + comp(*start, worth);
}

Frankly, whereas the concepts behind the algorithm are stunning, I discover the
implementation advanced and arduous to grasp or show right. This isn’t
meant as a jab at Malte Skarupke, I discover nearly all binary search
implementations arduous to grasp or show right.

On this article I’ll present an alternate implementation primarily based on comparable
concepts however with a really completely different interpretation that’s (in my view)
extremely elegant and clear to grasp, a minimum of so far as binary searches
go. The ensuing implementation additionally saves a comparability in nearly each case
and finally ends up fairly a bit smaller.

A brief history lesson

Be at liberty to skip this part if you’re not enthusiastic about historical past, however I had
to search out out whose shoulders we’re standing on. This isn’t solely to provide credit score
the place credit score is due, but in addition to see if any helpful particulars had been misplaced in
translation.

Malte Skarupke says they discovered in regards to the above algorithm from
Alex Muscar in their blog post.
Alex says they discovered the algorithm whereas studying Jon L. Bentley’s guide
Writing Environment friendly Packages (ISBN 0-13-970244-X). Jon Bentley writes:

If we’d like extra velocity then we should always seek the advice of Knuth’s [1973] definitive treatise on
looking. Part 6.2.1 discusses binary search, and Train 6.2.1-11 describes
a particularly environment friendly binary search program; […]

I personal the referenced guide hardcopy, Donald Knuth’s The Art of Computer Programming
(also called TAOCP), quantity 3 Sorting and Looking out. Train 6.2.1-11 will not be
the right train in my version, however 12 and 13 are, that are workouts
referring to “Shar’s methodology”.

We’ve to scan chapter 6.2.1 to search out the talked about methodology. Lastly, we discover it
on web page 416. First as context, Knuth makes use of the next notation for binary search:

Algorithm U (Uniform binary search). Given a desk of data
$R_1, R_2, dots, R_N$ whose keys are in rising order $K_1 < K_2 < cdots < K_N$,
this algorithm searches for a given argument $Okay$. If $N$ is even, the
algorithm will typically confer with a dummy key $K_0$ that ought to be set to $-infty$.
We assume that $N geq 1$.

Now we will lastly see Shar’s methodology:

One other modification of binary search, urged in 1971 by L. E. Shar,
will probably be nonetheless sooner on some computer systems, as a result of it’s uniform after the primary
step, and it requires no desk.
Step one is to check $Okay$ with $K_i$, the place $i = 2^okay$, $okay = lfloor lg Nrfloor$.
If $Okay$ < $K_i$, we use a uniform search with the $delta$‘s equal to $2^{k-1},
2^{k-2}, dots, 1, 0$. Then again, if $Okay > K_i$ we reset $i$ to $i’ = N + 1 – 2^l$
the place $l = lceillg(N – 2^okay + 1)rceil$, and faux that the primary
comparability was truly $Okay > K_{i’},$ utilizing a uniform search with the
$delta$’s equal to $2^{l-1}, 2^{l-2}, dots, 1, 0$.

The $delta$’s the primary paragraph refers to could be understood because the ‘step’
variable within the above C++ code. General Skarupke’s C++ code appears a reasonably
devoted implementation of Shar’s methodology as described by Knuth, besides that
Knuth makes use of one-based indexing which Skarupke’s methodology doesn’t take into consideration.
Knuth goes on to explain that Shar’s methodology by no means makes greater than $lfloor lg
N rfloor + 1$ comparisons, which is yet one more than the minimal attainable quantity
of comparisons.

To complete the historical past lesson, I did look on Google Scholar, however I couldn’t discover a 1971 paper by L. E. Shar.
I assume the modification was described in personal communication with Knuth.

Allow us to assume that we have now a zero-indexed array $A$ of size $n$ that’s in
ascending order: $A[0] leq A[1] leq cdots leq A[n-1]$. We wish to discover the
decrease sure of some ingredient $x$ on this array. That is the leftmost place
we might insert $x$ into the array whereas protecting it sorted. Alternatively
phrased, that is the variety of components strictly lower than $x$ within the array.

A standard binary search algorithm finds this quantity by protecting a spread of
attainable options, repeatedly chopping that vary in two items and
deciding on the one piece which comprises the answer. This tends to be tough
to get proper, as you should keep away from overflows whereas computing the midpoint, and
are coping with a number of boundary situations, each in your code in addition to
in your correctness invariant.

Earlier than we start with our answer that avoids this, we have now to take a second and
perceive an essential side of binary search. With every comparability we will
distinguish between two units of outcomes. Thus with $okay$ comparisons, we will
distinguish between $2^okay$ whole outcomes. Nevertheless, for $n$ components, there are
$n+1$ outcomes! For instance, for an array of seven components there are 8 positions in
which $x$ might be sorted:

Thus the pure array measurement for binary search is $2^okay – 1$, and
not $2^okay$.

Let’s check out the sixteen attainable 4-bit integers in binary:

 0 = 0000      8 = 1000 
 1 = 0001      9 = 1001 
 2 = 0010     10 = 1010 
 3 = 0011     11 = 1011 
 4 = 0100     12 = 1100 
 5 = 0101     13 = 1101 
 6 = 0110     14 = 1110 
 7 = 0111     15 = 1111

Discover how if the highest little bit of the integer is ready, it stays set for all bigger
integers. And inside every group with the identical high bit, when the second most
important bit is ready, it stays set for bigger integers inside that group.
And so forth for the third bit inside every group with the identical high two bits, advert
infinitum. In binary, inside every group with similar high $t$ bits set, the
worth of the $t+1$th bit is monotonically rising.

Since our desired answer is the variety of components strictly lower than $x,$
we will rephrase it as discovering the biggest quantity $b$ such that ${A[b-1] < x},$
or $b = 0$ if no components are lower than $x$.
We will discover the distinctive $b$ very effectively by developing it immediately, bit-by-bit,
utilizing the above remark.

Let’s assume that $A$ has size $n = 2^okay – 1$. Then any attainable reply $b$
suits precisely in $okay$ bits. Since $A$ is sorted, if we discover that $A[i-1] < x$, we
know that ${b geq i}$. Conversely, if that comparability fails, we all know that ${b <
i}.$ Thus, utilizing the above remark we will work out if the highest little bit of $b$
is ready just by testing $A[i-1] < x$ with $i = 2^{k-1}$.

Now we all know what the highest bit ought to be and set it accordingly, by no means altering it
once more. Utilizing the above remark, this time throughout the group of
integers with a given high bit, we all know that if we set the second bit and discover
that $A[i-1] < x$ nonetheless holds, that the second bit should be set, and if not it
should be zero. We repeat this course of bit-by-bit till we have now discovered all
bits of $b$, giving our reply!

Maybe you might be like me, and you’re a visible thinker. Allow us to flip our
earlier tree on its facet and visually affiliate affiliate a binary $b$
worth with every hole between the weather of our array:

The small crimson arrows point out which ingredient $A[b-1] < x$ would take a look at
for a given guess of $b$. Observe that no ingredient is related to $b = 0$,
as we will solely find yourself with this worth if all different assessments failed, and thus
we by no means have to check this worth. A seek for $5$ would find yourself testing
${b = {coloration{crimson}1}00_2}$ (success, set bit), ${b = 1{coloration{crimson}1}0_2}$ (fail, don’t set bit) and ${b = 10{coloration{crimson}1}_2}$ (success, set bit).

In C++ we might get the next:

// Solely works for n = 2^okay - 1.
template<typename It, typename T, typename Cmp>
It lower_bound_2k1(It start, It finish, const T& worth, Cmp comp) {
    size_t two_k = (finish - start) + 1;
    size_t b = 0;
    for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) = bit;
    
    return start + b;
}

Observe that we at all times do precisely $okay$ comparisons, which is perfect.

Nevertheless, there’s a evident difficulty: our authentic array may not have size
$2^okay – 1$. The best strategy to remedy that is so as to add components with worth
$infty$ to the tip, to pad the array out to $2^okay – 1$ components.
As an alternative of bodily including $infty$ components the array, we will merely
test if the index lies within the authentic array, and if not skip our take a look at
fully, as it could fail (we’d be testing if $infty < x$).

To pad our array out we wish to discover the smallest integer $okay$ such that $2^okay – 1 geq n$,
which suggests $okay geq log_2(n + 1)$, which after rounding up offers
$$okay = lceil log_2(n + 1) rceil = lfloor log_2(n) rfloor + 1.$$

Alternatively and positively extra enlightening is that this may be understood as
initializing bit in our loop to the very best set bit in $n$:

template<typename It, typename T, typename Cmp>
It lower_bound_pad(It start, It finish, const T& worth, Cmp comp) {
    size_t n = finish - start;
    size_t b = 0;
    for (size_t bit = std::bit_floor(n); bit != 0; bit >>= 1)  bit) - 1;
        if (i < n && comp(start[i], worth)) b 
    return start + b;
}

For my part that is essentially the most elegant binary search implementation there may be.

The above works nicely, however introduces an index test earlier than every array entry.
This implies the compiler cannot eradicate the
branch right here, lest we
entry out-of-bounds reminiscence.

To unravel this downside we use an identical trick as L. E. Shar: we do an preliminary
comparability with the center ingredient, then both take a look at $2^okay – 1$ components at
the beginning, or $2^okay – 1$ components on the finish of the array. If the array measurement
itself isn’t of the shape $2^okay – 1$, these two subslices overlap within the center.
To fully cowl our array (along with the ingredient we initially evaluate
with) we should have $$(2^okay – 1) + (2^okay – 1) + 1 = 2^{okay+1} – 1 geq n$$ and thus
we select $okay = lceil log_2(n + 1) – 1 rceil = lfloor log_2 (n) rfloor$ :

template<typename It, typename T, typename Cmp>
It lower_bound_overlap(It start, It finish, const T& worth, Cmp comp) {
    size_t n = finish - start;
    if (n == 0) return start;

    size_t two_k = std::bit_floor(n);
    if (comp(start[n / 2], worth)) start = finish - (two_k - 1);
    
    size_t b = 0;
    for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) = bit;
    
    return start + b;
}

Improving the efficiency

If our array doesn’t have measurement $2^{okay+1} – 1$, the subarrays overlap within the
center. Which means a part of the subarrays already eradicated by the preliminary
comparability $A[n / 2] < x$ are being unnecessarily searched. Can we enhance on
this?

We will, if we select two completely different sizes $2^l – 1$ and $2^r – 1$ for when
we’re respectively looking at the beginning (left) or finish (proper) of the array.
Once more, together with the preliminary ingredient we evaluate with (which is now
$A[2^l – 1] < x$) we discover that we should have

$$(2^l – 1) + (2^r – 1) + 1 = 2^l + 2^r – 1 geq n$$

to have the ability to deal with an arbitrary measurement $n$. And naturally, we should have
$2^l – 1 leq n$ and $2^r – 1 leq n$ for our subarrays to suit. Let’s discover the
optimum selection for $l, r$—which isn’t trivial.

Cost analysis

What’s the price of a sure selection of $l, r$, assuming a uniform distribution
over the $n + 1$ attainable outcomes of the binary search? We all know that after
the preliminary comparability, for $2^l$ of these outcomes we use $l$ comparisons,
and thus the remainder should use $r$ comparisons for an anticipated price of

$$C = 1 + frac{2^l}{n + 1}cdot l + frac{n + 1 – 2^l}{n + 1}cdot r$$

We solely actually care about minimizing this price, so we will throw out the extra
fixed $+1$ and the issue $1/(n+1)$ because it doesn’t change the relative order:

$$C’ = 2^lcdot l + (n + 1 – 2^l)cdot r$$

Optimizing $r$

Examine price $C’$ to the expression $2^l cdot l + 2^{r}cdot r$. On this case
the expression is fully symmetrical, so we might freely swap $l$ and $r$. However
we all know from our earlier array measurement inequality that $n + 1 – 2^l leq 2^{r}$.
Thus we will conclude that $l$ has a larger weight in the fee than $r$ and
due to this fact we will safely assume that $l leq r$ is minimal.

From this plus the truth that $2^l + 2^{r} – 1 geq n$
we will instantly deduce that $2^{r} geq (n + 1) / 2$ by weakening the
inequality with $2^l to 2^r$, and thus rounding as much as the closest integer offers
start{align*}
r &= lceillog_2(n+1) – 1rceil = lfloorlog_2(n)rfloor
finish{align*}

Observe that we will’t select $r$ any bigger, nor smaller, and thus we’ve
decided the optimum worth for $r$.

Optimizing $l$

Let’s reorder our relative price $C’$ a bit:

$$C’ = 2^lcdot (l – r) + (n + 1)cdot r$$

We will ignore the second time period as a continuing, as we’re now attempting to optimize $l$
given the optimum $r$. The perform

$$f(l) = 2^x(l – r)$$
has by-product w.r.t. $l$
$$f’(l) = 2^l(ln(2)(l – r) + 1)$$
with a single zero akin to the worldwide minimal at $r – frac{1}{ln(2)} approx r – 1.4427$.
Let’s plug within the two integers closest to this minimal in $f$:

$$f(r – 1) = 2^{r – 1}(r – 1 – r) = – 2^{r-1}$$
$$f(r – 2) = 2^{r – 2}(r – 2 – r) = – 2^{r-1}$$

Thus we discover that each $l = r – 1$ or $l = r – 2$ have optimum price. For
simplicity we will simply restrict ourselves to $l = r – 1$ as it’s equal however simpler
to fulfill $2^l + 2^r – 1 geq n$ with. Talking of that inequality,
we will’t at all times select $l = r – 1$ as we’re typically pressured to decide on $l = r$
by it.

Putting it together

We discovered that $r = lfloor log2(n) rfloor$, and that

$$l = start{instances}
r-1&textual content{if }2^r + 2^{r-1} – 1 geq n
r&textual content{in any other case}
finish{instances}.$$

The situation $2^r + 2^{r-1} – 1 geq n$ could be seen to be equal to “the
$r-1$th little bit of $n$ will not be set”. And as $2^r – 2^{r-1} = 2^{r-1}$ we will isolate
that bit, negate it, and subtract it from two_r to get our two_l:

template<typename It, typename T, typename Cmp>
It lower_bound_opt(It start, It finish, const T& worth, Cmp comp) {
    size_t n = finish - start;
    if (n == 0) return start;

    size_t two_r = std::bit_floor(n);
    size_t two_l = two_r - ((two_r >> 1) & ~n);
    bool use_r = comp(start[two_l - 1], worth);
    size_t two_k = use_r ? two_r : two_l;
    start = use_r ? finish - (two_r - 1) : start;

    size_t b = 0;
    for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) = bit;
    
    return start + b;
}

The considerably odd use of ternary statements and use_r is to persuade the
compiler to generate branchless code. We actually misplaced among the class we
had earlier than, however a minimum of now we do the minimal variety of comparisons we will do
with our bitwise binary search whereas being branchless. And it’s the truth is higher
than than L. E. Shar’s authentic methodology, whose preliminary comparability $A[i – 1] < x$
makes use of $i = leftlfloor log_2 (n) rightrfloor$, which is suboptimal as we’ve
seen.

For some purpose the usual implementation of
std::bit_floor… sucks
a bit. E.g. on x86-64 Clang 16.0 with
-O2 we compile this:

size_t bit_floor(size_t n) {
    if (n == 0) return 0;
    return std::bit_floor(n);
}

size_t bit_floor_manual(size_t n) {
    if (n == 0) return 0;
    return size_t(1) << (std::bit_width(n) - 1);
}

to this:

bit_floor(unsigned lengthy):
        take a look at    rdi, rdi
        je      .LBB0_1
        shr     rdi
        je      .LBB0_3
        bsr     rcx, rdi
        xor     rcx, 63
        jmp     .LBB0_5
.LBB0_1:
        xor     eax, eax
        ret
.LBB0_3:
        mov     ecx, 64
.LBB0_5:
        neg     cl
        mov     eax, 1
        shl     rax, cl
        ret

bit_floor_manual(unsigned lengthy):
        bsr     rcx, rdi
        mov     eax, 1
        shl     rax, cl
        take a look at    rdi, rdi
        cmove   rax, rdi
        ret

Yikes. Guide computation it’s!

Optimizing the tight loop

The astute observer may need observed that within the following loop, we solely
ever set every bit in b at most as soon as:

size_t b = 0;
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) = bit;

return start + b;

This implies we might change the binary OR to easy addition, which could
optimize higher in pointer calculations.

For the bitwise model we see the next tight loop for the above,
in x86-64 with GCC:

.L7:
        mov     rsi, rdx
        or      rsi, rcx
        cmp     DWORD PTR [rax-4+rsi*4], r8d
        cmovb   rcx, rsi
        shr     rdx
        jne     .L7

With addition we see the next:

.L7:
        lea     rsi, [rdx+rcx]
        cmp     DWORD PTR [rax-4+rsi*4], r8d
        cmovb   rcx, rsi
        shr     rdx
        jne     .L7

In reality, when utilizing addition we might eradicate variable $b$ fully,
and immediately add to start (just like Skarupke’s authentic model that sparked
this text):

for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) {
    if (comp(start[bit - 1], worth)) start += bit;
}
return start;

Nevertheless, I’ve discovered that some compilers, e.g. GCC on x86-64 will refuse to make
this variant branchless. I hate how fickle compilers could be typically, and I
want compilers uncovered not simply the
likely/unlikely
attributes, but in addition an attribute that lets you mark one thing as unpredictable
to nudge the compiler in direction of utilizing branchless methods like CMOV’s.

As an alternative of eliminating b, we will optimize the loop to solely do a single addition
explicitly, by shifting the -1 into the worth of b itself:

size_t b = -1;
for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) {
    if (comp(start[b + bit], worth)) b += bit;
}
return start + (b + 1);

Yay for 2’s complement and integer overflow! This generated one of the best code on
all platforms I’ve checked out, so I utilized this sample to all my
implementations within the benchmark.

Let’s evaluate all of the variants we’ve made, each in comparisons and precise
runtime. The latter I’ll take a look at on my Apple M1 2021 Macbook Professional which is an ARM
machine. Your mileage will range on completely different machines, particularly x86-64
machines, however I would like this text to focus extra on the algorithmic facet of
issues fairly than grow to be an in depth research on the precise traits of
department mispredictions, cache misses, and how one can get the compiler to generate
branchless code for quite a lot of platforms.

The code for the beneath benchmark is accessible on my Github.

Comparisons

To check the common variety of comparisons for measurement $n$ we will merely question for
every of the $n + 1$ outcomes what number of comparisons it takes to get that consequence.
We then common over all these for a given $n$. The result’s the next
graph:

We see that lower_bound_opt does the truth is do the fewest comparisons of
all of the branchless strategies, following the optimum lower_bound_std
extra intently than lower_bound_pad or lower_bound_skarupke.

Throughout all sizes lower than 256 we see the next common comparability counts,
minus the optimum comparability rely:

Algorithm	Comparisons above optimum
`lower_bound_skarupke`	1.17835
`lower_bound_overlap`	0.37250
`lower_bound_pad`	0.17668
`lower_bound_opt`	0.17238
`lower_bound_std`	0.00000

All our arduous work discovering the optimum break up into subarrays solely saved us ~0.2
comparisons on common on lower_bound_opt versus the a lot less complicated
lower_bound_overlap.

Runtime (32-bit integers)

To benchmark runtime for a sure measurement $n$ I pre-generated a million random
integers within the vary $[0, n + 1]$. Then I file the time it takes to look them
all up utilizing our decrease sure routine of curiosity, and calculate the common. As
I’ve discovered that eliminating b in our tight loop as talked about above can both
acquire or harm efficiency relying on the compiler / structure, so I’ve
benchmarked each variants.

Utilizing clang 13.0.0 with g++ -O2 -std=c++20 we get the next:

I feel this graph offers an interesting perception into the department predictor on the
Apple M1. Most hanging is the comparatively poor efficiency of lower_bound_opt.
Inside every bracket of sizes $[2^k, 2^{k+1})$ it performs much worse
than lower_bound_overlap, with a size-dependent slope, before suddenly dropping to a
consistently good performance.

This puzzled me for a while, and I triple-checked to see that lower_bound_opt
was really being compiled with branchless instructions. Only then did I realize
there was a hidden branch all along: the loop exit condition.
lower_bound_overlap always performs the same number of loop iterations,
allowing the CPU to always correctly predict the loop exit, whereas
lower_bound_opt tries to reduce the number of iterations it does to save
comparisons. It turns out that for integers the cost of an extra iteration is
much lower than risking a mispredict on the loop condition on the Apple M1.

If we also look at larger inputs we see that the above pattern keeps up
for quite a while until we start hitting sizes where cache effects become
a factor:

We also note that it truly is important for a binary search benchmark to look at
a variety of sizes, as you might reach rather different conclusions in
performance at $n = 2^{12}$ versus $n = 2^{12} cdot 1.5$.

A commonly heard advice is to not use binary search for small arrays, but to
use a linear search instead. I find that not to be true on the Apple M1 for integers,
at least compared to my branchless binary search, when searching a runtime-sized
but otherwise fixed size array:

Note that a linear search must always incur at least one branch misprediction:
on the loop exit condition. For a fixed size array lower_bound_overlap has
zero branch mispredictions, including the loop exit.

Runtime (strings)

To benchmark the performance on strings I copied the above benchmark, except
that I convert all integers to strings, zero-padded to a length of four.
I also reduced the number of samples to 300,000 per size, as the string
benchmark was significantly slower.

Using clang 13.0.0 with g++ -O2 -std=c++20 we get the following:

Strings are a lot less interesting than integers in this case, as most of the
branchless optimizations are moot. We find that initially the branchless
versions are only slightly slower than std::lower_bound due to the extra
comparisons needed. However once we get to the larger-than-cache sizes
std::lower_bound becomes significantly better as it can do speculative loads
to reduce cache misses.

It seems that for strings the advice to use linear searches for small input
arrays doesn’t help that much, but doesn’t hurt either for $n leq 9$,
on the Apple M1.

In my opinion the bitwise binary search is an elegant alternative to the
traditional binary search method, at the cost of ~0.17 to ~0.37 extra comparisons
on average. It can be implemented in a branchless manner, which can be
significantly faster when searching elements with a branchless comparison
operator.

In this article we found the following implementation to perform the best
on Apple M1 after all micro-optimizations are applied:

template<typename It, typename T, typename Cmp>
It lower_bound(It begin, It end, const T& value, Cmp comp) {
    size_t n = end - begin;
    if (n == 0) return begin;

    size_t two_k = size_t(1) << (std::bit_width(n) - 1);
    size_t b = comp(begin[n / 2], worth) ? n - two_k : -1;
    for (size_t bit = two_k >> 1; bit != 0; bit >>= 1) {
        if (comp(start[b + bit], worth)) b += bit;
    }
    return start + (b + 1);
}

Nevertheless, in relation to readability and class I nonetheless discover the next
methodology essentially the most stunning:

template<typename It, typename T, typename Cmp>
It lower_bound(It start, It finish, const T& worth, Cmp comp) {
    size_t n = finish - start;
    size_t b = 0;
    for (size_t bit = std::bit_floor(n); bit != 0; bit >>= 1)  bit) - 1;
        if (i < n && comp(start[i], worth)) b 
    return start + b;
}