The World’s Smallest Hash Desk
This December I as soon as once more did the Advent of Code,
in Rust. In case you are , my solutions
are on Github. I needed to focus on one specific resolution to the day 2
problem as it’s each optimized fully
past the purpose of motive but incorporates a helpful approach. For simplicity we’re
solely going to do half 1 of the day 2 downside right here, however the very same strategies
apply to half 2.
We’re going to begin off sluggish, however stick round as a result of on the finish you need to
have an concept what on earth this perform is doing, the way it works, how one can make one
and why it’s the world’s smallest hash desk:
pub fn phf_shift(x: u32) -> u8 {
let shift = x.wrapping_mul(0xa463293e) >> 27;
((0x824a1847u32 >> shift) & 0b11111) as u8
}
We obtain a file the place every line incorporates A
, B
, or C
, adopted by
an area, adopted by X
, Y
, or Z
. These are to be understood as decisions in
a sport of rock-paper-scissors as such:
A = X = Rock
B = Y = Paper
C = Z = Scissors
The primary letter (A
/B
/C
) signifies the selection of our opponent, the second
letter (X
/Y
/Z
) signifies our selection. We then compute a rating, which has two
elements:
-
If we picked Rock we get 1 level, if we picked Paper we get 2 factors, and three factors if we picked Scissors.
-
If we lose we acquire 0 factors, if we draw we acquire 3 factors, if we win we get 6 factors.
For example, if our enter file seems as such:
A Y
B X
C Z
Our whole rating can be (2 + 6) + (1 + 0) + (3 + 3) = 15
.
A sane resolution would confirm that certainly our enter strains have the format [ABC] [XYZ]
, earlier than extracting these two letters. After changing these letters to
integers 0
, 1
, 2
by subtracting both the ASCII code for 'A'
or 'X'
respectively we will instantly calculate the primary element of our rating as 1 + ours
.
The second element is extra concerned, however might be elegantly solved utilizing modular arithmetic.
Word that if Rock = 0, Paper = 1, Scissor = 2 then we all the time have that selection ${ok + 1 bmod 3}$
beats $ok$. Alternatively, $ok$ beats ${ok – 1}$, modulo 3:
If we divide the variety of factors that Creation of Code expects for a loss/draw/win by
three we discover {that a} loss is $0$, a draw is $1$ and a win is $2$ factors. From
these observations we will derive the next modular equivalence
$$1 + mathrm{ours} – mathrm{theirs} equiv mathrm{factors} pmod 3.$$
To see that it’s right, notice that if we drew, ours - theirs
is zero and we
accurately get one level. If we add one to ours
we modify from a draw to a
win, and factors
turns into congruent with $2$ as desired. Symmetrically, if
we add one to theirs
we modify from a draw to a loss, and factors
as soon as
once more turns into congruent with $0$ as desired.
Translated into code we discover that our whole rating is
1 + ours + 3 * ((1 + ours + (3 - theirs)) % 3)
We discovered a neat closed type, but when we have been even barely much less lucky it’d
not have existed. A extra common methodology for fixing comparable issues can be good.
On this specific occasion that’s doable. There are solely $3 occasions 3 = 9$
enter pairs, so we will merely hardcode the reply for every state of affairs:
let solutions = HashMap::from([
("A X", 4),
("A Y", 8),
("A Z", 3),
("B X", 1),
("B Y", 5),
("B Z", 9),
("C X", 7),
("C Y", 2),
("C Z", 6),
]);
Now we will merely get our reply utilizing solutions[input]
. This may really feel as a
little bit of a non-answer, however it’s a respectable approach. We’ve a mapping
of inputs to outputs, and typically the only or quickest (in both
programmer time or execution time) resolution is to put in writing it out explicitly and
fully reasonably than compute the reply at runtime with an algorithm.
The above resolution works wonderful, however it pays a price for its genericity. It makes use of
a full-fledged string hash algorithm, and lookups contain the total codepath for
hash desk lookups (most notably hash collision decision).
We will drop the genericity for a big increase in pace if we have been to make use of a
perfect hash function. A
excellent hash perform is a specifically constructed hash perform on some set $S$
of values such that every worth within the set maps to a distinct hash output,
with out collisions. It is very important notice that we solely care about its habits
for inputs within the set $S$, with an entire disregard for different inputs.
A minimal excellent hash perform is one which additionally maps the inputs to a dense
vary of integers $[0, 1, dots, |S|-1]$. This may be very helpful since you
can then instantly use the hash perform output to index a lookup desk. This
successfully creates a hash desk that maps set $S$ to something you need.
Nonetheless, strict minimality is just not needed for this so long as you might be okay
with losing a number of the house in your lookup desk.
There are totally generic strategies for setting up (minimal) excellent hash
features, such because the “Hash, displace and compress” algorithm by Belazzougui
et. al., which is applied within the phf crate.
Nonetheless, they have a tendency to make use of lookup tables to assemble the hash itself. For small
inputs the place pace and measurement is completely crucial I’ve had good success simply
making an attempt stuff. This may sound obscure—as a result of it’s—so let me stroll you
via some examples.
Reading the input
As a little bit of a hack we will notice that every line of our enter from the Creation of
Code consists of precisely 4 bytes. One letter for our opponent’s selection, a
house, our selection, and a newline byte. So we will merely learn our enter as a
u32
, which simplifies the hash building immensely as an alternative of coping with
strings.
For instance, consulting the ASCII table
we discover that A
has ASCII code 0x41
, house maps to 0x20
, X
has code
0x58
and the newline image has code 0x0a
so the enter "A Xn"
also can
merely be considered because the integer 0x0a582041
if you’re on a
little-endian machine. In case you are
confused why 0x41
is within the final place keep in mind that we people write numbers with the
least vital digit on the appropriate as a conference.
Word that on a big-endian machine the order of bytes in a u32
is flipped, so
studying these 4 bytes into an integer would outcome within the worth 0x4120580a
.
Calling u32::from_le_bytes
converts 4 bytes assumed to be little-endian to
the native integer illustration by swapping the bytes on a big-endian machine
and doing nothing on a little-endian machine. Virtually all trendy CPUs are
little-endian nonetheless, so it’s usually a good suggestion to put in writing your code such
that the little-endian path is quick and the big-endian path entails a
conversion step, if a conversion step can’t be averted.
Doing this for all inputs offers us the next desired integer →
integer mapping:
Enter LE u32 Reply
-------------------------------
A X 0xa582041 4
A Y 0xa592041 8
A Z 0xa5a2041 3
B X 0xa582042 1
B Y 0xa592042 5
B Z 0xa5a2042 9
C X 0xa582043 7
C Y 0xa592043 2
C Z 0xa5a2043 6
Example constructions
Once I mentioned I simply attempt stuff, I imply it. Let’s load our mapping into Python
and write a take a look at:
inputs = [0xa582041, 0xa592041, 0xa5a2041, 0xa582042,
0xa592042, 0xa5a2042, 0xa582043, 0xa592043, 0xa5a2043]
solutions = [4, 8, 3, 1, 5, 9, 7, 2, 6]
def is_phf(h, inputs):
return len({h(x) for x in inputs}) == len(inputs)
There are 9 inputs, so maybe we get fortunate and get a minimal excellent
hash perform straight away:
>>> [x % 9 for x in inputs]
[0, 7, 5, 1, 8, 6, 2, 0, 7]
Alas, there are collisions. What if we don’t must be completely minimal?
>>> subsequent(m for m in vary(9, 2**32)
... if is_phf(lambda x: x % m, inputs))
12
>>> [x % 12 for x in inputs]
[9, 1, 5, 10, 2, 6, 11, 3, 7]
That’s not too unhealthy! Solely three parts of wasted house. We will make our first
excellent hash desk by putting the solutions within the right spots:
def make_lut(h, inputs, solutions):
assert is_phf(h, inputs)
lut = [0] * (1 + max(h(x) for x in inputs))
for (x, ans) in zip(inputs, solutions):
lut[h(x)] = ans
return lut
>>> make_lut(lambda x: x % 12, inputs, solutions)
[0, 8, 5, 2, 0, 3, 9, 6, 0, 4, 1, 7]
Giving the easy mapping:
const LUT: [u8; 12] = [0, 8, 5, 2, 0, 3, 9, 6, 0, 4, 1, 7];
pub fn reply(x: u32) -> u8 {
LUT[(x % 12) as usize]
}
Compressing the table
We stopped right here on the primary modulus that works, which is actually wonderful on this
case as a result of solely three bytes of wasted house is fairly good. However what if we
didn’t get so fortunate? We’ve to maintain wanting. Although
modulo $m$ has $[0, m)$ as its
codomain, when applied to our set of
inputs its image might
span a smaller subset. Let’s inspect some:
>>> [(m, max(x % m for x in inputs))
... for m in range(1, 30)
... if is_phf(lambda x: x % m, inputs)]
[(12, 11), (13, 11), (19, 18), (20, 19), (21, 17), (23, 22),
(24, 19), (25, 23), (26, 21), (27, 25), (28, 19), (29, 16)]
Sadly but in addition logically, there’s an upwards pattern of the utmost
index as you enhance the modulus. However $13$ additionally appears promising, let’s have a look:
>>> [x % 13 for x in inputs]
[3, 6, 9, 4, 7, 10, 5, 8, 11]
>>> make_lut(lambda x: x % 13, inputs, solutions)
[0, 0, 0, 4, 1, 7, 8, 5, 2, 3, 9, 6]
Effectively, effectively, effectively, aren’t we fortunate? The primary three indices are unused, so we will
shift all of the others again and get a minimal excellent hash perform!
const LUT: [u8; 9] = [4, 1, 7, 8, 5, 2, 3, 9, 6];
pub fn reply(x: u32) -> u8 {
LUT[(x % 13 - 3) as usize]
}
In my expertise with making a bunch of comparable mappings prior to now, you’d
be shocked to see how usually you get fortunate, so long as your mapping isn’t too
giant. As you add extra ‘issues to attempt’ to your toolbox, you even have extra
alternatives of getting fortunate.
Fixing near-misses
One other factor to attempt is fixing near-misses. For instance, let’s take one other look
at our authentic naive try:
>>> [x % 9 for x in inputs]
[0, 7, 5, 1, 8, 6, 2, 0, 7]
Solely the final two inputs give a collision. So a reasonably naive however doable approach
to resolve these collisions is to maneuver these to a distinct index:
>>> [x % 9 + 3*(x == 0xa592043) - 3*(x == 0xa5a2043) for x in inputs]
[0, 7, 5, 1, 8, 6, 2, 3, 4]
Oh look, we received barely fortunate once more: each are utilizing the fixed 3, which might be
factored out. It may be fairly addictive to check out varied permutations of
operations and tweaks to search out these (minimal) excellent hash features utilizing as
few operations as doable.
Thus far we’ve simply been utilizing the modulo operator to scale back our enter area to a
a lot smaller one. Nonetheless, integer division/modulo is reasonably sluggish on most
processors. If we check out Agner Fog’s instruction
tables we see that the 32-bit DIV
instruction has a latency of 9-12 cycles on AMD Zen3 and 12 cycles on Intel Ice
Lake. Nonetheless, we don’t want a completely generic division instruction, since our
divisor is fixed right here. Let’s take a fast have a look at what the compiler does for
mod 13:
pub fn mod13(x: u32) -> u32 {
x % 13
}
instance::mod13:
mov eax, edi
mov ecx, edi
imul rcx, rcx, 1321528399
shr rcx, 34
lea edx, [rcx + 2*rcx]
lea ecx, [rcx + 4*rdx]
sub eax, ecx
ret
It interprets the modulo operation right into a multiplication with some shifts / provides / subtractions as an alternative.
To see how that works let’s first think about essentially the most magical half: the multiplication by $1321528399$ adopted
by the appropriate shift of $34$. That magical fixed is definitely $lceil 2^{34} / 13 rceil$
which implies it’s computing
$$q = leftlfloor frac{x cdot lceil 2^{34} / 13 rceil}{2^{34}}rightrfloor = lfloor x / 13 rfloor.$$
To show that’s in truth right we notice that $2^{34} + 3$ is divisible by $13$ permitting us
to separate the division within the right outcome plus an error time period:
$$frac{x cdot lceil 2^{34} / 13 rceil}{2^{34}} = frac{x cdot (2^{34} + 3) / 13}{2^{34}} = x / 13 + frac{3x}{13cdot 2^{34}}.$$
Then we examine the error time period and substitute $x = 2^{32}$ as an higher certain
to see it by no means impacts the outcome after flooring:
$$frac{3x}{13cdot 2^{34}} leq frac{3 cdot 2^{32}}{13cdot 2^{34}} leq frac{3}{13 cdot 4} < 1/13.$$
For extra context and references I’d recommend “Integer division by constants:
optimal bounds” by
Lemire et. al.
After computing $q = lfloor x/13rfloor$ it then computes the precise modulo we
need as $m = x – 13q$ utilizing
the id $$x bmod m = x – lfloor x / m rfloor cdot m.$$
It avoids using one other (comparatively) costly integer multiplication through the use of
the lea
instruction which might compute a + ok*b
, the place ok
could be a fixed
1, 2, 4, or 8. That is the way it computes $13q$:
Instruction Translation Impact
lea edx, [rcx + 2*rcx] t := q + 2*q t = 3q
lea ecx, [rcx + 4*rdx] o := q + 4*t o = (q + 4*3q) = 13q
We’ve seen that selecting completely different moduli works, and that compilers implement
fixed-divisor modulo utilizing multiplication. It’s time to reduce out the
intermediary and go straight to the great things: integer multiplication.
We will get a greater understanding of what integer multiplication really does
by multiplying two integers in binary utilizing the schoolbook methodology:
4242 = 0b1000010010010
4871 = 0b1001100100111 = 2^0 + 2^1 + 2^2 + 2^5 + 2^8 + 2^9 + 2^12
Binary Decimal
1000010010010 | 4242 * 2^0
1000010010010 | 4242 * 2^1
1000010010010 | 4242 * 2^2
1000010010010 | 4242 * 2^5
1000010010010 | 4242 * 2^8
1000010010010 | 4242 * 2^9
1000010010010 | 4242 * 2^12
-------------------------------|---------------- +
1001110110100100111111110 | 20662782
There’s a lovely property right here we will make the most of: the entire higher
bits of the product $x cdot c$ for some fixed $c$ rely on a lot of the bits
of $x$. That’s, for good decisions of the constants $c$ and $s$, c*x >> s
will
provide you with a outcome that’s wildly completely different even for small variations in $x$. It
is a robust bit mixer.
Hash features like bit mixing features, as a result of they need to be unpredictable.
A superb measure of unpredictability is discovered within the avalanche
effect. For a real random oracle
altering one bit within the enter ought to flip all bits within the output with 50% chance.
Thus having all of your output bits rely on the enter is an efficient property for
a hash perform, as a random oracle is the perfect hash perform.
So, let’s simply attempt one thing. We’ll stick to utilizing modulo $2^ok$
for optimum pace (as these might be computed with a binary AND as an alternative of
needing multiplication), and attempt to discover constants $c$
and $s$ that work. We wish our codomain to have measurement $2^4 = 16$ since that’s the
smallest energy of two larger than $9$. We’ll use a $32 occasions 32 to 32$ bit multiply
since we solely want 4 bits of output, and the highest 4 bits of the multiplication
will rely sufficiently on a lot of the bits of the enter. By doing a right-shift
of $28$ on a u32
outcome we additionally get our mod $2^4$ totally free.
import random
random.seed(42)
def h(x, c):
m = (x * c) % 2**32
return m >> 28
whereas True:
c = random.randrange(2**32)
if is_phf(lambda x: h(x, c), inputs):
print(hex(c))
break
It’s all the time a bit thrilling to hit enter when doing a random seek for a magic
fixed, not figuring out if you happen to’ll get a solution or not. On this case it immediately
printed 0x46685257
. Because it was so quick there are seemingly many options, so
we will undoubtedly be a bit greedier and see if we will get nearer to a minimal
excellent hash perform:
finest = float('inf')
whereas finest >= len(inputs):
c = random.randrange(2**32)
max_idx = max(h(x, c) for x in inputs)
if max_idx < finest and is_phf(lambda x: h(x, c), inputs):
print(max_idx, hex(c))
finest = max_idx
This rapidly iterated via a few options earlier than discovering a relentless that offers a minimal excellent hash perform,
0xedc72f12
:
>>> [h(x, 0xedc72f12) for x in inputs]
[2, 5, 8, 1, 4, 7, 0, 3, 6]
>>> make_lut(lambda x: h(x, 0xedc72f12), inputs, solutions)
[7, 1, 4, 2, 5, 8, 6, 9, 3]
Mockingly, if we wish the optimum efficiency in secure Rust, we nonetheless have to
zero-pad the array to 16 parts so we will by no means go out-of-bounds. However if you happen to
are completely sure there are not any inputs aside from the required inputs,
and also you needed optimum pace, you possibly can scale back your reminiscence utilization to 9 bytes with
unsafe Rust. Sticking with the secure code choice we’ll get:
const LUT: [u8; 16] = [7, 1, 4, 2, 5, 8, 6, 9, 3,
0, 0, 0, 0, 0, 0, 0];
pub fn phf_lut(x: u32) -> u8 {
LUT[(x.wrapping_mul(0xedc72f12) >> 28) as usize]
}
Inspecting the meeting code utilizing the Compiler
Explorer it’s extremely tight now:
instance::phf_lut:
imul eax, dword ptr [rdi], -305713390
shr rax, 28
lea rcx, [rip + .L__unnamed_1]
movzx eax, byte ptr [rax + rcx]
ret
.L__unnamed_1:
.asciz " 07 01 04 02 05b 06t 03 00 00 00 00 00 00"
You thought 9 bytes was the world’s smallest hash desk? We’re solely simply getting
began! You see, it’s really doable to have a small lookup desk with out
accessing reminiscence, by storing it within the code.
A very efficient methodology for storing a small lookup desk with small
parts is to retailer it as a relentless, listed utilizing shifts. For instance, the lookup
desk [1, 42, 17, 26][i]
is also written as such:
(0b11010010001101010000001 >> 6*i) & 0b111111
Every particular person worth matches in 6 bits, and we will simply match $4times 6 = 24$
bits in a u32
. In isolation this won’t make sense over a traditional lookup
desk, however it may be mixed with excellent hashing, and might be vectorized as
effectively.
Sadly we have now 9 values that every require 5 bits, which doesn’t slot in
a u32
… or does it? You see, by co-designing the lookup desk with the
excellent hash perform we may theoretically overlap the top of the bitstring
of 1 worth with the beginning of one other if we instantly use the hash
perform output because the shift quantity.
Replace on 2023-03-05: As tinix0 rightfully points out on
reddit,
our values solely require 4 bits, not 5. I’ve made issues unnecessarily tougher for
myself by successfully prepending a zero bit to every worth. That mentioned, you’d
nonetheless want overlapping for becoming $4 occasions 9 = 36$ bits in a u32
.
We’re thus searching for two 32-bit constants c
and d
such that
(d >> (x.wrapping_mul(c) >> 27)) & 0b11111 == reply(x)
Word that the magic shift is now 32 – 5 = 27 as a result of we wish 5 bits of output to feed into the
second shift, as $2^5 = 32$.
Fortunately we don’t have to truly enhance the search house, as we will assemble
d
from c
by simply putting the reply bits within the indicated shift positions.
Doing this we will additionally discover out whether or not c
is legitimate or not by detecting conflicts
in whether or not a bit ought to be $0$ or $1$ for various inputs. Will we be fortunate?
def build_bit_lut(h, w, inputs, solutions):
zeros = ones = 0
for x, a in zip(inputs, solutions):
shift = h(x)
if zeros & (a << shift) or ones & (~a % 2**w << shift):
return None # Conflicting bits.
zeros |= ~a % 2**w << shift
ones |= a << shift
return ones
def h(x, c):
m = (x * c) % 2**32
return m >> 27
random.seed(42)
whereas True:
c = random.randrange(2**32)
lut = build_bit_lut(lambda x: h(x, c), 5, inputs, solutions)
if lut is just not None and lut < 2**32:
print(hex(c), hex(lut))
break
It takes a second or two, however we discovered an answer!
pub fn phf_shift(x: u32) -> u8 {
let shift = x.wrapping_mul(0xa463293e) >> 27;
((0x824a1847u32 >> shift) & 0b11111) as u8
}
instance::phf_shift:
imul ecx, dword ptr [rdi], -1537005250
shr ecx, 27
mov eax, -2109073337
shr eax, cl
and al, 31
ret
We’ve managed to interchange a fully-fledged hash desk
with one that’s so small that it consists of 6
(vectorizable) meeting directions with none
additional knowledge.
Wew, that was a wild experience. Was it value it? Let’s
examine 4 hash-based variations on how lengthy they take to course of
ten million strains of random enter and sum all solutions.
-
hashmap_str
processes the strains correctly as newline
delimited strings, as within the general solution. -
hashmap_u32
nonetheless makes use of a hashmap, however reads the strains and does lookups
utilizingu32
s like the right hash features do. -
phf_lut
is the sooner outlined perform that feeds an ideal
hash perform right into a lookup desk. -
phf_shift
is our world’s smallest hash perform.
The whole take a look at code might be discovered here. On my 2021 Apple M1 Macbook Professional
I get the next outcomes with cargo run --release
on Rust 1.67.1:
Algorithm | Time |
---|---|
hashmap_str |
262.83 ms |
hashmap_u32 |
81.33 ms |
phf_lut |
2.97 ms |
phf_shift |
1.41 ms |
So not solely is it the smallest, it’s additionally the quickest, beating the unique
string-based HashMap
resolution by over 180 occasions. The rationale phf_shift
is
two occasions sooner than phf_lut
on this machine is as a result of it may be totally
vectorized by the compiler whereas phf_lut
must do a lookup in reminiscence
which is both unimaginable or comparatively sluggish to do in vectorized code,
relying on which SIMD directions you have got accessible.
Your outcomes might range, and also you may want
RUSTFLAGS='-C target-cpu=native'
for phf_shift
to autovectorize.