Scrambling Eggs for Spotify with Knuth’s Fibonacci Hashing
First Revealed: 08/12/23
He was very late in returning — so late, that I knew that the live performance
couldn’t have detained him on a regular basis. Dinner was on the desk
earlier than he appeared.
“It was magnificent,” he stated, as he took his seat. “Do you bear in mind
what Darwin says about music? He claims that the ability of manufacturing
and appreciating it existed among the many human race lengthy earlier than the ability
of speech was arrived at. Maybe that’s the reason we’re so subtly
influenced by it. There are imprecise recollections in our souls of these misty
centuries when the world was in its childhood.”
“That is fairly a broad concept,” I remarked.
“One’s concepts should be as broad as Nature if they’re to interpret
Nature,” he answered.
– A Research in Scarlet, Arthur Conan Doyle
Just a few weeks again, whereas looking Hacker Information’
second chance pool, I
got here throughout a
2014 blog post from Spotify
discussing how you can shuffle songs.
Now, at first look, you would possibly suppose, “How difficult may it’s to
shuffle songs in a playlist? Can we not randomly shuffle them out?”
You see, that’s exactly the strategy Spotify initially took. They used
the
Fisher–Yates shuffle
to do that. Say you’ve gotten 5 songs from three artists, as illustrated
within the determine beneath:
The trendy implementation of Fisher–Yates shuffle is an (O(n)) time
algorithm and could be described as follows:
 To shuffle an array (a) of (n) components (indices (0..n1)):
 For (i) from (n – 1) right down to (1) do
 (j leftarrow ) random integer such that ( 0 le j le i )
 trade (a[j]) and (a[i])
Utilizing the determine above, we are able to visualize the algorithm:
Now, this was all effective for the Swedenbased group until they confronted a
fascinating situation. Of their phrases:
….. We seen some customers complaining about our shuffling algorithm
taking part in just a few songs from the identical artist proper after one another. The
customers had been asking “Why isn’t your shuffling random?”. We responded “Hey!
Our shuffling is random!”
At first we didn’t perceive what the customers had been attempting to inform us by
saying that the shuffling shouldn’t be random, however then we learn the feedback
extra rigorously and seen that some folks don’t need the identical artist
taking part in two or 3 times inside a short while interval.
….. Should you simply heard a music from a selected artist, that doesn’t
imply that the following music might be extra possible from a unique artist in a
completely random order. Nonetheless, the outdated saying says that the person is
at all times proper, so we determined to look into methods of adjusting our shuffling
algorithm in order that the customers are happier. We discovered that they don’t like
excellent randomness.
Fascinating certainly! To create an phantasm of randomness –
randomness with artists evenly unfold all through the playlist –
the engineers drew inspiration from a
2007 blogpost by Martin Fiedler.
Fiedler’s algorithm is split into two components:
the merge algorithm and the fill algorithm. Initially, the
songs to be shuffled are categorized primarily based on their artist. For
occasion, assume that we now have three artists and ten songs in our
playlist. Then,
The fill algorithm goals to distribute dummy tracks evenly amongst
shorter classes to match the size of the longest one. This step is
adopted by the merge algorithm, which mixes tracks from
totally different classes to type a single playlist in a columnwise trend.
For instance, with the classes talked about earlier, the fill algorithm
would:
Subsequently, the merge algorithm would create a last playlist as
follows:
Whereas this technique distributes the songs fairly effectively, as talked about
by Spotify, it’s difficult and doubtlessly sluggish in sure eventualities,
notably as a result of quite a few randomization operations.
Spotify prompt drawing inspiration from current dithering
algorithms, like
Floyd–Steinberg dithering, to unfold every artist as evenly as potential. Though they do not
present particular particulars of their algorithm, the essential idea they
describe is as follows:
Let’s say we’ve 4 songs from The White Stripes as within the image
above. Because of this they need to seem roughly each 25% of the
size of the playlist. We unfold out the 4 songs alongside a line, however
their distance will range randomly from about 20% to 30% to make the
last order look extra random.
We introduce a random offset to start with; in any other case all first songs
would find yourself at place 0.
We additionally shuffle the songs by the identical artist amongst one another. Right here we
can use FisherYates shuffle or apply the identical algorithm recursively,
for instance we may stop songs from the identical album taking part in too shut
to one another.
After studying Spotify’s weblog put up, an concept struck me! You see, a 12 months
in the past, I wrote a small library to generate aesthetically pleasing colours
within the
HSV (hue, saturation, value) space.
However how will we outline “aesthetically pleasing colours”? And what is the
connection to shuffling algorithms?
Let’s contemplate an instance the place we have to decide 5 colours to label a
chart. Selecting these colours randomly from the RGB house could be fairly
problematic. Some colours, particularly on darker backgrounds, could make
textual content arduous to learn. Furthermore, it’s potential to finish up with colours that
are too related to one another.
Addressing the problem of darkish or gentle backgrounds seems to be
comparatively easy – keep away from the RGB house and as a substitute,
use the HSV house with fastened saturation and worth. Nonetheless, even
then, there is a excessive probability of choosing colours which might be too shut on
the colour spectrum. In different phrases,
the choice could be too random, and we’d choose colours that
are distributed as evenly as potential throughout your complete colour spectrum.
To realize this, my library employed Fibonacci hashing. The
algorithm I carried out
was proposed by Martin Leitner‑Ankerl. It makes use of the reciprocal of golden ratio: [ Phi^{1} =
dfrac{sqrt{5} – 1}{2} approx 0.618033988749895 ]
As Martin explains in his weblog, one of many intriguing properties of the
golden ratio is that
every subsequent hash worth divides the interval into which it falls in
accordance with the golden ratio!
To higher perceive Fibonacci hashing, allow us to discover it by way of the
perspective of Donald Knuth’s
The Art of Computer Programming, Volume 3.
The inspiration of Fibonacci hashing relies upon the
threedistance theorem. This theorem states: Suppose (theta) is an irrational quantity. When
the factors (theta), ((2*theta) % 1), ….., ((n*theta) % 1) are
positioned on a line section starting from (0) to (1) (inclusive), the
(n + 1) line segments shaped could have at most three distinct lengths.
Moreover, the following level (((n+1)*theta) % 1)
will fall into one of many largest current segments.
Take into account the biggest current section, spanning from (a) to (c)
(inclusive), being divided into two components: (a) to (b) and (b) to
(c). If one among these components is greater than twice the size of the opposite,
we name this division a dangerous break. Utilizing the threedistance
theorem, it may be proven that the 2 numbers (Phi^{1}) (the reciprocal of golden ratio!) and (1 – Phi^{1}) result in essentially the most uniformly distributed
sequences amongst all numbers (theta) between (0) and (1). In different
phrases, a break break will happen except (Phi^{1}) or (1 –
Phi^{1}) are used.
Utilizing the speculation described above, we are able to now outline
Fibonacci hashing. Nonetheless, earlier than diving into that, it’s
vital to first perceive multiplicative hashing.
Multiplicative hashing includes multiplying a key (Ok) by a relentless
(alpha). The fractional a part of this product is then used to
decide the index in a hash desk.
Fibonacci hashing is a selected variant of multiplicative hashing
the place the fixed (alpha) is about to be the reciprocal of the golden
ratio (Phi^{1}).
And that’s it! We are able to begin at a random hue worth between 0 and 1 (this
might be our preliminary key (K_1)) and use the Fibonacci hashing from that
level to evenly choose colours on the colour spectrum.
So how can we apply all of this to shuffle songs? Effectively, the algorithm I
got here up with is as follows:

Categorization Perform: Let (S) be the set of all songs in
the playlist. Outline a categorization operate (C: S rightarrow A),
the place (A) is the set of artists. This operate assigns every music in
(S) to an artist in (A). 
Shuffle Songs per Artist: For every artist (a in A), shuffle
the subset of (S) the place (C(s) = A) for all (s) within the subset. 
Shuffle Artist Record: Let (A’) be the listing of all artists.
Apply a random permutation (pi) to (A’). 
Initialize Playlist (P): Let (P) be an empty listing that can
retailer the ultimate shuffled songs. 
Set Parameters: Initialize (Ok=1) and let (N = A) (the
whole variety of artists). 
Choose Artist Index: Compute the index (I = lfloor ((Ok *
0.618033988749895) % 1) * N) rfloor ). 
Add Songs to (P): Take away the primary music from the listing of
songs akin to the artist at (A’[I]) and append it to (P).  Replace Parameters: Increment (Ok) by 1.

Replace Artist Record: If the artist at (A’[I]) has no remaining
songs, decrement (N) by 1 and take away the artist from (A’). 
Loop By means of: Repeat steps 6 to 9 till (A’) is empty. When
empty, return (P).
If we use a hash desk within the categorization operate and an array to
retailer the listing of all artists, then the general time complexity of the
algorithm could be (O(S)).
To check this algorithm in opposition to the naive FisherYates shuffle, I used
a measure that calculates the ratio of the rely of adjoining music pairs
by the identical artist to the whole variety of songs within the playlist minus
one. That’s,
def measure(shuffled_songs):
clusters = sum(
1
for i in vary(len(shuffled_songs)  1)
if shuffled_songs[i][‘artist’] == shuffled_songs[i + 1][‘artist’]
)
return clusters / (len(shuffled_songs)  1)
When the playlist accommodates songs solely from one artist, this measure will
be 1. Conversely, when no adjoining music pairs are by the identical artist,
the measure will return 0.
Utilizing fuzzy testing on a most of ten artists, with every having as much as
ten songs, I generated 1,000,000 playlists and calculated the statistics
for the above algorithm (A) and the fisheryates shuffle (B). The
outcomes had been as follows:

A – Imply: 0.11831, Median: 0.05263, Mode: 0.0, p25: 0.02040, p75:
0.14814, p90: 0.33333, p95: 0.5 
B – Imply: 0.23296, Median: 0.18181, Mode: 0.33333, p25: 0.12121, p75:
0.29629, p90: 0.46666, p95: 0.58333
At first look, the p95 outcome might sound shocking. Nonetheless, this
happens in circumstances the place the fuzzy testing populates a playlist
predominantly with songs from a single artist, rendering optimum
shuffling infeasible. When a minimal of 4 songs per artist had been used,
the outcomes had been:

A – Imply: 0.05948, Median: 0.03703, Mode: 0.0, p25: 0.01694, p75:
0.07843, p90: 0.14285, p95: 0.2 
B – Imply: 0.17223, Median: 0.15555, Mode: 0.2, p25: 0.10909, p75:
0.21875, p90: 0.29268, p95: 0.34482
The outcomes look first rate. I’d count on Fiedler’s algorithm to have a
nearzero imply, nevertheless, the simplicity and pace of the above algorithm
does look interesting to me.
And that’s it of us! I hope you loved this little journey. Comfortable
exploring!
Addendum
Not too long ago, I used to be exploring the “Historical past” part in Knuth’s TAOCP quantity
3 to hint again the historical past of hashing. In it, Knuth states that:
The concept of hashing seems to have been originated by H. P. Luhn, who
wrote an inside IBM memorandum in January 1953 that prompt the use
of chaining; in reality, his suggestion was one of many first functions
of linked linear lists.
Sadly, I could not find the memo. It appears they by no means made it
public. Nonetheless, I stumbled upon a pleasant paper from 1953 through which he
discusses
enhancing search engines by refining sets.
Knuth additionally references Arnold I. Dumey, who seems to be the primary to
describe hash tables in open literature.
I was able to retrieve his paper .
Dumey initiates with an (O(log; n)) answer utilizing the “twenty
questions” sport, subsequently explaining how we might be higher off if we
may do computation earlier than we entry the reminiscence. I discovered his
introduction to the hash operate fairly intriguing:
A sure manufacturing firm had a components and assemblies listing of many
1000’s of things. A blended digital and alphabetic system of numbering
objects was used, of six positions in all. Eight complicated machines or
assemblies had been offered to the general public. These had merchandise numbers taken from
the overall system. In establishing a punch card management system on these
eight objects it was first proposed to file your complete six digit quantity
for every merchandise. Nonetheless, examination of the eight meeting numbers
disclosed that no two had been alike on the fourth digit. It was subsequently
enough, for sorting functions, merely to file the fourth digit,
thereby releasing 5 badly wanted data areas for different
functions.
This fairly excessive case signifies that an examination of the merchandise
description might disclose a builtin redunancy which can be utilized to chop
the sector right down to sensible measurement.
He additional discusses dealing with duplicates and introduces chaining:
Regulate the addressing scheme, based on a technique which might be
described later, to scale back the variety of direct addresses, and use the
extra areas to retailer overflows. Put the overflow handle on the
tail finish of saved merchandise data. What the perfect discount is varies
from case to case. Be aware that the expectation of the variety of accesses
to be made goes up when these strategies are used. At every entry we examine
by utilizing the whole merchandise description, normally.
And discusses how a considerably environment friendly hash handle could be constructed
by dividing a firstrate quantity and utilizing the rest:
Take into account the merchandise description as if it had been a quantity within the scale of
37 or no matter. Or write it as a binary quantity by utilizing the suitable
punched tape coding. Divide this quantity by a quantity barely lower than
the variety of addressable areas (the author prefers the closest
prime). Throw away the quotient. Use the rest because the handle,
adjusting to base 10, because the case could also be.