Jane Road Tech Weblog – Learn how to shuffle an enormous dataset
At Jane Road, we frequently work with information that has a really low
signaltonoise ratio, however thankfully we even have a lot of knowledge.
The place practitioners in lots of fields is perhaps accustomed to
having tens or lots of of hundreds of accurately labeled
examples, a few of our issues are extra like having a billion coaching
examples whose labels have solely a slight tendency to be appropriate.
These massive datasets current numerous attentiongrabbing engineering
challenges. The one we deal with right here: How do you shuffle a very
massive dataset? (If you happen to’re not accustomed to why one would possibly want this,
soar to the part Why shuffle beneath.)
For a dataset x_{0} , . . . , x_{n – 1} that matches in RAM, you possibly can shuffle utilizing one thing like
Fisher–Yates:
for i = 0, ..., n  2 do
swap x[i] and x[j], the place j is a random draw from {i, ..., n  1}
However what in case your dataset doesn’t slot in RAM?
I’ll current the algorithm I take advantage of for shuffling massive datasets. It
isn’t novel, and one can discover a number of
instances of
people
reinventing
it or one thing comparable (and in essence it descends from
Rao). Nonetheless, I don’t know
of anyplace that states the algorithm, reveals why it’s appropriate, and
will get into the actual sensible points we deal with beneath. Additionally,
once I first encountered this downside and searched online for an
reply, I didn’t discover any of the great examples above, simply numerous dangerous
concepts, so hopefully this publish will enhance the chances for the subsequent
individual.
To be clear, this isn’t some minor efficiency hack. For giant
datasets, it makes the distinction between possible and infeasible.
(See appendix for a extra quantitative comparability.)
A 2pass shuffle algorithm
Suppose we now have information
x_{0} , . . . , x_{n – 1}.
Select an M sufficiently massive {that a} set of n/M factors could be shuffled
in RAM utilizing one thing like Fisher–Yates, however sufficiently small that you may have
M open information for writing (with first rate buffering). Create M “piles”
p_{0} , . . . , p_{M – 1}
that we will write information to. The psychological mannequin of a “pile” right here is that it’s a
file you possibly can append to, however in observe
you would possibly, say, have a number of piles exist as datasets in the identical HDF5
file. The primary move of the algorithm is to separate the information into these
M piles, and the second move shuffles every pile and appends it to
the ultimate outcome.
 First move
create empty piles p[0], ..., p[M  1]
for i = 0, ..., n  1 do
j := uniform random draw from {0, ..., M  1}
append x[i] to pile p[j]
 Second move (maybe completed lazily)
for j = 0, ..., M  1 do
shuffle p[j] in RAM with FisherYates or no matter is handy
append p[j] to output file
Instance of a shuffle: We begin with unshuffled information (high); the primary
move leaves M=6 piles (center); the second move yields shuffled information (backside).
Assuming you might have sufficient reminiscence to fulfill the above constraint on
M and assuming that drawing a random quantity is O(1), it is a
linear time algorithm; the fixed issue is dominated by having to
learn and write every information level twice in exterior storage (however the
studying/writing could be completed in blocks slightly than one level at a time).
For the reason that studying and writing is streamoriented, the algorithm nonetheless
works for information with variable file size.
To see that the 2pass shuffle yields an unbiased random permutation,
think about one other algorithm already identified to be appropriate: draw
U_{0} , . . . , U_{n – 1} ~ Uniform(0,1), affiliate x_{i} with U_{i}, and
type by U_{i}; this yields an unbiased permutation. Our algorithm
above could be seen to be equal to this: for M=1000, the selection
of pile is like radix sorting on the primary 3 digits of U_{i}, after which
shuffling inside every pile is like sorting on the remaining digits.
Coping with outsized piles
Even when the anticipated pile dimension can be
sufficiently small to shuffle in RAM, there may be some probability of getting an
outsized pile that’s too massive to shuffle in RAM. You may make
the likelihood of getting an outsized pile very small: if anticipated
pile dimension is s, the stdev is slightly below
√s, so you possibly can simply organize for, say,
s + 6√s
to be a dimension that you may nonetheless shuffle in RAM.
Even with M=1000, the prospect
that some pile might be bigger than anticipated by 6 stdevs is about
10^{−6}.
(This 6√s enterprise is simply
a formality. In observe, you simply depart your self what appears like a ample quantity
of headroom, and in case you get an outsized pile, it’s overwhelmingly probably that
you overestimated what number of factors you can slot in reminiscence slightly than getting unfortunate,
and also you strive once more with smaller pile dimension.)
Within the uncommon case that you find yourself with an outsized pile, you can recursively
apply the algorithm to the outsized pile, however it’s additionally okay simply to
begin over. As a result of the likelihood of getting to restart is
small, the anticipated runtime is barely barely elevated.
You would possibly fear that beginning over would introduce some bias into the shuffle,
however—surprisingly,
maybe—it doesn’t, as a result of the tuple of pile sizes that outcomes from the primary move
is impartial of the permutation that’s generated.
(Contemplate the above mindset of the algorithm as
associating every level with some U_{i} after which sorting; if I inform
you the way lots of the U_{i} occurred to fall in sure intervals, I
nonetheless haven’t given you any details about the relative ordering
among the many U_{i}.)
The same consideration applies if the way in which you might be storing your information
makes it vital or advantageous to preallocate the storage
for every pile: you preallocate
s + 6√s
for every pile, on common waste
6√s
per pile, and really not often need to restart in case you exceed
the storage you had preallocated.
Parallelizing, and different sensible issues
As a sensible matter, with very massive information units, the enter is commonly
damaged throughout a number of information slightly than being in a single file, and it will be
fascinating for the results of the shuffle to be damaged throughout a number of
information as effectively. The above algorithm adapts naturally to this context.

Suppose the enter is unfold throughout information
X_{0} , . . . , X_{Ok – 1}. We do the
first move for every of those information in parallel, leaving many units of
piles
p^{ok}_{0} , . . . , p^{ok}_{M – 1}
for
ok = 0 , . . . , Ok – 1. 
For j = 0 , . . . , M – 1, mix
p^{0}_{j} , . . . , p^{Ok – 1}_{j}
into p_{j}. 
Proceed with second move as above.
Generally, the information you are attempting to shuffle was the output of some
preprocessing step. The primary move could be builtin into the
preprocessing, in order that the additional price incurred by the primary move is
close to zero: throughout preprocessing, the place you’d have written
preprocessed information to at least one file, you as an alternative write it to many piles.
Additionally, in observe, it may be useful to have the ensuing chunks be
sufficiently small that they are often shuffled in RAM whereas additionally
coaching your mannequin. Then, the second move is finished lazily: You
solely shuffle the piles as they’re loaded for coaching. That is usually
a internet win, relying on what number of occasions you will devour the
information with out reshuffling. (Fancier nonetheless, if the piles are small
sufficient that you may match 2 in reminiscence on the identical time, you possibly can have
a greater enter pipeline: if you are coaching on one pile, you begin
loading and shuffling the subsequent one.)
Leaving piles unshuffled additionally permits for an additional trick pointed
out by my colleague David Wu: Suppose new
information is arriving at a roughly fixed charge, and also you need to preserve a shifting
window of size Y years. Consider every pile as a round buffer,
with its contents in chronological order. As new information is available in, when
you write to a pile, you take away outdated information and append the brand new information.
On this manner you possibly can incrementally preserve a shuffled copy of the final
Y years of knowledge. (Okay, it’s solely a halfshuffled copy, however the
remaining work is straightforward to do if you load every pile.)
Leaving the information in lots of piles, slightly than combining right into a single
monolithic output, additionally means that you can get imperfect (however for a lot of
functions adequate) reshuffles by permuting the order wherein you
load piles (and shuffling inside every pile if you load it).
Why shuffle
When coaching neural nets by stochastic gradient descent (or a variant thereof),
it is not uncommon observe to shuffle the information. With out getting slowed down in
an in depth dialogue, let’s attempt to get a way for why this
shuffling is beneficial by contemplating an excessive instance. Suppose you might be coaching
a classifier to inform cats from canines, and your coaching set
is 50,000 cats adopted by 50,000 canines. If you happen to don’t shuffle, you
will get poor coaching efficiency. Strictly talking the issue
arises from having serial correlation within the noise of your gradients,
mixed with noncommutativity of parameter updates (if coaching on
x after which y have been equal to coaching on y after which x, then
shuffling would haven’t any impact); intuitively, your internet will spend
50,000 examples studying “every little thing’s a cat” adopted by 50,000
examples studying “no, every little thing’s a canine,” and many of the finer
construction you may need realized alongside the way in which will get drowned out.
If you happen to solely domestically shuffle (e.g., preserve a reservoir of 10,000
examples that you just draw from randomly, which is replenished by
streaming by your dataset) then that may very well be ample if serial correlations
in your information persist for a lot fewer than 10,000 examples, however it will be inadequate
in our 50,000 cat–50,000 canine instance.
That’s to not say that shuffling is itself optimum. E.g., you would possibly get
higher coaching efficiency by ensuring every consecutive pair of coaching
examples has one cat and one canine (although we’ve discovered there are different issues that
crop up with this concept). Or, there are approaches like
curriculum studying (Bengio et al.).
Appendix: Efficiency comparability
The twopass shuffle appeared so clearly higher than random entry into
a file that I hadn’t bothered to measure how a lot quicker it really
is. One strategy works, the opposite doesn’t, what’s there to measure?
However the publish was met with numerous skepticism about whether or not it’s
quicker in any respect, apparently on the idea that the 2pass algorithm has
an additional learn/write and SSDs are quick. So I measured the distinction
and located that, for my information and the way it’s saved, the 2pass strategy
is 1000 occasions as quick as random entry (and that’s earlier than
incorporating additional enhancements to the 2pass strategy which can be
completed in observe, that are to parallelize the primary move and
combine it with the information preprocessing). If this sounds too good to
be true, keep in mind that this isn’t a comparability to some
highlyregarded observe; it’s a comparability to a nasty concept, like
quicksort towards bubblesort.
Even with uncompressed information on native SSDs, sequential traversals are
48 occasions as quick as random entry traversals for my information.
Clearly the efficiency hole will rely upon how massive your coaching
examples are, your storage setup, what file format you’re utilizing,
whether or not the information is compressed, and so forth. Specifically, if
particular person examples are very massive (500kB every?) then random entry
may very well be aggressive.
The dataset I examined this on is 220 million examples, 9kB every. It
can be 2TB uncompressed. It’s 320GB compressed (4 HDF5 information, 80GB
every, utilizing HDF5’s inner compression). If I attempt to traverse the
information by grabbing one random instance at a time, it takes 394,000μs
per instance (random entry into compressed 80GB information is SLOW). At
that charge, it will take 2.75 years to traverse the information as soon as.
(That’s not doing something clearly mistaken like reopening the file for
every learn—the 4 information are solely opened as soon as. The one
clearly mistaken factor it’s doing is making an attempt to traverse the information through
random entry.)
By comparability, studying the information sequentially in large blocks, it takes
120μs/instance, and a single traversal of the dataset takes 7.3
hours. Bearing in mind the truth that with the 2pass algorithm you
need to learn every information level twice and do an intermediate write, it
takes a couple of day, ranging from unshuffled information, to do a random
traversal. It is a 1000x speedup over random entry, with out
incorporating something like parallelizing the primary move, or
piggybacking the primary move on high of no matter preprocessing you’re
already doing. If I put some effort into optimizing the foolish
strategy, I can get the issue to be smaller. E.g., if I am going to the
hassle of placing the information on native storage (a RAID array of SSDs in
this case), nonetheless compressed, and solely studying from one file, it’s
“solely” a 460x speedup. Utilizing uncompressed information (I examined with a
memorymapped .npy file) on domestically hooked up SSD storage yields a
hefty speedup for each approaches, with random studying taking
720μs/instance and sequential studying taking 15μs/instance.
This narrows the hole, however not sufficient to make random entry aggressive.
So, the relative velocity of sequential entry greater than compensates for
the price of the primary move (which itself is negligible in case you are
going to preprocess the information anyway, as identified earlier).
You would possibly surprise: even in RAM, sequential entry is quicker than random
entry; does this imply that we will make inmemory shuffles quicker
utilizing an algorithm like this slightly than Fisher–Yates
(the place RAM is the brand new disk, and cache is
the brand new RAM)? Based on the Sanders
paper talked about within the
introduction, the reply is sure, and he claims a 4x speedup on
up to date {hardware}. (After all, within the context of our downside right here,
the place the inmemory operations are lowcost relative to getting stuff off
the disk, that 4x velocity up for the inmemory shuffle would make little
distinction for us.)