Now Reading
The Billion Row Problem (1BRC)

The Billion Row Problem (1BRC)

2024-02-22 08:49:07

QuestDB is a excessive efficiency time sequence database with SQL
analytics that may energy via knowledge ingestion and evaluation.
It is open source
and integrates with many instruments and languages. Give us a attempt!

As I used to be searching my timeline on the boring afternoon of the New 12 months’s Day
2024,
this tweet by Gunnar Morling
jumped out:

How briskly can YOU mixture 1B rows utilizing trendy #Java? Seize your threads, flex
your SIMD, and kick off 2024 true coder model by becoming a member of this pleasant little
competitors. Submissions accepted till Jan 31.

The problem was this:

Write a Java program for retrieving temperature measurement values from a textual content
file and calculating the min, imply, and max temperature per climate station.
There’s only one caveat: the file has 1,000,000,000 rows!

My first thought was, “Pfft, min, imply and max, that is so easy!” And the
dataset was easy as properly: simply 413 distinctive keys, of fairly uniform, brief
lengths. Tremendous-simple knowledge format. An entire month to do it. The place’s the
problem?? As is commonly the case, the satan was within the particulars.

It quickly dawned on the contestants that calculating solely min, imply and max made
the competitors tougher, not simpler. However why? There wasn’t one apparent place
consuming CPU cycles. Alternatives to make it even sooner lay all over the place, even
within the darkest corners of the CPU structure.

As a saving grace, the problem was held out within the open on GitHub. Copying
others’ concepts wasn’t simply allowed, it was inspired. This was to be a studying
expertise, not a warfare of secrets and techniques.

It was additionally going to be a month-long frenzy of analysis, ingenuity, and simply
plain enjoyable. It grabbed the eye and devotion of lots of of fans,
together with not less than a dozen or two prime Java efficiency consultants.

The profitable answer is the one submitted by none aside from Thomas Wuerthinger
(@thomaswue), the lead of the GraalVM challenge —
that is the type of heavyweights I am speaking about! Together with a number of different key
contestants, he launched most of the nice concepts that everybody else grabbed
and integrated into their very own submissions.

My submission did fairly properly, ending up at spot #9. My QuestDB colleague,
Jaromir Hamala (@jerrinot), fared even higher,
putting third — simply 73 milliseconds behind the winner!

Screenshot of 1BRC Scoreboard
The Official 1BRC Scoreboard

#

Now that you simply learn the principles, are you able to guess what is the precise problem? Are you able to
visualize what the profitable code would seem like? Right here,
have a look!

For that matter, take a look at any of the highest 10 options, together with
mine,
and particularly on the scorching loop in every of them — the a part of the code the place the
program spends nearly all of its time.

In the event you ever had the expertise of writing a small program in C++ or Rust, and
then trying on the optimized machine code the compiler produced, you will get
related vibes right here. Abstractions are spilled open, issues are criss-crossing
and interleaving one another. A ton of totally alien-looking, bit-twiddling
logic.

How might a human programmer presumably get so far? Like in so many different
circumstances, it was individuals working collectively and enhancing step-by-step. Dozens of Java
consultants iterated via many tips and hacks, and as January rolled on, the
processing time stored dropping decrease and decrease.

The primary factor I would like to point out you on this publish is {that a} good a part of that
superb velocity comes from easy-to-grasp, reusable tips that you can apply in
your code as properly. In direction of the top, I am going to additionally present you a number of the magical components
that take it past that degree.

#

OK: 1,000,000,000 rows – right here we go!

For context, it is a pattern of the temperature enter file. Every line incorporates
a station identify and a temperature studying:

And that is the anticipated output:

As a primary take, let’s apply some
idiomatic Java code
that might move muster with any seasoned Java developer:

This code:

  • makes use of parallel Java streams, which put all of the CPU cores to work
  • would not fall into any recognized efficiency traps like Java regex
  • leans closely into all the good constructing blocks supplied by the JDK

On a Hetzner CCX33 occasion with OpenJDK 21.0.2, it takes 71 seconds to
full. However one of the best answer takes 1.5 seconds — that is a jaw-dropping 47
occasions sooner! As we stated, the best way to get there may be step-by-step, so let’s begin
with the primary one.

#

Earlier than we contact the code, there is a low effort strategy to velocity up your program: use
a modernized JVM. Many manufacturing deployments nonetheless run on Java 8 or 11, however the
tempo of progress since these days has been vital.

Through the 1BRC problem, we discovered that GraalVM is
one rattling quick JVM. It additionally helps compiling right into a native binary, which
eliminates the JVM startup value.

By merely downloading GraalVM and making it the default, my answer improved
from 71 seconds to 66 — a strong 7.5% enchancment for little or no effort.

Once we get deeper into optimizing and produce down the runtime to 2-3 seconds,
eliminating the JVM startup supplies one other 150-200ms in aid. That turns into a
massive deal.

#

Each profitable 1BRC contestant used profiling of 1 variety or one other to information
their optimization efforts. I used a mix of three instruments:

  1. Good outdated
    VisualVM
  2. Async Profiler
  3. perf command-line software

Many individuals take into account VisualVM outdated, however it harbors a hidden gem: The
VisualGC plugin. You need to set up it from the Instruments→Plugins menu. When you
connect it to a working Java program, VisualGC reveals up because the rightmost tab in
the window.

You’ll want to choose the shortest refresh interval (100 ms), after which benefit from the present.
A realtime animation of all of the reminiscence areas the rubbish collector maintains,
together with a graph of JIT compilations and GC runs will seem. I used to spend
hours gazing this oddly satisfying, complicated dance of GC’s cogwheels. For the
1BRC program, I added a whereas (true) assertion to maintain processing the enter
file without end; in any other case issues simply flash by.

The Async Profiler got here from following Gunnar’s recommendation on the
1BRC GitHub page.
The jbang software supplies very handy entry to it. You run this system
as soon as, and get an HTML file with a flamegraph. The flamegraph then tells you
which capabilities/strategies your program is spending essentially the most time in.

The third software, perf, has many options, however for Java the preferred alternative
is perf stat. It would not analyze any particular technique, however provides you perception
into low-level CPU counters. It reveals:

  • What number of directions it executed
  • What number of branches and reminiscence accesses
  • What number of of these had been department/L1 cache misses.

To obtain these insights, I used the next command:

VisualGC was essentially the most helpful within the preliminary optimization part. Then, as soon as I
sorted out allocation, the flamegraph proved extremely helpful to pinpoint the
bottlenecks within the code. Nevertheless, as soon as the runtime went beneath ~3 seconds, its
usefulness declined. At this degree we’re squeezing out efficiency not from
strategies, however from particular person CPU directions. That is the place perf stat turned
one of the best software.

For reference, here is the perf stat report for our primary implementation:

It is most useful to interpret the numbers on a per-row foundation (dividing
the whole lot by 1 billion). We are able to see that this system spends greater than 2,000
directions on every row. No have to get into extra particulars; initially we’ll be
driving down simply this metric.

#

A fast profiling run with VisualVM and
flamegraph reveals no clear
bottleneck for our preliminary Streams API code.

Observe: Scroll down the flamegraph web page to see the graph!

The time divides roughly equally amongst three important duties:

  1. BufferedReader work that outputs a string for every line
  2. Processing these strains
  3. Rubbish assortment (GC)

VisualVM reveals GC cycles working like loopy, 10 occasions per second or extra. What’s
worse, there’s some spilling to the Previous era, triggering background GC
runs as properly. We have now to resolve what to assault first.

The very first thing most of us on the problem realized was that we have now to
parallelize the I/O. Within the above code, a single thread does all of the file
studying and emits a stream of strains. This consists of discovering the newline and
allocating a string for every line. Then again, your complete enter file matches
into the disk cache. It is an ideal goal for parallelization.

One tried-and-true method is to separate the file into as many chunks as there
are threads, and course of every chunk independently. Sadly, with the intention to
take that step, we have now to say goodbye to the concise Streams API and do
the whole lot by hand.

We might learn the chunks utilizing the API of RandomAccessFile, however because it
would not natively assist buffered studying, it might be a unusual implementation
that might contain copying from the native reminiscence to the Java buffer.

As an alternative, everybody went for the mmap method. This meant that you’d work
with the file as if it was a big in-memory array. Java has supported mmap
for a very long time, counting on the ByteBuffer API to learn the native reminiscence. It
makes use of int for indexing, limiting the mapped area measurement to 2 GB.

The JDK staff is at the moment introducing a more moderen API based mostly on lengthy indexing,
MemorySegment. Within the spirit of 1BRC, which inspires utilizing the newest and
best Java options, let’s go together with that:

There are some finicky particulars concerned find the precise place to separate the
recordsdata, launch threads, wait on them, and so forth. Tending to those particulars sees
our code explode from the preliminary 17 strains to 120. You possibly can assessment it
here
(for now we’re utilizing the commented-out “Variant 1” at
line 84).

Let’s deal with a number of key snippets.

First, the new loop now appears like this:

With the Streams API and BufferedReader gone, we run a hand-coded operate,
findByte(), to seek out the separator characters. This avoids making a string
for the entire line, however nonetheless creates strings for the identify and the temperature,
utilizing the tactic named stringAt(). Listed below are these two strategies:

We additionally added code to gather the partial outcomes from all of the threads and merge
them. All put collectively, this brings the processing time down by a whopping 4x,
from 66 to 17 seconds!

On every other day, you could congratulate your skilled self on this achievement. However
within the 1BRC, we had been simply getting began.

Let’s examine the perf stat report on this code:

We halved the variety of directions per row to 945.

One other flamegraph reveals
that GC time has nearly disappeared. However in VisualVM, we will see there’s nonetheless a
lot of minor GC runs, though they use a lot much less CPU as a result of there is not any extra
promotion to the Previous era.

The CPU is now spending most of its time in stringAt, primarily in string
creation and in copying the information from native reminiscence right into a heap-allocated
byte[]. Vital time can also be spent in Map.computeIfAbsent(),
Double.parseDouble(), and findByte().

Let’s assault temperature parsing first.

#

We are able to enhance temperature parsing so much. As it’s, first we allocate a
String, then name parseDouble() on it, after which convert to int for
environment friendly storage and calculation. As an alternative, we must always
immediately create the integer:

This code is within the
same file as above,
solely utilizing “Variant 2” at
line 88.

With this alteration alone, our time drops by 1.6x, to 11 seconds. Not solely can we
spend a lot much less CPU cycles on parsing, we additionally get rid of the allocation of a
short-term String.

And now, the perf stat report:

Instruction rely dropped by one other 1.6x, all the way down to 607 per row.

The brand new flamegraph reveals
that temperature parsing shrunk from 21% of whole time to only 6%. The time
spent in stringAt() has additionally dropped, since we now name it solely as soon as. Nevertheless,
it nonetheless looms massive, at 35% of whole time.

What can we do?

#

We might prefer to keep away from calling stringAt() for the station identify, as a result of it
entails copying the information and initializing a brand new string occasion. In nearly all
circumstances, the one objective of that occasion is to seek out the prevailing entry within the
HashMap.

Nevertheless, avoiding calls to stringAt() shall be fairly laborious to perform if we
stick with utilizing HashMap. It expects that we move in an occasion of the important thing
class that is the same as an current occasion. We might moderately keep away from that. So… perhaps
we must always construct a customized hashtable?

This may increasingly sound loopy at first. Is not Java speculated to have already got a
super-optimized HashMap implementation? Had been all of them mendacity to us after they stated
“do not be a smartass, use the usual library”?

Effectively, no. Basically, they’re completely proper.

However right here we have now a extremely particular, extremely constrained downside, and we will do a
lot higher with a customized implementation. Past our important motivation to get rid
of stringAt(), HashMap has to gracefully serve every use case, and
even defend from denial-of-service assaults.

We’re combating for each CPU cycle and wish our implementation to do the naked
minimal that serves the aim.

On prime of all that, you will see that implementing an open-addressed hashtable
with a predetermined measurement is not all that a lot bother. Here is a lot of the code
you want (full code
here):

And here is the principle loop that makes use of it:

As an alternative of making a string each time, now we simply retailer the situation of the
identify throughout the file. With this, we have now fully eradicated allocation inside
the new loop. The GC thread is now idle.

That is nice by itself, however the principle profit is one thing else: zero allocation
stops the churning of the CPU cache contents. Now the cache fills up with the
hashtable knowledge, which is essential as a result of that is the a part of our in-memory
state the place we won’t keep away from random entry.

Now our runtime is down to six.6 seconds, a 1.7x speedup — our method was price
it.

Let’s test in with perf stat:

We improved instructions-per row once more by 1.6x. As predicted, cache misses
improved much more, by 2.2x.

And our flamegraph? It now reveals
that 47% of whole time is spent in findAcc — we’ll see if we will enhance that.
Additionally it factors out that, as we optimized different issues, 20% of our CPU time is
now spent parsing the temperature. Hmmmm. Extra to do!

However earlier than we go on, let’s test into how GraalVM is working for us. Thus far,
we have been working all our optimization steps on GraalVM. However how would this
code run on OpenJDK?

It will take 9.6 seconds, or 3.3 seconds slower than GraalVM. That is a forty five%
enchancment!

#

Thus far, we have now improved on our preliminary answer by a clear order of magnitude:
from 66 seconds down to six.6 seconds. Not dangerous. The methods we utilized are
comparatively easy to digest and use commonplace, secure Java.

Even if you happen to cease studying right here, you will stroll away with a set of tips that may
make you a efficiency hero in your personal initiatives. However we wish to push far
past this degree of efficiency!

For that, 1BRC contestants regarded to extra esoteric optimizations.

A few of them are clearly harmful and may end up in laborious JVM crashes with a
low-level Segmentation Fault. Others are extremely particular to this problem.

Readability and maintainability additionally take an enormous hit, whereas offering diminishing
returns in efficiency. However, a problem is a problem, and the contestants
pressed on with out trying again!

For this subsequent iteration, we’ll apply a number of methods that every one the highest options
shared:

  • Use solar.misc.Unsafe as an alternative of MemorySegment to keep away from bounds checks

  • Keep away from re-reading the identical enter bytes: reuse the identical loaded worth for hashing
    and semicolon search

  • Course of the information 8 bytes at a time, utilizing a SWAR method to seek out the
    semicolon

  • Use @merykitty‘s magic SWAR (SIMD Inside A
    Register) code to parse the temperature

Now try the brand new scorching loop. It is
looking pretty alien:

And wait until you see the strategies this calls into:

That is a number of bit-twiddling magic, however observe one common factor: there are
nearly no if statements, and that is the purpose. We changed department directions
with straight-through bitwise calculation.

The CPU now tries to foretell whether or not it can go into the “then” or the “else”
department for every if assertion, based mostly on earlier iterations. Consequently, it
begins to decode the suitable directions earlier than having all the information prepared
to guage the situation.

So, every time it will get it incorrect, it has to discard all that work and begin to
decode the opposite directions. As a rule of thumb, a single department misprediction
prices as a lot as 10-15 directions.

See Also

We additionally utilized the SWAR thought: SIMD Inside A Register, which implies treating a
lengthy quantity as a vector of 8 byte values, and performing the identical operation on
every.

In our case, semicolonMatchBits() locates the ASCII semicolon byte and returns
a lengthy with bits set to 1 the place it was discovered. Then the tactic nameLen()
turns that bit sample into the quantity telling us the place it’s. This comes from a
commonplace method, used for instance in C to effectively decide the size of
a zero-terminated string.

Learn an in depth clarification of a really related method on this insightful publish
Finding Null Terminators without Branches
by Richard Startin.

The strategy maskWord() takes a lengthy containing 8 bytes of enter knowledge and
zeroes out all of the bytes past the semicolon. We’d like this to carry out a quick
identify equality test.

The algorithm in parseTemperature() and dotPos() is a genius creation by
@merykitty (Quan Anh Mai), who made it
particularly for this problem. It leverages the properties of the bit patterns
of ASCII - and ., in addition to a number of different tips, and produces the integer
worth of the 2 or three temperature digits, accounting for all 4 doable
patterns (X.X, -X.X, XX.X and -XX.X) in a single go.

If you wish to examine it in additional element, take into account that the quantity string is
saved within the lengthy in little-endian order. For instance, this line:
lengthy signed = (invNumberBytes << 59) >> 63; isolates bit quantity 4 of the primary
byte – the one the place the minus signal could seem – and sign-extends it throughout the
lengthy.

This bit is 0 within the - signal, and 1 in all of the digits. The operation is finished
after flipping all of the bits (~numberBytes), so this turns into both all 1’s if
the byte is -, or all 0’s in any other case.

This parsing code deserves a weblog publish of its personal, and it might distract us too
a lot to clarify it intimately right here. As an alternative I’ve thrown in
@merykitty‘s authentic code, and expanded his
feedback a bit extra:

All these methods put collectively end in a 2.8x speedup. From 6.6 seconds,
we’re now all the way down to 2.4 seconds. Our total enchancment is now 28x.

As perf stat studies:

There’s a enormous drop in instruction rely, by 3x. Since that is simply 120
directions per row now, we must always look into making the identical variety of
directions execute sooner. One factor stands out: there are 0.66
branch-misses per row.

Can we do one thing about that?

#

The flamegraph for our present
answer signifies that the relative affect of strategies hasn’t modified a lot. The
CPU spends 45% of its time whole inside findAcc(), and simply nameEquals()
alone takes 19%:

On the floor, that appears fairly environment friendly. It compares the station names eight
bytes at a time, and even makes use of the trick of reusing the masks operation on the
final block, passing it in as lastNameWord. However, it has a loop, which ends up
in unpredictable branching, and it re-reads the enter identify from reminiscence.

How do we all know the department instruction within the loop shall be unpredictable? The
reply lies within the statistics of the station names.

If a lot of the names are lower than 8 bytes lengthy, the situation that decides
whether or not to enter one other iteration will all the time be false, and that may consequence
in a predictable department instruction.

So, what is the precise distribution of the identify lengths? Whereas doing the
problem, I wrote some code on the facet, in a file known as
Statistics.java,
to seek out out issues like that.

The distribution() technique prints out the statistical distribution of identify
lengths. In the event you run it on a dataset generated utilizing one of many scaffolding
scripts (create_meauserements.sh) from the 1BRC repo, you will see that there is
an nearly even break up between the names as much as and longer than 8 bytes.

I additionally wrote a technique that simulates the CPU’s department prediction
(branchPrediction()
in the identical file). The CPU has a Department Historical past Desk (BHT) that tracks the
conduct of every department instruction within the scorching loop.

An entry within the desk is a 2-bit saturating counter which increments/decrements
relying on the result of the department situation. As an alternative of overflowing, it
will get caught on the min/max worth (in different phrases, it saturates). When the counter
is 0 or 1, it predicts that the department shall be taken; if it is 2 or 3, in
predicts it will not be taken.

Operating Statistics.branchPrediction() with the situation
nameLen > 8
ends in 50% department mispredictions. But when we alter the situation in that
line of code to nameLen > 16, mispredictions drop to only 2.5%.

Knowledgeable by this discovering, it is clear that we have now to jot down some code to keep away from
any department directions on the situation nameLen > 8, and as an alternative go immediately
for nameLen > 16.

To try this, we have now to unroll the semicolon-searching loop alongside these strains:

  1. Carry out the primary two steps in a single go, with out checking any situations
  2. Use bit-twiddling logic to mix the outcomes of discovering the semicolon in
    every of the 2 lengthy phrases
  3. Use the mixed end in an if test, which now accounts for all of the
    preliminary 16 bytes

We additionally want specialised variants of findAcc() and nameEquals() for the
circumstances the place the identify is as much as 16 bytes or extra.

In
my solution,
this reduces the time to 1.8 seconds — one other 33% enchancment, for the brand new
whole enchancment of ~40x.

And perf stat confirms our reasoning:

There’s solely a modest enchancment in directions per row, from 120 to 98. However
take a look at “branch-misses”! It dropped nearly 8 occasions, from 0.657 to 0.084 per row.

This explains a lot of the velocity enchancment. Cache misses dropped as properly, from
0.092 to 0.069 per row. This in all probability comes from the improved reminiscence format of
our stats accumulator, which now shops the primary 16 identify bytes inside the category
occasion, and never within the separately-allocated array.

One other metric that folks like to trace is instructions-per-cycle (IPC). We are able to
see that it improved on this final step from 1.89 to 2.19. Decreasing department
mispredictions implies that the CPU has to do so much much less rework, discarding the
speculatively executed directions. This combines with the drop in
instructions-per-row to clarify the general 33% enchancment.

#

If we wish to evaluate our present results of 1.8 seconds to the winner, 1.5
seconds, we have now to have in mind the measurement methodology.

All alongside this publish, we have been reporting interior timings, that the code studies
for itself. The outer timing, as measured within the contest, consists of each the JVM
startup and cleanup on the finish. This provides 200 milliseconds – so we’re really
at 2.0 seconds in comparison with 1.5 seconds.

@thomaswue realized that round half of this time, 100 ms, is spent on unmapping
the memory-mapped file after the output is already produced. He discovered a strategy to
keep away from paying for this with a hack, which was instantly copied by all the opposite
prime contenders. He began a subprocess that might really do the work, in order that
the guardian course of might finish as quickly because it forwarded all of the output. It will
then go away the subprocess to wash up within the background.

For this trick to work, contestants needed to get rid of the JVM startup time as
properly, in any other case they’d pay for it twice. This is able to cancel out all of the
enchancment! Consequently, this compelled everybody to additionally use ahead-of-time
compilation right into a native binary.

With these two tips added, our outer timing turns into nearly an identical to the
interior timing we have been reporting, which implies we’re really getting shut!

#

At this level, we’re deep into low-level optimization territory. The tips that
additional enhance the timing are coupled to the detailed specifics of
the whole lot, reminiscent of:

  • CPU make and mannequin
  • Structure of the connection to the RAM subsystem
  • Minute particulars of the compiler

Explaining all these tips would get us deep into the weeds, and would not be
reusable information. As an alternative let me present you one final trick that is type of cute,
and may come in useful in real-life situations.

The best way we divide the work right into a single chunk per thread, we will find yourself with
some threads getting “luckier” than others and finishing sooner. When that
occurs, the CPU is underutilized for the remaining computation.

To handle this, we will introduce a small replace that adjustments this to a bigger
variety of small, fixed-size chunks. Up-front we’ll solely calculate the variety of
chunks, after which let the threads seize them and calculate their bounds when
they’re prepared.

The important thing component is in guaranteeing that each chunk will get processed precisely as soon as.

And the fantastic thing about it’s that it is nearly a one-liner:

And that is it! All of the magic occurs within the
atomic counter we increment.
This trick ekes out one final tenth of a second, all the way down to 1.7 seconds.

#

On the finish of our 1BRC speedrun, we managed a 42x enchancment over the Parallel
Streams implementation on OpenJDK, from 71 seconds all the way down to 1.7. It’s possible you’ll discover
that my official 1BRC consequence was fairly a bit worse, at 2.3 seconds. The code in
this publish is completely different from what I submitted; a few of it I wrote only for the
publish. It turned out that I had to decide on between one final spherical of optimization
at 1BRC, or giving full consideration to the problem I acquired whereas getting employed for
QuestDB. I am very glad I selected the latter!

Efficiency optimizations we went via are actually spectacular, however do maintain
in thoughts that a number of the beneficial properties come from allotting with all one of the best practices
that apply to manufacturing code: validations, bounds checks, hashtable resizing,
and so forth.

The only real objective of this code was to be quick at one very significantly specified,
error-free enter file. It has completely no tolerance for any type of deviation,
for instance a single temperature studying that exceeds the three-digit most
would trigger it to fully lose observe, and possibly crash.

However, coding challenges are supposed to be enjoyable — and all people is aware of enter
validation is the other of enjoyable!

Download QuestDB Open supply below Apache 2.0. Blazing quick ingest. SQL analytics.



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top