Now Reading
foojay – 12 Classes Discovered From Doing The 1BRC Problem

foojay – 12 Classes Discovered From Doing The 1BRC Problem

2024-02-23 07:01:11



Friends of OpenJDK Today

February 22, 2024

How briskly are you able to course of a 1 billion rows textual content file in Java? That is the problem that many Java builders tried to resolve in January 2024.

Creator(s)

  • Avatar photo

    Anthony Goubard

    Java developer since 1995, I’ve seen, in so a few years, many applied sciences: from J2ME to the cloud, from sockets to GraphQL. Working now as freelance for shoppers or for my … Learn more

The One Billion Row Problem or 1BRC or 1️⃣🐝🏎 was a problem to learn a CSV file of 1 billion rows with “station title;temperature” information and compute the min/common/max temperature per climate station as quick as attainable.

If you wish to know what’s the quickest algorithm, you’ll be able to go the the 1BRC page. However the actual reply is “it relies upon“.

It relies upon:

  • The machine you are working it on
  • How far are you prepared to make use of hacks in your code
  • The JVM
  • The info within the file

It really works on my machine

or to be extra exact, it is sooner on my machine.

CPU

My first submission was 23 seconds on my 2016 desktop (Intel) so I used to be fairly shocked to listen to it was 41 seconds on the “tremendous quick” CPU of the 1BRC server.

Lesson 1️⃣: Measure efficiency on manufacturing (like) {hardware}.

To be truthful, solely 8 cores had been used on the 1BRC server as an alternative of the 32 out there.
I then determined to make use of one HashMap per thread and mix them as an alternative of 1 international ConcurrentHashMap. This was 5 seconds sooner on my desktop and 29 seconds sooner on the 1BRC server.

It additionally occurred throughout the problem that modifications exhibiting efficiency enhancements on Apple ARM CPU weren’t leading to higher efficiency on the 1BRC (AMD) CPU.

Lesson 2️⃣: Optimizations will produce completely different outcomes relying on the {hardware}.

Laborious disk

Within the case of the 1BRC server the file was within the disk cache because it’s learn many instances. In the same manufacturing state of affairs, it is fairly probably that the massive file will not be within the disk cache and that disk I/O will probably be extra time consuming.

I had this downside domestically with my CPU at 50% whereas the laborious disk was at 100%.

This was confirmed with my check of studying and parsing the file -> 18 seconds, studying the file (to byte[]) and do nothing with the information -> 14 seconds.

On my laptop computer, it was even much less vital, studying and parsing the file -> 5.6 seconds, studying the file and do nothing with the information -> 5.1 seconds.

Lesson 3️⃣: The printed outcomes will not be together with file I/O which could be the most time consuming in your machine.

Hack coding

The aim of the problem is to get the end result as quick as attainable with many of the hacks allowed.

Unsafe

At one level somebody determined to make use of Unsafe to keep away from the vary checks when studying an array. Unsafe permits additionally to get greater than 1 byte per name. As that is sooner, this went rapidly viral within the completely different options.

Lengthy

You can work with bytes however you may as well work with longs that accommodates 8 bytes per worth.

You’ll be able to then use bit operations to discover a particular byte contained in the lengthy. This could possibly be fairly environment friendly on a 64 bits (= 8 bytes) CPU.

Hash collisions

For the reason that default testing file has lower than 500 station names, just a few thought it could be higher to make use of the hash code as key to a customized hash map as an alternative of the station title.

Although Gunnar was beneficiant in time period of hacks allowed (like beginning a brand new course of), this one was not allowed. However in case you have the same downside with a hard and fast set of keys, you could possibly discover a hash that will not battle between the keys and use it.

Lesson 4️⃣: Accessing the information or copying the information could possibly be fairly CPU intensive.

Preview API’s

Preview API’s akin to Overseas Operate Reminiscence (FFM) and incubator API akin to the brand new Vector API had been used very often.

I joined the problem to see how briskly I may get the job executed with none hacks. The reply is 10 seconds.

The JVM

The JVM parameters

The problem allowed you to cross parameters when beginning Java to nice tune how the JVM ought to behave when working your program.

Within the case of 1BRC, many determined to go together with the Epsilon rubbish collector, a rubbish collector that does not do something. As this system finishes in just a few seconds, you want much less to fret in regards to the accumulation of unused objects in reminiscence.

Here’s a checklist of a number of the JVM parameters utilized by the individuals.

-XX:+UnlockExperimentalVMOptions -XX:-EnableJVMCI -XX:+UseEpsilonGC -Xmx2G -Xms2G -XX:+AlwaysPreTouch
-XX:-TieredCompilation -XX:InlineSmallCode=10000 -XX:FreqInlineSize=10000
-XX:+UseTransparentHugePages
-XX:+UseShenandoahGC -XX:+UseStringDeduplication -XX:+UseTransparentHugePages -da
-XX:+TrustFinalNonStaticFields -XX:CICompilerCount=2 -XX:CompileThreshold=1000
-dsa -XX:+UseNUMA

Lesson 5️⃣: You will have many choices to get higher efficiency with out the necessity to change any line of code.

The JVM Used

The highest 3 outcomes are utilizing GraalVM native binary which is thought to have a really quick begin up time. It additionally has the benefit to optimize the code at compile time, as an alternative of utilizing a JIT (Simply In Time) compiler.

It is then a combination between 21.0.1-graal and 21.0.1-open.

Basically GraalVM JIT gave a bit higher outcomes. In my case, GraalVM (JIT) gave combined outcomes, typically related instances, typically just a few seconds (~10) slower. This was noticed on my pc and the 1BRC server.

The quickest “no hacks” implementation from Sam Pullara is quicker with GraalVM CE.

Specifying the JVM was fairly simple as SDKMAN was used.

Lesson 6️⃣: You will have many JVM distributions at your disposal, possibly one will higher fit your wants.

The Information

A few of the very quick implementations are just a few instances slower when utilizing the 10K station names, as an alternative of the default 500. A few of the submissions even failed.

Lesson 7️⃣: Totally different enter information will end in completely different performances.

The 1BRC mission features a weather_stations.csv as instance, I assume that many of the implementations would fail with this file because it begins with remark traces and has 4 digits after the comma for temperatures.

Lesson 8️⃣: Bear in mind that optimization may come to the prize of flexibility.

My implementation

My technique right here was to start out with the Okay.I.S.S. (Maintain It Easy Silly) optimization first and profile for bottlenecks. Utilizing JFR to profile gave me some concepts on the place to profile however it did not end result within the anticipated win very often.

-XX:StartFlightRecording=length=15s,settings=profile,title=CalculateAverage_japplis,filename=flight-recorder.jfr,dumponexit=true
And -Dorg.eclipse.swt.browser.DefaultType=edge in jmc.ini on Home windows

What you do not see within the proposed options are additionally the numerous makes an attempt and code that had been producing much less efficiency.

So listed below are the teachings discovered on my implementation:

  • 9️⃣ Utilizing FileInputStream and use new String(byte[], StandardCharset.UTF_8) for textual content is quicker than utilizing FileReader
  • 1️⃣0️⃣ Utilizing a HashMap after which kind keys when exhibiting the result’s sooner than utilizing a TreeMap
  • 1️⃣1️⃣ Utilizing one HashMap per thread after which mix the result’s sooner (particularly on some machines) than one international ConcurrentHashMap
  • 1️⃣2️⃣ Put a attempt {} catch {} exterior a loop is means sooner than having it within the loop

Conclusion

  • Efficiency is dependent upon a number of elements
  • Collaboration of concepts between individuals gave one of the best outcomes
  • With a little bit of effort, you will discover a number of efficiency enhancements, as lots of the no hacks implementations had been beneath 20″ in comparison with the baseline implementation of 4’49”.

Different fascinating information:

  • Including ‘.parallel()‘ to the baseline implementation stream makes it 73% sooner (2’15” -> 1’18” on my laptop computer)
  • ChatGPT (Antonio Goncalves entry) implementation was slower than many of the outcomes
  • Operating on the out there 32 cores made the quickest implementation to run in 0.3″ as an alternative of 1.5″
  • All the outcomes are incorrect. The station names ought to be sorted alphabetically however the final station exhibiting is İzmir and it ought to be Zürich.

And naturally an enormous due to Gunnar Morling for organizing and spending of lot of time working the completely different proposed options.

The neighborhood was additionally nice to both assist Gunnar in organizing it or one another to get one of the best efficiency throughout the problem.

1BRC Hyperlinks:

Sponsored Content material

Jakarta EE 11: Past the Period of Java EE

This person information gives a short historical past of Java EE/Jakarta EE and an in depth overview of a number of the specs that will probably be up to date in Jakarta EE 11.


Get Started



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top