Now Reading
A couple of cores too many

A couple of cores too many

2023-10-12 11:26:56

:: performance, benchmarking, lost time

By: Ben Greenman

Efficiency issues for software program programs, however efficiency will not be all the time simple to measure. On the PRL we lately had a scare with some unreliable measurements. Right here is the story.

Final yr, we proposed a technique for evaluating the efficiency of gradual sort programs based mostly on measuring each doable configuration of typed and untyped code {that a} programmer may discover (pdf). Given the liberty that gradual typing provides, that is the one lifelike option to measure the efficiency of a gradual sort system.

However it’s a lot to measure! Whereas growing the strategy, we spent over 3 months benchmarking a complete of 75,844 configurations. Every configuration is a whole program and a few gradual typings brought on some applications to sluggish by 50x and even 100x, so many of those configurations took minutes to run.

The following query we requested was naturally “how can we scale this methodology to giant software program initiatives?” In our case, the variety of regularly typed configurations scaled exponentially with the variety of modules. Present gradual sort system for Python and JavaScript are exponential within the variety of variables in this system.

We explored two options:

  1. Max New started work on a prediction mannequin (impressed by work on software product lines) to estimate the efficiency of 2^N configurations after polynomially-many measurements.
  2. Asumu Takikawa and I shopped for a multi-core pc.

By Thanksgiving, we had purchased a Linux machine with 2 AMD Opteron 6376 2.3GHz processors (16 cores every) and put it to work working benchmarks on 29 cores concurrently. Life was good.

Later that winter, Max carried out a prediction algorithm. The fundamental concept was to concentrate on boundaries between modules and isolate their impact on efficiency. If two modules are untyped, their boundary may have zero value. If the identical two modules are typed, their boundary may end in an general efficiency enchancment because of type-driven optimizations. And if one module is typed and the opposite untyped, their boundary will undergo some value of sort checking at runtime. On the whole a program with N modules has at most N(N - 1) / 2 inside boundaries, so it’s way more time-efficient to measure solely the boundaries than to benchmark 2^N regularly typed configurations.

Quick-forward to March, we had a prototype prediction algorithm and it was time to check. Once more utilizing 29 cores (as a result of, why not), we gathered value/profit numbers for one 4-module benchmark and used them to foretell efficiency for its 16 configurations. The outcomes weren’t excellent.

Figure 1: True running time vs. predicted running time for 16 configurations

Determine 1: True working time vs. predicted working time for 16 configurations

These inexperienced circles are the bottom fact, the common working time after 5 iterations of every config. The blue triangles are what we predicted. Apart from configurations 0 and eight, the triangles are FAR off from the reality. Many are even detrimental … clearly the algorithm wants work.

However then, out of frustration, desperation, or simply good luck, Max in contrast the predictions to floor fact knowledge gathered on a single core, leaving the opposite 31 cores idle.

Figure 2: Predictions made using measurements from a single core

Determine 2: Predictions made utilizing measurements from a single core

First off, the crimson “sequential fact” dots are barely nearer to the expected triangles. Second — and that is the scary half — the crimson dots are very completely different from the inexperienced dots. Operating on 1 core vs. 29 cores mustn’t change the measurements!

From right here we tried growing the working time of the benchmark, eradicating I/O and system calls, checking for hyperthreading (ARM cores don’t assist it), and even altering the cores’ CPU governor. The hope was that outcomes taken from 1 core might match outcomes from N cores, for some N > 1. It seems N = 2 was steady, however even for N = 3 we discovered graphs like the next:

See Also

Figure 3: exact running times. Same-colored dots in each column should be tightly clustered.

Determine 3: precise working instances. Identical-colored dots in every column ought to be tightly clustered.

This knowledge is for a similar 16 configurations because the earlier two graphs. Inexperienced dots are precise working instances measured with 25 cores. Purple dots are working instances measured with 1 core. The crimson dots are a lot nearer collectively, and all the time unimodal. The inexperienced dots are proof that possibly the 32-core machine has, as Jan Vitek put it, 30 cores too many.

“Oh my. You assume it’ll by no means occur to you. Nicely, now I’ve realized my lesson.”

And so, we mentioned goodbye to the final 4 months of knowledge and began over working at most two cores. The brand new outcomes are all steady, however nonetheless we hold pinching ourselves.

P.S. the outcomes from POPL 2016 are simply advantageous, as they weren’t taken on the brand new machine working greater than 2 cores. In case you have time to verify, that knowledge is in our artifact and within the gradual-typing-performance repo.



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top