Now Reading
How a Single Line of Code Made a 24-core Server Slower Than a Laptop computer

How a Single Line of Code Made a 24-core Server Slower Than a Laptop computer

2023-06-17 21:08:10


Think about you wrote a program for a pleasingly parallel drawback,
the place every thread does its personal unbiased piece of labor,
and the threads don’t must coordinate besides becoming a member of the outcomes on the finish.
Clearly you’d anticipate the extra cores it runs on, the quicker it’s.
You benchmark it on a laptop computer first and certainly you discover out it scales
almost completely on the entire 4 accessible cores. Then you definately run it on a giant, fancy, multiprocessor
machine, anticipating even higher efficiency, solely to see it truly runs slower
than the laptop computer, regardless of what number of cores you give it. Uh. That has simply occurred to me lately.

I’ve been working lately on a Cassandra benchmarking software Latte
which might be probably the most environment friendly Cassandra benchmarking software you may get, each when it comes to CPU use and reminiscence use.
The entire thought may be very easy: a small piece of code generates knowledge and executes a bunch of
asynchronous CQL statements in opposition to Cassandra.
Latte calls that code in a loop and data how lengthy every iteration took.
Lastly, it makes a statistical evaluation and shows it in numerous kinds.

Benchmarking appears to be a really nice drawback to parallelize.
So long as the code underneath benchmark is stateless, it may be pretty trivially known as
from a number of threads. I’ve blogged about methods to obtain that in Rust
already here and
here.

Nevertheless, on the time I wrote these earlier weblog posts, Latte’s workload definition capabilities had been nonexistent fairly restricted.
It got here with solely two predefined, hardcoded workloads, one for studying and one other one for writing.
There have been just a few issues you possibly can parameterize, e.g. the quantity or the sizes of desk columns, however nothing actually fancy.
No secondary indexes. No customized filtering clauses. No management over the CQL textual content. Actually nothing.
So, general, Latte at the moment was extra of a proof-of-concept relatively than a common software for doing actual work.
Certainly, you possibly can fork it and write a brand new workload in Rust, then compile every thing from supply. However who needs to waste time
on studying the internals of a distinct segment benchmarking software?

Rune scripting

So the final yr, so as to have the ability to measure the efficiency of storage connected indexes in Cassandra,
I made a decision to combine Latte with a scripting engine that may permit me to simply outline workloads with out recompiling
the entire program. After enjoying a bit with embedding CQL statements in TOML config recordsdata (which turned out to be each messy and restricted on the similar time),
by means of having some enjoyable with embedding Lua (which might be nice in C world, however didn’t play so good with Rust as I anticipated, though it kinda labored),
I finally ended up with a design much like that of sysbench however
with an embedded Rune interpreter as a substitute of Lua.

The primary promoting factors of Rune that satisfied me had been painless Rust integration and assist for async code.
Due to async assist, the consumer can execute CQL statements instantly within the workload scripts, leveraging the asynchronous nature
of the Cassandra driver. Moreover, the Rune workforce is amazingly useful and eliminated something that blocked me in nearly no time.

Right here is an instance of a whole workload that measures efficiency of choosing rows by random keys:

const ROW_COUNT = latte::param!("rows", 100000);

const KEYSPACE = "latte";
const TABLE = "fundamental";

pub async fn schema(ctx) {
    ctx.execute(`CREATE KEYSPACE IF NOT EXISTS ${KEYSPACE} 
                    WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }`).await?;
    ctx.execute(`CREATE TABLE IF NOT EXISTS ${KEYSPACE}.${TABLE}(id bigint PRIMARY KEY)`).await?;
}

pub async fn erase(ctx) {
    ctx.execute(`TRUNCATE TABLE ${KEYSPACE}.${TABLE}`).await?;
}

pub async fn put together(ctx) {
    ctx.load_cycle_count = ROW_COUNT;
    ctx.put together("insert", `INSERT INTO ${KEYSPACE}.${TABLE}(id) VALUES (:id)`).await?;
    ctx.put together("choose", `SELECT * FROM ${KEYSPACE}.${TABLE} WHERE id = :id`).await?;
}

pub async fn load(ctx, i) {
    ctx.execute_prepared("insert", [i]).await?;
}

pub async fn run(ctx, i) {
    ctx.execute_prepared("choose", [latte::hash(i) % ROW_COUNT]).await?;
}

You’ll find extra data on methods to write these scripts within the README.

Benchmarking the benchmarking program

Though the scripts will not be JIT-compiled to native code but, they’re acceptably quick, and due to the restricted quantity of code they
usually include, they don’t present up on the high of the profile. I’ve empirically discovered that the overhead of Rust-Rune FFI was decrease than that of
Rust-Lua offered by mlua, most likely because of the security checks employed by mlua.

Initially, to evaluate the efficiency of the benchmarking loop, I created an empty script:

pub async fn run(ctx, i) {
}

Despite the fact that there is no such thing as a perform physique there, the benchmarking program must do some work to really run it:

  • schedule N parallel asynchronous invocations utilizing buffer_unordered
  • setup a contemporary native state (e.g. stack) for the Rune VM
  • invoke the Rune perform, passing the parameters from the Rust facet
  • measure the time it took to finish every returned future
  • gather logs, replace HDR histograms and compute different statistics
  • and run all of that on M threads utilizing Tokio threaded scheduler

The outcomes on my previous 4-core laptop computer with Intel Xeon E3-1505M v6 locked at 3 GHz regarded very promising:

As a result of there are 4 cores, the throughput will increase linearly as much as 4 threads. Then it will increase barely extra
as much as 8 threads, because of hyper-threading that squeezes a bit extra efficiency out of every core. Clearly there is no such thing as a
efficiency enchancment past 8 threads, as a result of all CPU assets are saturated at this level.

I used to be additionally glad with absolutely the numbers I received. A couple of million of empty calls per second on a laptop computer feels like
the benchmarking loop is light-weight sufficient to not trigger important overhead in actual measurements. An area Cassandra server
launched on the identical laptop computer can solely do about 200k requests per second when totally loaded and that provided that these requests are
stupidly easy and all the info suits in reminiscence.

By the way in which, after including some actual code for knowledge technology within the physique, however with no calls to the database, as anticipated
every thing received proportionally slower, however no more than 2x slower, so it was nonetheless in a “hundreds of thousands ops per second” vary.

That was simple. I may have stopped right here and announce victory. Nevertheless, I used to be curious how briskly it may go if tried on a much bigger machine with extra cores.

Working an empty loop on 24 cores

A server with two Intel Xeon CPU E5-2650L v3 processors, every with 12 cores operating at 1.8 GHz needs to be clearly rather a lot quicker than an previous 4-core laptop computer, shouldn’t it?
Effectively, possibly with 1 thread it could be slower due to decrease CPU frequency (3 GHz vs 1.8 GHz), however it ought to make up for that by having many extra cores.

Let the numbers communicate for themselves:

You’ll agree there’s something unsuitable right here. Two threads are higher than one… and that’s principally it.
I couldn’t squeeze extra throughput than about 2 million calls per second,
which was about 4x worse than the throughput I received on the laptop computer.
Both the server was a lemon or my program had a critical scalability problem.

Investigation

If you hit a efficiency drawback, the most typical means of investigating it’s to run the code underneath profiler.
In Rust, it is rather simple to generate flamegraphs with cargo flamegraph.
Let’s evaluate the flamegraphs collected when operating the benchmark with 1 thread vs 12 threads:

flamegraph 1 thread

flamegraph 12 threads

I used to be anticipating to discover a single factor that was a bottleneck, e.g. a contended mutex or one thing comparable, however to my shock, there was nothing apparent there.
There wasn’t even a single bottleneck! Rune’s VM::run code appeared to take about 1/3 of the time, however the remainder was merely taken
by polling futures and fairly probably the perpetrator received inlined and disappeared from the profile.

Anyway, due to VM::run and the trail rune::shared::assert_send::AssertSend main additionally to Rune, I made a decision to disable the code chargeable for
calling the Rune perform, and I reran the experiment with only a loop operating an empty future, albeit with timing and statistics code nonetheless enabled:

// Executes a single iteration of a workload.
// This needs to be idempotent –
// the generated motion needs to be a perform of the iteration quantity.
// Returns the tip time of the question.
pub async fn run(&self, iteration: i64) -> Outcome<Immediate, LatteError> _

That scaled wonderful to over 100M calls per second on 48 threads!
So the issue have to be someplace under the Program::async_call perform:

// Compiled workload program
pub struct Program {
    sources: Sources,
    context: Arc<RuntimeContext>, 
    unit: Arc<Unit>,
}

// Executes given async perform with args.
// If execution fails, emits diagnostic messages, e.g. stacktrace to straightforward error stream.
// Additionally alerts an error if the perform execution succeeds, however the perform returns
// an error worth.    
pub async fn async_call(
    &self,
    enjoyable: FnRef,
    args: impl Args + Ship,
) -> Outcome<Worth, LatteError> {
    let handle_err = |e: VmError| {
        let mut out = StandardStream::stderr(ColorChoice::Auto);
        let _ = e.emit(&mut out, &self.sources);
        LatteError::ScriptExecError(enjoyable.identify, e)
    };
    let execution = self.vm().send_execute(enjoyable.hash, args).map_err(handle_err)?;
    let end result = execution.async_complete().await.map_err(handle_err)?;
    self.convert_error(enjoyable.identify, end result)
}

// Initializes a contemporary digital machine wanted to execute this program.
// That is extraordinarily light-weight.
fn vm(&self) -> Vm {
    Vm::new(self.context.clone(), self.unit.clone())
}

The async_call perform does just a few issues:

  • it prepares a contemporary Rune VM – that is presupposed to be a really light-weight operation that principally prepares a contemporary stack; the VMs are not shared between calls nor threads to allow them to run completely independently
  • it invokes a perform by passing its identifier and parameters
  • lastly it receives the end result and converts some errors; we will safely assume that in an empty benchmark this can be a no-op

My subsequent thought was to only take away the send_execute and async_complete calls and depart simply the VM preparation.
So principally I needed to benchmark that line:

Vm::new(self.context.clone(), self.unit.clone())

The code appears to be like pretty harmless. No locks, no mutexes, no syscalls, no shared mutable knowledge right here.
There are some read-only buildings context and unit shared behind an Arc, however read-only sharing
shouldn’t be an issue.

VM::new can be trivial:

impl Vm {

    // Assemble a brand new digital machine.
    pub const fn new(context: Arc<RuntimeContext>, unit: Arc<Unit>) -> Self {
        Self::with_stack(context, unit, Stack::new())
    }

    // Assemble a brand new digital machine with a customized stack.
    pub const fn with_stack(context: Arc<RuntimeContext>, unit: Arc<Unit>, stack: Stack) -> Self {
        Self {
            context,
            unit,
            ip: 0,
            stack,
            call_frames: vec::Vec::new(),
        }
    }

Nevertheless, not matter how harmless the code appears to be like, I prefer to double test my assumptions.
I ran that with totally different numbers of threads and, though it was now quicker than earlier than,
it didn’t scale in any respect once more – it hit a throughput ceiling of about 4 million calls per second!

The issue

Though at first it doesn’t appear to be there’s any sharing of mutable knowledge within the code above, truly
there’s something barely hidden that’s shared and mutated: the Arc reference counters themselves.
These counters are shared between all of the invocations, carried out from many threads, and
they’re the supply of the congestion right here.

See Also

Some might argue that atomically rising or lowering a shared atomic counter shouldn’t be an issue as a result of these
are “lockless” operations. They even translate to single meeting directions (e.g. lock xadd)! If one thing is a single meeting
instruction, it isn’t gradual, isn’t it? That reasoning is sadly flawed.

The foundation of the issue just isn’t actually the computation, however the price of sustaining the shared state.

The period of time required to learn or write knowledge is generally influenced by how far the CPU core wants to succeed in out for the info.
Listed below are the everyday latencies for the Intel Haswell Xeon CPUs in response to this site:

  • L1 cache: 4 cycles
  • L2 cache: 12 cycles
  • L3 cache: 43 cycles
  • RAM: 62 cycles + 100 ns

L1 and L2 caches are usually native to a core (L2 could also be shared by two cores). L3 cache is shared by all cores of a CPU.
There may be additionally a direct interconnect between L3 caches of various processors on the principle board for managing L3 cache coherency, so L3 is
logically shared between all processors.

So long as you don’t replace the cache line and solely learn it from a number of threads, the road will likely be loaded by a number of cores
and marked as shared. It’s probably that frequent accesses to such knowledge could be served from L1 cache, which may be very quick. Due to this fact sharing
read-only knowledge is completely wonderful and scales properly. Even utilizing atomics for under studying will likely be a lot quick in that case.

Nevertheless, as soon as we introduce updates to the shared cache line, issues begin to complicate. The x86-amd64 structure has coherent knowledge caches.
This implies principally that what you write on one core, you possibly can learn again on one other one. It isn’t attainable to retailer a cache line with conflicting knowledge
in a number of cores. As soon as a thread decides to replace a shared cache line, that line will get invalidated on all the opposite cores, so subsequent hundreds
on these cores must fetch the info from not less than L3. That’s clearly rather a lot slower, and even slower if there
are extra processors than one on the principle board.

The truth that our reference counters are atomic is a further drawback that makes issues much more complicated for the processor.
Though utilizing atomic directions is sometimes called “lockless programming”, that is barely deceptive –
in reality, atomic operations require some locking to occur on the {hardware} stage. This locking may be very fine-grained and low cost so long as there is no such thing as a congestion,
however as normal with locking, you could anticipate very poor efficiency if many issues attempt to battle for a similar lock on the similar time. And it’s after all
a lot worse if these “issues” are complete CPUs and never simply single cores which are shut to one another.

The repair

The apparent repair is to keep away from sharing the reference counters. Latte has a quite simple, hierarchical lifecycle construction,
so all these Arc updates regarded like an overkill to me they usually may most likely get replaced with easier references and Rust lifetimes.
Nevertheless, that is simpler mentioned than accomplished. Sadly Rune requires the references to the Unit and RuntimeContext
to be handed wrapped in Arc for managing the lifetime (in most likely extra complicated situations) and it additionally makes use of some Arc-wrapped values internally as a part of these
buildings. Rewriting Rune only for my tiny use case was out of the query.

Due to this fact the Arc needed to keep. As a substitute of utilizing a single Arc worth we will use one Arc per thread.
That requires additionally separating the Unit and RuntimeContext values, so every thread would get their very own.
As a facet impact, this ensures there is no such thing as a sharing in any respect, so even when Rune clones an Arc saved internally as part of these values, that drawback could be additionally fastened.
The draw back of this resolution is greater reminiscence use. Fortuitously . Latte workload scripts are normally tiny, so greater reminiscence use is probably going not a giant drawback.

To have the ability to use separate Unit and RuntimeContext I submitted a patch to Rune to make them Clone.
Then, on the Latte facet, the entire repair was truly introducing a brand new perform for “deep” cloning the Program struct after which ensuring
every thread will get its personal copy:

    // Makes a deep copy of context and unit.
    // Calling this methodology as a substitute of `clone` ensures that Rune runtime buildings
    // are separate and may be moved to totally different CPU cores effectively with out unintended
    // sharing of Arc references.
    fn unshare(&self) -> Program {
        Program {
            sources: self.sources.clone(),
            context: Arc::new(self.context.as_ref().clone()),   // clones the worth underneath Arc and wraps it in a brand new counter
            unit: Arc::new(self.unit.as_ref().clone()),         // clones the worth underneath Arc and wraps it in a brand new counter
        }
    }

BTW: The sources area just isn’t used through the execution, aside from emitting diagnostics, so it could possibly be left shared.

Word that the unique line the place I initially discovered the slowdown didn’t want any modifications!

Vm::new(self.context.clone(), self.unit.clone())

It’s because self.context and self.unit will not be shared between threads anymore.
Atomic updates to non-shared counters are luckily very quick.

Closing outcomes

Now the throughput scales linearly as much as 24 threads, as anticipated:

Takeaways

  • The price of a shared Arc may be absurdly excessive on some {hardware} configurations whether it is up to date often on many threads.
  • Don’t assume {that a} single meeting instruction can’t change into a efficiency drawback.
  • Don’t assume that if one thing scales wonderful on a single-CPU laptop, it could nonetheless scale on a multi-CPU laptop.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top