What number of CPU cores are you able to really use in parallel?

2023-12-18 10:04:00

While you’re working a CPU-intensive parallel program, you usually wish to have a thread or course of pool sized by the variety of CPU cores in your machine.
Fewer threads and also you’re not benefiting from all of the cores, greater than that and your program will begin working slower as a number of threads compete for a similar core.
Or that’s the idea, anyway.

So how do you verify what number of cores your laptop has?
And is that this really good recommendation?

It seems to be surprisingly difficult to nail down what number of threads to run:

The Python customary library gives a number of APIs to get this data, however none are ample.
Even worse, due to CPU options like instruction-level parallelism and simultaneous threading (aka Hyper-threading on Intel CPUs), the variety of cores you may successfully use depends upon the code you have written!

Let’s see why it’s so troublesome to determine what number of CPU cores your program can use, after which contemplate a possible resolution.

Getting the variety of CPU cores from Python

In the event you learn the Python customary library documentation, it has a os.cpu_count() perform that returns “the variety of logical CPUs within the system”.
What does logical imply?
We’ll get to that in a bit.

The documentation additionally tells you that “len(os.sched_getaffinity(0)) will get the variety of logical CPUs the calling thread of the present course of is restricted to”.
Scheduler affinity is a approach to limit a course of to specific cores.

Sadly, this API will not be ample both.
For instance, on Linux the cgroups API, used to implement Docker and different container techniques, has a wide range of methods to restrict CPU utilization.
Right here we restrict the CPU to the equal of two.25 cores; the mechanism is completely different, however the results will likely be related:

$ docker run -i -t --cpus=2.25 python:3.12-slim
Python 3.12.1 (predominant, Dec  9 2023, 00:21:37) [GCC 12.2.0] on linux
Sort "assist", "copyright", "credit" or "license" for extra data.
>>> import os
>>> os.cpu_count()
20
>>> len(os.sched_getaffinity(0))
20

We are able to solely use the equal of two.25 cores at a time, however neither API is aware of about this.

What’s a logical CPU?

Working system choices are simply the beginning of our troubles, however earlier than we see an instance we have to perceive what bodily and logical CPU cores are.
My laptop use an Intel i7-12700K processor, which has:

12 bodily cores (8 efficiency cores, and 4 much less highly effective ones).
20 logical cores.

Fashionable CPU cores can execute multiple instructions in parallel.
However what occurs if the CPU is caught ready for some knowledge to be loaded from RAM?
It could not be capable of do any work till that occurs.

To permit using these probably wasted assets, a bodily CPU core’s computational assets could be uncovered as a number of cores to the working system.
On my CPU, every of the 8 sooner cores could be uncovered as two cores, for a complete of 16 logical cores.
The pairs of logical cores will share the computational assets of a single bodily core.
For instance, if a logical core isn’t absolutely using all the inner arithmetic logic items, say as a result of it’s ready for a reminiscence load, the code working through the paired logical core can nonetheless use these idle assets.

This expertise is known as simultaneous multithreading, or Hyper-threading in Intel’s terminology.
You probably have a PC, you may usually disable it within the BIOS.

So now we have now a brand new query.
Placing apart scheduler affinity and the like, ought to we use the variety of bodily or logical cores as our thread pool measurement?

An embarrassingly-parallel instance

Let’s contemplate two capabilities which can be compiled to machine code with Numba.
We be certain that to release the GIL to allow parallelism.

Each capabilities do the identical factor, however one is way sooner than the opposite.
We are able to run these capabilities in parallel on a number of threads, and in principle get linear enhancements in throughput till we run out of cores, simply by processing extra pictures in parallel.

from numba import njit
import numpy as np

@njit(nogil=True)
def slow_threshold(img, noise_threshold):
    noise_threshold = img.dtype.kind(noise_threshold)
    end result = np.empty(img.form, dtype=np.uint8)
    for i in vary(end result.form[0]):
        for j in vary(end result.form[1]):
            end result[i, j] = img[i, j] // 256
    for i in vary(end result.form[0]):
        for j in vary(end result.form[1]):
            if end result[i, j] < noise_threshold // 256:
                end result[i, j] = 0
    return end result

@njit(nogil=True)
def fast_threshold(img, noise_threshold):
    noise_threshold = np.uint8(noise_threshold // 256)
    end result = np.empty(img.form, dtype=np.uint8)
    for i in vary(end result.form[0]):
        for j in vary(end result.form[1]):
            worth = img[i, j] >> 8
            worth = (
                0 if worth < noise_threshold else worth
            )
            end result[i, j] = worth
    return end result

We’ll run the perform on a picture and measure how lengthy they take to run:

rng = np.random.default_rng(12345)

def make_image(measurement=256):
    noise = rng.integers(0, excessive=1000, measurement=(measurement, measurement), dtype=np.uint16)
    sign = rng.integers(0, excessive=5000, measurement=(measurement, measurement), dtype=np.uint16)
    # A loud, arduous to foretell picture:
    return noise | sign

NOISY_IMAGE = make_image()
assert np.array_equal(
    slow_threshold(NOISY_IMAGE, 1000),
    fast_threshold(NOISY_IMAGE, 1000)
)

Right here’s how lengthy it takes to run every of the capabilities on a single core:

%timeit slow_threshold(NOISY_IMAGE, 1000)

90.6 µs ± 77.7 ns per loop (imply ± std. dev. of seven runs, 10,000 loops every)

and

%timeit fast_threshold(NOISY_IMAGE, 1000)

24.6 µs ± 10.8 ns per loop (imply ± std. dev. of seven runs, 10,000 loops every)

Curious why the quick perform is a lot sooner?
You may wish to learn a book I am working on about optimizing low-level code.

Scaling to a number of threads

Now that we have now a few capabilities, we’ll arrange a approach to course of a given checklist of pictures with a thread pool:

from multiprocessing.dummy import Pool as ThreadPool

def apply_in_thread_pool(
    num_threads, perform, pictures
):
    with ThreadPool(num_threads) as pool:
        for picture in pictures:
            end result = pool.map(
                lambda img: perform(img, 1000),
                pictures
            )
            assert len(end result) == len(pictures)

Subsequent, we’ll graph how lengthy it takes to run for various numbers of threads for the completely different capabilities, utilizing the benchit library (you too can use perfplot, however observe it’s GPL-licensed):

import benchit
benchit.setparams(rep=1)

# 400 pictures to run by means of the pool:
IMAGES = [make_image() for _ in range(400)]

def slow_threshold_in_pool(num_threads):
    apply_in_thread_pool(num_threads, slow_threshold, IMAGES)

def fast_threshold_in_pool(num_threads):
    apply_in_thread_pool(num_threads, fast_threshold, IMAGES)

# Measure the 2 capabilities with 1 to 24 threads:
timings = benchit.timings(
    [slow_threshold_in_pool, fast_threshold_in_pool],
    vary(1, 25),
    input_name="Variety of threads"
)
timings.plot(logy=True, logx=False)

Discover how the run time declines because the variety of threads will increase… up to a degree.
After that run time begins getting worse once more.
Thus far that is what we’ve anticipated.
However there’s something sudden too: the optimum variety of threads is completely different for every perform.

timings.to_dataframe().idxmin(axis="rows")

Features	Optimum variety of threads
slow_threshold	19
fast_threshold	8

A reader identified that setting the chunksize possibility on map() could make the thread pool extra environment friendly, by decreasing the overhead and potential bottleneck of interacting with the pool’s inside activity queue.
When every thread handles 5 pictures at a time, the optimum variety of threads was 20 and 10, respectively.

The optimum degree of parallelism additionally depends upon your code

Our slower perform was capable of make the most of mainly all of the logical cores.
Probably it isn’t absolutely using all of the obtainable processing in a given bodily core, so logical cores permit for extra parallelism.

In distinction, our sooner perform may make the most of not more than 8 cores; past that it began slowing down.
Maybe it began hitting some bottleneck aside from computation, like reminiscence bandwidth.

There isn’t any thread pool measurement that’s optimum for each capabilities.

A unique method: empirical measurement

We’ve encountered a number of issues in getting the optimum variety of threads:

It’s troublesome to get an correct variety of cores that keep in mind all of the other ways the working system can limit CPU utilization.
The optimum parallelism degree, e.g. variety of threads, is workload dependent.
Extra optimized code could not be capable of make the most of the additional logical cores.
The variety of cores isn’t the one bottleneck.
Bonus downside: In the event you’re working within the cloud, you’re utilizing “vCPUs”, no matter which means.
Totally different cases could have completely different CPU fashions, for one factor.

So right here’s one other method: empirically uncover the optimum variety of threads, at runtime.
Within the instance above we measured the optimum variety of threads for a selected piece of code.
You probably have a long-running knowledge processing job that will likely be working the identical code for some time in a number of threads, you are able to do the identical.
That’s, you may spend a bit little bit of time originally to empirically measure the optimum variety of threads, maybe with some heuristics to compensate for noise.

For runtime functions, for those who’re utilizing empirical measurement you don’t have to care about why a selected variety of threads is perfect.
Whatever the {hardware}, working system configuration, or cloud surroundings, you may be utilizing the optimum degree of parallelism.

Source Link