Now Reading
Estimating your reminiscence bandwidth – Daniel Lemire’s weblog

Estimating your reminiscence bandwidth – Daniel Lemire’s weblog

2024-01-14 03:04:30

One of many limitations of a compute is the reminiscence bandwidth. For the scope of this text, I outline “reminiscence bandwidth” because the maximal variety of bytes you’ll be able to convey from reminiscence to the CPU per unit of time. E.g., in case your system has 5 GB/s of bandwidth, you’ll be able to learn as much as 5 GB from reminiscence in a single second.

To measure this reminiscence bandwidth, I suggest to learn knowledge sequentially. E.g., it’s possible you’ll use a operate the place we sum the byte values in a big array. It’s not essential to sum each byte worth, you’ll be able to skip some as a result of the processor operates in models of cache traces. I have no idea of a system that makes use of cache traces smaller than 64 bytes, so studying one worth each 64 bytes must be sufficient.

uint64_t sum(const uint8_t *knowledge,
    size_t begin, size_t len, size_t skip) {
  uint64_t sum = 0;
  for (size_t i = begin; i < len; i+= skip) {
    sum += knowledge[i];
  return sum;

It might not be ok to maximise the bandwidth utilization: your system has certainly a number of cores. Thus we should always use a number of threads. The next C++ code divides the enter into consecutive segments, and assigns one thread to every phase, dividing up the duty as pretty as potential:

size_t segment_length = data_volume / threads_count;
size_t cache_line = 64;
for (size_t i = 0; i < threads_count; i++) {
  threads.emplace_back(sum, knowledge, segment_length*i,
       segment_length*(i+1), cache_line);
for (std::thread &t : threads) { part of();

I ran this code on a server with two Intel Ice Lake  processors. I get that the extra threads I exploit, the extra bandwidth I’m able to stand up to round 15 threads. I begin out at 15 GB/s and I am going as much as over 130 GB/s. As soon as I attain about 20 threads, it’s not potential to get extra bandwidth out of the system. The system has a complete of 64 cores, over two CPUs. My program doesn’t do any fidgeting with locking threads to cores, it isn’t optimized for NUMA. I’ve clear big pages enabled by default on this Linux system.

My benchmark must be make it straightforward for the processor to maximise bandwidth utilization, so I might not anticipate extra sophisticated software program to hit a bandwidth restrict with as few as 20 threads.

My source code is available.

See Also

This machine has two NUMA nodes, splitting up the duties by thread affinity would in all probability enhance the bandwidth.

Additional studying: Many tools to measure bandwidth.

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top