Cache Line Alignment in C++ — How It Makes Your Program Quicker. | by Ryonald Teofilo | Sep, 2023

As mentioned in Memory and Data Alignment in C++, reminiscence alignment is often used for optimisation. One of many strategies is cache line alignment.
Cache Strains
As a prerequisite, knowledge is moved between the CPU cache and foremost reminiscence in fastened dimension blocks, generally known as cache strains. The standard dimension of a cache line is 64 bytes!
C++17 supplies a transportable approach of buying the cache line dimension by std::hardware_destructive_interference_size
.
Be aware: I didn’t use the aforementioned on this demo, because the GCC model I used to be utilizing doesn’t have it applied but (implemented in 12.1).
Sharing Cache Strains
If knowledge is shut to one another in reminiscence, it is vitally seemingly that they’d find yourself in the identical cache line. This might adversely have an effect on efficiency when a number of cores have to entry these knowledge, as a result of the cores must bounce the cache line between their native caches!
When a core wants to switch knowledge that occurs to dwell in the identical cache line as knowledge that’s being utilized by one other core, time will probably be wasted to attend for the opposite core to launch the information. That is generally generally known as false sharing.
To not get this confused with synchronising entry, like guarding shared knowledge/reminiscence with a mutex. The cores are accessing utterly separate knowledge, however they simply occur to dwell in the identical cache line.
Cache Line Alignment
A strategy to fight that is to make the information cache line aligned. If the cache line dimension 64 bytes, this implies allocating the information on a 64-byte boundary (I extremely advocate studying my story on memory alignment if this isn’t very clear to you!). This ensures knowledge is not going to dwell in the identical cache line as one other.
The efficiency good thing about this may be demonstrated with a easy utility.
#embody <thread>
#embody <chrono>
#embody <cstdlib>
#embody <vector>
#embody <iostream>struct A
{
int mInt = 0;
};
int foremost()
{
// Initialise array
A* arr = new A[2];
// Seed rand
std::srand(std::time(nullptr));
// Increment the variable repeatedly
auto course of = [](int* num) {
for(int i = 0; i < 100000000; i++)
*num = *num + std::rand();
};
// Beginning time
auto startTime = std::chrono::high_resolution_clock::now();
// Spawn and anticipate threads to complete
std::thread t1(course of, &arr[0].mInt);
std::thread t2(course of, &arr[1].mInt);
t1.be a part of();
t2.be a part of();
// End time
auto endTime = std::chrono::high_resolution_clock::now();
// Get outcomes
std::cout << "Length: "
<< std::chrono::duration_cast<std::chrono::microseconds>(endTime - startTime).rely() / 1000.f
<< " ms" << std::endl;
// Deallocate
delete[] arr;
return 0;
}
Right here, we’re incrementing two integers in two separate threads. Every could have their very own integer to increment.
For all of the nerds on the market, I’m utilizing the next compiler 🙂
$ g++ --version
g++ (GCC) 11.4.0$ g++ -dumpmachine
x86_64-pc-cygwin
With out cache line alignment, the code runs in 541.886 ms
$ g++ cachelinealignment.cpp -o cachelinealignment
$ ./cachelinealignment
Length: 517.87 ms
So as to align to the cache line, I’ll make the next modifications
// Align to 64-byte boundary
struct alignas(64) A
{
int mInt = 0;
};
// Only for completeness, assert appropriate alignment
static_assert(alignof(A) == 64);
With cache line alignment, we see an enchancment in efficiency — 265.304 ms
$ g++ cachelinealignment.cpp -o cachelinealignment
$ ./cachelinealignment
Length: 265.304 ms
For completeness, right here is the ultimate supply.
#embody <thread>
#embody <chrono>
#embody <cstdlib>
#embody <vector>
#embody <iostream>struct alignas(64) A
{
int mInt = 0;
};
int foremost()
{
static_assert(alignof(A) == 64);
// Initialise array
A* arr = new A[2];
// Seed rand
std::srand(std::time(nullptr));
// Increment the variable repeatedly
auto course of = [](int* num) {
for(int i = 0; i < 100000000; i++)
*num = *num + std::rand();
};
// Beginning time
auto startTime = std::chrono::high_resolution_clock::now();
// Spawn and anticipate threads to complete
std::thread t1(course of, &arr[0].mInt);
std::thread t2(course of, &arr[1].mInt);
t1.be a part of();
t2.be a part of();
// End time
auto endTime = std::chrono::high_resolution_clock::now();
// Get outcomes
std::cout << "Length: "
<< std::chrono::duration_cast<std::chrono::microseconds>(endTime - startTime).rely() / 1000.f
<< " ms" << std::endl;
// Deallocate
delete[] arr;
return 0;
}