# Lowering the Variety of Reminiscence Accesses 1/2

*by*Phil Tadros

*We at Johnny’s Software program Lab LLC are specialists in efficiency. If efficiency is in any means concern in your software program undertaking, be at liberty to contact us.*

After we try to hurry up a memory-bound loop, there are a number of completely different paths. We might attempt lowering the dataset size. We might attempt growing obtainable instruction level parallelism. We might attempt modifying the way we access data. A few of these methods are very superior. However generally we must always begin with the fundamentals.

One of many methods to enhance on reminiscence boundness of a sure piece of code is the old school means: lower the whole variety of reminiscence accesses (hundreds or shops). As soon as a chunk of information is within the register, utilizing it is vitally low cost, to the purpose of being free (because of CPU’s skill to execute as much as 4 directions in a single cycle and their out-of-order nature). So all methods that attempt to decrease the whole variety of hundreds and shops ought to end in speedups.

## Methods

*Like what you might be studying? Observe us on LinkedIn , Twitter or Mastodon and get notified as quickly as new content material turns into obtainable.Need assistance with software program efficiency? Contact us!*

### Loop Fusion

If we’ve got two consecutive loops that function on the identical dataset, fusing them into one loop would lower the variety of reminiscence hundreds and shops and consequently enhance efficiency. This transformation known as *loop fusion* or *loop jamming*. Take into account the instance of discovering a minimal and most in an array. One of many methods to do it’s utilizing two separate loops:

double min = a[0]; for (int i = 1; i < n; i++) { if (a[i] < min) { min = a[i] }; } double max = a[0]; for (int i = 1; i < n; i++) { if (a[i] > max) { max = a[i] }; }

Each loops contact course of the identical dataset. We might merge the 2 loops into one loop that finds each minimal and most. This cuts the variety of information hundreds by two.

double min = a[0]; double max = a[0]; for (int i = 1; i < n; i++) { if (a[i] < min) { min = a[i] }; if (a[i] > max) { max = a[i] }; }

We measure the impact of loop fusion within the experiments part.

A couple of notes about loop fusion:

*Loop fusion can also be a compiler optimization approach*, so in principle the compiler can do it mechanically. However this doesn’t occur usually and when it does occur, this optimization tends to interrupt simply.- With regard to loop fusion,
*there are circumstances the place loop fusion would fail to ship velocity enhancements*. If one or each loops are vectorized earlier than the fusion, however the fused loop will not be vectorized, then this transformation may end up in a slowdown, not a speedup. *Loop fusion is a detailed relative of loop sectioning*. The primary distinction is that loop fusion reuses the info whereas it’s in registers, whereas loop sectioning reuses the info whereas it’s in quick information caches. Due to this fact, a loop fusion is extra reminiscence environment friendly that loop sectioning, however fusing two loops is extra advanced than sectioning two loops. Additionally, preserving vectorization is less complicated with loop sectioning.

#### C++ Ranges

For folk utilizing C++ STL algorithms, it is very important pay attention to C++ ranges. The unique STL library accommodates many algorithms, however they don’t seem to be composable. Composability implies that the results of one algorithm is fed into one other algorithm. Take into account the instance^{:}

struct user_t { int id; std::string title; int age; }; std::vector<int> get_ids_adult_users(std::vector<user_t>& customers) { std::vector<user_t> adult_users; std::copy_if(std::cbegin(customers), std::cend(customers), std::back_inserter(adult_users), [](auto const& person) { return person.age >= 18; }); std::vector<int> ids; std::remodel(std::cbegin(adult_users), std::cend(adult_users), std::back_inserter(ids), [](auto const& person){ return person.id; }); return ids; }

The perform `get_ids_adult_users`

returns the vector containing the `id`

s of all of the customers who’re adults, i.e. whose age is eighteen or extra. To realize this utilizing STL, we use two algorithms: `std::copy_if`

which filters out the minor customers to create the checklist of grownup customers and `std::remodel`

to extract solely ids from the vector of `user_t`

class.

This strategy forces the code to iterate the 2 collections as an alternative of 1: the primary assortment is the unique assortment of customers, and the second assortment is the short-term assortment holding grownup customers. To keep away from this, C++ builders have *STL ranges* at their disposal. Right here is identical instance rewritten utilizing ranges:

std::vector<int> get_ids_adult_users(std::vector<user_t>& customers) { auto ids = customers | std::views::filter([](auto const& person){ return person.age >= 18; }) | std::views::remodel([](auto const& person){ return person.id; }); return {ids.start(), ids.finish()} }

This code, other than being cleaner, additionally performs fewer reminiscence hundreds and reminiscence shops. The filter adapter performs the filtering, and the results of filtering is immediately fed into the remodel adapter, component by component. This avoids working by way of the vector two occasions, and it’s equal to loop fusion.

#### Kernel Fusion

*Kernel Fusion* is simply loop fusion dropped at a a lot greater stage. Take into account the next: think about you could have N picture processing algorithms making a picture processing pipeline. The output of algorithm X is the enter of algorithm X+1. One of many methods to implement the pipeline is to have them run one after one other, from zero to N-1. Every algorithm should end earlier than the following one begins.

With this sort of setup, processing a picture will usually be reminiscence inefficient. The entire picture is loaded from the slower elements of the reminiscence subsystem to CPU registers, processed, then the output is written again to the reminiscence subsystem. And that is repeated for every algorithm within the pipeline: load enter, do some modifications, retailer output.

On this instance, every algorithm is a *kernel*. And by kernel fusion, we imply that two algorithms are fused. An algorithm X generates a single pixel, and feeds it on to the algorithm X + 1, then to the algorithm X + 2, and so forth. The advantage of such an strategy is that each one related information by no means leaves the CPU, which avoids pointless information actions.

Nonetheless, there are two issues with this strategy:

- Writing such a processing pipeline will not be simple, and this process must be deliberate from day one. In truth, there’s a particular programming language,
*Halide*, that’s designed for writing quick and moveable picture and tensor processing codes. - The kinds of algorithms that may profit from this strategy should be reminiscence certain, i.e. mild in computation. Algorithms which can be computationally intensive would profit little or under no circumstances, as a result of computational bottleneck will cover the reminiscence latency.

For those who occur to know extra about this matter, please go away a remark or ship me an e-mail, I wish to know extra as nicely (and in addition hold this put up up to date).

*Like what you might be studying? Observe us on LinkedIn , Twitter or Mastodon and get notified as quickly as new content material turns into obtainable.Need assistance with software program efficiency? Contact us!*

#### Loop Fusion inside a Loop Nest

Loop fusion inside a loop nest (additionally referred to as *unroll-and-jam*) is a extra superior sort of loop fusion, relevant to loop nests. The approach is straightforward to use to many sorting algorithms, however not solely them. Take into account the choice type algorithm. Right here is the supply code:

for (int i = 0; i < n; i++) { double min = a[i]; int index_min = i; for (int j = i + 1; j < n; j++) { if (a[j] < min) { min = a[j]; index_min = j; } } std::swap(a[i], a[index_min]); }

The algorithm may be very easy: within the `i`

-th iteration, it scans the weather from `i`

to `n`

to seek out the smallest worth, and places the worth in place `a[i]`

.

For loop fusion inside a loop nest, we have to unroll the outer loop two or extra occasions to make the fusion potential specific. Right here is the choice type algorithm, the place the outer loop has been unrolled two occasions:

for (int i = 0; i < n; i+=2) { min = a[i]; index_min = i; for (int j = i + 1; j < n; j++) { if (a[j] < min) { min = a[j]; index_min = j; } } std::swap(a[i], a[index_min]); min = a[i + 1]; index_min = i + 1; for (int j = i + 2; j < n; j++) { if (a[j] < min) { min = a[j]; index_min = j; } } std::swap(a[i + 1], a[index_min]); }

The primary and second interior loops are iterating over virtually equivalent datasets. There are a couple of statements stopping a easy loop fusion, however they are often moved away. With some tips, we fused collectively the 2 interior loops. Right here is the end result:

for (int i = 0; i < n; i+=2) { min_1 = a[i]; min_2 = a[i + 1]; index_min_1 = i; index_min_2 = i + 1; // min1 should be smaller than min2 // Swap two values if not true if (min2 < min1) { std::swap(min1, min2); std::swap(index_min_1, index_min_2); } // Look-up two smallest values in array. // The smaller is saved in min1, the bigger // in min2. for (int j = i + 2; j < n; j++) { if (a[j] < min_2) { if (a[j] < min_1) { min_2 = min_1; index_min_2 = index_min_1; min_1 = a[j]; index_min_1 = j; } else { min_2 = a[j]; index_min_2 = j; } } } std::swap(a[i], a[index_min_1]); std::swap(a[i + 1], a[index_min_2]); }

The loop fusion on this case will not be trivial, however it’s doable. The interior loop is wanting up the 2 smallest values within the array to place them into the start of the part at the moment being processed. The full variety of reminiscence accesses is about two occasions decrease in comparison with the easy choice type algorithm.

Be aware: This optimization intently resembles *outer loop vectorization*, the place the outer loop is working a number of cases of the interior loop in parallel.

### Lowering the Variety of Knowledge Passes by Doing Extra Work

As we’ve got seen in earlier examples, loop fusion permits the elimination of some reminiscence accesses by fusing two neighboring loops working over the identical information. However this isn’t the one means. Many concepts that lower the variety of information passes will end in fewer reminiscence accesses and higher efficiency.

Take into account the easy choice type algorithm from the earlier part. The unique algorithm was in search of the minimal within the remaining array. To lower the variety of complete reminiscence accesses, we might scan for each minimal and most. We might then put the minimal component initially of the remaining array and the utmost component at its finish. The algorithm appears like this:

for (int i = 0, j = n - 1; i < j; i++, j--) { double min = a[i]; int index_min = i; double max = a[j]; int index_max = j; for (int okay = i; okay < j; okay++) { if (a[k] < min) { min = a[k]; index_min = okay; } if (a[k] > max) { max = a[k]; index_max = okay; } } std::swap(a[i], a[index_min]); if (a[index_min] != max) { std::swap(a[j], a[index_max]; } else { std::swap(a[j], a[index_min]); } }

This model has fewer iterations, and due to this fact fewer reminiscence hundreds, however inside every iteration it does twice as a lot work. From the algorithmic standpoint, it performs roughly the identical variety of operations as the primary model. However, performance-wise it’s extra environment friendly. Within the experiments part we’ll see how a lot environment friendly.

One other essential algorithm with an analogous discount in reminiscence accesses is a variant of Quicksort referred to as Multi-Pivot Quicksort. Earlier than explaining MPQ, let’s give a fast reminder about Quicksort. The Quicksort algorithm consists of two steps. Step one is array partitioning: choosing a pivot after which partitioning the array right into a left half that’s smaller than the pivot and a proper half that’s bigger. The second step is the recursive name: the partitioning is carried out recursively on the left and proper a part of the enter array, till the dimensions of the array turns into 1.

With Multi-Pivot Quicksort, the partitioning step is carried out by choosing two pivots and partitioning the array into three elements. Then partitioning is recursively carried out on the left, center and proper a part of the array. If an array has N components, with plain Quicksort we anticipate a mean variety of reminiscence accesses for every component of the array to be O(log_{2} N). With Multi-Pivot Quicksort, the typical variety of reminiscence accesses can be O(log_{3} N).

*Like what you might be studying? Observe us on LinkedIn , Twitter or Mastodon and get notified as quickly as new content material turns into obtainable.Need assistance with software program efficiency? Contact us!*

## Experiments

All of the experiments have been executed on Intel Core i5-10210U CPU, 4 cores, 2 threads per core. Every core has 32kB of L1i cache, 32kB of L1d cache and 256kB of L2 cache. There’s a complete of 6MB of L3 cache shared by all of the cores. The supply code used for all experiments is on the market here.

### Loop Fusion

The primary experiment is said to loop fusion. We measure the runtime of two separate loops and examine it with the runtime of the fused loop. The examples we use for testing are loops that calculate the min and max, first in two separate loops, then merged.

Listed here are the runtimes (5 repetitions, common values):

Array Dimension | Unique | Fused |
---|---|---|

32 kB | Runtime: 0.159 s Instr: 671 M CPI: 0.793 |
Runtime: 0.068 s Instr: 402 M CPI: 0.665 |

256 kB | Runtime: 0.136 s Instr: 671 M CPI: 0.800 |
Runtime: 0.068 s Instr: 402 M CPI: 0.667 |

2 MB | Runtime: 0.136 s Instr: 671 M CPI: 0.801 |
Runtime: 0.068 s Instr: 402 M CPI: 0.667 |

16 MB | Runtime: 0.171 s Instr: 671 CPI: 0.855 |
Runtime: 0.085 s Instr: 402 M CPI: 0.739 |

128 MB | Runtime: 0.175 s Instr: 671 M CPI: 0.855 |
Runtime: 0.086 s Instr: 402 M CPI: 0.742 |

The desk reveals, that, on common, the fused model is about two occasions sooner than the unique model. The fused model additionally executes much less instruction and is extra {hardware} environment friendly (Cycle per Instruction metric is best). Fewer directions are executed as a result of (1) there is just one loop as an alternative of two, so this implies fewer iterator will increase, iterator comparisons, jumps and (2) eliminated one redundant load, because the piece of information is already in a register.

### Choice Kind

We’re going to experiment with choice type, as described within the part about decreasing the number of data passes. To measure the impact we’ll examine the model of the choice type the place we’re discovering minimal solely vs the model the place we’re discovering each minimal and most. The primary model scans the remaining a part of the array to seek out the minimal and put it initially. The second model scans to seek out each the minimal and most, and locations them initially and on the finish of the array respectively. We anticipate the second model to be sooner as a result of it must carry out two occasions fewer scans.

Listed here are the numbers (5 runs, common numbers):

Array Dimension | Solely Min | Min and Max |
---|---|---|

8 kB 16384 repeats |
Runtime: 8.74 s Instr: 60.3 B CPI: 0.57 |
Runtime: 4.44 s Instr: 43.2 B CPI: 0.405 |

32 kB 1024 repeats |
Runtime: 8.72 s Instr: 60.1 B CPI: 0.573 |
Runtime: 4.36 s Instr: 43.0 B CPI: 0.401 |

128 kB 64 repeats |
Runtime: 8.69 s Instr: 60.1 B CPI: 0.572 |
Runtime: 4.37 s Instr: 43.0 B CPI: 0.402 |

512 kB 4 repeats |
Runtime: 8.69 s Instr: 60.1 B CPI: 0.572 |
Runtime: 4.39 s Instr: 42.9 B CPI: 0.405 |

The Min and Max model is each sooner and extra {hardware} environment friendly in all circumstances. It additionally executes fewer directions, as a result of each the interior and the outer loops are shorter, so that they carry out fewer reminiscence accesses.

## Conclusion

Loop fusion is a straightforward and highly effective approach to lower the whole variety of reminiscence accesses in this system. Though we described right here the best model, loop fusion is feasible even when datasets overlap partially.

Normally, any thought that may end in a lower of reminiscence accesses has the potential to hurry up your code. You probably have any concepts that aren’t talked about on this put up, be at liberty to go away a remark so we will replace this put up.

Within the subsequent put up we’ll speak about one other technique to lower the whole variety of reminiscence accesses. These reminiscence accesses are “undesirable”, within the sense that the compiler has created them with out your intention: reminiscence accesses associated to pointer aliasing and reminiscence accesses associated to register spilling. Till quickly!

Need assistance with software program efficiency? Contact us!