Reminiscence Profiling Half 1. Introduction

2024-02-14 02:01:04

Subscribe to my newsletter, assist me on Patreon or by PayPal donation.

I’d love to listen to your suggestions!

I wrote this weblog collection for the second version of my ebook titled “Efficiency Evaluation and Tuning on Fashionable CPUs”. It’s open-sourced on Github: perf-book. The ebook primarily targets mainstream C and C++ builders who need to be taught low-level efficiency engineering, however devs in different languages may additionally discover some helpful data.

After you learn this write-up, let me know which elements you discover helpful/boring/sophisticated, and which elements want higher rationalization? Ship me ideas in regards to the instruments that I exploit and if higher ones.

Inform me what you assume within the feedback or ship me an e-mail, which you could find here. Additionally, you’re welcome to write down your ideas in Github, right here is the corresponding pull request.

Please understand that it’s an excerpt from the ebook, so some phrases might sound too formal.

P.S. When you’d slightly learn this within the type of a PDF doc, you’ll be able to obtain it here.

Reminiscence Profiling Introduction

On this collection of weblog posts, you’ll learn to gather high-level details about a program’s interplay with reminiscence. This course of is often known as reminiscence profiling. Reminiscence profiling helps you perceive how an software makes use of reminiscence over time and helps you construct the proper psychological mannequin of a program’s habits. Listed below are some questions it will probably reply:

What’s a program’s complete reminiscence consumption and the way it modifications over time?
The place and when does a program make heap allocations?
What are the code locations with the most important quantity of allotted reminiscence?
How a lot reminiscence a program accesses each second?

When builders speak about reminiscence consumption, they implicitly imply heap utilization. Heap is, the truth is, the most important reminiscence shopper in most functions because it accommodates all dynamically allotted objects. However heap isn’t the one reminiscence shopper. For completeness, let’s point out others:

Stack: Reminiscence utilized by body stacks in an software. Every thread inside an software will get its personal stack reminiscence house. Often, the stack dimension is just a few MB, and the applying will crash if it exceeds the restrict. The overall stack reminiscence consumption is proportional to the variety of threads working within the system.
Code: Reminiscence that’s used to retailer the code (directions) of an software and its libraries. Usually, it doesn’t contribute a lot to the reminiscence consumption however there are exceptions. For instance, the Clang C++ compiler and Chrome browser have giant codebases and tens of MB code sections of their binaries.

Subsequent, we are going to introduce the phrases reminiscence utilization and reminiscence footprint and see learn how to profile each.

Reminiscence Utilization and Footprint

Reminiscence utilization is steadily described by Digital Reminiscence Measurement (VSZ) and Resident Set Measurement (RSS). VSZ contains all reminiscence {that a} course of can entry, e.g., stack, heap, the reminiscence used to encode directions of an executable, and directions from linked shared libraries, together with the reminiscence that’s swapped out to disk. Alternatively, RSS measures how a lot reminiscence allotted to a course of resides in RAM. Thus, RSS doesn’t embrace reminiscence that’s swapped out or was by no means touched but by that course of. Additionally, RSS doesn’t embrace reminiscence from shared libraries that weren’t loaded to reminiscence.

Think about an instance. Course of A has 200K of stack and heap allocations of which 100K resides in the principle reminiscence, the remainder is swapped out or unused. It has a 500K binary, from which solely 400K was touched. Course of A is linked towards 2500K of shared libraries and has solely loaded 1000K in the principle reminiscence.

VSZ: 200K + 500K + 2500K = 3200K
RSS: 100K + 400K + 1000K = 1500K

An instance of visualizing the reminiscence utilization and footprint of a hypothetical program is proven in Determine 1. The intention right here is to not study statistics of a selected program, however slightly to set the framework for analyzing reminiscence profiles. Later on this chapter, we are going to study a number of instruments that allow us gather such data.

Let’s first have a look at the reminiscence utilization (higher two strains). As we’d count on, the RSS is at all times much less or equal to the VSZ. Trying on the chart, we will spot 4 phases in this system. Section 1 is the ramp-up of this system throughout which it allocates its reminiscence. Section 2 is when the algorithm begins utilizing this reminiscence, discover that the reminiscence utilization stays fixed. Throughout part 3, this system deallocates a part of the reminiscence after which allocates a barely greater quantity of reminiscence. Section 4 is much more chaotic than part 2 with many objects allotted and deallocated. Discover, that the spikes in VSZ are usually not essentially adopted by corresponding spikes in RSS. That may occur when the reminiscence was reserved by an object however by no means used.


Determine 1. Instance of the reminiscence utilization and footprint (hypothetical state of affairs).

Now let’s swap to reminiscence footprint. It defines how a lot reminiscence a course of touches throughout a interval, e.g., in MB per second. In our hypothetical state of affairs, visualized in Determine 1, we plot reminiscence utilization per 100 milliseconds (10 occasions per second). The stable line tracks the variety of bytes accessed throughout every 100 ms interval. Right here, we don’t depend what number of occasions a sure reminiscence location was accessed. That’s, if a reminiscence location was loaded twice throughout a 100ms interval, we depend the touched reminiscence solely as soon as. For a similar motive, we can’t mixture time intervals. For instance, we all know that through the part 2, this system was touching roughly 10MB each 100ms. Nonetheless, we can’t mixture ten consecutive 100ms intervals and say that the reminiscence footprint was 100 MB per second as a result of the identical reminiscence location might be loaded in adjoining 100ms time intervals. It will be true provided that this system by no means repeated reminiscence accesses inside every of 1s intervals.

The dashed line tracks the scale of the distinctive knowledge accessed because the begin of this system. Right here, we depend the variety of bytes accessed throughout every 100 ms interval which have by no means been touched earlier than by this system. For the primary second of this system’s lifetime, many of the accesses are distinctive, as we’d count on. Within the second part, the algorithm begins utilizing the allotted buffer. Throughout the time interval from 1.3s to 1.8s, this system accesses many of the places within the buffer, e.g., it was the primary iteration of a loop within the algorithm. That’s why we see a giant spike within the newly seen reminiscence places from 1.3s to 1.8s, however we don’t see many distinctive accesses after that. From the timestamp 2s up till 5s, the algorithm principally makes use of an already-seen reminiscence buffer and doesn’t entry any new knowledge. Nonetheless, the habits of part 4 is totally different. First, throughout part 4, the algorithm is extra reminiscence intensive than in part 2 as the whole reminiscence footprint (stable line) is roughly 15 MB per 100 ms. Second, the algorithm accesses new knowledge (dashed line) in comparatively giant bursts. Such bursts could also be associated to the allocation of latest reminiscence areas, engaged on them, after which deallocating them.

We’ll present learn how to receive such charts within the following two case research, however for now, you might surprise how this knowledge can be utilized. Effectively, first, if we sum up distinctive bytes (dotted strains) accessed throughout each interval, we are going to get the whole reminiscence footprint of a program. Additionally, by trying on the chart, you’ll be able to observe phases and correlate them with the code that’s working. Ask your self: “Does it look in line with your expectations, or the workload is doing one thing sneaky?” You might encounter surprising spikes in reminiscence footprint. Reminiscence profiling methods that we are going to focus on on this collection of posts don’t essentially level you to the problematic locations just like common hotspot profiling however they definitely provide help to higher perceive the habits of a workload. On many events, reminiscence profiling helped establish an issue or served as a further knowledge level to assist the conclusions that have been made throughout common profiling.

In some situations, reminiscence footprint helps us estimate the stress on the reminiscence subsystem. As an example, if the reminiscence footprint is small, say, 1 MB/s, and the RSS matches into the L3 cache, we would suspect that the stress on the reminiscence subsystem is low; do not forget that out there reminiscence bandwidth in fashionable processors is in GB/s and is getting near 1 TB/s. Alternatively, when the reminiscence footprint is slightly giant, e.g., 10 GB/s and the RSS is way larger than the scale of the L3 cache, then the workload would possibly put vital stress on the reminiscence subsystem.

-> part 2

Source Link