Now Reading
CPU Utilization is Flawed

CPU Utilization is Flawed

2023-07-03 18:11:35

The metric all of us use for CPU utilization is deeply deceptive, and getting worse yearly. What’s CPU utilization? How busy your processors are? No, that is not what it measures. Sure, I am speaking concerning the “%CPU” metric used in all places, by everybody. In each efficiency monitoring product. In prime(1).

What you could assume 90% CPU utilization means:

What it would actually imply:

Stalled means the processor was not making ahead progress with directions, and normally occurs as a result of it’s ready on reminiscence I/O.
The ratio I drew above (between busy and stalled) is what I sometimes see in manufacturing. Chances are high, you are largely stalled, however do not know it.

What does this imply for you? Understanding how a lot your CPUs are stalled can direct efficiency tuning efforts between lowering code or lowering reminiscence I/O.
Anybody taking a look at CPU efficiency, particularly on clouds that auto scale based mostly on CPU, would profit from realizing the stalled part of their %CPU.

What actually is CPU Utilization?

The metric we name CPU utilization is admittedly “non-idle time”: the time the CPU was not working the idle thread. Your working system kernel (no matter it’s) normally tracks this throughout context swap. If a non-idle thread begins working, then stops 100 milliseconds later, the kernel considers that CPU utilized that complete time.

This metric is as outdated as time sharing programs. The Apollo Lunar Module steerage laptop (a pioneering time sharing system) known as its idle thread the “DUMMY JOB”, and engineers tracked cycles working it vs actual duties as a necessary laptop utilization metric. (I wrote about this before.)

So what’s unsuitable with this?

These days, CPUs have change into a lot sooner than important reminiscence, and ready on reminiscence dominates what continues to be known as “CPU utilization”. While you see excessive %CPU in prime(1), you may consider the processor as being the bottleneck – the CPU package deal below the warmth sink and fan – when it is actually these banks of DRAM.

This has been getting worse. For a very long time processor producers have been scaling their clockspeed faster than DRAM was scaling its entry latency (the “CPU DRAM hole”). That levelled out round 2005 with 3 GHz processors, and since then processors have scaled utilizing extra cores and hyperthreads, plus multi-socket configurations, all placing extra demand on the reminiscence subsystem. Processor producers have tried to scale back this reminiscence bottleneck with bigger and smarter CPU caches, and sooner reminiscence busses and interconnects. However we’re nonetheless normally stalled.

inform what the CPUs are actually doing

Through the use of Efficiency Monitoring Counters (PMCs): {hardware} counters that may be learn utilizing Linux perf, and different instruments. For instance, measuring the whole system for 10 seconds:

# perf stat -a -- sleep 10

 Efficiency counter stats for 'system extensive':

     641398.723351      task-clock (msec)         #   64.116 CPUs utilized            (100.00%)
           379,651      context-switches          #    0.592 Okay/sec                    (100.00%)
            51,546      cpu-migrations            #    0.080 Okay/sec                    (100.00%)
        13,423,039      page-faults               #    0.021 M/sec                  
 1,433,972,173,374      cycles                    #    2.236 GHz                      (75.02%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
 1,118,336,816,068      directions              #    0.78  insns per cycle          (75.01%)
   249,644,142,804      branches                  #  389.218 M/sec                    (75.01%)
     7,791,449,769      branch-misses             #    3.12% of all branches          (75.01%)

      10.003794539 seconds time elapsed

The important thing metric right here is directions per cycle (insns per cycle: IPC), which exhibits on common what number of directions we have been accomplished for every CPU clock cycle. The upper, the higher (a simplification). The above instance of 0.78 sounds not unhealthy (78% busy?) till you notice that this processor’s prime pace is an IPC of 4.0. That is also called 4-wide, referring to the instruction fetch/decode path. Which suggests, the CPU can retire (full) 4 directions with each clock cycle. So an IPC of 0.78 on a 4-wide system, means the CPUs are working at 19.5% their prime pace. Newer Intel processors might transfer to 5-wide.

There are a whole bunch extra PMCs you should utilize to dig additional: measuring stalled cycles straight by differing kinds.

Within the cloud

If you’re in a digital surroundings, you may not have entry to PMCs, relying on whether or not the hypervisor helps them for friends. I just lately posted about The PMCs of EC2: Measuring IPC, exhibiting how PMCs are actually accessible for devoted host sorts on the AWS EC2 Xen-based cloud.

Interpretation and actionable gadgets

In case your IPC is < 1.0, you’re doubtless reminiscence stalled, and software program tuning methods embody lowering reminiscence I/O, and bettering CPU caching and reminiscence locality, particularly on NUMA programs. {Hardware} tuning consists of utilizing processors with bigger CPU caches, and sooner reminiscence, busses, and interconnects.

In case your IPC is > 1.0, you’re doubtless instruction certain. Search for methods to scale back code execution: get rid of pointless work, cache operations, and so on. CPU flame graphs are an amazing device for this investigation. For {hardware} tuning, strive a sooner clock fee, and extra cores/hyperthreads.

For my above guidelines, I break up on an IPC of 1.0. The place did I get that from? I made it up, based mostly on my prior work with PMCs. This is how one can get a worth that is customized in your system and runtime: write two dummy workloads, one that’s CPU certain, and one reminiscence certain. Measure their IPC, then calculate their mid level.

See Also

What efficiency monitoring merchandise ought to inform you

Each efficiency device ought to present IPC together with %CPU. Or break down %CPU into instruction-retired cycles vs stalled cycles, eg, %INS and %STL.

As for prime(1), there may be tiptop(1) for Linux, which exhibits IPC by course of:

tiptop -                  [root]
Duties:  96 complete,   3 displayed                               display screen  0: default

  PID [ %CPU] %SYS    P   Mcycle   Minstr   IPC  %MISS  %BMIS  %BUS COMMAND
 3897   35.3  28.5    4   274.06   178.23  0.65   0.06   0.00   0.0 java
 1319+   5.5   2.6    6    87.32   125.55  1.44   0.34   0.26   0.0 nm-applet
  900    0.9   0.0    6    25.91    55.55  2.14   0.12   0.21   0.0 dbus-daemo

Different causes CPU utilization is deceptive

It isn’t simply reminiscence stall cycles that makes CPU utilization deceptive. Different elements embody:

  • Temperature journeys stalling the processor.
  • Turboboost various the clockrate.
  • The kernel various the clock fee with pace step.
  • The issue with averages: 80% utilized over 1 minute, hiding bursts of 100%.
  • Spin locks: the CPU is utilized, and has excessive IPC, however the app isn’t making logical ahead progress.

Replace: is CPU utilization really unsuitable?

There have been a whole bunch of feedback on this put up, right here (beneath) and elsewhere (1, 2). Because of everybody for taking the time and the curiosity on this matter. To summarize my responses: I am not speaking about iowait in any respect (that is disk I/O), and there are actionable gadgets if you recognize you’re reminiscence certain (see above).

However is CPU utilization really unsuitable, or simply deeply deceptive? I feel many individuals interpret excessive %CPU to imply that the processing unit is the bottleneck, which is unsuitable (as I stated earlier). At that time you do not but know, and it’s typically one thing exterior. Is the metric technically appropriate? If the CPU stall cycles cannot be utilized by the rest, aren’t they’re due to this fact “utilized ready” (which feels like an oxymoron)? In some instances, sure, you could possibly say that %CPU as an OS-level metric is technically appropriate, however deeply deceptive. With hyperthreads, nonetheless, these stalled cycles can now be utilized by one other thread, so %CPU might depend cycles as utilized which can be in actual fact accessible. That is unsuitable. On this put up I wished to give attention to the interpretation drawback and urged options, however sure, there are technical issues with this metric as effectively.

You may simply say that utilization as a metric was already damaged, as Adrian Cockcroft mentioned previously.

Conclusion

CPU utilization has change into a deeply deceptive metric: it consists of cycles ready on important reminiscence, which might dominate trendy workloads. Maybe %CPU ought to be renamed to %CYC, quick for cycles. You’ll be able to determine what %CPU actually means through the use of extra metrics, together with directions per cycle (IPC). An IPC < 1.0 doubtless means reminiscence certain, and an IPC > 1.0 doubtless means instruction certain. I lined IPC in my previous post, together with an introduction to the Efficiency Monitoring Counters (PMCs) wanted to measure it.

Efficiency monitoring merchandise that present %CPU – which is all of them – also needs to present PMC metrics to elucidate what meaning, and never mislead the top consumer. For instance, they’ll present %CPU with IPC, and/or instruction-retired cycles vs stalled cycles. Armed with these metrics, builders and operators can select learn how to higher tune their functions and programs.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top