Measuring Stack Utilization the Arduous Means

2024-02-12 05:40:20

Embedded programs typically require a cautious eye to the place reminiscence assets are being
spent, particularly runtime reminiscence utilization like stack and heap recollections.

This text is meant to shed some mild on methods for measuring stack
reminiscence utilization on a small embedded system.

How stacks work

Earlier than we dive into the measurement methods, let’s take a fast take a look at how
stacks normally behave on all these small embedded chips (skip to
“Measuring stack usage” if that is previous information). For this
article, we’ll be assuming an ARM Cortex-M embedded system with “full
descending” stacks, however related methods will apply to different programs.

Stack grows down, heap grows up

In a “full descending” stack, objects are “allotted” into stack reminiscence in
reducing reminiscence order.

If our code is allocating a 32-bit worth on stack:

int foo(int a) {
  uint32_t var;  // the compiler could select to allocate this on stack
  ...
}

First the “stack pointer” is decremented by 32 bits (4 bytes), then the tackle
for var is ready to the brand new stack pointer tackle. Masses and shops of the var
variable will then entry the stack-allocated 32-bit area.

In a typical single-threaded embedded program that makes use of each stack and heap
(malloc and associates), the reminiscence ordering may appear like this:

Yow will discover some extra rationalization right here:

https://courses.engr.illinois.edu/cs225/fa2022/resources/stack-heap/

Aspect notice, there’s an fascinating technique for single-threaded applications that
entails a two-pass linking step, and outcomes on this reminiscence format:

Learn extra about that right here:

https://blog.japaric.io/stack-overflow-protection/

RTOS stacks

Issues get slightly extra difficult when there’s an RTOS concerned: every thread
(in a pre-emptive RTOS) normally has its personal devoted stack. See this hyperlink for a
nice description:

https://www.digikey.com/en/maker/projects/introduction-to-rtos-solution-to-part-4-memory-management/6d4dfcaa1ff84f57a2098da8e6401d9c

When a thread is operating, stack allocations happen simply as in a single-threaded
system, solely the stack is ready up (by the RTOS kernel) particularly for the
thread.

Alright, let’s bounce into the stack utilization measurement methods!

Measuring stack utilization

Static evaluation with `-fstack-usage`

Gcc and clang will emit stack utilization per operate by way of the -fstack-usage
argument. It seems like this in apply:

// stack-usage-example.c
#embody <stdio.h>

int foo_2(int c) {
  int array[4];
  array[1] = c;
  array[3] = c* c;
  return array[3] - array[1];
}

int foo(int a, int b) {
  int array[10];
  array[1] = a;
  array[9] = foo_2(b);

  return array[1] * array[9];
}

int predominant() {
  printf("%dn", foo(1,2));
}

Compiling with -fstack-usage:

$ gcc -c -fstack-usage stack-usage-example.c
$ cat stack-usage-example.su
instance/stack-usage/stack-usage-example.c:3:5:foo_2     64      static
instance/stack-usage/stack-usage-example.c:10:5:foo      80      static
instance/stack-usage/stack-usage-example.c:18:5:predominant     16      static

You’ll be able to see above that the stack-usage-example.su file accommodates a listing of
capabilities and the stack utilization in every.

That is fairly helpful when a single operate, however is tough to make use of
when attempting to investigate nested operate calls. Within the above instance, the complete
stack utilization in foo() is foo() + foo_2(), or 80 + 64 = 144 bytes, nevertheless it’s
not apparent from the .su knowledge.

There are some instruments to assist fill on this hole, which I’ll hyperlink under for
reference, however they’ll require a little bit of fiddling to get working accurately.

Honorable point out to the puncover
device (disclaimer, I assist keep the device), which might help visualize the information
within the .su recordsdata for a complete program:

Runtime evaluation with stack portray

This technique is quite common in embedded programs, and is utilized by each FreeRTOS
and Zephyr RTOS.

The concept is to “paint” the stack with a recognized worth, after which to measure stack
utilization by analyzing the very best location within the stack not set to the recognized worth
(aka “excessive watermark”).

Each FreeRTOS and Zephyr RTOS apply this method for measuring stack utilization at
runtime:

Right here’s a brief instance of how this works in apply:

#embody <stdint.h>

#outline STACK_FILL_VALUE 0xaa

// Helper operate to initialize a stack with the fill sample
void stack_initialize_fill_value(uintptr_t stack_top, size_t stack_size) {
  memset((void *)stack_top, STACK_FILL_VALUE, stack_size);
}

// Operate to return the utmost stack utilization
uintptr_t stack_high_watermark(uintptr_t stack_top, size_t stack_size) {
  uint8_t *stack = (uint8_t *)stack_top;
  uint8_t *stack_end = (uint8_t *)(stack_top + stack_size);


  // To compute the stack utilization, iterate over the stack reminiscence and discover the
  // highest location that does not match the fill worth
  whereas (stack < stack_end && *stack == STACK_FILL_VALUE) {
    stack++;
  }

  return (uintptr_t)stack;
}

Use this technique after the system executes core performance to gather a
moderately correct measurement of the utmost stack utilization. utilization.

Be aware: since this strategy depends on values being written to the stack, if a
buffer is allotted on the stack however solely partially used, the complete buffer dimension
is not going to be counted as used. Any reminiscence nonetheless containing the fill sample will
be counted as unused

One other utility of this method is to make use of it to measure stack consumption
of a specific operate name, with the next process:

Simply earlier than the operate name, re-apply the stack fill sample throughout the
unused portion of the stack
Save the present stack pointer worth, and name the operate of curiosity
Measure the stack excessive watermark, and compute the distinction between the
saved stack pointer worth and the excessive watermark. That is the stack utilization of
the operate

Runtime evaluation with GDB single-stepping

It is a approach I utilized lately to measure stack utilization with out modifying
the goal system code.

The concept is to make use of GDB to single-step by way of this system, and to dump the stack
pointer register, $sp, at each step. I used a small GDB Python script to do
this work, a small snippet is proven under:

# save the beginning stack pointer worth
sp_start = int(gdb.parse_and_eval("$sp"))
print("Beginning SP: 0x{:08x}".format(sp_start))

# document the bottom (reminiscence index) value- that is the max stack utilization for the
# operate. initialize it to the utmost addressable tackle.
sp_min = 0xFFFFFFFF
whereas True:
    # single instruction step, so we seize _any_ sp change
    gdb.execute("si")
    sp = int(gdb.parse_and_eval("$sp"))
    if sp < sp_min:
        sp_min = sp
    # exit this loop when a breakpoint is hit
    if hit_breakpoint():
      break
print("Min SP: 0x{:08x}".format(sp_min))
print("Delta SP: {}".format(sp_start - sp_min))

The script is very gradual to run because it interrupts program stream at each
instruction, nevertheless it offers a really correct outcome.

It may be enhanced to supply extra knowledge, for instance recording the max utilization for
every body in a name stack:

# write the verbose knowledge to a file
with open("stack_usage.log", "w") as f:
    sp_min = 0xFFFFFFFF
    current_function = None
    current_function_max_su = 0
    whereas True:
        gdb.execute("si")
        sp = int(gdb.parse_and_eval("$sp"))
        su = sp_start - sp
        identify = gdb.selected_frame().identify()
        # if we simply modified operate identify, document the excessive water mark
        # from the final operate
        if current_function is not None and identify != current_function:
            f.write(f"{current_function}|{current_function_max_su}n")
            current_function_max_su = 0
        current_function = identify
        if su > current_function_max_su:
            current_function_max_su = su
        if sp < sp_min:
            sp_min = sp
        # test if we have reached a breakpoint
        if hit_breakpoint():
            break
    f.write(f"{current_function}|{current_function_max_su}n")

Plotting that utilization may be fascinating:

# utilizing the superb gnuplot to plot a visualization of the stack utilization
❯ reduce -d | -f 2 stack_usage.log | 
     | gnuplot -e "set title 'Stack Utilization'; plot '-' with traces" -persist