Now Reading
All my favourite tracing instruments: eBPF, QEMU, Perfetto, new ones I constructed and extra

All my favourite tracing instruments: eBPF, QEMU, Perfetto, new ones I constructed and extra

2023-12-05 16:56:29

Ever needed extra other ways to grasp what’s happening in a program? Right here I catalogue an enormous number of tracing strategies you should use for various kinds of issues. Tracing has been such a long-standing curiosity (and job) of mine that a few of these will novel and fascinating to anybody who reads this. I’ll assure it by together with 2 novel tracing instruments I’ve made and haven’t shared earlier than (search for this: Tooling drop!).

What I see as the important thing elements of tracing are accumulating timestamped knowledge on what occurred in a system, after which ideally visualizing it in a timeline UI as a substitute of simply as a textual content log. First I’ll cowl my favourite methods of actually simply getting hint knowledge into a pleasant timeline UI, as a result of it’s a superpower that makes all the opposite tracing instruments extra fascinating. Then I’ll go over methods to get that knowledge, every part from instrumentation to binary patching to processor {hardware} options.

I’ll additionally give a real-life instance of mixing eBPF tracing with Perfetto visualization to diagnose tail latency points in big traces through the use of numerous neat methods. Search for the “eBPF Instance” part.

Word: I’m hiring for my accelerator optimization crew at Anthropic! See the bottom of the post for extra element.

Getting occasion knowledge onto a pleasant zoomable timeline UI is approach simpler than most individuals assume. Right here’s my favourite technique I do on a regular basis which may take you from logging your knowledge to visualizing it in minutes:

# from:
print("%d: %s %d" % (event_name, timestamp, length))
# to:
with open('hint.json','w') as f:
  f.print('{"name": "%s", "ts": %d, "dur": %d, "cat": "hi", "ph": "X", "pid": 1, "tid": 1, "args": {}}n' %
    (event_name, timestamp, duration))
  f.print("]") # this closing ] is not truly required

That is the facility of the Chromium Event JSON Format. It’s a brilliant easy JSON format that helps a bunch of various sorts of occasions, and is supported by numerous completely different profile visualizer instruments.

You’ll be able to view the ensuing tracing information in Google’s Perfetto hint viewer by going to, or within the older Catapult viewer (which is nicer for some traces) by going to chrome://tracing in Chrome. You’ll be able to mess around with the UI by going to Perfetto and clicking “Open Chrome Instance” within the sidebar. Right here’s a screenshot exhibiting an occasion annotated with arguments and movement occasion arrows:

Perfetto Screenshot

Me and my coworkers do that on a regular basis at work, whip up hint visualizations for brand new knowledge sources in below an hour and add them to our rising set of hint instruments. We’ve a Python utility to show a hint file right into a clickable permanently-saved intranet hyperlink we will share with coworkers in Slack. That is straightforward to arrange by constructing a duplicate of Perfetto and importing to a file internet hosting server you management, after which placing hint information on that server and producing hyperlinks utilizing Perfetto’s ?url= parameter. We additionally write customized hint evaluation scripts by loading the straightforward JSON right into a Pandas dataframe.

I like Perfetto as its use of WebAssembly lets it scale to about 10x extra occasions than Catapult (though it will get laggy), and you’ve got the escape hatch of the native backend for even larger traces. Its SQL query feature additionally helps you to discover occasions and annotate them within the UI utilizing arbitrary predicates, together with special SQL functions for coping with hint stacks.

UI protip: Press ? in Perfetto to see the shortcuts. I take advantage of each WASD and CTRL+scroll to maneuver round.

Superior Format: Fuchsia Hint Format

The Chromium JSON format can produce gigantic information and be very sluggish for giant traces, as a result of it repeats each the sector names and string values for each occasion. Perfetto additionally helps the Fuchsia Trace Format (FTF) which is an easy compact binary format with an unimaginable spec doc that makes it straightforward to supply binary traces. It helps interning strings to keep away from repeating occasion names, and is designed round 64 byte phrases and helps clock bases so to straight write timestamp counters and have the UI compute the true time.

Once I labored at Jane Avenue I used this to log instrumentation events to a buffer directly in FTF as they occurred in <10ns per span (it could have been nearer to 4ns if it wasn’t for OCaml limitations).

Superior Format: Perfetto Protobuf

One other format which is equally compact, and likewise helps extra options, is Perfetto’s native Protobuf trace format. It’s documented solely in feedback within the proto information and is a bit trickier to determine, however is perhaps a bit simpler to generate when you have entry to a protobuf library. It allows entry to superior Perfetto options like together with callstack samples in a hint, which aren’t accessible with different codecs. It’s slower to write down than FTF, though Perfetto has a ProtoZero library to make it considerably sooner.

This may be actually difficult to get proper although and I needed to reference the Perfetto supply code to determine error codes within the “data and stats” tab so much. The most important gotchas are you must set trusted_packet_sequence_id on each packet, have a TrackDescriptor for each monitor, and set sequence_flags=SEQ_INCREMENTAl_STATE_CLEARED on the primary packet.

Different instruments

Another good hint visualization instruments are Speedscope which is healthier for a hybrid between profile and hint visualization, pprof for pure profile name graph visualization, and Rerun for multimodal 3D visualization. Different profile viewers I like much less however which have some good elements embody Trace Compass and the Firefox Profiler.

Now lets go over all types of various neat tracing strategies! I’ll begin with some obscure and fascinating low stage ones however I promise I’ll get to some extra broadly usable ones after.

{Hardware} breakpoints

For ages, processors have supported {hardware} breakpoint registers which allow you to put in a small variety of reminiscence addresses and have the processor interrupt itself when any of them are accessed or executed.

perf and perftrace

Linux exposes this performance via ptrace but additionally via the perf_event_open syscall and the perf record command. You’ll be able to file a course of like perf file -e mem:0x1000/8:rwx my_command and think about the outcomes with perf script. It prices about 3us of overhead each time a breakpoint is hit.

Tooling drop! I wrote a tiny Python library called perftrace with a C stub which calls the perf_event_open syscall to file timestamps and register values when the breakpoints have been hit.

It presently solely helps execution breakpoints however you may also breakpoint on reads or writes of any reminiscence and it could be easy to modify the code to do that. {Hardware} breakpoints are mainly the one option to look ahead to accessing a particular reminiscence deal with at a wonderful granularity which doesn’t add overhead to code which doesn’t contact that reminiscence.

GDB scripting

Along with utilizing it manually, you possibly can automate the method of following the execution of a program utilizing debugger breakpoints through the use of GDB’s Python scripting interface. That is slower than perf breakpoints however provides you the power to examine and modify reminiscence once you hit breakpoints. GEF is an extension to GDB that along with making it a lot nicer generally, additionally extends the Python API with a bunch of helpful utilities.

Tooling drop! Here’s an example GDB script I wrote using GEF which gives examples of how to puppeteer, trace and inspect a program

Intel Processor Hint

Intel Processor Trace is a {hardware} know-how on Intel chips since Skylake which permits recording a hint of each instruction the processor executes through recording sufficient data to reconstruct the management movement in a super-compact format, together with fine-grained timing data. It has extraordinarily low overhead because it’s performed by {hardware} and writes bypass the cache so the one overhead is decreasing predominant reminiscence bandwidth by about 1GB/s. I see no noticeable overhead in any respect on most program benchmarks I’ve examined.

You’ll be able to entry a dump of the meeting directions executed in a recorded area utilizing perf, lldb and gdb.


Nonetheless meeting traces aren’t helpful to most individuals, so when at Jane Avenue I created magic-trace together with my intern Chris Lambert, which generates a hint file (utilizing FTF and Perfetto as described above) which visualizes each operate name in a program execution. Jane Avenue generously open-sourced it so anybody can use it! Since then it’s been prolonged to assist tracing into the kernel as effectively. I wrote a blog post about how it works for the Jane Street tech blog.

magic-trace demo

Processor Hint can file to a hoop buffer, and magic-trace makes use of the {hardware} breakpoint characteristic described earlier to allow you to set off seize of the final 10ms each time some operate that indicators an occasion you wish to take a look at occurred, or when this system ends. This makes it nice for a bunch of eventualities:

  • Debugging uncommon tail latency occasions: Add a set off operate name after one thing takes unusually lengthy, after which depart magic-trace connected in manufacturing. As a result of it captures every part you’ll by no means haven’t logged sufficient knowledge to determine the sluggish half.
  • On a regular basis efficiency evaluation: A full hint timeline could be simpler to interpret than a sampling profiler visualization, particularly as a result of it shows the distinction between one million quick calls to a operate and one sluggish name.
    • It’s typical to seek out efficiency issues on programs that had solely ever been analyzed with a sampling profiler by noticing the primary time you magic-trace this system that many capabilities are being referred to as extra instances than anticipated or in areas you didn’t anticipate.
  • Debugging crashes: When a program crashes for causes you don’t perceive, you possibly can simply run it below magic-trace and see each operate name main as much as the crash, which is usually sufficient to determine why the crash occurred with out including additional logging or utilizing a debugger!

If you wish to modify magic-trace to fit your wants, it’s open-source OCaml. And should you like Rust greater than OCaml somebody made a easy Rust port referred to as perf2perfetto.

Sadly, Processor Hint isn’t supported on many digital machines that use appropriate Intel {Hardware}. Complain to your cloud supplier so as to add assist of their hypervisor or strive bare-metal situations!

Instrumentation-based tracing profilers

What most individuals use to get comparable advantages to magic-trace traces, particularly within the gamedev business, is low-overhead instrumentation-based profilers with customized UIs. One main benefit of instrumentation-based traces is they’ll comprise additional details about knowledge and never simply management movement, placing arguments out of your capabilities into the hint could be key for determining what’s happening. These instruments usually assist together with different knowledge sources reminiscent of OS scheduling data, CPU samples and GPU hint knowledge. Right here’s my favourite instruments like this and their professionals/cons:


Tracy screenshot

  • Cross platform, together with good Linux sampling and scheduling seize
  • Overhead of solely 2ns/span, helps big traces with tons of of tens of millions of occasions
  • Very nice and quick UI with tons of options (try the demo videos within the readme)
  • Integrates CPU sampling with detailed supply and meeting evaluation
  • Standard so there are bindings in non-C++ languages like Rust and Zig.
  • Con: Solely helps a single string/quantity argument to occasions
  • Con: Timeline is overly aggressive in collapsing small occasions into squiggles (see my post on this).


Optick screenshot

  • Cross-platform, a number of options, very good UI
  • Helps a number of named arguments per occasion
  • Con: Not as fleshed-out for non-game functions
  • Con: sampling integration solely works on Home windows


  • Perfetto UI is sweet, occasions can embody arguments and movement occasion arrows
  • Integrates with different Perfetto knowledge sources like OS occasions and sampling
  • Con: Larger overhead of round 600ns/span when tracing enabled
  • Con: UI doesn’t scale to traces as giant because the above two packages

Different packages

There’s a bunch extra comparable small packages that typically include their very own instrumentation library and their very own WebGL profile viewer. These are typically extra light-weight and could be simpler to combine. For instance Spall, microprofile, Remotery, Puffin (Rust-native), gpuviz. I need to additionally point out the OCaml tracing instrumentation library I wrote for Jane Street which has overheads below 10ns/span through a compile-time macro just like the C++ libraries.


If you wish to hint issues utilizing the Linux kernel there’s a brand new sport on the town, and it’s superior. The eBPF subsystem means that you can connect advanced packages to all types of various issues within the kernel and effectively shuttle knowledge again to userspace, mainly subsuming all of the legacy services like ftrace and kprobes such that I received’t speak about them.

Issues you possibly can hint embody: syscalls, low overhead tracepoints all through the kernel, {hardware} efficiency counters, any kernel operate name and arbitrary breakpoints or operate calls/returns in userspace code. Mixed these mainly allow you to see something on the system in or out of userspace.

You usually write BPF packages in C however there are even perhaps nicer toolkits for utilizing Zig and Rust.

There’s a whole bunch of ways to use eBPF and I’ll speak about a few of my favorites right here. Another favorites I received’t go into intimately are Wachy and retsnoop.

BCC: Straightforward Python API for eBPF

The BPF Compiler Collection (BCC) is a library with very nice Python bindings for compiling eBPF packages from C supply code, injecting them, and getting the info again. It has a very nice characteristic the place you possibly can write a C struct to carry the occasion knowledge you wish to file, after which it should parse that and expose it so you possibly can entry the fields in Python. Try how simple this syscall tracing example is.

I actually like having the total energy of Python to regulate my tracing scripts. BCC scripts usually use Python string templating to do compile time metaprogramming of the C to compose the precise probe script you need, after which do knowledge post-processing in Python to current issues properly.

bpftrace: terse DSL for eBPF tracing

In order for you a terser option to compose tracing packages, within the type of dtrace, try bpftrace. It helps you to write one liners like these:

# Recordsdata opened by course of
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %sn", comm, str(args->filename)); }'

# Depend LLC cache misses by course of identify and PID (makes use of PMCs):
bpftrace -e '{hardware}:cache-misses:1000000 { @[comm, pid] = rely(); }'

ply: less complicated bpftrace

In order for you one thing like bpftrace however less complicated and sooner with no LLVM dependencies. Try ply.

# Which processes are receiving errors when studying from the VFS?
ply 'kretprobe:vfs_read if (retval < 0) { @[pid, comm, retval] = rely(); }'

eBPF Instance: Anthropic’s Perfetto-based packet and consumer occasion tracing

For work at Anthropic I needed to research tail latency of some networking code so I used BCC and hooked into low-overhead kernel probe factors to hint data from each single packet into a hoop buffer. I may even embody fields pulled from the packet header and NIC queue info, all at 1 million packets per second with no noticeable overhead.

Trick for tracing userspace occasions with low overhead in eBPF

I needed to correlate packets with userspace occasions from a Python program, so I used a enjoyable trick: Discover a syscall which has an early-exit error path and bindings in most languages, after which hint calls to that which have particular arguments which produce an error. I traced the faccessat2 syscall such that in Python os.entry(event_name, -932, dir_fd=-event_type) the place event_type was an enum for begin, cease and instantaneous occasions would log spans to my Perfetto hint. This had an overhead of round 700ns/occasion, which is in an analogous league to Perfetto’s full-userspace C++ instrumentation, and numerous that’s Python name overhead. The os.entry operate is very good as a result of when the syscall errors it doesn’t incur overhead by producing a Python exception like most different syscall wrappers do.

Methods to course of occasions extra rapidly utilizing a C helper with BCC

With 1 million packets per second I had an issue that with uncommon tail latency occasions, my traces rapidly acquired big and lagged Perfetto. I needed to solely preserve knowledge from shortly earlier than one in all my userspace ship occasions took too lengthy. Usually you’d do that with a round buffer that will get snapshotted, and it could be doable to implement that in eBPF. However I didn’t wish to implement my very own ringbuf and the included ones don’t assist wraparound overwriting. So as a substitute I used the interior _open_ring_buffer operate to register a ctypes C operate as a ringbuffer callback as a substitute of a Python operate, and wrote an environment friendly C callback to filter out packets close to a tail latency occasion earlier than passing these to Python.

Perks of Perfetto visualization

I used the Perfetto Protobuf format with interned strings with a view to preserve hint measurement down to a couple bytes per packet.

I may use Perfetto’s SQL assist within the ensuing hint to question for ship occasions above a sure time threshold after startup in a particular course of. Right here’s a screenshot exhibiting an extended ship occasion coinciding with packets beginning to be paced out with bigger gaps on one of many queues, together with the power to have line graph tracks:

Perfetto Packet Trace

I believe it’s kinda loopy that we now have all these completely different mostly-text-based BPF instruments slightly than a framework that permits you to put all types of various sorts of system occasions right into a hint UI, together with simply scripting your individual new occasions. It’s a lot simpler to analyze this type of factor with a timeline UI. I began constructing that framework at Anthropic, however solely spent every week on it since I’ve had greater precedence issues to do since I did the packet latency investigation.

Binary Instrumentation

If you’re instrumenting userspace packages in a approach the place the overhead of kernel breakpoints is simply too excessive, however you don’t have entry to the supply code, maybe since you’re reverse-engineering one thing, then it might be time for binary instrumentation.

bpftime: eBPF-based binary instrumentation

One straightforward approach that’s an excellent segue is bpftime which takes your present eBPF packages with userspace probes, and runs them a lot sooner by patching the directions to run the BPF program inside the method slightly than incurring 3us of kernel interrupt overhead each time.

See Also


For extra subtle binary patching on x86, look to E9Patch.

On some architectures, patching could be very easy because you simply patch the instruction you wish to hint with a soar to a bit of “trampoline” code which has your instrumentation, after which the unique instruction and a soar again.

It’s a lot tougher on x86 since directions are variable size, so should you simply patch a soar over a goal instruction, often that’ll trigger issues since another instruction jumps to an instruction your longer soar needed to stomp over.

Folks have invented every kind of intelligent methods to get round these points together with “instruction punning” the place you set your patch code at addresses that are additionally legitimate x86 nop or entice directions. E9Patch implements very superior variations of those methods such that the patching ought to mainly all the time work.

It comes with an API in addition to a instrument referred to as E9Tool which helps you to patch utilizing a command line interface:

# print all soar directions within the xterm binary
$ e9tool -M jmp -P print xterm
jz 0x4064d5
jz 0x452c36


The opposite option to get across the issue of static patching, when you must be conservative round how jumps you don’t find out about could possibly be tousled by your patches, is dynamic binary instrumentation, the place you mainly puppeteer the execution of this system. That is the approach utilized by JIT VMs like Rosetta and QEMU to mainly recompile your program as you run it.

Frida exposes this extremely highly effective approach in a basic approach you possibly can script in Javascript utilizing its “Stalker” interface. Permitting you to connect JS snippets to items of code or rewrite the meeting as it’s run. It additionally helps you to do extra commonplace patching, though it doesn’t work as effectively on x86 as E9Patch.


In case you simply wish to hint a operate in a dynamic library like libc, you should use LD_PRELOAD to inject a library of your individual to interchange any capabilities you want. You should use dlsym(RTLD_NEXT, "fn_name") to get the previous implementation with a view to wrap it. Try this tutorial post for the way.

Distributed Tracing

Distributed Tracing is the place you possibly can hint throughout completely different companies through attaching particular headers to requests and sending all of the timing knowledge again to a hint server. Some common options are OpenTelemetry (of which there are lots of implementations and UIs) and Zipkin.

There’s some cool new options like Odigos that use eBPF so as to add distributed tracing assist with none instrumentation.

Sampling Profilers

Sampling profilers take a pattern of the total name stack of your program periodically. Typical profiler UIs don’t have the time axis I’d consider as a part of “tracing”, however some UIs do. For instance Speedscope accepts many profiler knowledge codecs and may visualize with a time axis, and Samply is a straightforward to make use of profiler which makes use of the Firefox Profiler UI, which additionally has a timeline view.

One neat sampling technique utilized by py-spy and rbspy is to make use of the process_vm_readv syscall to learn reminiscence out of a course of with out interrupting it. If like an interpreter the method shops data about what it’s doing in reminiscence, this could assist you to comply with it with no overhead on the goal course of. You might even use this trick for low-overhead native program instrumentation: arrange a bit of stack knowledge construction the place you push and pop tips that could span names or different context data, after which pattern it from one other program when wanted utilizing eBPF or process_vm_readv.

QEMU Instrumentation

When all different tracing instruments fail, typically you must fall again on essentially the most highly effective instrument within the tracing toolbox: Full emulation and hooking into QEMU’s JIT compiler. This theoretically means that you can hint and patch each management movement and reminiscence, in each userspace and the kernel, together with snapshot and restore, throughout many architectures and working programs.

Nonetheless, truly doing this isn’t for the faint of coronary heart and the tooling for it solely barely exists.


Cannoli is a tracing engine for qemu-user (so no kernel stuff) which patches QEMU to log execution and reminiscence occasions to a high-performance ringbuffer learn by a Rust extension you compile. This lets it hint with very low overhead by spreading the load of following the hint over many cores, at the price of not having the ability to modify the execution.

It’s a bit difficult to make use of, you must compile QEMU and Cannoli your self for the time being, and it’s sort of a prototype so once I’ve used it prior to now for CTFs I’ve usually had so as to add new options to it.

QEMU TCG Plugins

QEMU has lately added plugin support for its TCG JIT. Like Cannoli that is read-only for now, and its seemingly slower than Cannoli, nevertheless it works in qemu-system mode and exposes barely completely different performance.


My good friend has an previous venture referred to as usercorn that’s principally bitrotted however has the power to hint packages utilizing QEMU and analyze them with Lua scripts and all types of fancy hint evaluation. Somebody (probably him finally) may theoretically revive it and rebase it on high of one thing like QEMU TCG plugins.

In case you made it to the underside and loved all these completely different tracing methods, you may additionally be excited about engaged on my crew!

I lead the efficiency optimization crew at Anthropic (we construct one of many world’s main giant language fashions, and have a heavy deal with determining how future extra highly effective fashions can go effectively for the world). We’ll be doing accelerator kernel optimization throughout GPUs, TPUs and Trainium. TPUs and Trainium are cool in that they’re less complicated architectures the place optimization is extra like a cycle-counting puzzle, and so they even have amazing tracing tools. Nearly no one is aware of these new architectures, so we’re presently hiring excessive potential individuals with other forms of low-level optimization expertise who’re keen to be taught.

I plan for us to do a bunch of optimization work as compiler-style transformation passes over IRs, however less complicated through being bespoke to the ML structure we’re optimizing. These will parallelize architectures throughout machines, inside a machine, and inside a chip in comparable methods. We additionally work intently with an incredible ML analysis crew to do experiments collectively and provide you with architectures that collectively optimize for ML and {hardware} efficiency.

Anthropic lately acquired ~$6B in funding commitments, and are investing it closely in compute. We presently have ~5 efficiency specialists, with each making an immense contribution in serving to us have fashions that exhibit fascinating capabilities for our alignment researcher and coverage groups.

AI now remains to be lacking so much, however progress is extremely quick. It’s laborious for me to say the approaching decade of progress received’t result in AI pretty much as good as us at almost all jobs, which might be the largest occasion in historical past. Anthropic is unusually full of people that joined as a result of they actually care about guaranteeing this goes effectively. I believe we now have the world’s greatest alignment, interpretability analysis, and AI coverage groups, and I personally work on efficiency optimization right here as a result of I believe it’s one of the best ways to leverage my comparative benefit to assist the remainder of our efforts succeed at steering in direction of AI going effectively for the world within the occasion it retains up this tempo.

In case you too want to do enjoyable low-level optimization on what I believe will likely be an important know-how of this decade and wish to chat: E-mail me at [email protected] with a hyperlink or paragraph about essentially the most spectacular low-level or efficiency factor you’ve performed. And be happy to take a look at a few of
my other performance writing.

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top