Grokking Large Unfamiliar Codebases – Jeremy’s Weblog
There are many causes you may end up in an enormous unfamiliar codebase.
You may need simply began a brand new place.
You may be doing technical due diligence on a potential acquisition or contract with an exterior firm.
You may be evaluating a 3rd celebration framework or engine.
No matter how you bought there, studying to navigate unfamiliar code successfully is a helpful talent that’s seldom taught or mentioned.
Having evaluated and labored in lots of (generally wildly totally different) codebases over time, I’ve developed a private course of for understanding new code that I needed to share right here.
This recommendation is clearly considerably private, so ought to be learn by way of the lens of a sport engine programmer (who largely does rendering-esque work),
however whereas the content material right here is geared primarily in the direction of understanding giant codebases, I believe a lot of the following pointers will generalize to smaller codebases or codebases in different domains as effectively.
Common mindset
There are a couple of important tips that I feel make sense all through the method of exploring a brand new codebase. In no specific order:
- Assume critically (with out being important, but!).
- Be comfy treating sure methods or subcomponents as “black bins” till you might be able to revisit them later.
- Depend on documentation the place potential, however don’t blindly belief it.
- Lean on instruments liberally (debuggers, profilers, customized instruments, built-in instruments).
- If assist is obtainable, don’t be afraid to ask when you get caught, ideally with some context on what you have been making an attempt to grasp, and what was tried already.
- Let your instincts information you when you suppose one thing feels rather more tough than you’d anticipate, drawing from prior expertise as mandatory.
- On the flipside, train judgement to determine key areas the place what’s noticed differs from what your expertise would lead you to anticipate.
What this boils right down to is that it’s vital to keep up an energetic mindset moderately than a passive one.
Outfitted with an energetic mindset, you might be within the driver seat. You kind hypotheses after which search to show or disprove them.
You compose guiding questions after which probe to reply them. You probe for strengths and weaknesses of a offered resolution
after which work backwards to grasp why an strategy was taken. On the finish of an energetic studying session, it’s typically clear
what was achieved, and the best way to resume research on the subsequent session.
In distinction, a passive mindset is a reasonably unproductive one. Versus a focused and intentional path by way of uncharted territory,
I discover that with out actively maintained targets and subgoals, my time is spent meandering.
After a couple of hours on this state, it’s not at all times simple to grasp what was achieved in that point, or the place to even choose up when resuming.
Excessive-level aim setting
Earlier than starting your journey in an new codebase, it’s vital to make clear what it’s you’re making an attempt to perform.
In my case, I usually am doing one in all a number of duties:
- Auditing a codebase for efficiency issues and common code high quality
- Studying a brand new codebase that will likely be my “residence” for months to years as a part of a brand new function
- Understanding a selected aspect of the codebase with the intention to increase or modify it for a consumer or employer
Whereas figuring out a aim or two it isn’t arduous to do, when you dive into issues it may be simple to get side-tracked and overwhelemed, particularly when the codebase in query is many hundreds of thousands of source-lines-of-code (SLOC) in weight.
This aim can change over the course of your engagement with any codebase, and it’s potential to have many targets within the docket.
That stated, even when you may need a number of targets in thoughts, I extremely suggest having a slim set of targets energetic at anybody time with a view to preserve your focus.
Construct and run, debugger and profiler in tow
No matter what your excessive stage targets are, you want to be completely assured within the potential to:
- Compile the code
- Make adjustments to the codebase (supply/header edits, including new supply recordsdata, and so on.)
- Observe the code operating in a debugger
- Observe the code operating in a profiler (presumably on a couple of platform if wanted)
Because of this, at a minimal, the primary order of enterprise is a excessive stage understanding of the codebase’s construct system.
In some instances, this can be a customized construct system (e.g. UnrealBuildTool
), and in others, it might be a CMake
or Makefile
primarily based setup, or one thing else totally.
At this level, I ensure to get a body seize of a consultant body (this half is clearly sport or sport engine particular).
Ideally, the body seize is instrumented so you possibly can see excessive stage markers positioned to point, e.g., the place physics sim is operating, or the place rendering begins.
You will want these anchor factors always, so I’ll usually take the body seize at the very least as soon as and mainly go away it open always.
Discovering the markers is often as simple as a textual search of the marker title. This avoids the code-start downside of “the place do I set my breakpoint” because you
will at all times at the very least have these markers as a place to begin to your search.
It stands to purpose, by the way in which, that understanding your instruments pays for itself many instances over. In my subject, these instruments are Visible Studio, RenderDoc, PIX,
VTune, RGP, RGA, Superluminal, NSight, Aftermath, and an entire host of different instruments relying on the platform I’m engaged on.
The supporting instruments and frameworks matter too. Git, GitHub, GitLab, CMake, Clang, ASAN, and so on. are all too ubiquitous in my line of labor to disregard for too lengthy.
I don’t have to love them, however having greater than a cursory understanding of the best way to function inside these instruments and toolchains offers me a leg up in getting onboarded.
Asking the correct questions and getting solutions
With a aim in thoughts, I like having a couple of energetic questions in thoughts at any second.
For instance, suppose I’m making an attempt to grasp how reminiscence is managed by a hypothetical renderer.
The candidate questions I’d keep in mind are:
- At a excessive stage, how is reminiscence versioned from body to border on the CPU and the GPU?
- How are CPU and GPU writes synchronized/fenced?
- What threads are concerned?
- What suballocation schemes are used for buffer and texture reminiscence?
- How are UMAs and discrete reminiscence architectures dealt with in a different way?
- and so on.
The precise substance of the questions isn’t that vital, however the level is that I may need a number of dozen of those questions on a given matter.
From there, I pare that set down to at least one or possibly two focused questions. At this level, utilizing the instruments talked about above, I can begin making particular probes to reply the questions.
Suppose I’m making an attempt to learn the way CPU and GPU reminiscence is synchronized. I do know that I’m in search of:
- Waits on a fence of some type (relying on RHI backend)
- Some type of write-combined or readback reminiscence, relying on whether or not I’m investigating the add or readback case
Because of this, I do know that for a selected backend (D3D12 on this instance), I can determine all places the place ID3D12Fence::SetEventOnCompletion
and ID3D12Fence::Sign
is invoked, and set breakpoints accordingly.
Working backwards from every distinctive callstack, I can work out the household of code paths that end in occasions that pertain to the GPU’s DMA engine. I could must filter out “false positives” (makes use of of these APIs associated to
non-memory associated synchronization), however the search will rapidly slim.
To generalize the above instance, the fundamental thought is to raise from a low stage abstraction to a excessive stage one.
Use what you realize concerning the platforms, underlying framework, or third celebration APIs to begin your search and work backside up to be taught the codebase’s greater stage abstractions.
Candidate breakpoints embrace platform APIs (e.g. CreateFile
, CreateThread
, vkCreateBuffer
, and so on.), APIs of recognized dependencies associated to your search (e.g. ASIO, Unreal engine APIs, protocol buffers, and so on.), or
features that confirmed up straight in your body seize.
Now suppose you might have a bunch of name stacks. You will have the beginnings of a excessive stage grasp of the execution circulate of this system.
The subsequent order of enterprise is to transition that understanding from execution to information.
How are objects modeled? How are they saved? Who maintains what and the way are lifetimes managed?
After figuring out helpful consultant callstacks (that is the underside up strategy), the subsequent factor to do is strategy the codebase prime down by finding out the info constructions themselves and understanding why and the way
a given callstack was fashioned.
After going top-down, or from execution to information, you could discover different instances the place you must go bottom-up once more, or determine new execution flows you need to study.
This circulate of research is extraordinarily cyclical, and I discover I’m by no means in a selected mode of considering for too lengthy.
Nonetheless, it’s typically useful to grasp that this cycle is occurring and can repeat.
Transitioning modes (i.e. execution to information, information to execution, bottom-up, top-down) is a good way to get “unstuck” and discover new concepts.
The aggregation of the training gathered throughout all these cycles is what kinds the idea of your instinct about how a codebase operates as an entire.
Within the diagram above, you typically try to fill in particulars concerning the backside field, utilizing your expertise with the highest field as a information to seed investigation.
As you progress, you typically transition again up once more as mentioned earlier than to get a special approach on how issues within the backside field works.
You may think about this diagram being replicated for each system within the codebase you might be studying, and generally you must make progress in a single system earlier than you revisit a one which was beforehand opaque.
Doc and confirm
Taking notes is non-negotiable for me, not strictly as a reminiscence assist, however as a way to speak findings and confirm them later.
Regardless of how arduous you attempt, navigating new codebases is inherently messy, and errors will likely be made.
Maybe you compiled the codebase with out a particular setting that might change the conduct in an atypical manner?
Maybe you inadvertently examined in opposition to legacy content material and analyzed a defunkt codepath?
Maybe you merely made an error when studying code over the course of many hours of research?
Finally, there are many explanation why we will go astray, despite our greatest efforts, and we must always merely assume errors as an inevitability.
Anticipating these errors then, I’ve discovered it’s at all times helpful to compile a set of formatted notes to current to friends if potential.
The overall format is a couple of paragraphs on how I feel one thing works, some issues I seen, and excellent questions I may need.
Even when I get it utterly proper, this course of is commonly helpful. Occasionally, friends I’ve offered info like this to are shocked (“oh, we do what?”).
On different events, friends themselves wanted a refresher on a system. In fact, this suggestions typically catches errors as effectively.
The purpose although is to create a suggestions loop in order that we aren’t studying in a vacuum.
Abstract
If you end up in entrance of a giant codebase and don’t know the place to begin, hopefully that is useful to you.
Admittedly, expertise can play an enormous issue. It’s rather a lot simpler to grasp what’s there if you realize what you’re in search of!
Absent expertise although, I might counsel merely asking for assist in getting began in your unbiased exploration.
At the same time as a junior engineer, it ought to be potential to glean what’s there, possibly with some preliminary help on getting a helpful
body seize, or some consultant breakpoints to begin with.
Paramount although, is solely having an inquisitive and energetic mindset. Studying and evaluating code isn’t fairly the identical factor as writing code, nevertheless it’s an unquestionably helpful talent for all software program engineers to have.