Debugging a Crash in OpenRCT2
2023-01-19
Final yr, my curiosity in RollerCoaster Tycoon was renewed via Marcel Vos‘s in-depth movies on
the sport. Like most gamers right now, I picked up OpenRCT2, which affords cross-platform assist,
larger resolutions, and different enhancements. That is the definitive technique to
expertise the sport, actually exceeding the disappointing official
ports.
Whereas enjoying on Linux I skilled occassional crashes when inserting coasters:
Whereas the bug itself was trivial and simply mounted, reproducing and finding
it was tougher. Should you’re impatient you may skip to
the solution that lastly discovered the issue, or learn on for the total
investigation.
Preliminary Observations
Earlier than debugging I collected some ideas and researched if this was a
recognized challenge. I discovered a single
similar-sounding bug report, additionally from a Linux consumer, though solely seen
in multiplayer.
The crash solely affected particular mixtures of saved coaster designs and
eventualities. Not each run of the sport would crash, allowing one to save lots of and
reload a number of instances till getting fortunate.
The bug’s inconsistent look, and doable platform dependence,
advised an invalid entry was at all times current, however the crash depended
on the reminiscence format. Many construct and runtime elements affecting format would
make replica troublesome.
I used to be primed to suspect a use-after-free, after having seen different glitches
in-game suggesting reminiscence being incorrectly re-used throughout coaster
building. That might develop into a unique bug fully, unrelated to
this text.
I had briefly seen components of the supply earlier than, to be taught recreation guidelines and logic, however
was in any other case unfamiliar with the codebase and had no thought the place to start out. My
earlier expertise with Nineteen Nineties recreation code primed me to count on undisciplined
international variable use any port would essentially inherit. I had seen the
challenge contained a mixture of coding types, with some logic
straight tailored from the unique recreation meeting.
Debugging of Launch Construct
Step one was to run the sport beneath gdb
. This proved
instantly informative, as the sport did not crash. To my frustration, I
repeatedly positioned a coaster which had beforehand been a constant supply of
faults.
I had slim earlier expertise with gdb, however knew packages might
behave subtly otherwise inside a debugger. If a program was loaded in
otherwise, this might complicate replica of a layout-dependent crash. I
suspected gdb was affecting ASLR,
and a search confirmed gdb disables it by default:
With ASLR re-enabled, the sport crashed throughout the debugger as desired.
Since this can be a launch construct, gdb cannot present perform or variable names to inform
us the place we’re.
Some extra exploration was doable from right here. Repeating the crash a number of
instances and amassing the faulting addresses (the situation this system
incorrectly tried to entry), we see many widespread low bits:
The goal tackle having a standard web page offset suggests, however doesn’t show,
this system was making an attempt to learn a world variable, reasonably than a stack or
heap location. It additionally implies that if this system was studying outdoors the
bounds of an array, it was doing so with the identical (unhealthy) index every time.
We are able to change to the disassembly inside gdb to see what this system was doing
on the time of the crash. With luck we would match this as much as one thing within the
supply code:
The faulting instruction is making an attempt to load a price from an array. If,
like me, you do not regularly learn meeting intimately, listed below are the parts
of an x86 reminiscence reference (introduced in AT&T syntax, the default of the instruments
I used to be utilizing):
A reminiscence reference has as much as 4 components, and resembles utilizing an array of
structs in C. There’s a base tackle, a scale for the scale of the array
parts (as much as eight bytes), an index to be multiplied by the size, and a
displacement so as to add to the bottom, helpful for instance to pick a member from
inside a bigger construction. The bottom and index could also be specified by register. The
kind of the worth to be loaded is indicated by the suffix of the meeting
mnemonic.
Mess around with this Compiler
Explorer example to see how the components of a reminiscence reference correspond to
supply code. Whereas the total syntax permits expressiveness of widespread operations
for the meeting programmer, compilers searching for to reduce code dimension will use
obtainable info to realize a extra concise illustration, akin to
by combining base and displacement values when each are recognized upfront.
The faulting instruction within the debugger is an easy listed load:
The zbl
suffix signifies the worth being moved is one byte, and
it’s being positioned right into a 32 bit register, with the remaining bits stuffed with
zeros (“zero-extended”). The dimensions issue is 1, and there’s no displacement,
so the goal might be1 an array of
bytes.
The bottom and index are provided by register. What are the contents of
these registers on the time of the crash?
The maxed-out 16 bit index worth is straight away attenion-grabbing. We’re
indexing far off the tip of one thing. Although the crash is inconsistent, that
unhealthy index might be at all times there, and such a loudly improper worth must be
simpler to chase down than some fuzzy distribution of junk numbers. Seeing
0xFFFF suggests integer underflow, however this turned out to not be the case.
At this level I might have additionally checked the bottom tackle to confirm the goal
was a world variable, however I used to be rapidly distracted and did not get round to
confirming this till later.
The following instruction is a return, so we’re wanting on the finish of a perform.
We have to have a look at the previous context to see what is going on on and the place that
index worth is coming from. This introduced me to the primary of many tooling bugs
encountered throughout this train:
GDB had a bug the place it could not scroll as much as present earlier directions. And
this bug solely occurred when making an attempt to debug a crash:
A debugger that may’t debug packages!
The bug had gone unfixed for almost a yr, and was preserved within the
LTS launch of Ubuntu for me to expertise one other yr later. Putting in a unique
model of growth instruments on my major machine only for this was not one thing I needed to
leap into, so I turned my consideration away from the discharge binary.
Debug Construct
Whereas poking on the faulting launch construct on my major machine, I had been setting
up a clear VM to make use of for constructing the sport. Reproducing a crash right here would
find the supply instantly.
After the standard mess of putting in dependencies, I had all the pieces wanted to construct, however
encountered one other downside:
GCC encountered an inside error making an attempt to supply a debug construct. I discovered
no assist for this error and was not keen to start out altering compiler variations.
Maybe I used to be impatient after GitHub additionally went down on the identical time I used to be
establishing the construct:
I switched to Clang, additionally supported, which compiled with out challenge. I copied
over the sport belongings and eventually had a operating (however nonetheless crash-free) debug construct:
This was lower than excellent. My crashing launch construct had been made with
gcc.2 With a fault that was so delicate
to format, there was no assure totally different builds would exhibit the crash.
Altering to a unique compiler might obscure the issue additional. It might
additionally reduce potential similarity within the disassembly between the builds, making
it more durable to match the faulting location within the launch construct to supply code by
looking for corresponding instruction patterns within the debug construct.
Supply Exploration
With out reproducing the crash in debug, I do not know what I anticipated to be taught
stumbling round an unfamiliar codebase. An apparent bug would possible have been
mounted already. Getting acquainted to the diploma essential to
slim down a seek for the wrongdoer would take time and endurance I did not
have.
The port general is high quality work, with acquainted patterns and new options
carried out in maintainable model. However a lot of the sport’s core essentially
conforms to the unique RCT. The nearer one will get to this previous logic, the much less
clear issues are. Consequently there have been many unencapsulated international arrays and
frequent lack of bounds checking. With so many suspicious areas there was
no apparent place to start out wanting.
Failed Makes an attempt
Array Instrumentation
Assuming the unhealthy index was at all times current, I assumed I would catch it by
instrumenting accesses to international arrays. Utilizing ASan was a
possibilty, however I believed enabling checking globally on a program filled with
unrelated reminiscence points could be too noisy to be of use.
I made a easy wrapper class with a bounds-checked index operator, to be
used as a drop-in alternative with out affecting level of use code. Beginning
with arrays that sounded associated to trip building and drawing, I began
substituting the checked kind and repeatedly re-playing the identical situation as
earlier than.
I quickly felt this wasn’t going anyplace. RCT has many static lookup tables
and none I touched have been the issue. It turned out I used to be not even on the lookout for
the fitting factor – The problematic variable, I might later uncover, was an
array of bools, not a char or integer kind. My naive discover and change would not
have discovered it.
Disassembly Comparability
GDB had been unable to indicate the earlier directions, however after dumping the
disassembly utilizing objdump
and grep-ing round for the directions
I had already seen, I discovered the total context throughout the launch binary:
We’ll talk about what this code is doing in a second, however first I needed to see if
I might discover one thing comparable within the debug construct. I quickly realized this was not viable
both.
With solely these few, generic directions, there was virtually no hope of discovering
a corresponding location throughout the debug construct. Checking for occurence
counts of assorted patterns, I discovered the 2 builds differed drastically within the
frequencies at which these two directions appeared collectively, in addition to the best way reminiscence
was referenced. This could not have been stunning given the totally different
compilers, and the numerous doable methods to generate logically equal code.
Even when I might get a gcc debug construct to check to, this strategy was
most likely not going to work with out further context to slim the
search.
Misc. Failed Copy Makes an attempt
Different methods I attempted to impress the crash beneath debug have been:
- Loading a save file from a crashing situation, from the failing machine, on the debug construct.
- Copying a crashing customized coster design from the failing setting, and loading beneath debug.
- Working the debug construct on the failing machine, and vice versa.
- Varied mixtures of construct choices.
- Enabling/disabling ASLR.
None of those labored. I ended up having to unravel the issue with out ever
reproducing the crash on a debug construct.
An Unlucky Setback
Whereas failing to breed the crash beneath debug, issues received worse: The
crash stopped occuring on my launch construct too, it doesn’t matter what I modified. What
had as soon as been a semi-regular frustration had gone away simply once I wanted it. I
was now tragically caught with a working recreation.
I by no means caught the crash in a debugger once more. What follows needed to be achieved
with solely earlier observations. To search out the situation of the crash
throughout the launch disassembly, I had resorted to referring again to a display
recording I had from my preliminary debugging session:
Fortunately, at this level we’ve adequate info to get the place we
want with out having to attend for the fault to recur, and proceed the search
from there.
Lifeless Reckoning
From the disassembly, we see within the instruction previous the faulting load
that the bottom tackle of the goal array is someplace within the bss
part, and is being accessed by relative addressing. Let’s overview
what this implies.
A program’s reminiscence incorporates segments for code and information loaded from the
binary picture, in addition to mappings offered at runtime for functions akin to
stack and heap.3 Their corresponding reminiscence
pages are marked with permissions akin to learn, write, and execute. Many
of those sections have annoying historic names, so “textual content” means this system’s
executable code, and “bss” means zero-initialized static variables.
This system’s code and information from the binary are introduced in collectively. Their
dimension is mounted, as is their relative place to one another.4 This implies we at all times know the place static variables
will be discovered relative to directions, in contrast to dynamic or stack allocations
whose areas are solely recognized at runtime. For safety, this system’s
base tackle, and its heap and stack areas, are randomized by the OS
(ASLR), which is feasible as a result of packages are usually not permitted to imagine something
about their location.
Packages as soon as arduous coded most information addreses, however the popular trendy method is
to reference reminiscence comparatively, as offsets from the present instruction. Such
addressing is alleged to be “rip-relative” after of the title of the x86
instruction pointer register. This method facilitates dynamic loading and
safety by allowing this system to be loaded in anyplace in digital reminiscence
with out having to fixup its reminiscence references, a sluggish course of.
Trying on the lea
instruction in our disassembly that computes
the bottom tackle for the next load, whereas the presence of a register title
within the syntax suggests the goal location is variable, the worth of that
register is at all times the identical for a given level in this system, so we’re
truly taking a look at a hard and fast reminiscence reference to a static variable. An
annotation offered by the disassembly gives the relative location of that
variable throughout the bss part:
If we learn out of bounds of the array, we might learn random information from
elsewhere within the bss section, and if we go far sufficient, we would find yourself in an
unmapped web page and crash. However we’re not assured a crash. As a result of the index is
solely 16 bits, relying on the scale and format of the bss part, we would not
truly go outdoors of it. Even when we do, there may occur to be one other
legitimate mapping close by, so we nonetheless might not fault. This is the reason the crash
relied on a mixture of platform, construct, and runtime circumstances, making it
uncommon and troublesome to breed.
Fortuitously, as a result of the array being referenced is static, reasonably
than one offered by the caller as a pointer, we need not anticipate
the circumstances of the crash to recur to seek out it.
What Really Labored
With no replica, I assumed I might attempt to set up a correspondence
between the 2 builds primarily based on looking out their runtime information, reasonably than
on the lookout for code patterns. There was no assure this might work – totally different
variations of the sport may load belongings in a unique order, or assign
totally different IDs, or one thing else that made their array contents incomparable.
However I used to be fortunate and a naive seek for a precise sample proved
utterly profitable.
Right here is an outline of the search steps that discovered the bug, which can be
described under:
I first received again to the crash location throughout the launch recreation operating beneath
gdb. A straightforward method to do that was to straight search the reminiscence of the method
for the directions I had seen in my earlier display seize.5
To do that, with the sport operating, we will ask gdb the place the executable’s sections
are positioned in reminiscence (information information
):
Then we use the reminiscence search command, with the goal tackle vary
being the textual content (code) part, and the search sample being the bytes of the
instruction we noticed within the disassembly:
Including a bit extra context narrows it down to 2 outcomes. I selected the primary
tackle because it had the identical low bits because the offset from the binary disassembly:
We are able to ask gdb to print out the contents of the reminiscence at that tackle, deciphering
it as an instruction, and see that it matches the disassembly:
At this level we’re again on the crash web site. If we needed to, we might
set a breakpoint conditional on the unhealthy index register worth, and it will be
as if we had simply reproduced the crash. However figuring out I used to be unlikely
to have the ability to make sense of the backtrace throughout the launch construct, I proceeded
to the reminiscence examination.
Subtracting seven bytes from this tackle will get us to the previous
lea
instruction that computes the bottom of the array. GDB
has helpfully annotated the instruction with the ensuing digital tackle:
Printing out a bit of reminiscence ranging from that tackle, we will see … nothing.
However this can be a mutable information part. I had the sport stopped on the major
menu – perhaps this array is just used inside a situation. I resumed, loaded the
identical situation as earlier than, then checked once more. The goal reminiscence was now populated with
values:
Each byte is both zero or one. I hadn’t even thought of the likelihood
the array was of bool kind. I briefly searched the supply to see if this
sample occurred anyplace akin to an initializer record, however discovered nothing. Hopefully the
identical sample could possibly be discovered within the debug construct at runtime.
I loaded the identical situation within the debug construct in its VM, and halted it on the identical level, having
simply positioned the check coaster. As earlier than, we ask gdb the place the debug
construct’s sections are positioned in reminiscence:
We are able to then carry out one other byte search, this time with the goal vary being the
bss part, and the search sample being the contents of the array seen in
the discharge construct. There’s one match.
GDB has already used its debug info to annotate the end result with the
title of the variable comparable to this tackle, which is strictly what we
are on the lookout for. The arduous half was over and I might lastly begin taking a look at
related supply code.
Right here is the declaration of that array:
Nice, so all I have to do is drop within the checked array wrapper and anticipate
it to say:
The array kind is bool. My unexpectedly written check class wrapped a
std::vector
. Lots of you already know what occurs if you attempt
to return a reference to a bool inside a vector:
Proper, vector<bool>
is totally
broken. Fortuitously, there was no have to try to make this work, as I used to be
achieved with the broad looking out and this array was solely utilized in a pair related
locations:
We all know this system is crashing on a learn, so I went to the one listed learn location,
which was instantly suspicious:
The perform ride_entry_is_invented
does what the title
suggests, checking if a selected trip asset is unlocked within the present
situation. Neither this helper perform, nor the array kind used,
have any bounds checking, so anybody can cross a loopy quantity and crash the
recreation. This perform was referred to as from many locations.
An assertion positioned right here fired beneath the identical circumstances the discharge construct had
crashed:
Printing out the index, we see the invalid worth repeatedly because the trip building
preview is being drawn:
Including a breakpoint lastly yields a debug backtrace:
The Bug
The unhealthy index worth was coming from a way of a category that’s a part of the
command pattern. An
object representing a recreation motion has member capabilities to question if a possible
change is legitimate (“you may’t construct that right here”) or execute it. TrackDesignAction
had the identical bug in each capabilities, which is why the crash occurs from
hovering the mouse over the map, which invokes the Question. The bug is proven
under:
The code makes use of a world object supervisor to search for an ObjectEntryIndex for the
observe design’s automobile kind. It checks if the index is the same as a magic null
worth, and if that’s the case, calls the “is invented” perform with the index. If the entry
shouldn’t be invented, it units the index to null once more. Clearly, the intent had been
to solely do the invention examine if the index was not null, however the
check was inverted, making certain the perform was solely referred to as with null values.
What’s the null worth? It is outlined as the utmost worth of the index
kind, which is uint16_t:
For any automobile kind not loaded within the present situation, the OOB learn
would happen. The bug had been current for a few years, however might not have crashed
till later modifications disturbed the reminiscence format. As for in-game indications of
the bug, nothing improper most likely resulted from an motion making an attempt to construct a
trip with an uninvented automobile kind, maybe attributable to one other fallback examine
elsewhere. The sport does warn concerning the automobile kind when inserting the design, a clue
I may need been in a position to discover earlier.
If the bug sounds unexciting and primary, it’s. I opened a GitHub issue
with my findings, and a repair was dedicated ten days later. The inverted check was
mounted, and a bounds examine added to the array entry. Within the meantime I commented out
the issue code in my very own construct.
Conclusions
This text ended up being longer than I anticipated however I consider the element
is necessary. I used to be utilizing a number of debugging options for the primary time,
looking for the right way to do virtually all the pieces as I went, so I needed to cowl each
step to make it simpler for the subsequent particular person. GDB’s means to arbitrarily search
and reinterpret reminiscence has broad utility however I hadn’t seen these instructions
demonstrated earlier than, both in tutorials or utilizing it for extra primary troubleshooting.
Debugging
I spent about 8-10 hours on this bug. A few of this was unavoidable, like establishing
the construct setting and coping with tooling bugs.
I wasted time making an attempt oblique, broad searches for the bug. Looking round
the supply code with out figuring out the way it was organized was not going to information me
to the fitting location. Making small modifications, including assertions in all places,
and repeatedly making an attempt to impress the debug construct to crash was unlikely to assist,
since in spite of everything, it did not crash for most individuals more often than not anyway.
What labored was taking small however direct steps with the items of knowledge
I had earlier than shifting on, even when it did not appear to be a lot to go on. I ought to
have tried to get the total disassembly of the discharge construct sooner, reasonably than
getting distracted by the debug construct and supply exploration. As quickly as I noticed
the beforehand obscured lea instruction, I had extra context and a literal pointer
telling me the place to look subsequent. The supply code wasn’t in a position to inform me something
till I would already remoted the issue utilizing different instruments.
Bug Prevention
Opinionated methods to discourage bugs like this:
Keep away from C Arrays
The prevailing code had inconsistent bounds checks on international array accesses, even
in comparable, adjoining capabilities. Checks have been most likely added the place crashes had occurred earlier than.
We all know handbook checks do not scale and add visible noise to code readers.
Container sorts can construct in checks for vary and
worth, log their utilization, use customized allocation, and many others. These options will be
conditionally compiled out as wanted.
One potential benefit of native arrays is they’re recognized to the compiler
and different instruments, which can have their very own reminiscence checking instrumentation
choices. However the truth that bugs like this persist for a few years
is proof these checks are usually not extensively used, most likely as a result of they
are sluggish and noisy. Video games usually want focused debug options for particular
programs or sorts which can be toggled by their very own compile-time switches and do not
depend upon particular person compiler extensions.
Container lessons have their very own annoyances. Particularly, writing a
drop-in alternative static array class template that deduces its dimension from an
initializer, would not require helper capabilities, and would not use heap allocation,
requires superior template information. The truth that CTAD
is all-or-nothing stays a major obstacle to widespread use circumstances of
std::array
and comparable mounted container sorts.
Keep away from In-Band Sign Values
Fashionable requirements discourage utilizing magic values to point null, failure, or
different particular circumstances. Counting on programmer self-discipline so as to add handbook checks
doesn’t scale, however was maybe adequate for the unique recreation as developed by
one particular person.
Whereas it wasn’t the reason for this bug, the faulting perform additionally had an
incorrect parameter kind which was corrected in the identical commit. Integer kind
combine ups are an identical supply of delicate bugs in C. Typedefs enhance readability
however are clear to the language, allowing undesired conversions.
Kind programs can be utilized to implement appropriate unit utilization, and separate checked and
unchecked domains. A sort returned by an object supervisor system, which could possibly be
null, will be prevented from being handed to a question perform that solely makes
sense for legitimate entity identifiers.
The argument in opposition to this observe is once more that C++ could make such library
sorts cumbersome to make use of compared to languages like Rust which give possibility
and enum sorts first-class assist. Inside the usual library,
std::optionally available
is first rate; std::variant
is fairly ugly.
Enforcement of checking often depends on exceptions, or non-obvious management circulate
the place all doable circumstances are dealt with by e.g. overloaded perform name operators on
a customer object, or callables handed collectively as arguments to a selector perform.
auto Consequence = ObjectManager::GetVehicleType(TrackDesign.VehicleType);
// this kinda sucks
Consequence.Match(
[](ObjectEntryIndex Index) {
// do one thing with the index
},
[]() {
// null case
// As a result of this can be a lambda, we will not have an effect on enclosing
// management circulate with break, proceed, or return.
// Purposeful model alternate options doable however not widespread in C++.
}
);
An intermediate answer that preserves regular code circulate is a debug-only
enforcement that asserts or throws conditional on a flag indicating if the
validity of the contained worth was checked previous to a get perform being
referred to as.
Apart: Meeting Programming
RollerCoaster Tycoon’s growth has been the topic of a lot programming
lore. Social media posts specific amazement on the superhuman thoughts required to
create such an intricate recreation utilizing primitive instruments.
This notion is detrimental to the educational of newcomers, and the spirit
of programming. Generations of programmers used meeting, and so they weren’t
superhuman. They created video games, working sytems, and different advanced software program,
utilizing comparable organizational rules as programmers right now. Even you, when
studying a contemporary language, most likely as soon as felt misplaced in its particulars till you
constructed up a psychological repertoire of patterns, idioms, and organizing methods
that imbued that textual content with that means better than its key phrases and grammar. The
skilled meeting coder contemplates that language with comparable
richness.
So it’s with area information as properly. Even a talented programmer, if
unfamiliar with graphics and simulation, may regard video games as magic. However solely
a number of books will break this phantasm. With reasonable expertise, you may have a look at
a recreation and picture its believable information constructions, processing steps, and
group. The skilled coder of any area is aware of that is the
essence of programming, and the belief of those relationships as code,
whereas laborious, shouldn’t be magic. You can simulate a theme park recreation in
FORTRAN, or write it on punch playing cards; it is solely a matter of familiarity,
tooling, and time.
Apart from, the creation of RCT didn’t actually include writing a fantastic
edifice of meeting directions. It was written in a
quite capable
macro assembler, which enabled acquainted program group. This
assembler supported structs, macros, variable names, perform calls, math
expressions, string manipulation, and extra. I would not select to write down software program
in it right now, however it was an inexpensive software for the job, one the creator was
already accustomed to. In observe Chris Sawyer was possible working a degree of
abstraction akin to C. If you already know C right now, or know individuals who know C,
you already know creating spectacular software program on this method shouldn’t be magic. It is simply
work, and you are able to do it.