Photos of a Working Rubbish Collector
|
blog | oilshell.org
2023-01-11
The final publish, Garbage Collector Problems, was a standing replace on Oil’s C++ runtime, about 4 months in.
Since then, we solved each the “rooting” and fork()
-friendly issues. So this publish reveals a demo of a quick shell in C++, with a working rubbish collector! Keep in mind, the oil-native tarball incorporates no Python code in any respect, and now it makes use of much less reminiscence.
Extra on this publish:
- I describe how we arrived at a less complicated design
- I give a sense for the bugs we needed to repair, together with some engineering suggestions
- Discuss efficiency
This matter will be dense and wordy, so I break it up with screenshots of the various instruments we used: ASAN, GDB, perf and flame graphs, R for benchmarking, and extra. I wish to give contributors a way of what engaged on Oil is like.
We are able to nonetheless use extra assist! Excellent news: we now have a second grant with NLnet. Let me know if engaged on this sounds enjoyable. Comply with our new account @oilsforunix
on Mastodon.
Massive because of Jesse Hughes, Melvin Partitions, and CoffeeTableEspresso for important assistance on the C++ runtime. Because of NLnet for funding us.
Screencast
If you happen to click on on this screenshot, you may see OSH working ./configure
from CPython’s tarball, with GC debug output.
That is:
- 16K traces of gnarly shell generated by GNU autoconf
- Working in our shell interpreter, written in ~40K traces of typed Python.
- However, it is translated to ~80K traces of pure C++!
- That generated C++ runs on prime of a ~4K line runtime of rubbish collected knowledge buildings, and ~3K traces of OS bindings (May 2022 blog post with background).
- The debug output is from our
OIL_GC_STATS=1
instrumentation.
- The debug output is from our
It was a thrill to get this working! Do not forget that the interpreter took years to write.
Course of Metrics
Here is one other manner to have a look at it. We now run benchmarks on each commit, together with reminiscence measurements with getrusage()
.
In November, our “leaky” prototype used ~550 MB of reminiscence working Python’s ./configure
, which is 20x greater than bash!
Once we turned on the rubbish collector, the reminiscence utilization was diminished by 10x.
Observe that that is certainly one of our hardest workloads. On 3 different hand-written ./configure
scripts, OSH already matches the pace and reminiscence utilization of bash. Most shell scripts are I/O certain.
We have additionally made progress since then: as of the most recent CI run, the autoconf-generated script makes use of 1.6x extra reminiscence underneath OSH, not 2.0x. It is now 1.4x slower, not 1.9x slower. After we repair just a few “dumb issues”, it ought to be simply as quick.
The place vs. When: Rooting Coverage vs. “Secure Factors”
How did we get it working? What have been the improper turns? This part summarizes part of this lengthy journey.
Within the last post, I discussed that we would change our rooting coverage. The rooting drawback is: The place does the collector begin tracing?
A significant motivation was “the f(g(), h())
drawback”, which we discovered with ASAN. (I simply seen that LLVM’s docs call out precisely this problem, naming it h(f(), g())
! )
I proposed a coverage referred to as “Return Worth Rooting”, ran it by just a few folks, and talked about it on the weblog.
Some contributors have been satisfied by the logic, so I went forward and carried out it, testing it on the identical mycpp examples. It fastened 5 failures, however our recursive descent parsing check case parse.py
failed in a brand new manner.
Parsing 1+2 =>
=================================================================
==15704==ERROR: AddressSanitizer: heap-use-after-free on handle 0x6020000023f0 at laptop 0x55b1eeba4d01 bp 0x7ffde9fe6e90 sp 0x7ffde9fe6e80
READ of measurement 1 at 0x6020000023f0 thread T0
#0 0x55b1eeba4d00 in ObjHeader(Obj*) /residence/andy/git/oilshell/oil/mycpp/gc_obj.h:114
#1 0x55b1eeba5f31 in MarkSweepHeap::MarkObjects(Obj*) mycpp/mark_sweep_heap.cc:125
#2 0x55b1eeba6099 in MarkSweepHeap::MarkObjects(Obj*) mycpp/mark_sweep_heap.cc:136
#3 0x55b1eeba54f3 in RootSet::MarkRoots(MarkSweepHeap*) mycpp/mark_sweep_heap.cc:11
#4 0x55b1eeba687a in MarkSweepHeap::Accumulate() mycpp/mark_sweep_heap.cc:202
I noticed that the issue is elementary: it does not deal with mutating member variables accurately!
I used to be discouraged. This rubbish collector appeared to have extra failed makes an attempt than any program I’ve ever written! I went again to the drafting board, reviewing my giant assortment of hyperlinks, notes, and books.
My first thought was to pursue the only doable conservative/imprecise stack scanner, simply to get one thing working (on one structure). It might briefly “unblock” the mission.
I had conjectured with contributors that Boehm GC was primarily a manner of claiming “screw it” to the rooting drawback in C++, and thus to specific rubbish assortment. That’s, it is a hack that language implementers got here up with ~30 years in the past to work round precisely the issues we have been having.
Here is one thread of notes: #oil-dev > Conservative GC Design Links.
It features a long 2015 conversation on an option for conservative GC in Julia, which principally scared me off once more. There’s a honest quantity of controversy about conservative assortment. (It now seems to be a workaround for integrating with uncommon libraries, not Julia’s primary method.)
To make a protracted story quick, I went again to a good easier answer, which I name:
We merely insert mylib.MaybeCollect()
manually within the interpreter’s Python supply. It will get preserved as mylib::MaybeCollect()
in C++.
You can name these advert hoc GC protected factors. A protected level solutions the query: When does the GC have sufficient info to gather?
I famous that the time period “protected level” does not seem in index of the Rubbish Assortment Handbook. It isn’t an algorithm challenge, however an implementation challenge which is nonetheless troublesome.
What’s stunning is that we solely want a handful of guide assortment factors in our 40K-line interpreter — primarily the Python loop that interprets shell for
and whereas
loops! If you concentrate on it, loops are the first methods for applications to create unbounded rubbish.
(It is doable for a single MaybeCollect()
name to run after each shell assertion, not simply loops. Max Bernstein identified that recursion in shell is feasible, so we’ll in all probability do that.)
Three Advantages of Our Easy Answer
Both manner, this answer has a variety of good properties:
(1) It drastically reduces the state area explosion that accumulating at each allocation creates.
The outdated coverage turned each allocation right into a department — gather, or do not gather — which wants check protection.
(2) It avoids the f(g(), h())
drawback, and extra issues like #oil-dev> Allocating in a constructor, which CoffeeTableEspresso bumped into.
// Bug: if Foo::Foo() allocates a member variable, the occasion
// could also be swept away earlier than it is certain to myobj and thus rooted
myobj = Alloc<Foo>();
Extra issues: #oil-dev > Interactions Between C++ and GC
(3) It allow us to delete all rooting annotations from hand-written code!
To inform the GC the place to start out tracing, mycpp generates a line like this for each perform:
StackRoots _r({ &local1, &local2, ... });
However we even have ~7,000 traces of hand-written code, which signifies that a whole lot of capabilities had such guide annotations.
With the brand new guide assortment factors, we will throw all of them out!
It’s because hand-written capabilities like Str::strip()
and posix::fork()
can by no means be decrease on the stack than a guide assortment level. Within the outdated scheme, Str::strip()
may name a perform which will gather.
So they seem to be a now bit like “leaf nodes” of the decision tree.
This barely sudden property makes it simpler to work on Oil. We attempt to write the only code doable.
Drawbacks?
Our runtime is instrumented with OIL_GC_STATS=1
, and we gather knowledge on actual shell applications in our CI. If something, we’re nonetheless checking for assortment too usually, not too hardly ever.
So I do not see any actual downsides!
Whereas in idea the shell could use extra reminiscence than needed in some circumstances, typical workloads include many tiny objects. And there are various different methods we must always optimize reminiscence utilization, talked about beneath.
So I believe I misconceptualized our bugs as a rooting challenge — being fastidious about the place to start out tracing — but it surely’s extra helpful to consider when to gather.
Associated Papers
In the Reddit comments to my last post, two readers identified very related analysis. (Thanks! This is the reason I write these posts.)
I wrote Zulip notes on these papers, in addition to on Boehm GC, however right here I am going to simply say that studying them made me really feel higher.
- They’re from the early 2000’s, not the 1970’s! Now I do not really feel as unhealthy about stumbling round this drawback for a number of months.
- When chatting with contributors, I mentioned that GC rooting for C++ was a “I did not know what I did not know” drawback. (Skimming this 2016 blog post, the analogous points in Rust appear even tougher!)
- One paper is by my former teammate at Google, Fergus Henderson! I later seen that it is cited in LLVM’s documentation on its GC support.
- We do not use these strategies actually, since one pertains to producing C from a Prolog-like language, and the opposite is not fairly moveable. However they have been helpful reference factors.
Debugging the Collector
So after switching to guide GC factors, all of our checks handed underneath ASAN and #ifdef GC_ALWAYS
. This feels good after months of latent segfaults, but it surely is not too stunning, since there are various fewer code paths.
Does the shell run? No, there have been nonetheless fairly just a few crashes.
I am going to recap that thread after giving some coloration on GC bugs on the whole, and our drawback particularly.
GC Bugs Are Arduous to Diagnose, and They Lurk
From a former v8 lead, and WebAssembly co-author:
I totally hate debugging rubbish assortment bugs. Apart from stress mode, fuzzing, and tracing output, there is not actually a lot that will help you. A great debugger is a should. However even then, I really feel like it’s a particular sort of hell.
— Ben L. Titzer (@TitzerBL) August 30, 2022
I positively had this expertise with the transferring Cheney collector. It had the extra complication of being on the metalanguage stage — i.e. the place the stack is actually the C stack, laid out by a compiler we do not management.
The C++ translation primarily stalled as a result of it was too troublesome to debug!
Hans Boehm motivates conservative rubbish assortment:
It’s troublesome to design a compiler such that the generated code is assured to protect rubbish assortment invariants. [Errors] within the object code can doubtlessly go unnoticed for lengthy intervals of time. Even once they do seem, the issue is troublesome to isolate, since signs often seem a
very long time after the faulty code was executed.
Rubbish assortment bugs are usually among the many final to be faraway from a brand new programming language implementation.
— Rubbish Assortment in an Uncooperative Setting (Boehm, 1988)
So how did we discover and repair our GC bugs? Are there extra lurking?
Leaning Arduous on ASAN
Once more, essentially the most helpful method is to make use of ASAN as a lot as doable. It is constructed into each Clang and GCC, and trivial to make use of (cross -fsanitize=handle
). It compiles your program with runtime instrumentation of allocations.
It is primarily used to seek out basic C and C++ errors with guide reminiscence administration, but it surely works simply as effectively for a rubbish collector that makes use of malloc()
and free()
. (We’ll have a customized allocator for efficiency, however will all the time retain this feature!)
There have been two frequent failure modes:
(A) Heap Use After Free. A standard sequence of occasions:
- You overlook a GC root in hand-written code. There may be an object referenced solely by a C++ international.
- The mark part cannot discover this object.
- So the sweep part calls
free()
on it. - It is later utilized by this system, and ASAN flags the use-after-free.
(B) Seg Faults, e.g. as a consequence of lacking object headers. ASAN principally provides you a stack hint, so in lots of circumstances you do not have to hassle beginning the debugger.
A “Actuality Sandwich” of Statically Typed Bread
Though ASAN reliably turns use-after-free into crashes, and these bugs finally had easy causes, they have been not trivial to diagnose. The core drawback is that by the point the crash occurs, the debugger lacks sort info.
I wish to re-emphasize this “actuality sandwich” view of rubbish assortment, described in May:
- The highest layer is your software varieties: user-defined lessons,
Dict
,Listing
,Str
, and many others. - The center layer is actuality: the bits in reminiscence.
- The underside layer is the rubbish collector’s homogeneous view of reminiscence. For us, it is a graph of
ObjHeader*
, with kids traced byHeapTag::FixedSize
andHeapTag::Scanned
.
The center layer is rigorously designed to be suitable with the highest and backside views. Exact rubbish assortment requires this.
Whilst you often debug your program fascinated by the prime layer, the crashes occur within the backside layer.
Methods (GDB Watch Factors)
Listed here are extra debugging strategies I used:
-
Validate the roots as quickly they’re registered, not simply on assortment. Assert all the pieces you’ll be able to in regards to the object headers, e.g. that they’ve legitimate tags like
HeapTag::FixedSize
. -
GDB watch points utilizing the command
awatch -l myvar
.
That is helpful as a result of ASAN provides you 3 stack traces:
- The place the use-after-free occurred.
- That is within the app code: the prime layer.
- The place it was freed.
- This can be within the
Sweep()
part of collector: the backside layer.
- This can be within the
- The place it was initially allotted.
So by placing a watchpoint on a reminiscence location, you’ll be able to see the place an object is touched by each “bread” layers.
Hole Between Testing and Manufacturing: 9 Bugs
Now that we perceive the debugging activity, this is a abstract of that Zulip thread: #oil-dev > Summary of GC Bugs
Logic errors producing GC metadata:
- The C++ compiler warned us that a few of our
constexpr
subject masks overflowed 16 bits! In different phrases, just a few lessons had greater than 16 members.- I fastened this by placing all pointer fields first in lessons with out inheritance, which suggests they’ll use
Tag::Scanned
as an alternative ofTag::FixedSize
.
- I fastened this by placing all pointer fields first in lessons with out inheritance, which suggests they’ll use
- Off-by-one error for
Tag::Scanned
objects in mycpp. This was as a consequence of a legacy of the Cheney collector, which has now been cleaned up. - Unhealthy
Tag::Opaque
optimization.
Lacking GC metadata:
- Discipline masks within the
frontend/flag_gen.py
code generator. - Object headers within the
core/optview_gen.py
code generator.- Keep in mind, we expanded Python reflection to textual code era in C++!
- Object header in hand-written
ParserSpec
class.
Lacking international roots:
- Shared sign handler state. (We had it underneath the deserted “return worth rooting” scheme, however not the brand new, easier design.)
- World
stdout
andstderr
.
(These are the one 3 globals in Oil!)
- A trivial reminiscence leak within the
getline()
binding, which ASAN flagged.- The one challenge right here was that my machine did not have symbols for
libc
, which suggests ASAN’s stack hint was complicated.
- The one challenge right here was that my machine did not have symbols for
A Delicate Octopus With 1000’s of Arms
So not one of the bugs have been within the collector itself; they have been points with the metadata used for rooting and tracing.
So I now envision a rubbish collector as an octopus!
- It has a small head, which is say 500 or 2000 traces of chic marking and sweeping logic.
- It has tentacles that contact all your program! That is 50,000 or 500,000 traces of code that work together with garbage-collected objects.
- Stack frames are being pushed and popped.
- Variables are being mutated.
- In the meantime, the octopus is tracing all objects with its arms, and throwing out what it may well’t attain.
If there is a single mistake within the logic of the second half, the octopus’s
mind explodes.
So this was the reason for primarily all 9 bugs. Exact rubbish assortment
requires exact metadata, everywhere in the program.
After this expertise, I am now assured our collector will probably be dependable. Why?
An sudden good thing about writing the shell in Python is that the overwhelming majority of GC metadata is in generated supply code. We do not have dozens or a whole lot of bugs like Mozilla defined in Clawing Our Way Back to Precision (2013).
So our supply code is not affected by any of:
- Rooting annotations (do not forget that we deleted all of them in hand-written code)
- Manually created object headers (with subject masks)
- Guide reminiscence administration
- Reference counting annotations like
Py_INCREF
andPy_DECREF
So I anticipate that the C++ runtime will change slowly, and new options to the shell will probably be added in Python, the place it is unattainable to make errors with reminiscence.
This great PyPy retrospective makes the same level:
RPython is a rubbish collected language, and the interpreter doesn’t need to care a lot about GC normally. When the C supply code is generated, a GC is routinely inserted. It is a supply of nice flexibility. Over time we experimented with a variety of totally different GC approaches, from reference counting to Boehm to our present incremental generational collector
In truth, I revived the Cheney collector, as an experiment in efficiency. It might work effectively with our new guide GC factors.
Efficiency
I’ve rather a lot to say about efficiency, together with describing 3 speedy fixes:
- Fastened thrashing within the assortment coverage
- Eliminated hash desk in favor of a mark bitmap
- A “lazy
free()
” algorithm tweak fell out of recycling object IDs, and improved efficiency.
- A “lazy
- Non-recursive marking (surprisingly did not enhance efficiency, however makes the collector extra strong)
And future plans:
- True lazy sweeping, for O(num reside objects) slightly than O(num allotted), like Cheney
- A customized allocator to take over small allocations
- Persistently put the
ObjHeader
earlier than any vtable pointer - Optimizations from the app aspect (large parser refactoring)
- Small string optimization (hat tip to Max Bernstein), which interacts with GC
- “Boxless” optimization for the interpreter
This work is guided by many benchmarks:
This publish is already lengthy, so I will not go into element. Each bit of labor above has a thread on Zulip.
Comply with our new Mastodon account or Twitter account if you wish to study extra.
Abstract
Penning this collector was loads of work, however we ended up with one thing that is easy and dependable.
- Ideas
- The place vs. when: rooting vs. protected factors.
- Exact GC metadata for exact assortment (vs. “conservative” assortment)
- Analogies
- The “actuality sandwich” for modeling reminiscence.
- The fragile octopus. The octopus can also be the underside layer of bread!
- Instruments
- ASAN is important for flagging bugs throughout testing.
- GDB watch factors.
- I hope to explain efficiency instruments in a future publish. I converged on idiom for tables in shell, just like what I called the “PPPT” pattern.
What’s Subsequent?
It is now clear that the OSH a part of the mission will “exist”. We nonetheless have some work to do to interchange the Python tarball with the oil-native tarball, however it can occur.
However the Oil language has stalled for practically 6 months. I’ve gotten nice suggestions from folks like Samuel Hierholzer and James Sully, which I hope to deal with quickly.
I’ve additionally been falling behind on shell advocacy (#shell-the-good-parts). For now, I am going to level out that shell was the sixth fastest growing language on Github in 2022. We nonetheless want a brand new shell.
And the #1 quickest rising language is the Hashicorp Config Language. A brand new part of Oil is similar to it — interleaved with a principled shell, and decoupled from any software:
To get these efforts going once more, I am going to attempt to get more people involved with the mission. Let me know if you wish to be paid to work on open supply code!
Appendix
FAQ: Why Ought to a Shell Have a Rubbish Collector?
- It’s going to permit recursive JSON-like knowledge buildings in Oil.
- It signifies that the Oil interpreter itself reminiscence protected, not simply Oil applications.
- That is in distinction to each POSIX shell (bash, sprint) and most language implementations (CPython, Ruby, v8, Lua). All of them have reminiscence administration logic littered all through.
Extra solutions for the curious:
For instance, why not write it in Go? Go’s runtime implements goroutines with OS threads, whereas shells use fork()
-based concurrency. Go additionally does not use libc, which shells depend on extensively.
We Should Chill out A Constraint
Here is one other manner of our “guide assortment level” design alternative.
I consider the unique drawback was over-constrained, to the purpose that was unattainable to resolve:
- We are going to generate C++ supply code to make Oil quick.
- For instance, we do not use LLVM IR, as a result of Oil ought to compile with a plain C++ compiler. Shells run on each conceivable structure.
- We must always use and generate normal, moveable C++ (to the diploma doable).
- Conservative stack scanning is inherently unportable. It relies on how the compiler generates code for a selected CPU structure, and on what optimizations it does.
- We must always gather precisely / exactly.
- Conservative assortment ought to actually be referred to as imprecise assortment — there is a low likelihood of lacking rubbish.
- The generated code ought to be readable.
- Within the last post, I discussed this as a cause for not doing bytecode or SSA-like transformations. This looks like a trivial challenge, till you are handed a blob of 90K traces of
var345
andvar346
to debug! In distinction, our generated code is readable, and a number of contributors have debugged it with regular instruments.
- Within the last post, I discussed this as a cause for not doing bytecode or SSA-like transformations. This looks like a trivial challenge, till you are handed a blob of 90K traces of
- mycpp principally prints MyPy’s typed AST.
- Once more, we’re not doing compiler-like transformations on the code.
- We are able to gather at any allocation.
When seen this fashion, it is fairly clear to me that the sixth constraint is the one to calm down.
Are there every other choices? I do not assume so, apart from altering the C++ language itself, which appears to have stalled:
Guide assortment factors aren’t totally normal, however they solves precisely the issue we have. That’s, we’re in a position to exactly gather rubbish in a subset of moveable C++ code. It is simple to learn, write, and generate code on this type.
Algorithm Animations
If you happen to bought this far and have been upset by the photographs, I like these animations: