Writing a TLA⁺ tree-sitter grammar
2021 noticed the completion of my first substantial free software program challenge: a TLA⁺ grammar for tree-sitter, the error-tolerant incremental parser generator.
The challenge stabilized & discovered customers over the course of 2022, then over the vacations I used it to construct the TLA⁺ Unicode Converter.
The brand new yr is a time to mirror on the previous and look to the longer term, so right here in early 2023 appears best to publish my expertise.
Each TLA⁺ and tree-sitter itself ensured the challenge was fascinating on a technical stage;
much more fascinating had been the social and psychological points of my first actual involvement with the free software program neighborhood!
I’ll go over why I wished to create this challenge and the primary technical challenges I confronted doing so, then talk about the circumstances that enabled me to create it and the way free software program growth modified the way in which I feel.
In the event you’d reasonably get the technical half in video type (with demos!), you possibly can watch the presentation I gave at TLA⁺ Conf 2021 (slides: pdf, odp):
You will discover the grammar here or take it for a spin in your browser.
Why write this parser?
I like TLA⁺.
The language is simply implausible; it has enabled me to design, cause about, and implement so many techniques that will in any other case be riddled with concurrency bugs.
Nevertheless, it isn’t actually a secret that the consumer expertise is a bit clunky; like many instruments with roots in analysis & academia, it has been gradual to maneuver on from monolithic customized GUI-based IDEs.
Though the TLA⁺ Toolbox works fairly nicely in my view, lately individuals like well-designed CLIs with git-friendly supply & config recordsdata, CI/CD-friendly instruments, and light-weight zero-config language assist of their personalised editor.
There was some progress right here with the TLA⁺ VS Code extension, however I feel a giant stumbling block is that SANY, the only real feature-complete TLA⁺ parser, is considerably inscrutable for language tooling functions.
Accessing and querying the parse tree shouldn’t be simple.
I wished to create a TLA⁺ parser the place this performance was entrance & centre, with a concentrate on language tooling developer expertise as an alternative of finish consumer expertise.
I additionally wished to lastly convey actually nice syntax highlighting to TLA⁺.
Typical syntax highlighting makes use of common expressions to half-assedly parse non-regular languages.
Though it requires extra work up entrance, a syntax highlighting system utilizing context-free (and even context-sensitive!) grammars to parse languages in the identical stage of the Chomsky hierarchy makes lot extra sense.
Enter tree-sitter!
Language servers vs. tree-sitter
First, a digression: should you’ve hung round any developer information or dialogue boards over the previous few years you’ve in all probability heard about tree-sitter.
Sadly the following dialogue in all probability instantly went off on a tangent speaking about language servers, so I’ll take a while right here to explain the similarities & variations and hopefully transfer the dialogue ahead.
Mainly, tree-sitter is a software program challenge that helps you write parsers for programming languages.
Individuals lately hardly ever write parsers by hand except they actually wish to get into the weeds and create an industrial-strength implementation with full management over conduct, optimization, and error reporting; way more frequent is using a parser generator, the place you employ a pleasant DSL resembling a formal grammar and the generator emits the required gnarly string processing code in C or another language.
There are lots of parser mills on the market, so what’s so particular about parsers generated by tree-sitter?
4 issues:
- They’re error-tolerant, to allow them to parse recordsdata even when the consumer is midway via writing a block of code
- They’ve incremental parsing, to allow them to in a short time re-parse solely the code that has been edited by the consumer
- They’ve a standardized question API, so you possibly can rapidly discover syntax occurrences
- They’ve minimal uniform runtime necessities, so applications can eat many various tree-sitter grammars with out a lot overhead or customization
These properties make them best for language tooling.
I wouldn’t use a tree-sitter grammar because the spine of an interpreter, however they’re nice for circumstances the place an approximately-correct parse works high-quality, like:
- Syntax highlighting or semantic highlighting
- Image/reference discovering (GitHub is utilizing tree-sitter with stack graphs for this)
- Code evaluation, linting, or transformation instruments
- Coding with various enter strategies for accessibility (cursorless makes use of tree-sitter)
- Code folding
All of those might be applied with little greater than intelligent use of tree-sitter’s standardized query API, customizing the instrument’s performance for every supported language with a file that maps question captures to recognized names.
For instance, these files are all that had been essential to get TLA⁺ highlighting and code folding working in Neovim.
Distinction this with language servers.
The Language Server Protocol (LSP) was created to deal with the issue of needing to write down a separate extension for every programming language and every editor; by implementing a standard capabilities-based API, you possibly can write a single language server in your programming language and have or not it’s supported out of the field on all kinds of editors, thus turning an (N instances M) drawback into an (N) drawback:
The API contains generic high-level requests like “go to image definition” or “discover all image references”.
Every language server has nonstandard runtime necessities and is written in no matter programming language the creator wishes; usually instances it’s written in the identical language it’s serving (the language server for R is written in R!).
You could possibly even use tree-sitter inside your language server implementation, as was achieved with the bash language server.
Basically language servers are a lot higher-level, heavyweight, distinctive beasts than tree-sitter grammars.
They’re the last word type of language tooling, whereas tree-sitter grammars are themselves used to construct language instruments.
Language servers and tree-sitter grammars are wildly completely different; nevertheless, the small quantity of conceptual overlap and the tendency for his or her assist to be announced as side-by-side headline features ensures confusion will reign for fairly a while.
Writing a tree-sitter grammar
I’d by no means written a parser earlier than.
One among my nice undergrad CS regrets isn’t making it via the compilers course.
Nonetheless, I’d achieved fairly nicely within the formal languages class in order that was one thing; the fundamental Extended Backus-Naur form (EBNF) is intuitive sufficient.
However how will you translate language syntax to EBNF?
I’ll let you know the way you shouldn’t do it: by writing the language in EBNF as you realize it from expertise, then fleshing it out by guessing & checking whether or not the gold-standard parser accepts some syntax.
Put within the legwork and search for the docs!
Almost all languages could have some sort of previous language spec doc floating round.
Relying on how forward-thinking the language designer was, this would possibly even itself comprise an in depth EBNF grammar!
Probably the language spec is previous and out-of-date, nevertheless it’s a lot, a lot higher as a place to begin than your personal expertise utilizing some subset of the language syntax.
Fortunately TLA⁺ has such a doc, and as you may need anticipated for a proper specification language the grammar spec is written in TLA⁺ itself!
This received me about 50% of the way in which there.
Sadly implementation slowly diverged from specification, as usually occurs.
TLA⁺ actually solely has one widely-used parser known as SANY (the TLA⁺ Proof System has its own parser written in OCaml however this isn’t used wherever else).
SANY was written in Java round a decade in the past, utilizing the JavaCC parser generator.
JavaCC grammars are written in a form of EBNF, and SANY’s EBNF grammar diverges from the supposed TLA⁺ normal.
Sadly that isn’t the top of it, as a result of the code generated by JavaCC was subsequently hand-edited: the implementation diverges from the specification diverging from the specification!
My job thus grew to become synthesizing these fractured accounts of what TLA⁺ is right into a coherent complete, then hopefully backporting this understanding to the unique language spec so no one else must undergo what I went via.
This was by far essentially the most tedious a part of your complete challenge.
The times stretched into weeks and at instances I couldn’t see the way in which via – would I actually handle to climb out of the quagmire?
Lastly, after painstakingly chasing down each correspondence of parse tree node to grammar rule I claimed victory and filed a PR in opposition to the unique language grammar with all of the adjustments.
Could no one ever once more undergo on this approach.
At this level I believed I used to be about 90% achieved.
The opposite 90%
In software program engineering individuals usually speak of ending the primary 90% of a challenge, then having to work on the second 90%.
So it was with this grammar.
After wrapping up the EBNF work I had completed all of the context-free elements of the language.
TLA⁺ isn’t a context-free language, although.
It has context-sensitive elements.
These phrases have precise formal definitions, however mainly you possibly can consider context-sensitive language constructs as altering their that means relying on surrounding language constructs.
To be able to correctly parse them you need to hold monitor of what you’ve seen earlier than that time.
A easy instance is python’s use of indentation stage to mark code blocks: the that means of a given line of code relies on its columnar place and the columnar positions of earlier traces of code.
As you parse the code you need to carry some additional state together with you, a stack of nested code block indentation ranges (really this would possibly transfer past context-sensitive and require full Turing completeness as a result of to write down this in EBNF you’d have to parameterize grammar guidelines with their indentation stage, however I’ll go away this one as much as the formal language theorists).
TLA⁺ makes use of indentation ranges equally when writing conjunction & disjunction lists:
op ≜ ∧ A
∧ B
∧ ∨ C
∨ D
∧ E
Right here the conjunct and disjunct bullet alignments guarantee this expression is interpreted as (A ∧ B ∧ (C ∨ D) ∧ E)
.
This can be a very helpful option to write lengthy conjunctions & disjunctions, one of many massive language options of TLA⁺, so I needed to deal with it correctly.
The best way to write down context-sensitive language guidelines in tree-sitter is to make use of an exterior scanner: a hand-written C or C++ file encoding your personal customized lexing logic.
A few years in the past I’d sworn to myself I might by no means once more write C or C++.
Sadly I now discovered I needed to break that oath; tree-sitter generates a big parser.c
file, and (currently) the one supported technique of writing an exterior scanner is to write down an accompanying scanner.cc
file that parser.c
calls into.
Fortunately the single-file nature of this system and necessity of heavily restricting dependencies prevented most C++ unpleasantness; it additionally supplied a superb alternative to get my toes moist with actual hand-written lexing & parsing code.
It was spooky how simple it was.
I’d assumed lexing & parsing was this unbelievably troublesome area, nevertheless it was unusual how little logic it took to begin accurately parsing some fairly difficult circumstances!
That is one thing I had examine earlier than, and is type of analogous to the expertise of writing a recursive operate: you don’t really feel as if you’ve wrapped your head round it or dealt with all the pieces (simply the bottom case and basic case), nevertheless it simply… works?
After dispatching with parsing conjunction & disjunction lists I solely had yet one more context-sensitive dragon to slay: proofs.
TLA⁺ contains a whole language for writing formally-verified proofs that most individuals by no means encounter in the event that they’re solely writing specs for modelchecking.
Proofs are structured hierarchically, with statements damaged down into subproofs and people subproofs damaged down into additional subproofs, and so forth, following the type specified by Leslie Lamport’s paper How to Write a 21st Century Proof.
This appears to be like like:
THEOREM X
PROOF
<*> A
PROOF
<+> B
<1> QED
PROOF
<*> C
<3> D
<*> QED
<2> QED
<0> QED
Don’t be misled, most proofs aren’t this ugly – I’m simply attempting to depict cross-section of syntax.
Proof step ranges are denoted by a complicated combination of specific numbers (<3>
) and contextually-relevant symbols (<*>
, <+>
).
Determining all these guidelines was fairly a problem and I ultimately ended up with this somewhat-deranged state diagram for a way the exterior scanner ought to modify its state in response to tokens as they seem:
It was additionally difficult getting proofs to play good with conjunction & disjunction lists.
All of it labored out ultimately although, and now you are able to do issues like fold TLA⁺ proofs under a sure stage in neovim!
Definitely an inexpensive commerce for a whole month of my life.
At that time I used to be basically achieved; there have been 100 small fixes and refinements to make however the actual work was completed.
It was gratifying to lastly ship a usable TLA⁺ tree-sitter grammar.
One thing individuals can use, one thing that didn’t exist on this planet earlier than and was solely introduced into being via an individual’s labor.
It was my first actual free software program challenge!
After presenting the project at TLA⁺ conf 2021, I let it stabilize so it might evolve in response to consumer necessities.
2022 noticed two spectacular items of labor by others: Vasiliy Morkovkin prolonged the grammar to parse PlusCal, a TLA⁺ variant for specs which are extra sequential than event-driven.
Then, William Schultz applied a web-based TLA⁺ interpreter on high of the tree-sitter parse tree enabling you to discover the state graph by selecting which motion to carry out; this idea was discussed on the TLA⁺ conf so it was cool to see it applied so rapidly!
One different notable integration was GitHub itself selecting the grammar for syntax highlighting: now each TLA⁺ file (example) and code snippet on the location is highlighted with the grammar.
Turning into a free software program developer
To me, turning into a free software program developer was not completed after I started writing a free software program challenge.
I don’t imply this within the gatekeeping sense.
I imply there was a metamorphosis in worldview that occurred over time as I responded to incentives within the system.
If I had been to sum up the transformation, it will be this – initially, I might take a look at a free software program challenge pushed by a handful of builders and suppose: that is their software program.
Ultimately, I checked out that challenge and thought: that is our software program.
I can break this transformation down into 5 discrete steps of interplay with free software program tasks, all of which I moved via as I developed the grammar and interacted with dependencies:
- Writing web feedback complaining about bugs or lacking options in a free software program challenge.
- Reproducing a bug or spec’ing a function, opening a problem within the challenge’s situation tracker, then ready for them to repair it whereas often asking for updates.
- Forking the challenge’s repo, writing the adjustments myself, opening a PR in opposition to the repo, then ready for them to merge it whereas often asking for updates.
- Creating my very own launch of the software program from the fork containing my adjustments, so I or others can use it till the adjustments are upstreamed.
- Creating forks of different items of software program that depend upon the upstream challenge, updating their dependency to level to my first fork, then releasing the second fork so I or others can use it till the adjustments are upstreamed and the dependency up to date.
I’m not attempting to create an ethical hierarchy right here.
I’ve written my share of web feedback complaining about free software program tasks, even lately!
You could possibly say I’m nonetheless a bit half-baked.
Maybe we have to add an as-yet-unachieved sixth step:
- Apply perspective evolution of steps 1-5 to all free software program tasks, not simply tasks straight vital to perform my very own tasks.
I’m solely responding to incentives: wanting software program to operate in a sure approach, wanting that sooner reasonably than later, and having (or developing!) the technical know-how to make it do this.
So why did these incentives not work their magic on me earlier than?
I’ve been an expert software program engineer for a few decade now, and studied pc science in college for years previous to that.
The reply is time and company.
Whereas working inside a big company I used to be educated to push points to the proudly owning workforce, with a transitory go to to that workforce’s PM who would do their greatest to bully me into making it my workforce’s drawback – a mistake I made solely as soon as, earlier than realizing the PM’s job was to not obtain a mutually-beneficial separation of labor however reasonably to scare away obligations (my PM associates had been very entertained by this story of naivete – I had performed the flawed language recreation).
At work and in my private life I nearly solely used proprietary software program.
With proprietary software program you might be trapped within the function of easy client.
In order for you adjustments, you possibly can normally solely get to step 1 of the above interplay tiers – complaining on-line.
In the event you’re anomalously fortunate, the corporate growing the proprietary software program may need a community suggestions site so that you can make a half-assed try at step 2.
And that’s the place issues died – I used to be educated to don’t have any company in my interactions with software program.
I learn a nice post by Protesilaos Stavrou who phrased this by way of autonomy: rule by one’s self, vs. heteronomy: rule by one other.
I readily translated this discovered helplessness to free software program, regardless that the explanations behind it had been not current.
On the massive company, my assigned work duties absolutely consumed my time and, extra importantly, my productive vitality.
But time is a humorous factor – there’s a phrase heard in mountain climbing and elsewhere, “gradual is easy and easy is quick”.
This was the enduring expertise of engaged on this challenge.
I used to be on sabbatical, with no one to inform me what to do with my time & the right way to do it.
Most days I felt like an previous man wandering round, poking issues with a stick.
I might take the time to actually learn documentation, writing tiny prototypes to ensure I understood how issues ticked.
This was nice in itself.
It might by no means have flown on the company, with its manager-led each day standup and demoralizing expectation that something apart from uniform incremental each day progress was an aberration to be corrected.
I might not have acquired clearance to spend three weeks studying rust to implement a function in a dependency.
Growth in an organization is frenetic.
You might be solely to learn the way dependencies work on the shallowest stage, to facilitate completion of labor gadgets in preparation for accepting additional work gadgets.
These enduring gaps in understanding, the fixed acquisition after which disposal of uninteresting surface-level information, is to me a recipe for burnout.
I would like my studying to compound.
However issues definitely seem to maneuver rapidly!
In contrast if we are saying my growth tempo on this sabbatical was glacial, the comparability is apt: gradual, however inexorable and all-encompassing by the use of advance.
On the finish of six months I had completed greater than I ever had throughout any six-month interval at an organization, by a a number of.
Is it doable to translate this type of growth inside an organization?
I actually don’t know.
However I feel it’s the event type essential to thrive in free software program.
It appears to be like gradual.
Nevertheless it compounds, and turns into very quick.
Fixing upstream points isn’t a burden, it’s an accelerator.
I’m going to attempt to write fewer imply feedback about free software program tasks on the web.
With my phrases I’m simply disparaging myself.
These tasks provide one thing necessary to me that no proprietary software program can ever match.
Conclusion
Veteran free software program maintainers will observe I’m nonetheless within the honeymoon interval the place every extra consumer of my tasks brings happiness as an alternative of piling on high of an already-crushing assist burden.
I additionally haven’t but handled some contentious social state of affairs forcing me to decide on between suppressing my wishes or beginning & sustaining a fork of a dependency.
I additionally haven’t even touched the subject of sustainable free software program growth, aka “how will you afford to spend time on this”.
At the moment my reply is by doing consulting contracts, incomes sufficient to dwell comfortably-if-frugally whereas additionally devoting a considerable chunk of the yr to free software program growth (and burnout restoration).
So should you’re on the lookout for somebody to use formal strategies to your distributed system, among many other things, drop me a line at consulting@disjunctive.llc and hopefully I can hold this virtuous cycle turning!