Postgres Language Server: implementing the Parser
Throughout our earlier Launch Week we announced the event of a Postgres Language Server. The response was extra enthusiastic than we imagined and we acquired some glorious concepts.
That is an replace on the Parser, which is the basic primitive of a Language Server. We’ve been by a number of iterations of the parser and we need to element our explorations.
Wish to be taught some extra acronyms?
There shall be just a few acronyms on this submit. We don’t need this submit to be “inside baseball” so listed below are just a few ideas that it’s essential to know:
- LSP / Language Server Protocol: One thing that helps code editors perceive code higher. It supplies performance like linting and error detection.
- Postgres Language Server: a server that can assist with Postgres programing – writing SQL, sort inference, and so on.
- Parser: A parser converts written code right into a type that instruments can work with. For instance, it identifies variables in code. after which the instruments can simply entry these variables. We’ll describe this extra under.
Background: Why construct a Language Server?
Postgres is gaining recognition, and but the tooling round it nonetheless wants plenty of work. Writing SQL in editors like VSCode is a ache. One of many distinctive options of Supabase is the power to entry your Postgres database from a browser or cellular app by our Data APIs. Which means builders are writing extra PL/pgSQL.
Whereas code editors have nice assist for many programming languages, SQL assist is underwhelming. We need to make Postgres so simple as Python.
Language Server’s Core: The Function of the Parser
On the very best degree, a language server is a factor which
- accepts enter supply code from the consumer
- turns it right into a semantic mannequin of the code
- and supplies language-specific smarts again to the consumer.
The parser is the core of each language server. It takes the uncooked string, turns it right into a stream of tokens, and builds a syntax tree. This syntax tree can then be used to extract structural info.
Often, the parser builds a concrete syntax tree (CST) earlier than turning it into an abstract syntax tree (AST). A CST preserves all syntax parts current within the supply code with the objective of having the ability to re-create the precise unique supply, whereas an AST incorporates solely the that means of the supply. For instance, take a easy expression 2 * (7 + 3)
:
_15
"https://supabase.com/"
Implementing a Parser for Postgres
Implementing a parser for Postgres is difficult due to the ever-evolving and complicated syntax of Postgres. Even choose
statements are very complicated to parse. Then there are frequent desk expressions, sub-queries and the like. This is without doubt one of the the reason why current Postgres tooling is scarce, badly maintained, and infrequently doesn’t work nicely.
We determined to not create a customized parser. As a substitute, we leverage libpg_query to parse SQL code reliably. The pganalyze crew has a printed an amazing weblog submit on why this approach is preferable.
Nevertheless, libpgquestion is designed to parse _executable SQL — to not present language intelligence. Utilizing it for a language server means adapting it to our particular use case. Let’s discover how we tailored it:
Tokenization
Earlier than any syntax tree might be constructed, the enter string must be transformed right into a stream of tokens. libpg_query exposes a scan
API that returns all non-whitespace tokens of the supply, even for invalid SQL. For instance, a easy assertion choose '1';
returns
_18
keyword_kind: ReservedKeyword,
_18
keyword_kind: NoKeyword,
_18
keyword_kind: NoKeyword,
Each ScanToken
token incorporates its variant, a variety, and a key phrase type. To simplify the implementation of the parser, we extract the textual content for every token utilizing the vary for now. We arrive on the following Token
struct.
_10
/// The form of the token
_10
pub type: SyntaxKind,
_10
/// Textual content from the enter
_10
/// Vary inside the enter
_10
/// Variants from `ScanToken.keyword_kind` + `Whitespace`
_10
pub token_type: TokenType,
To have a whole token stream, we merge the outcomes of scan
with a listing of all whitespace tokens within the supply, the place the latter are extracted by a easy common expression. For a easy assertion choose '1';
, the next tokens are returned by the lexer.
_26
token_type: ReservedKeyword,
_26
token_type: Whitespace,
_26
token_type: NoKeyword,
_26
token_type: NoKeyword,
Conversion to a Syntax Tree
To remodel the stream of tokens right into a syntax tree, we face the primary challenges with libpg_query
. In a language server, it is vital to deal with incomplete or improperly formatted enter gracefully. When an error happens you don’t need the parser to “cease”. You need it to examine for all errors in your code:
_10
id serial major key # <- ERR: lacking comma, Parser ought to proceed
_10
create desk feedback (
_10
id serial major key,
_10
post_id int references posts # <- ERR: lacking comma, second error returned
Sadly, the parse
api from libpg_query
solely parses the whole enter — if any SQL assertion incorporates a syntax error, an error is returned for the whole enter.
To beat this, we carried out a resilient supply
parser. This parser breaks the enter into particular person SQL statements and parses them one after the other, permitting us to deal with syntax errors inside every assertion independently. It’s carried out as a top-down LL parser. Particularly, the parser iterates the token stream left-to-right, and checks if the cursor at the moment is initially of a brand new assertion. As soon as an announcement is entered, it walks the tokens till ;
or one other assertion begin is encountered whereas skipping sub-statements.
Fortunately, Postgres statements at all times begin with distinct key phrases. An replace assertion is identifiable with replace
, a delete assertion with delete from
. There are just a few statements that want greater than the primary few tokens to be distinguishable, however we solely care about whether or not there’s a assertion, and never what assertion there may be precisely, so ambiguity could be very a lot acceptable.
For the implementation, we solely want to supply the distinct key phrases every assertion begins with and evaluate it to the present tokens utilizing a lookahead mechanism.
Reverse-Engineering the CST
The second limitation we encountered: libpg_query
solely exposes an API for the AST, not for the CST. To offer language intelligence on the supply code, each are required. Since we don’t need to implement our personal parser, we have to work with what now we have to construct the CST: the AST and a stream of tokens. The objective is to reverse-engineer the AST into the CST. This entails re-using the AST nodes as CST nodes, and determining what token belongs beneath what node. The exemplary assertion choose '1';
needs to be be parsed into
To try this, we have to know the vary of each node that’s inside the AST.
We made quite a few iterations over the previous few months to determine how one can accomplish this with minimal guide implementations. Earlier than diving into particulars, lets take a better take a look at the parse
API of libpg_query
. For the exemplary assertion above, it returns (simplified for readability):
There are just a few limitations:
- Some nodes do have a
location
property that signifies the place the node begins within the supply, however not all. - There is no such thing as a info on the vary of a node inside the supply.
- Some places are usually not what you’d anticipate. For instance the placement of the expression node is the placement of the operator, not the beginning of the left expression.
To summarise: we want a variety for every node, and we solely have a begin place, however not at all times, and typically it’s fallacious.
Our first very iterations have been naive. We explored what info we might extract from scan
and parse
, and if something may help in reverse-engineering the CST.
It seems, essentially the most dependable method of figuring out the vary of a node is by realizing all properties of that node, and its place within the tree.
A property is any textual content for which a Token can doubtlessly be discovered within the supply code. For instance, a SelectStmt
node has the choose
key phrase as a property, and if there’s a from_clause
, a from
key phrase. For the exemplary assertion above, the properties are
Notice that we do not need an additional String
node, and as a substitute add its properties to the dad or mum AConst
node. This purpose is {that a} String
node doesn’t convey any worth to the CST, since we already know that ‘1'
is a string from the form of the respective Token
. The identical is true for all nodes that simply comprise sort info resembling Integer
, Boolean
and Float
.
The place of any node is identical in AST and CST, and thereby might be mirrored from it.
Implementation
Earlier than the precise parsing begins, the AST returned by parse
is transformed into an uniform tree construction the place every node holds the form of the node, a listing of properties, the depth and, if obtainable, the placement.
For instance, for choose '1';
:
_39
edges: (0, 1), (1, 2),
To implement such a conversion we want a solution to
- get an
Choice<usize>
with the placement for each node sort, if any - compile a listing of all properties {that a} node incorporates
- stroll down the AST till the leaves are reached
Because of the strict sort system of Rust, a guide implementation could be a big and repetitive effort. With languages resembling JavaScript, getting the placement of a node could be so simple as node?.location
. In Rust, a big match assertion overlaying all attainable nodes is required to do the identical. Fortunately, libpg_query
exports a protobuf definition containing all AST nodes and their fields. For instance, an AExpr
node is outlined as
_10
A_Expr_Kind type = 1 [json_name="kind"];
_10
repeated Node title = 2 [json_name="name"];
_10
Node lexpr = 3 [json_name="lexpr"];
_10
Node rexpr = 4 [json_name="rexpr"];
_10
int32 location = 5 [json_name="location"];
We introspect that definition to generate code at construct time utilizing procedural macros.
Leveraging the highly effective repetition characteristic of the [quote](https://github.com/dtolnay/quote)
crate, the match
assertion of a get_location
perform might be carried out with only a few traces of code.
_20
pub fn get_location_mod(_item: proc_macro2::TokenStream) -> proc_macro2::TokenStream {
_20
// Parse the proto file utilizing a customized parser
_20
let parser = ProtoParser::new("libpg_query/protobuf/pg_query.proto");
_20
let proto_file = parser.parse();
_20
// get all node identifiers for the left aspect of the match arm
_20
let node_identifiers = node_identifiers(&proto_file.nodes);
_20
// get a TokenStream for every node that returns the placement
_20
// whether it is a part of the node properties
_20
let location_idents = location_idents(&proto_file.nodes);
_20
/// Returns the placement of a node
_20
pub fn get_location(node: &NodeEnum) -> Choice<usize> {
_20
#(NodeEnum::#node_identifiers(n) => #location_idents),*
The location_idents
perform iterates all nodes, searches for a location
property within the protobuf definition for every node, and returns a TokenStream
with both Some(n.location)
or None
for every.
_18
fn location_idents(nodes: &[Node]) -> Vec<TokenStream> {
_18
// has location property?
_18
.discover(|n| n.title == "location" && n.field_type == FieldType::Int32)
_18
quote! { Some(n.location) }
Equally, we will generate code to recursively stroll down the FieldType == Node
and FieldType == Vec<Node>
properties of the AST nodes. No guide work required.
Even the perform that returns all properties for a node might be generated, at the least partly. All AST fields of FieldType == String
can at all times be added to the listing of properties. Within the instance above, the sval: “1”
of the String
node makes up the properties of its dad or mum, the AConst
node. What’s remaining are principally simply the key phrases that have to be outlined for each node. A SelectStmt
node has the choose
key phrase as a property, and if there’s a from_clause
, a from
key phrase.
_11
fn custom_handlers(node: &Node) -> TokenStream {
_11
match node.title.as_str() {
_11
"SelectStmt" => quote! {
_11
tokens.push(TokenProperty::from(Token::Choose));
_11
if n.from_clause.len() > 0 {
_11
tokens.push(TokenProperty::from(Token::From));
Parsing a Assertion
After the tree has been generated, the parser goes by the tokens and finds the node in whose properties the present token might be discovered. However not each node is a attainable successor. Lets look how the parser builds the CST for the assertion choose '1' from contact
at a excessive degree.
We begin with the total tree, and all tokens:
_10
Remaining Tokens: ["select", "'1'", "from", "contact"]
_10
SelectStmt (0, [Select, From])
_10
1 (ResTarget, []) 2 (RangeVar, ['contact'])
Beginning with the foundation node, the parser first searches the present node for the token. On this case, with success. Choose
is faraway from SelectStmt
.
Within the subsequent iteration, we seek for '1'
. Since its not within the present node, a breadth-first search is used to seek out the property inside youngsters nodes. We arrive at AConst
, open all nodes we encountered alongside the way in which, and advance.
_17
Remaining Tokens: ["from", "contact"]
_17
SelectStmt (0, [From])
_17
1 (ResTarget, []) 2 (RangeVar, ['contact'])
Since we arrived at a leaf node with no properties left, the following token can’t be a part of this node or any youngster. It may be closed instantly after advancing the token. We take away it from the tree and set the present node to its dad or mum. The identical now applies to ResTarget
, so we arrive again at SelectStmt
:
_15
Remaining Tokens: ["from", "contact"]
_15
SelectStmt (0, [From])
_15
2 (RangeVar, ['contact'])
The from
token can as soon as once more be discovered inside the present node and we simply advance the parser. Since SelectStmt
will not be a leaf node, we keep the place we’re.
_17
Remaining Tokens: ["contact"]
_17
2 (RangeVar, ['contact'])
_17
Whitespace@10..11 " "
From right here, it repeats itself: contact
is discovered inside RangeVar
utilizing a breadth-first search. It turns into a leaf node that’s closed after making use of the token. Since no tokens are left, we end parsing by closing SelectStmt
, leading to:
_11
Whitespace@10..11 " "
_11
Whitespace@15..16 " "
_11
Ident@16..23 "contact"
Remember, this illustration reveals you solely an outline of the method.In case you are within the particulars, check out the supply code here.
You will have seen that neither location
nor depth
have been talked about. Each are solely used to enhance efficiency and safeguard. Amongst different issues, branches with nodes behind the present place of the parser are skipped. Additional, the parser panics when a node is opened and both its place or its depth doesn’t match the present state of the parser. Which means the returned CST is assured to be right.
Limitations
If the SQL is invalid and parse
returns an error, the returned CST is only a flat listing of tokens. Consequently, the assertion parser will not be resilient. This isn’t nice, however now we have deliberately carried out it in order that customized and resilient implementations might be added assertion by assertion later.
In the end, we wish the libpg_query
-based parser to simply function a fallback. For now nevertheless, our objective is to supply a usable language server as quick as attainable. And even when some intelligence options will solely work on legitimate statements, we consider it’s nonetheless higher than what now we have at present: no intelligence in any respect.
Subsequent Steps
There are some minor enhancements remaining for the parser. However the largest half are the guide implementations lacking in get_node_properties
. Its a time-consuming effort, and we’d like to get assist from the neighborhood right here. Try this issue for those who prefer to assist.
After that, we’ll transfer on to the semantic information mannequin and the language server itself. Different parser options resembling assist for PL/pgSQL perform physique parsing shall be added later. We need to get this venture right into a usable state as quick as attainable.