Now Reading
Postgres Language Server: implementing the Parser

Postgres Language Server: implementing the Parser

2023-12-08 06:33:41

Throughout our earlier Launch Week we announced the event of a Postgres Language Server. The response was extra enthusiastic than we imagined and we acquired some glorious concepts.

That is an replace on the Parser, which is the basic primitive of a Language Server. We’ve been by a number of iterations of the parser and we need to element our explorations.

Wish to be taught some extra acronyms?

There shall be just a few acronyms on this submit. We don’t need this submit to be “inside baseball” so listed below are just a few ideas that it’s essential to know:

  • LSP / Language Server Protocol: One thing that helps code editors perceive code higher. It supplies performance like linting and error detection.
  • Postgres Language Server: a server that can assist with Postgres programing – writing SQL, sort inference, and so on.
  • Parser: A parser converts written code right into a type that instruments can work with. For instance, it identifies variables in code. after which the instruments can simply entry these variables. We’ll describe this extra under.

Background: Why construct a Language Server?

Postgres is gaining recognition, and but the tooling round it nonetheless wants plenty of work. Writing SQL in editors like VSCode is a ache. One of many distinctive options of Supabase is the power to entry your Postgres database from a browser or cellular app by our Data APIs. Which means builders are writing extra PL/pgSQL.

Whereas code editors have nice assist for many programming languages, SQL assist is underwhelming. We need to make Postgres so simple as Python.

Language Server’s Core: The Function of the Parser

On the very best degree, a language server is a factor which

  1. accepts enter supply code from the consumer
  2. turns it right into a semantic mannequin of the code
  3. and supplies language-specific smarts again to the consumer.

The parser is the core of each language server. It takes the uncooked string, turns it right into a stream of tokens, and builds a syntax tree. This syntax tree can then be used to extract structural info.

Often, the parser builds a concrete syntax tree (CST) earlier than turning it into an abstract syntax tree (AST). A CST preserves all syntax parts current within the supply code with the objective of having the ability to re-create the precise unique supply, whereas an AST incorporates solely the that means of the supply. For instance, take a easy expression 2 * (7 + 3):

_15

"https://supabase.com/"

Implementing a Parser for Postgres

Implementing a parser for Postgres is difficult due to the ever-evolving and complicated syntax of Postgres. Even choose statements are very complicated to parse. Then there are frequent desk expressions, sub-queries and the like. This is without doubt one of the the reason why current Postgres tooling is scarce, badly maintained, and infrequently doesn’t work nicely.

We determined to not create a customized parser. As a substitute, we leverage libpg_query to parse SQL code reliably. The pganalyze crew has a printed an amazing weblog submit on why this approach is preferable.

Nevertheless, libpgquestion is designed to parse _executable SQL — to not present language intelligence. Utilizing it for a language server means adapting it to our particular use case. Let’s discover how we tailored it:

Tokenization

Earlier than any syntax tree might be constructed, the enter string must be transformed right into a stream of tokens. libpg_query exposes a scan API that returns all non-whitespace tokens of the supply, even for invalid SQL. For instance, a easy assertion choose '1'; returns

_18

keyword_kind: ReservedKeyword,

_18

keyword_kind: NoKeyword,

_18

keyword_kind: NoKeyword,

Each ScanToken token incorporates its variant, a variety, and a key phrase type. To simplify the implementation of the parser, we extract the textual content for every token utilizing the vary for now. We arrive on the following Token struct.

_10

/// The form of the token

_10

pub type: SyntaxKind,

_10

/// Textual content from the enter

_10

/// Vary inside the enter

_10

/// Variants from `ScanToken.keyword_kind` + `Whitespace`

_10

pub token_type: TokenType,

To have a whole token stream, we merge the outcomes of scan with a listing of all whitespace tokens within the supply, the place the latter are extracted by a easy common expression. For a easy assertion choose '1';, the next tokens are returned by the lexer.

_26

token_type: ReservedKeyword,

_26

token_type: Whitespace,

_26

token_type: NoKeyword,

_26

token_type: NoKeyword,

Conversion to a Syntax Tree

To remodel the stream of tokens right into a syntax tree, we face the primary challenges with libpg_query. In a language server, it is vital to deal with incomplete or improperly formatted enter gracefully. When an error happens you don’t need the parser to “cease”. You need it to examine for all errors in your code:

_10

id serial major key # <- ERR: lacking comma, Parser ought to proceed

_10

create desk feedback (

_10

id serial major key,

_10

post_id int references posts # <- ERR: lacking comma, second error returned

Sadly, the parse api from libpg_query solely parses the whole enter — if any SQL assertion incorporates a syntax error, an error is returned for the whole enter.

To beat this, we carried out a resilient supply parser. This parser breaks the enter into particular person SQL statements and parses them one after the other, permitting us to deal with syntax errors inside every assertion independently. It’s carried out as a top-down LL parser. Particularly, the parser iterates the token stream left-to-right, and checks if the cursor at the moment is initially of a brand new assertion. As soon as an announcement is entered, it walks the tokens till ; or one other assertion begin is encountered whereas skipping sub-statements.

Fortunately, Postgres statements at all times begin with distinct key phrases. An replace assertion is identifiable with replace, a delete assertion with delete from. There are just a few statements that want greater than the primary few tokens to be distinguishable, however we solely care about whether or not there’s a assertion, and never what assertion there may be precisely, so ambiguity could be very a lot acceptable.

For the implementation, we solely want to supply the distinct key phrases every assertion begins with and evaluate it to the present tokens utilizing a lookahead mechanism.

Reverse-Engineering the CST

The second limitation we encountered: libpg_query solely exposes an API for the AST, not for the CST. To offer language intelligence on the supply code, each are required. Since we don’t need to implement our personal parser, we have to work with what now we have to construct the CST: the AST and a stream of tokens. The objective is to reverse-engineer the AST into the CST. This entails re-using the AST nodes as CST nodes, and determining what token belongs beneath what node. The exemplary assertion choose '1'; needs to be be parsed into

To try this, we have to know the vary of each node that’s inside the AST.

We made quite a few iterations over the previous few months to determine how one can accomplish this with minimal guide implementations. Earlier than diving into particulars, lets take a better take a look at the parse API of libpg_query. For the exemplary assertion above, it returns (simplified for readability):

There are just a few limitations:

  1. Some nodes do have a location property that signifies the place the node begins within the supply, however not all.
  2. There is no such thing as a info on the vary of a node inside the supply.
  3. Some places are usually not what you’d anticipate. For instance the placement of the expression node is the placement of the operator, not the beginning of the left expression.

To summarise: we want a variety for every node, and we solely have a begin place, however not at all times, and typically it’s fallacious.

Our first very iterations have been naive. We explored what info we might extract from scan and parse, and if something may help in reverse-engineering the CST.

It seems, essentially the most dependable method of figuring out the vary of a node is by realizing all properties of that node, and its place within the tree.

A property is any textual content for which a Token can doubtlessly be discovered within the supply code. For instance, a SelectStmt node has the choose key phrase as a property, and if there’s a from_clause, a from key phrase. For the exemplary assertion above, the properties are

Notice that we do not need an additional String node, and as a substitute add its properties to the dad or mum AConst node. This purpose is {that a} String node doesn’t convey any worth to the CST, since we already know that ‘1' is a string from the form of the respective Token. The identical is true for all nodes that simply comprise sort info resembling Integer, Boolean and Float.

The place of any node is identical in AST and CST, and thereby might be mirrored from it.

Implementation

Earlier than the precise parsing begins, the AST returned by parse is transformed into an uniform tree construction the place every node holds the form of the node, a listing of properties, the depth and, if obtainable, the placement.

For instance, for choose '1';:

_39

edges: (0, 1), (1, 2),

To implement such a conversion we want a solution to

  • get an Choice<usize> with the placement for each node sort, if any
  • compile a listing of all properties {that a} node incorporates
  • stroll down the AST till the leaves are reached

Because of the strict sort system of Rust, a guide implementation could be a big and repetitive effort. With languages resembling JavaScript, getting the placement of a node could be so simple as node?.location. In Rust, a big match assertion overlaying all attainable nodes is required to do the identical. Fortunately, libpg_query exports a protobuf definition containing all AST nodes and their fields. For instance, an AExpr node is outlined as

_10

A_Expr_Kind type = 1 [json_name="kind"];

_10

repeated Node title = 2 [json_name="name"];

_10

Node lexpr = 3 [json_name="lexpr"];

_10

Node rexpr = 4 [json_name="rexpr"];

_10

int32 location = 5 [json_name="location"];

We introspect that definition to generate code at construct time utilizing procedural macros.

Leveraging the highly effective repetition characteristic of the [quote](https://github.com/dtolnay/quote) crate, the match assertion of a get_location perform might be carried out with only a few traces of code.

_20

pub fn get_location_mod(_item: proc_macro2::TokenStream) -> proc_macro2::TokenStream {

_20

// Parse the proto file utilizing a customized parser

_20

let parser = ProtoParser::new("libpg_query/protobuf/pg_query.proto");

_20

let proto_file = parser.parse();

_20

// get all node identifiers for the left aspect of the match arm

_20

let node_identifiers = node_identifiers(&proto_file.nodes);

_20

// get a TokenStream for every node that returns the placement

_20

// whether it is a part of the node properties

_20

let location_idents = location_idents(&proto_file.nodes);

_20

/// Returns the placement of a node

_20

pub fn get_location(node: &NodeEnum) -> Choice<usize> {

_20

#(NodeEnum::#node_identifiers(n) => #location_idents),*

The location_idents perform iterates all nodes, searches for a location property within the protobuf definition for every node, and returns a TokenStream with both Some(n.location) or None for every.

_18

fn location_idents(nodes: &[Node]) -> Vec<TokenStream> {

_18

// has location property?

_18

.discover(|n| n.title == "location" && n.field_type == FieldType::Int32)

_18

quote! { Some(n.location) }

Equally, we will generate code to recursively stroll down the FieldType == Node and FieldType == Vec<Node> properties of the AST nodes. No guide work required.

Even the perform that returns all properties for a node might be generated, at the least partly. All AST fields of FieldType == String can at all times be added to the listing of properties. Within the instance above, the sval: “1” of the String node makes up the properties of its dad or mum, the AConst node. What’s remaining are principally simply the key phrases that have to be outlined for each node. A SelectStmt node has the choose key phrase as a property, and if there’s a from_clause, a from key phrase.

_11

match node.title.as_str() {

_11

"SelectStmt" => quote! {

_11

tokens.push(TokenProperty::from(Token::Choose));

_11

if n.from_clause.len() > 0 {

_11

tokens.push(TokenProperty::from(Token::From));

Parsing a Assertion

After the tree has been generated, the parser goes by the tokens and finds the node in whose properties the present token might be discovered. However not each node is a attainable successor. Lets look how the parser builds the CST for the assertion choose '1' from contact at a excessive degree.

We begin with the total tree, and all tokens:

_10

Remaining Tokens: ["select", "'1'", "from", "contact"]

_10

SelectStmt (0, [Select, From])

_10

1 (ResTarget, []) 2 (RangeVar, ['contact'])

Beginning with the foundation node, the parser first searches the present node for the token. On this case, with success. Choose is faraway from SelectStmt .

Within the subsequent iteration, we seek for '1'. Since its not within the present node, a breadth-first search is used to seek out the property inside youngsters nodes. We arrive at AConst , open all nodes we encountered alongside the way in which, and advance.

_17

Remaining Tokens: ["from", "contact"]

_17

SelectStmt (0, [From])

_17

1 (ResTarget, []) 2 (RangeVar, ['contact'])

Since we arrived at a leaf node with no properties left, the following token can’t be a part of this node or any youngster. It may be closed instantly after advancing the token. We take away it from the tree and set the present node to its dad or mum. The identical now applies to ResTarget, so we arrive again at SelectStmt:

_15

Remaining Tokens: ["from", "contact"]

_15

SelectStmt (0, [From])

_15

2 (RangeVar, ['contact'])

The from token can as soon as once more be discovered inside the present node and we simply advance the parser. Since SelectStmt will not be a leaf node, we keep the place we’re.

_17

Remaining Tokens: ["contact"]

_17

2 (RangeVar, ['contact'])

_17

Whitespace@10..11 " "

From right here, it repeats itself: contact is discovered inside RangeVar utilizing a breadth-first search. It turns into a leaf node that’s closed after making use of the token. Since no tokens are left, we end parsing by closing SelectStmt, leading to:

_11

Whitespace@10..11 " "

_11

Whitespace@15..16 " "

_11

Ident@16..23 "contact"

Remember, this illustration reveals you solely an outline of the method.In case you are within the particulars, check out the supply code here.

You will have seen that neither location nor depth have been talked about. Each are solely used to enhance efficiency and safeguard. Amongst different issues, branches with nodes behind the present place of the parser are skipped. Additional, the parser panics when a node is opened and both its place or its depth doesn’t match the present state of the parser. Which means the returned CST is assured to be right.

Limitations

If the SQL is invalid and parse returns an error, the returned CST is only a flat listing of tokens. Consequently, the assertion parser will not be resilient. This isn’t nice, however now we have deliberately carried out it in order that customized and resilient implementations might be added assertion by assertion later.

In the end, we wish the libpg_query-based parser to simply function a fallback. For now nevertheless, our objective is to supply a usable language server as quick as attainable. And even when some intelligence options will solely work on legitimate statements, we consider it’s nonetheless higher than what now we have at present: no intelligence in any respect.

Subsequent Steps

There are some minor enhancements remaining for the parser. However the largest half are the guide implementations lacking in get_node_properties . Its a time-consuming effort, and we’d like to get assist from the neighborhood right here. Try this issue for those who prefer to assist.

After that, we’ll transfer on to the semantic information mannequin and the language server itself. Different parser options resembling assist for PL/pgSQL perform physique parsing shall be added later. We need to get this venture right into a usable state as quick as attainable.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top