What I Speak About Once I Speak About Question Optimizer (Half 1): IR Design

2024-01-29 08:45:08

I not too long ago got here throughout an insightful article on SQL Question Optimizers by @leiysky on Zhihu, and I need to say it was glorious! To make it accessible to a wider viewers, I’ve translated the piece into English. Take pleasure in studying!

Please notice that @leiysky deserves full credit score for the standard of this text. Any errors are solely on account of my insufficient translation expertise.

Alright, let’s start!

Throughout our current dialog, Mr. Chi and I mentioned his newest venture at CMU referred to as optd, which is a question optimizer library developed utilizing the Cascades framework. We ended up griping collectively about numerous design and implementation points of database optimizers. It was at that second after I realized the intriguing nature of sure technical topics and determined to pay attention to them.

So, I’ve made the choice to launch a collection the place we’ll cowl every little thing about question optimizers—starting from algorithm fundamentals to engineering methods, technological evolution to venture implementation, and even some insider trade gossip.

The inspiration of the title comes from Haruki Murakami’s book, which I extremely advocate. The ebook is a group of essays that Murakami wrote about his expertise as a runner. The title is a reference to a group of quick tales by Raymond Carver, “What We Speak About When We Speak About Love”, which Murakami translated into Japanese.

Software program growth is a ability simply as working is, everybody can do it by practising however not everybody may have the identical emotions about it. I might prefer to share some private ideas and experiences about question optimizers, and I hope that you can find them fascinating.

Right now, let’s start by discussing the subject of IR Design, specializing in frequent design patterns in optimizers and the underlying issues.

What’s a Question Optimizer?

Earlier than we formally begin the dialogue, I might prefer to make clear the definition of a question optimizer.

Typically, a question optimizer is a database element that optimizes the execution plan of queries. Nonetheless, totally different databases have numerous strategies for optimizing queries. As an example, some could instantly rewrite the AST, carry out transformations throughout AST decreasing, or dynamically rewrite throughout question execution.

To unify the idea, I’ll check with all elements from the SQL parser to the SQL Executor collectively because the Question Optimizer.

What’s IR?

Buddies aware of compilation know-how ought to be very aware of the time period IR. IR stands for Intermediate Illustration, which is often utilized in compilers for various programming languages like Rust’s HIR & MIR and LLVM’s LLVM IR. IR serves as a structural illustration of programming languages, enabling the compiler to conduct numerous analyses and optimizations extra successfully.

If SQL is taken into account a programming language, then relational databases perform as digital machines that execute SQL, much like how the JVM executes Java. The question optimizer is liable for translating SQL statements (Java code) into execution plans (Java bytecode) for the executor (Java runtime) to execute. Consequently, it’s important to design totally different IRs for SQL when creating a question optimizer.

What does SQL IR seem like?

Typical database initiatives are divided into a number of modules: Parser, Analyzer/Binder, Optimizer, and Executor. These parts course of SQL statements sequentially to remodel them into question outcomes. In our context, the Optimizer contains each the Analyzer and Optimizer modules talked about earlier.

AST

SQL is a declarative language that mimics pure language syntax. It’s based mostly on relational algebra and might describe operations on units, mapping them to queries on tabular information (tables).

To facilitate processing, just like the overwhelming majority of compilers, we are going to first parse SQL language into an AST (Summary Syntax Tree). A typical SQL AST is illustrated as follows:

Within the SQL AST, we usually divide nodes into two sorts: Assertion (Stmt) and Expression (Expr). The basis node of every SQL AST is all the time a Assertion, which can include some Clauses in addition to Expressions. An Expression is a recursive construction that features numerous operators and performance calls, and even nested Statements (Subqueries).

One fascinating side of SQL is the blurred boundary between Statements and Expressions in SELECT Statements. This happens as a result of SELECT Statements are recursive and should handle operator priority points (UNION/EXCEPT/INTERSECT). Moreover, solely SELECT Statements can work together with Expressions recursively, which ought to be thought-about when designing an SQL AST.

Relational Algebra

The theoretical basis of the SQL language is relational algebra, and each question assertion corresponds to a illustration in relational algebra. For instance:

For the reason that expression of relational algebra can also be a recursive tree construction, many programs naturally convert SQL AST into an execution plan much like the one proven beneath. We refer to every node as an operator, and we name your complete operator tree a question plan.

After all, there are exceptions among the many many programs. For instance, IBM, as a pioneering, launched the Question Graph Mannequin (QGM) in its Starburst system. This illustration is sort of summary and hardcodes many properties into QGM, making it exceptionally obscure. Its claimed extensibility can also be questionable.

As a consequence of area limitations, I will not elaborate right here; if , you may learn the associated papers Extensible Query Processing in Starburst and Extensible/Rule Based Query Rewrite Optimization in Starburst.

At the moment, mainstream databases have primarily adopted the illustration of relational algebra (akin to IBM’s System R and DB2, Oracle’s numerous database merchandise, Microsoft’s SQL Server collection, open-source PostgreSQL and MySQL 8.0). Based mostly on this basis, they’ve developed quite a few optimization frameworks and execution frameworks. Due to this fact, selecting to make use of the abstraction of relational algebra when designing SQL IR is a foolproof resolution.

By using the varied axioms and theorems of relational algebra, we are able to carry out a wide range of transformations on SQL IR to realize optimization whereas guaranteeing correctness. Particular optimization guidelines and algorithms might be mentioned in subsequent articles.

(My) Finest Engineering Practices

There’s a cognitive bias often known as the curse of information, which happens when one assumes that others possess the identical degree of information throughout communication.

This phenomenon is sort of frequent in software program growth. Individuals who have expertise writing sure kinds of code and those that do not typically battle to speak successfully, even when they share the identical theoretical basis (algorithms, programming languages, or area information). The rationale for this lies within the vital flexibility of software program engineering; there are a number of methods to implement the identical performance, every with its personal set of challenges.

To remove such communication obstacles, numerous technical fields have developed their very own set of idioms or design patterns. New initiatives constructed on these practices can keep away from a number of pointless bother. The identical is true for the sphere of databases; nevertheless, on account of its area of interest nature and excessive diploma of commercialization, information circulated among the many public may be very scarce, and engineering practices are scattered throughout numerous open-source initiatives.

On this article, I’ll construct a SQL IR from scratch based mostly alone finest practices, which can facilitate the progressive sharing of some design issues. As a consequence of private choice, I’ll use Rust to write down code. Buddies who aren’t aware of Rust needn’t fear; so long as you could have a primary understanding of C/C++, you may comprehend the logic behind Rust’s code.

Hey, world!

After we study a brand new programming language, the primary program we usually encounter is “whats up world”.

fn predominant() {
    println!("Hey, world!");
}

Due to this fact, we may even begin by constructing our IR from the SQL model of “whats up world”.

create desk t(a int);
choose * from t;

Translating this SQL assertion into relational algebra may be very easy; we denote it as Get(t), which implies to return all the info in set t. To characterize such a question, we are able to outline a easy struct referred to as Get.

pub struct Get {
    pub desk: String,
}

fn plan() -> Get {
    // choose * from t;
    Get {
        desk: "t".to_string(),
    }
}

This straightforward SQL IR is now full. With the Get, we are able to characterize all queries much like choose * from xxx. Is not it very easy?

Choose & Undertaking

Subsequent, we are able to add extra options to this IR, supporting further SQL clauses. For instance:

create desk t(a int, b int);
choose a from t the place a = 1;

This SQL question, when translated into relational algebra, will be denoted as Undertaking(Choose(Get(t), a = 1), a). The Choose operator can filter information based mostly on the offered predicate, whereas the Undertaking operator can trim the set to acquire the required attribute. To characterize such a question, we have to add extra struct definitions.

pub struct Get {
    pub desk: String,
}

pub struct Choose {
    pub get: Get,
    pub predicate: String,
}

pub struct Undertaking {
    pub choose: Choose,
    pub venture: String,
}

fn plan() -> Undertaking {
    // choose a from t the place a = 1;
    Undertaking {
        choose: Choose {
            get: Get {
                desk: "t".to_string(),
            },
            predicate: "a = 1".to_string(),
        },
        venture: "a".to_string(),
    }
}

Upon arriving right here, we’re confronted with a number of questions: In line with the theory of relational algebra, can Undertaking act as a baby of Choose? Provided that Choose is non-compulsory for an SQL question, how ought to this be mirrored within the code?

To deal with these points, we are able to introduce some options of dynamic dispatch. In C++/Java, inheritance is often used to characterize an Operator, for instance:

class Operator {};
class Get : public Operator {};
class Choose : public Operator {
    Operator* _child;
};
class Undertaking : public Operator {
    Operator* _child;
};

In Rust, we’ve a extra handy choice that permits us to take pleasure in the advantages of each static typing and dynamic dispatch, which is enum. Rust’s enum is an ADT (Algebraic Knowledge Sort), often known as tagged union, and it may characterize our operators very conveniently:

pub enum Operator {
    Get {
        desk: String,
    },
    Choose {
        youngster: Field<Self>,
        predicate: String,
    },
    Undertaking {
        youngster: Field<Self>,
        initiatives: String,
    },
}

fn plan() -> Operator {
    // choose a from t the place a = 1;
    Operator::Undertaking {
        youngster: Field::new(Operator::Choose {
            youngster: Field::new(Operator::Get {
                desk: "t".to_string(),
            }),
            predicate: "a = 1".to_string(),
        }),
        venture: "a".to_string(),
    }
}

With this, we are able to freely characterize operator bushes of assorted shapes, and the design of the IR begins to get heading in the right direction.

Scalar expression

Though we’ve launched the operators Choose and Undertaking, the Choose Predicate and Undertaking Expression nonetheless exist within the type of strings, which can not meet the necessities for evaluation and optimization. Due to this fact, we have to design an IR for these expressions as effectively.

Wanting again, after being processed by the Parser, SQL strings are remodeled into AST, and the expressions inside them change into Expr nodes, roughly like this:

pub enum Expr {
    ColumnRef(ColumnRef),
    Literal(Literal),
    Operate(Operate),
    BinaryOp(BinaryOp),
    UnaryOp(UnaryOp),
    Subquery(SelectStmt),
}

The expression itself is a recursive construction, and the Expr node of AST can also be a recursive construction. Can we lazily use the Expr node instantly as a part of our SQL IR? Let’s give it a attempt first.

After changing string with Expr, we are able to acquire:

pub enum Operator {
    Get {
        desk: String,
    },
    Choose {
        youngster: Field<Self>,
        predicate: Expr,
    },
    Undertaking {
        youngster: Field<Self>,
        initiatives: Vec<Expr>,
    },
}

Subsequent, let’s attempt some frequent evaluation utilizing the given SQL assertion to see if it really works effectively:

choose a from t
the place exists (choose * from t1 the place t.a = t1.a)

Q: Which tables and columns does Expr in Undertaking depend upon? A: It makes use of a column referred to as a, however I do not know which desk it belongs to, possibly this column would not even exist.
Q: What’s the return kind of Expr in Undertaking? A: I do not know, there isn’t any kind data included in Expr.
Q: Is the subquery in Choose a correlated subquery? A: I do not know, the subquery in Expr is simply an unprocessed AST.

Okay, evidently Expr is just not as helpful as we imagined. With a purpose to conduct the above evaluation, we have to design a extra informative IR. To differentiate it from Expr, we are going to identify it ScalarExpr.

Summarizing the above evaluation, our necessities for ScalarExpr are as follows:

All identifiers should be resolved to completely certified names.
Sort data must be injected and bear kind examine.
All subqueries must be remodeled into SQL IR kind.

Combining the above necessities, together with some desugar, ScalarExpr would look one thing like this:

pub enum ScalarExpr {
    ColumnRef(Vec<Identifier>, Sort),
    Literal(Worth, Sort),
    Operate(Signature, Vec<Self>),
    Subquery(Quantifier, Field<Operator>),
}

On this means, the design of the expression’s IR can also be shaped. Let’s combine your complete set of SQL IR collectively.

The IR

After the above design, we’ve:

Operator, a tree construction able to flexibly expressing numerous SQL queries.
ScalarExpr, offering wealthy semantic data.

Though some key operators are nonetheless lacking, akin to Be part of, Union, Mixture, and so forth. Nonetheless, because the general framework is already very clear, we are able to observe the identical sample and add them as effectively.

After integration, we’ve a reasonably good SQL IR.

pub enum ScalarExpr {
    ColumnRef(Vec<Identifier>, Sort),
    Literal(Worth, Sort),
    Operate(Signature, Vec<Self>),
    Subquery(Quantifier, Field<Operator>),
}

pub enum Operator {
    Get {
        desk: String,
    },
    Choose {
        youngster: Field<Self>,
        predicate: ScalarExpr,
    },
    Undertaking {
        youngster: Field<Self>,
        initiatives: Vec<ScalarExpr>,
    },
    Be part of {
        variety: JoinKind,
        situation: ScalarExpr,
        left: Field<Self>,
        proper: Field<Self>,
    },
    UnionAll {
        left: Field<Self>,
        proper: Field<Self>,
    },
    Mixture {
        group_by: Vec<ScalarExpr>,
        aggr_exprs: Vec<ScalarExpr>,
        youngster: Field<Self>,
    },
}

As a result of it’s too good, I’ve determined to offer this IR an imposing identify – The IR.

Property Derivation

After we wish to analyze and optimize IR, we all the time have to acquire some properties of the IR. We are able to calculate these properties by writing an analyzer that traverses your complete IR, however this requires a number of effort to keep up the state of the context through which the IR is positioned.

Happily, SQL as a declarative question language for information movement is sort of easy, and we are able to use its options to calculate properties.

The information movement and parent-child relationships between operators in The IR are carefully associated and offered as a DAG (directed acyclic graph), the place all information flows from youngster nodes to dad or mum nodes.

Beneath this attribute, it’s easy to compute the property of a sure IR node. It solely requires recursively computing the property of every youngster node after which calculating its personal property based mostly on these properties. We check with this course of as property derivation.

pub struct Property;

fn derive_property(op: &Operator) -> Property {
    // Calculate the properties of the kids operators.
    let children_property: Vec<Property> = op
        .youngsters()
        .map(derive_property)
        .acquire();

    // Calculate property with the kids properties.
    op.calculate_property(&children_property)
}

In SQL optimization, generally used properties will be divided into two classes: relational/logical properties that describe the traits of an information set and bodily properties that describe the bodily traits of the info.

Frequent relational properties embody:

Details about the attributes/columns contained within the dataset
Cardinality of the dataset, indicating the variety of data within the dataset
Statistics, representing the info distribution of attributes
Constraints, representing constraints on attributes, akin to NOT NULL
Practical dependency, indicating the practical dependency relationship between attributes

Frequent bodily properties embody:

Order
Diploma of parallelism (DOP)
Knowledge distribution
Knowledge partition

Combining the properties of relational algebra, we are able to describe the variations between kinds of properties.

Assuming there are relations R and S: the relational property of R is RP_R, and the bodily property is PP_R; the relational property of S is RP_S, and the bodily property is PP_S.

We are able to acquire:

It isn’t tough to see that the equivalence relationship between two relations can decide the equivalence relationship of relational properties, however the equivalence relationship of bodily properties is just not affected by the equivalence relationship of relations.

The content material about combining properties with particular question optimization algorithms might be mentioned intimately in subsequent articles.

With property derivation, we are able to use theorems from relational algebra to optimize The IR whereas guaranteeing correctness.

So the following query is, what ought to the property seem like?

Relational properties

An important a part of relational property is the illustration of attributes. In naive relational algebra, every relation consists of units of tuples, and every attribute in a tuple has its personal distinctive identify. It’s pure for us to instantly think about using the tuple schema because the illustration of attributes.

Let’s first recall how a desk is created.

In SQL, we use DDL (Knowledge Definition Language) to create and handle numerous tables. When making a desk, we have to specify its desk schema, which incorporates the precise definition of every column within the desk, equivalent to attributes in relational algebra. The construction of the desk schema may look one thing like this:

pub struct TableSchema {
    pub identify: String,
    pub columns: Vec<ColumnDefinition>
}

pub struct ColumnDefinition {
    pub identify: String,
    pub column_type: Sort,
    pub not_null: bool,
}

Since ColumnDefinition and attribute have a one-to-one correspondence, can we instantly use ColumnDefinition to characterize the property of attribute?

We are able to attempt including help for attributes in The IR first.

fn derive_attributes(op: &Operator) -> Vec<ColumnDefinition> {
    // Calculate the attributes of the kids operators.
    let children_attributes: Vec<Vec<ColumnDefinition>> =
        op.youngsters().iter().map(derive_attributes).acquire();

    // Calculate attributes with the kids attributes.
    op.calculate_attributes(&children_attributes)
}

First, we have to make some modifications to The IR and add desk schema data for the Get operator.

pub enum Operator {
    Get {
        desk: String,
        schema: Vec<ColumnDefinition>,
    },
    // Nothing modified for different variants
}

Then we implement attribute derivation for the Operator.

impl Operator {
    fn calculate_attributes(&self, youngsters: &[Vec<ColumnDefinition>]) -> Vec<ColumnDefinition> {
        match self {
            Operator::Get { schema, .. } => {
                let attributes = schema.clone();
                attributes
            }
            Operator::Choose { .. } => youngsters[0].clone(),
            Operator::Be part of { .. } => {
                let mut attributes = youngsters[0].clone();
                attributes.prolong(youngsters[1].clone());
                attributes
            }
            Operator::UnionAll { .. } => youngsters[0].clone(),

            Operator::Undertaking { .. } => todo!(),
            Operator::Mixture { .. } => todo!(),
        }
    }
}

Many of the operator implementations went easily, however it may be seen that Undertaking and Mixture have been marked as todo. At this level, we are going to discover that Undertaking and Mixture can not instantly generate their very own attributes utilizing youngsters attributes.

Going again to relational algebra, the aim of Undertaking is to trim the form of tuples or modify the identify of attributes. This type of SQL expression like SELECT a + 1 AS b FROM t can’t be expressed as a naive Undertaking; as for Mixture, it isn’t even current in primary relational algebra, it’s an extension to relational algebra.

The idea of relational algebra now not exists!

Nonetheless, regardless of this, the venture nonetheless must proceed. We have to introduce some village guidelines to broaden the definition of relational algebra. Right here we offer the formal definitions of Undertaking and Mixture in The IR.

Undertaking represents the attributes in relationship R as enter, output a tuple consisting of n perform mappings f_1 to f_n.

Mixture represents grouping the tuples in relationship R in line with m attributes k_1 to k_m, and making use of n perform mappings f_1 to f_n on every group, lastly outputting the grouped tuples.

The most important change on this village rule is the introduction of derived columns. For columns instantly from tables in SQL, we name them base desk columns; for columns calculated by Undertaking/Mixture, we name them derived columns.

Earlier than introducing the idea of derived columns, we might make sure that all information sources would finally level to the Get operator. Nonetheless, after its introduction, this conference was damaged and an idea much like scope in programming languages emerged. We must be extra cautious when optimizing.

After having village rule, we are able to additionally obtain attribute derivation for Undertaking and Mixture. Nonetheless, on the identical time, we additionally have to make some modifications to the construction of The IR.

pub enum Operator {
    Undertaking {
        youngster: Field<Self>,
        initiatives: Vec<(ScalarExpr, String)>,
    },
    // Others
}

impl Operator {
    fn calculate_attributes(&self, youngsters: &[Vec<ColumnDefinition>]) -> Vec<ColumnDefinition> {
        match self {
            Operator::Undertaking { initiatives, .. } => {
                let attributes: Vec<ColumnDefinition> = initiatives
                    .iter()
                    .map(|(expr, alias)| ColumnDefinition {
                        identify: alias.clone(),
                        column_type: expr.column_type(),
                        not_null: expr.nullable(),
                    })
                    .acquire();

                attributes
            }
            Operator::Mixture {
                group_by,
                aggr_exprs,
                ..
            } => {
                let mut attributes: Vec<ColumnDefinition> = group_by
                    .iter()
                    .map(|expr| ColumnDefinition {
                        identify: expr.identify(),
                        column_type: expr.column_type(),
                        not_null: expr.nullable(),
                    })
                    .acquire();

                attributes.prolong(aggr_exprs.iter().map(|expr| ColumnDefinition {
                    identify: expr.identify(),
                    column_type: expr.column_type(),
                    not_null: expr.nullable(),
                }));

                attributes
            }
            // Others
        }
    }
}

On this means, we are able to calculate the attributes property for all operators. Come and take a look at it out!

First, let’s check out the commonest and efficient optimization in SQL – predicate pushdown. This optimization can scale back the computational workload of different operators by pushing down the Choose operator into different operators, whereas guaranteeing that the general question consequence stays unchanged. It is rather concise and stylish.

Let’s attempt to implement this optimization on The IR. The thought may be very easy, simply swap the positions of Choose and Undertaking based mostly on the relational algebra theorem. Nonetheless, since we launched derived columns, we should examine if the predicate in Choose will depend on the column generated by Undertaking.

fn push_down_select_project(op: &Operator) -> Choice<Operator> {
    match op {
        Operator::Choose {
            youngster: venture @ field Operator::Undertaking { youngster, initiatives },
            predicate,
        } => {
            let project_attributes: Vec<ColumnDefinition> = derive_attributes(&venture);
            let predicate_used_columns: Vec<String> = predicate.used_columns();

            // Test if the predicate makes use of any column from the venture.
            let used_derived_columns = predicate_used_columns.iter().any(|used_column|  attr.identify == *used_column)
            );

            if used_derived_columns {
                None
            } else {
                Some(Operator::Undertaking {
                    youngster: Field::new(Operator::Choose {
                        youngster: youngster.clone(),
                        predicate: predicate.clone(),
                    }),
                    initiatives: initiatives.clone(),
                })
            }
        }
        _ => None,
    }
}

It appears to be mainly usable now, which is pleasant. Let’s attempt a extra complicated instance, akin to attempting SQL with Be part of:

As a result of Be part of doesn’t generate further derived columns like Undertaking, the logic for checking might be comparatively less complicated. Let’s first implement an optimization that makes an attempt to push Choose all the way down to the left youngster of Be part of:

fn push_down_select_join_left(op: &Operator) -> Choice<Operator> {
    match op {
        Operator::Choose {
            youngster: be part of @ field Operator::Be part of { left, proper, .. },
            predicate,
        } => {
            let left_attributes: Vec<ColumnDefinition> = derive_attributes(&left);
            let predicate_used_columns: Vec<String> = predicate.used_columns();

            // Test if the predicate solely makes use of column from left.
            let only_left = predicate_used_columns
                .iter()
                .all(|used_column| left_attributes.iter().any(|attr| attr.identify == *used_column));

            if only_left {
                Some(Operator::Be part of {
                    left: Field::new(Operator::Choose {
                        youngster: left.clone(),
                        predicate: predicate.clone(),
                    }),
                    proper: proper.clone(),
                    ..be part of.clone()
                })
            } else {
                None
            }
        }
        _ => None,
    }
}

Every part appears to be like nice, however the satan typically hides within the particulars. Let’s check out the output of this instance in PostgreSQL:

leiysky=# create desk t(a int);
CREATE TABLE
leiysky=# create desk t1(a int);
CREATE TABLE
leiysky=# insert into t values(1);
INSERT 0 1
leiysky=# insert into t1 values(1);
INSERT 0 1
leiysky=# choose * from t, t1 the place t.a = 1;
 a | a
---+---
 1 | 1
(1 row)

The ultimate consequence returned has two attributes referred to as a. Within the present implementation of The IR, we can not know which aspect this Choose ought to be pushed all the way down to. As a result of after we examine which aspect the predicate that will depend on a will be pushed all the way down to, we are going to discover that each side of the Be part of can fulfill it. Though it isn’t allowed to have a number of columns with the identical identify in the identical desk, there isn’t any such restriction between totally different tables.

Because the open-source database product with the best help for ANSI SQL, PostgreSQL naturally handles this type of drawback very effectively. By way of the EXPLAIN assertion, we are able to see that it pushes down the Choose to the right place.

leiysky=# clarify(verbose) choose * from t, t1 the place t.a = 1;
                              QUERY PLAN
----------------------------------------------------------------------
 Nested Loop  (price=0.00..491.78 rows=33150 width=8)
   Output: t.a, t1.a
   ->  Seq Scan on public.t1  (price=0.00..35.50 rows=2550 width=4)
         Output: t1.a
   ->  Materialize  (price=0.00..41.94 rows=13 width=4)
         Output: t.a
         ->  Seq Scan on public.t  (price=0.00..41.88 rows=13 width=4)
               Output: t.a
               Filter: (t.a = 1)
(9 rows)

As an ideal SQL IR, The IR should even have its personal resolution. If we rigorously observe this question, we are going to discover that the predicate of Choose is represented by a certified identify. If an unqualified identify is used, PostgreSQL will throw such an error:

leiysky=# choose * from t, t1 the place a = 1;
ERROR:  column reference "a" is ambiguous
LINE 1: choose * from t, t1 the place a = 1;

As a result of within the present context, a is ambiguous, however t.a is just not. Let’s attempt utilizing certified identify to characterize attribute property to resolve this drawback. For this function, we have to make some code modifications.

pub struct QualifiedName(pub Vec<String>);

impl QualifiedName {
    /// If the present identify can be utilized to refer one other identify
    pub fn can_refer(&self, different: &Self) -> bool (a, b)
}

pub struct ColumnDefinition {
    /// Use certified identify
    pub identify: QualifiedName,
    pub column_type: Sort,
    pub not_null: bool,
}

fn resolve_attribute(
    attributes: &[ColumnDefinition],
    identify: &QualifiedName,
) -> Choice<ColumnDefinition> {
    let candidates: Vec<ColumnDefinition> = attributes
        .iter()
        .filter(|attr| attr.identify.can_refer(identify))
        .acquire();

    if candidates.len() == 1 {
        Some(candidates[0].clone())
    } else if candidates.len() > 1 {
        panic!("Be careful, ambiguous reference discovered!")
    }else {
        None
    }
}

fn push_down_select_join_left(op: &Operator) -> Choice<Operator> {
    match op {
        Operator::Choose {
            youngster: be part of @ field Operator::Be part of { left, proper, .. },
            predicate,
        } => {
            let left_attributes: Vec<ColumnDefinition> = derive_attributes(&left);
            let predicate_used_columns: Vec<QualifiedName> = predicate.used_columns();

            // Test if the predicate solely makes use of column from left.
            let only_left = predicate_used_columns
                .iter()
                .all(|used_column| resolve_attribute(&left_attributes, used_column).is_some());

            if only_left {
                Some(Operator::Be part of {
                    left: Field::new(Operator::Choose {
                        youngster: left.clone(),
                        predicate: predicate.clone(),
                    }),
                    proper: proper.clone(),
                    ..be part of.clone()
                })
            } else {
                None
            }
        }
        _ => None,
    }
}

On this means, the above drawback is solved, and we’ve the power to deal with complicated attribute references. Nonetheless, there may be nonetheless an extended method to go earlier than reaching an ideal resolution. Let’s take one other instance:

leiysky=# choose * from (choose * from t1) as t, t1 the place t.a = 1;
 a | a
---+---
 1 | 1
(1 row)

Though SQL doesn’t permit using a number of an identical desk names in the identical FROM clause, we are able to bypass this through the use of an inlined view or CTE. In line with our present implementation, when processing t.a = 1, we’ve two t1.a attributes as a substitute of t.a as a result of we didn’t deal with the alias of the inlined view. Due to this fact, we have to add a Undertaking particularly for renaming attributes.

So the issue arises once more, as a result of we solely renamed some columns and handled them as derived columns, which added a number of burden to our Choose pushdown. Due to this fact, we should modify the definition of The IR and numerous associated codes to serve the mapping of names.

pub enum Operator {
    Undertaking {
        youngster: Field<Self>,
        // (Expression, Supply identify, Alias)
        initiatives: Vec<(ScalarExpr, QualifiedName, QualifiedName)>,
    },
    // Others
}

These issues will be solved by writing a little bit extra code, however check out the following instance. I consider that most individuals will go loopy similar to me:

leiysky=# choose a from t pure be part of t1;
 a
---
 1
(1 row)

leiysky=# choose t.a from t pure be part of t1;
 a
---
 1
(1 row)

leiysky=# choose t1.a from t pure be part of t1;
 a
---
 1
(1 row)

leiysky=# choose * from t pure be part of t1;
 a
---
 1
(1 row)

leiysky=# choose a from t be part of t1 on t.a = t1.a;
ERROR:  column reference "a" is ambiguous
LINE 1: choose a from t be part of t1 on t.a = t1.a;

After all, we are able to add all types of unusual restrictions to the code, create difficult-to-maintain loopholes to keep up this property, and make sure the correctness of those properties whereas optimizing. However for lazy programmers, discovering an easier design is a better option.

Welcome to the Deep Water Zone.

The IR made easy

The preliminary model of The IR was very concise and stylish, however with the intention to obtain extra performance and help extra complicated necessities, we added a number of data that we do not wish to give attention to.

Typically, the best state of The IR ought to be:

Having a concise algebraic construction
Operator nodes being utterly impartial from one another
Not having to take care of names (just for debugging and show functions)

Let’s take a second to replicate, does IR actually depend on identify? We initially used identify to characterize attributes primarily based mostly on instinct and reused the desk schema. Nonetheless, there may be a number of ineffective data embedded within the identify, which is of no assist to our optimization. It is much like numerous image names in programming languages that ultimately change into reminiscence addresses and register numbers throughout program execution.

With no identify, attributes can’t be distinguished. Is a reputation actually such an inconvenient factor?

NOTE: the above picture means “Is a reputation actually such an inconvenient factor?”

Finally, what we’d like is to assign a singular id to every attribute, whether or not it’s an integer or a string. Our sole function is to distinguish and reference attributes utilizing these ids. All identify decision might be dealt with in AST decreasing; I solely need the attribute id!

After the redesign, we’ve modified the best way attributes are represented and in addition made some modifications to The IR‘s definition. By default, we use int64 because the attribute id kind.

pub kind Id = i64;

pub struct Attribute {
    pub id: Id,
    pub column_type: Sort,
    pub nullable: Sort,
}

pub enum ScalarExpr {
    ColumnRef(Id),
    // Others
}

The design of the id usually can’t be separated from the corresponding context. In SQL IR, the frequent design strategies for attribute id can primarily be divided into two classes:

One relies on the abstraction of tuple attribute that we’ve used earlier than, utilizing the index of attribute in tuple as its id. We name this type of id as native id. The attribute of this design is that the id of the identical logical attribute will change with totally different operators it belongs to. The benefit of this design is that it may be inferred from the operator tree with out counting on exterior states for upkeep. Nonetheless, a drawback is that frequent remapping of ids is required when changing operators.
One other technique is to keep up a worldwide id generator and assign a singular id to all attributes in SQL IR. We name this type of id as international id. The benefit of this design is that it decouples attributes from tuple schema and permits illustration utilizing an unordered assortment construction like HashMap<Id, Attribute>. It additionally helps property derivation by set operations and reduces upkeep complexity. Nonetheless, a drawback is that operator bushes utilizing international ids depend upon exterior states and can’t exist independently.

Using these two totally different designs may have a big influence on the precise implementation of the optimizer.

For instance, concerning this optimization:

When there are appropriate indexes out there, this optimization can keep away from full desk scans and enhance efficiency.

If utilizing the native id design, implementing this optimization may be very easy, simply want to repeat your complete operator tree and at last join them with UnionAll.

But when utilizing the international id design, it is a non-trivial operation, even will be stated to be very painful. With a purpose to distinguish totally different attributes, we should generate new IDs for all attributes whereas copying the operator tree on the identical time, after which change all locations that reference these attributes with new IDs. This can trigger many troubles when the question is extra complicated.

For instance, when optimizing be part of order:

In line with the commutative regulation of Be part of operators, we are able to legally change the left and proper youngster of a Be part of.

When utilizing international id design, as a result of attributes will be represented as an unordered set, this operation has no influence on property derivation.

Nonetheless, when utilizing native id design, this operation turns into extraordinarily painful.

Other than optimization-related elements, there are additionally vital variations in representing correlated subqueries. Correlated subquery is a particular kind of subquery that may entry attributes exterior its personal scope. We check with accessing such particular attributes as outer reference.

Many programming languages additionally help related operations, which permit accessing variables that aren’t outlined throughout the perform by binding them to a selected surroundings. This particular kind of perform is known as a closure.

fn predominant() {
    let a = 1;
    let f = || {
        let b = a; // a is captured from exterior
        println!("{}", b);
    }; // f is a closure

    f(); // stdout: 1
}

The design utilizing international id can decide whether or not the subquery is correlated by attribute property calculation. Nonetheless, when utilizing native id design, we usually want to keep up a further scope id within the ColumnRef of scalar expression, which may be very cumbersome to implement.

Correlated subquery is a really huge matter, and we could talk about it in subsequent articles.

It may be seen that each designs have their very own benefits and drawbacks. In engineering apply, we have to select an appropriate design based mostly on our personal wants. Personally, I feel international id is a greater design as a result of it may simply resolve issues generally.

After the transformation utilizing international id, the code of The IR will be drastically simplified.

pub kind Id = i64;

pub struct Context {
    pub id_gen: Id,
}

pub struct Attribute {
    pub id: Id,
    pub column_type: Sort,
    pub nullable: Sort,
}

pub kind AttributeSet = HashMap<Id, Attribute>;

pub enum ScalarExpr {
    ColumnRef(Id),
    Literal(Worth, Sort),
    Operate(Signature, Vec<Self>),
    Subquery(Quantifier, Field<Operator>),
}

pub enum Operator {
    Get {
        desk: String,
        output_columns: AttributeSet,
    },
    Choose {
        youngster: Field<Self>,
        predicate: ScalarExpr,
    },
    Undertaking {
        youngster: Field<Self>,
        initiatives: Vec<(ScalarExpr, Id)>,
    },
    Be part of {
        variety: JoinKind,
        situation: ScalarExpr,
        left: Field<Self>,
        proper: Field<Self>,
    },
    UnionAll {
        left: Field<Self>,
        proper: Field<Self>,
    },
    Mixture {
        group_by: Vec<ScalarExpr>,
        aggr_exprs: Vec<(ScalarExpr, Id)>,
        youngster: Field<Self>,
    },
}

After transferring the complexity to the AST lowerer, we are able to confidently say that The IR is now a production-ready SQL IR. It helps all SQL operations and customary optimizations, has a user-friendly API, and can also be very straightforward to know.

What’s extra vital is that nobody understands The IR higher than the readers of this text, and any reader can simply prolong The IR in line with their very own wants.

Afterword

Lastly, we’ve reached the tip of this text.

Because the opening of the collection, on this article I merely mentioned some focal factors in SQL IR design with out delving into the main points of assorted algorithms.

Nonetheless, sharing the design strategy of IR is an fascinating factor. Understanding a number of IRs is like understanding why a roadside tree grows crooked. To somebody seeing it for the primary time, the tree’s uncommon form is puzzling. Nonetheless, locals who’ve lived round it are conscious of its backstory: when it was younger, the tree grew to become bent as a result of behavior of hanging preserved meat on its branches.

This little factor is a vital purpose for the ultimate consequence, however it’s too insignificant to be voluntarily shared by those that know – after all, in actuality, nobody typically cares concerning the causes behind it both.

Database growth is a distinct segment subject with many engineering practices and experiences. These experiences are not often circulated amongst individuals and I do not need them to vanish with altering occasions like America’s moon touchdown know-how; therefore my unique intention to write down this collection of articles happened.

Within the subsequent article, I’ll share associated content material about optimizer structure; please keep tuned.

Source Link