Now Reading
Question Engines: Push vs. Pull

Question Engines: Push vs. Pull

2023-07-31 07:05:42

26 Apr 2021

Folks discuss quite a bit about “pull” vs. “push” based mostly question engines, and it’s
fairly apparent what meaning colloquially, however a few of the
particulars is usually a bit exhausting to determine.

Vital individuals clearly have thought exhausting about this distinction, judging by this paragraph from Snowflake’s Sigmod paper:

Push-based execution refers to the truth that relational operators push their
outcomes to their downstream operators, slightly than ready for these
operators to drag knowledge (traditional Volcano-style mannequin). Push-based execution
improves cache effectivity, as a result of it removes management circulation logic from tight
loops. It additionally permits Snowflake to effectively course of DAG-shaped plans, as
opposed to only timber, creating further alternatives for sharing and
pipelining of intermediate outcomes.

And…that’s all they actually should say on the matter.
It leaves me with two main unanswered questions:

  1. Why does a push-based system “allow Snowflake to effectively course of DAG-shaped plans” in a method not supported by a pull-based system, and who cares? (DAG stands for directed, acyclic graph.)
  2. Why does this enhance cache effectivity, what does it imply to “take away management circulation logic from tight loops?”

On this submit, we’re going to speak about a few of the philosophical variations
between how pull and push based mostly question engines work, after which discuss concerning the
sensible variations of why you would possibly favor one over the opposite, guided by
these questions we’re attempting to reply.

Contemplate this SQL question.

SELECT DISTINCT customer_first_name FROM buyer WHERE customer_balance > 0

Question planners usually compile a SQL question like this right into a sequence of
discrete operators:

A simple query plan

Distinct
<- Map(customer_first_name)
<- Choose(customer_balance > 0)
<- buyer

In a pull based mostly system, customers drive the system. Every operator
produces a row when requested for it: the person will ask the foundation node
(Distinct) for a row, which is able to ask Map for a row, which is able to ask
Choose for a row, and so forth.

In a push based mostly system, the producers drive the system. Every operator, when it has some knowledge,
will inform its downstream operators about it. buyer, being a base desk on this question,
will inform Choose about all of its rows, which is able to trigger it to inform Map about of its rows, and so forth.

Let’s begin by constructing a brilliant easy implementation of every form of question engine.

A primary pull-based question engine

A pull-based question engine can be usually stated to make use of the Volcano or Iterator mannequin.
That is the oldest and most well-known question execution mannequin, and is called for the paper
which standardized its conventions in
1994.

First, we’ll begin with a relation and a option to flip that into an iterator:

let buyer = [
  { id: 1, firstName: "justin", balance: 10 },
  { id: 2, firstName: "sissel", balance: 0 },
  { id: 3, firstName: "justin", balance: -3 },
  { id: 4, firstName: "smudge", balance: 2 },
  { id: 5, firstName: "smudge", balance: 0 },
];

perform* Scan(coll) {
  for (let x of coll) {
    yield x;
  }
}

As soon as now we have our arms on an iterator, we are able to repeatedly ask it for its subsequent factor.

let iterator = Scan(buyer);

console.log(iterator.subsequent());
console.log(iterator.subsequent());
console.log(iterator.subsequent());
console.log(iterator.subsequent());
console.log(iterator.subsequent());
console.log(iterator.subsequent());

This outputs:

{ worth: { id: 1, firstName: 'justin', stability: 10 }, executed: false }
{ worth: { id: 2, firstName: 'sissel', stability: 0 }, executed: false }
{ worth: { id: 3, firstName: 'justin', stability: -3 }, executed: false }
{ worth: { id: 4, firstName: 'smudge', stability: 2 }, executed: false }
{ worth: { id: 5, firstName: 'smudge', stability: 0 }, executed: false }
{ worth: undefined, executed: true }

We will then create some operators to remodel an iterator into one other type.

perform* Choose(p, iter) {
  for (let x of iter) {
    if (p(x)) {
      yield x;
    }
  }
}

perform* Map(f, iter) {
  for (let x of iter) {
    yield f(x);
  }
}

perform* Distinct(iter) {
  let seen = new Set();
  for (let x of iter) {
    if (!seen.has(x)) {
      yield x;
      seen.add(x);
    }
  }
}

Then we are able to translate our authentic question:

SELECT DISTINCT customer_first_name FROM buyer WHERE customer_balance > 0

into this:

console.log([
  ...Distinct(
    Map(
      (c) => c.firstName,
      Select((c) => c.balance > 0, Scan(customer))
    )
  ),
]);

which outputs, as anticipated:

[ 'justin', 'smudge' ]

A primary push-based question engine

A push based mostly question engine, generally referred to as the Reactive, Observer, Stream, or
callback hell mannequin, as you would possibly anticipate, is like our earlier instance, however
turned on its head.

Let’s begin by defining an applicable Scan operator.

let buyer = [
  { id: 1, firstName: "justin", balance: 10 },
  { id: 2, firstName: "sissel", balance: 0 },
  { id: 3, firstName: "justin", balance: -3 },
  { id: 4, firstName: "smudge", balance: 2 },
  { id: 5, firstName: "smudge", balance: 0 },
];

perform Scan(relation, out) {
  for (r of relation) {
    out(r);
  }
}

We mannequin “this operator tells a downstream operator” as a closure that it calls.

Scan(buyer, (r) => console.log("row:", r));

Which outputs:

row: { id: 1, firstName: 'justin', stability: 10 }
row: { id: 2, firstName: 'sissel', stability: 0 }
row: { id: 3, firstName: 'justin', stability: -3 }
row: { id: 4, firstName: 'smudge', stability: 2 }
row: { id: 5, firstName: 'smudge', stability: 0 }

We will outline the remainder of our operators equally:

perform Choose(p, out) {
  return (x) => {
    if (p(x)) out(x);
  };
}

perform Map(f, out) {
  return (x) => {
    out(f(x));
  };
}

perform Distinct(out) {
  let seen = new Set();
  return (x) => {
    if (!seen.has(x)) {
      seen.add(x);
      out(x);
    }
  };
}

Our question is now written:

let outcome = [];
Scan(
  buyer,
  Choose(
    (c) => c.stability > 0,
    Map(
      (c) => c.firstName,
      Distinct((r) => outcome.push(r))
    )
  )
);

console.log(outcome);

Outputting, as anticipated,

[ 'justin', 'smudge' ]

Comparability

In a pull-based system, the operators sit idle till somebody asks them for a row.
This implies it’s apparent easy methods to get outcomes out of the system: you ask it
for a row and it provides it to you.
Nevertheless, this additionally signifies that the behaviour of the system could be very tightly coupled to its customers;
you do work if requested to and never in any other case.

Within the push-based system, the system sits idle till somebody tells it a few row.
Thus, the work the system is doing and its consumption are decoupled.

You may need seen that in comparison with our pull-based system, in our
push-based system above we needed to do an odd dance with making a buffer
(outcome) which we instructed the question to shove its outcomes into.
That is how push-based techniques wind up feeling, they don’t exist in
relation to their designated client, they form of simply exist, and when issues occur,
they do stuff in response.

DAG, yo

Let’s return to our first main query:

Why does a push-based system “allow Snowflake to effectively course of DAG-shaped plans” in a method not supported by a pull-based system, and who cares?

By “DAG-shaped plans” they imply operators which output their knowledge to a number of downstream operators.
It seems that is extra helpful than it sounds, even within the context of SQL, which we
usually consider as inherently tree-structured.

SQL has a assemble referred to as WITH that enables customers to reference a outcome set a number of occasions in a question.
This implies the next question is legitimate SQL:

WITH foo as (<some advanced question>)
SELECT * FROM
    (SELECT * FROM foo WHERE c) AS foo1
  JOIN
    foo AS foo2
  ON foo1.a = foo2.b

Which has a question plan that appears one thing like this:

A DAG query plan

Exterior of this specific instance, a sensible question planner can usually make use of
DAG-ness to reuse outcomes.
As an illustration, Jamie Brandon has a
post
describing a normal technique for decorrelating subqueries in SQL that makes in depth use
of DAG question plans with the intention to be environment friendly.
Due to all this, it’s precious to have the ability to deal with
these instances with out merely duplicating a department of the plan tree.

There are two primary issues that make this tough in a pull mannequin: scheduling and lifetimes.

Scheduling

In a setting the place each operator has precisely one output,
when to run an operator to provide some output is apparent: when your client wants it.
This turns into, on the very least, messier with a number of outputs, since “requests for
rows” and “computations to provide rows” are now not one-to-one.

By comparability, in a push-based system, scheduling of operators was by no means tied
to their outputs within the first place, so shedding that data makes no distinction.

Lifetime

The opposite difficult factor with DAGs in a pull-based mannequin is that an operator in
such a system is on the mercy of its downstream operators: a row that may
be learn sooner or later by any of its customers should be stored round (or should be
in a position to be re-computed).
One normal resolution to that is for an operator to buffer all of its rows
that get output so you possibly can re-hand them out, however introducing probably
unbounded buffering at each operator boundary is undesirable (however is, by
necessity, what Postgres and CockroachDB do for WITH having a number of
customers).

This isn’t as a lot of an issue in a push-based system, as a result of operators
now drive when their customers course of a row, they’ll successfully power
them to take possession of a row and take care of it.
Most often, this can both end in some form of important buffering that may
have been wanted even within the absence of a DAG (say, for a Distinct or hash be part of
operation), or will merely be processed and handed on instantly.

Cache Effectivity

Now let’s discuss concerning the second query.

Why does this enhance cache effectivity, what does it imply to “take away management circulation logic from tight loops?”

To begin with, the Snowflake paper cites a
paper by Thomas Neumann
in assist of this declare.
I don’t actually suppose this paper helps the declare in isolation although,
if I needed to sum up the paper, it’s extra like,
“we wish to compile queries to machine code in service of improved
cache effectivity, and to that finish, a push-based paradigm is preferable.”
The paper could be very attention-grabbing and I like to recommend you give it a learn, nevertheless it appears to me that
its conclusions don’t actually apply until you’re ranging from a place of desirous to
compile your queries utilizing one thing like LLVM (which, from some cursory
analysis, it’s not clear to me if Snowflake does).

In doing analysis for this part I discovered this
paper
by Shaikhha, Dashti, and Koch, that does a terrific job of highlighting some
of the strengths and weaknesses of every mannequin.
In reality, they reference the Neumann paper:

Extra not too long ago, an operator chaining mannequin has been proposed that shares the
benefit of avoiding materialisation of intermediate outcomes however which
reverses the management circulation; tuples are pushed ahead from the supply
relations to the operator producing the ultimate outcome. Latest papers
appear to recommend that this push-model constantly results in higher question
processing efficiency than the pull mannequin, though no direct, truthful
comparisons are offered.

One of many primary contributions of this paper is to debunk this fantasy. As we
present, if in contrast pretty, push and pull based mostly engines have very related
efficiency, with particular person strengths and weaknesses, and neither is a transparent
winner. Push engines have in essence solely been thought of within the context of
question compilation, conflating the potential benefits of the push paradigm
with these of code inlining. To match them pretty, one has to decouple these
facets.

They conclude that there’s no clear winner right here however observe that
compiling a push-based question makes for easier code.
The principle concept is that it seems it’s truly extraordinarily simple to unroll a
synchronous, push-based question into the equal code you’d write by hand.
Take our question from earlier than:

let outcome = [];
Scan(
  buyer,
  Choose(
    (c) => c.stability > 0,
    Map(
      (c) => c.firstName,
      Distinct((r) => outcome.push(r))
    )
  )
);

console.log(outcome);

This very naturally unrolls to:

let outcome = [];
let seen = new Set();
for (let c of buyer) {
  if (c.stability > 0) {
    if (!seen.has(c.firstName)) {
      seen.add(c.firstName);
      outcome.push(c.firstName);
    }
  }
}

console.log(outcome);

For those who attempt to unroll the equal pull-based question you’ll discover the ensuing code is way much less pure.

I believe it’s exhausting to come back to any actual conclusions about which is “higher” based mostly on this, and I believe
essentially the most wise factor is to make selections based mostly on the wants of any specific
question engine.

Issues

Impedance Mismatch

One factor that may give you these techniques is a mismatch at their boundaries.
Crossing a boundary from a pull system to a push system requires polling its state, and
crossing a boundary from a push system to a pull system requires materialization of its state.
Neither of those are dealbreakers, however each incur some value.

This is the reason in a streaming system like Flink or Materialize you’ll usually
see push-based techniques used: the inputs to such a system are inherently
push-based, because you’re listening to incoming Kafka streams, or one thing
related.

In a streaming setting, in order for you your finish client to really be capable of
work together with the system in a pull-based method (say, by operating queries in opposition to
it when it must), it is advisable to introduce some form of materialization
layer the place you construct an index out of the outcomes.

Conversely, in a system that doesn’t expose some form of streaming/tailing mechanism,
if you wish to know when some knowledge has modified, your solely possibility will probably be to ballot
it periodically.

Algorithms

Some algorithms are merely not applicable to be used in a push system.
As mentioned within the Shaikhha paper: the merge be part of algorithm working is basically
based mostly across the capability to traverse two iterators in lockstep, which isn’t
sensible in a push system the place the patron has little-to-no management.

Equally, LIMIT operators may be problematic within the push mannequin.
Wanting introducing bidirectional communication, or fusing the LIMIT to
the underlying operator (which isn’t at all times doable), the manufacturing
operators can’t know they’ll cease doing work as soon as their client has been
happy.
In a pull system this isn’t an issue, because the client can simply cease
asking for extra outcomes when it doesn’t want any extra.

Cycles

With out going into an excessive amount of element, having not simply DAGs however full on cyclic graphs in both
of those fashions is nontrivial, however essentially the most well-known system that solved that is
Naiad, a Timely Dataflow System,
whose trendy incarnation is
Timely Dataflow.
Each of those techniques are push techniques, and as with DAGs, many issues simply
work higher in a push mannequin right here.

Conclusion

Overwhelmingly introductory database supplies concentrate on the iterator mannequin, however
trendy techniques, particularly analytic ones, are beginning to discover the push mannequin extra.
As famous within the Shaikhha paper, it’s exhausting to search out apples-to-apples comparisons, since
a whole lot of the migration to push fashions are motivated by a need to compile
queries to decrease degree code and the advantages that come from that cloud the outcomes.

Regardless of that, there are some quantitative variations that make every mannequin applicable in
totally different eventualities and if you happen to’re interested by databases it’s price having a normal concept
of how they each work.
Sooner or later I’d like to enter extra element about how these techniques are constructed and
attempt to expose a few of the magic that makes them work.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top