Question finest practices: When must you use the IN as a substitute of the OR operator?
The unique concept of a declarative question language like SQL is that you just ask a database administration system (DBMS) to provide the reply you need it to and never the way to compute it. It’s as much as the DBMS’s question optimizer (aka planner, compiler) to find out probably the most environment friendly execution plan for a SQL question. Ideally, the DBMS ought to select the identical optimum plan in the event you ask the identical query utilizing completely different SQL instructions.
However this isn’t all the time the case, particularly in beloved open-source DBMSs like PostgreSQL. Question efficiency usually varies relying on the way you write queries. We now have seen some OtterTune clients the place a easy change of SQL results in a exceptional 10,000x speedup.
To higher perceive this drawback, on this article, we give attention to queries using the IN and OR operators to compute a disjunction of their WHERE clauses. Beneath are two instance queries, each yielding an identical outcomes, but their efficiency differs wildly:
SELECT * FROM merchandise WHERE id IN (1, 2, 3)
SELECT * FROM merchandise WHERE id = 1 OR id = 2 OR id = 3
We’ll analyze the efficiency outcomes obtained from our experiments in PostgreSQL, evaluating IN operator and OR operator queries below varied situations. By exploring the explanations behind these efficiency variations, our aim is to suggest finest practices for writing environment friendly queries.
Experiment setup
We use the next instance desk containing merchandise info. Every row within the desk consists of the merchandise identify, worth, amount, and a novel ID serving as the first key. We additionally create a composite index on (identify, amount).
Right here is the DDL instructions to create this desk and the INSERT assertion to populate it with 10m tuples of random knowledge:
CREATE TABLE merchandise (
id INT PRIMARY KEY,
identify VARCHAR(32),
worth INT,
amount INT
);
CREATE INDEX idx_item_name_quantity ON merchandise (identify, amount);
INSERT INTO merchandise
SELECT id,
SUBSTR(MD5(RANDOM()::textual content), 0, 32) AS identify,
(RANDOM() * 10000)::int AS worth,
(RANDOM() * 10000)::int AS amount
FROM generate_series(1, 10000000) AS t(id)
We used a PostgreSQL v14 database on a db.t3.medium Amazon RDS occasion outfitted with 2 CPUs and 4GB RAM. The storage capability is 200GB (gp2 storage). All experiments are carried out with PostgreSQL’s Just-in-Time Compilation (JIT) function enabled to compile predicates into machine code utilizing LLVM. We conduct three consecutive runs for every question and report the common latency.
We use OtterTune’s AI-powered service to optimize PostgreSQL’s configuration knobs (e.g., AWS’s ParameterGroup). We additionally pre-load all the desk into reminiscence utilizing pg_prewarm:
CREATE EXTENSION pg_prewarm;
SELECT pg_prewarm('merchandise')
Single attribute filters
We first look at queries that reference a single attribute of their WHERE clause. These characterize the vast majority of the queries that we see in OtterTune’s automated query tuning function.
Single listed attribute
We measure PostgreSQL’s efficiency on queries that retrieve a number of merchandise information on an listed attribute. We first use a single IN operator within the question’s WHERE clause after which run the identical lookups utilizing a number of OR clauses. We run a number of variations of every question with an growing variety of id lookups.
SELECT * FROM merchandise WHERE id IN (?); –- IN Clause
SELECT * FROM merchandise WHERE id = ? OR id = ? OR ... ;
The chart beneath reveals the efficiency outcomes. When the variety of predicates within the filter situation is small, the efficiency distinction between the IN clause and OR clauses is negligible. Nevertheless, because the variety of predicates per question will increase, PostgreSQL executes the IN clause question a lot sooner than the question with the OR clauses. Once more, each queries are computing the identical reply on the identical desk so the run instances ought to be the identical. However the efficiency hole between the 2 queries expands with a bigger variety of predicates.
At 5000 predicates, the IN clause question is roughly 69x sooner than the OR clause (7.56ms vs. 521.19ms). To know why that is the case, we look at the question plans for each the IN clause and OR clause utilizing PostgreSQL’s EXPLAIN function. For brevity, we think about the queries when utilizing solely three predicates:
ottertune=> EXPLAIN SELECT * FROM merchandise WHERE id IN (1,2,3);
QUERY PLAN
Index Scan utilizing item_pkey on merchandise (price=0.43..17.36 rows=3 width=44)
Index Cond: (id = ANY ('{1,2,3}'::integer[]))
ottertune=> EXPLAIN SELECT * FROM merchandise WHERE id = 1 OR id = 2 OR id = 3;
QUERY PLAN
Bitmap Heap Scan on merchandise (price=12.90..24.52 rows=3 width=70)
Recheck Cond: ((id = 1) OR (id = 2) OR (id = 3))
Heap Blocks: actual=1
-> BitmapOr (price=12.90..12.90 rows=3 width=0)
-> Bitmap Index Scan on item_pkey (price=0.00..4.30 rows=1 width=0)
Index Cond: (id = 1)
-> Bitmap Index Scan on item_pkey (price=0.00..4.30 rows=1 width=0)
Index Cond: (id = 2)
-> Bitmap Index Scan on item_pkey (price=0.00..4.30 rows=1 width=0)
Index Cond: (id = 3)
The IN clause question makes use of PostgreSQL’s Index Scan entry technique to retrieve knowledge by way of an index. The DBMS scans the first key index (item_pkey) to seek out which rows fulfill the question situation (i.e., id is any worth in {1,2,3}) after which retrieves the matching rows from the desk.
The OR clause question makes use of PostgreSQL’s Bitmap Index Scan entry technique to scan the index and assemble a bitmap to trace matching rows. Every entry within the bitmap represents a row place within the desk and signifies whether or not that row satisfies the question circumstances because the DBMS scans the desk’s major key index (item_pkey).
After populating the bitmap, the DBMS retrieves the desk pages containing matching information. The DBMS creates three bitmaps for every predicate within the disjunction clause. That’s, PostgreSQL creates a bitmap to trace (id = 1), one other bitmap for (id = 2), and a 3rd one for (id = 3). It then performs three index scans to populate every bitmap. However sustaining separate bitmaps is pointless as a result of id is the first key, which signifies that every predicate will match one and just one tuple. PostgreSQL then combines these bitmaps utilizing a bitwise OR operation.
The variety of bitmaps that PostgreSQL makes use of for these OR clause queries will increase proportionally with the variety of predicates within the WHERE clause. For instance, when there are 5000 OR operations, the DBMS creates 5000 bitmaps, which causes a substantial slowdown in comparison with the IN clause that solely requires one index scan. Because the variety of circumstances will increase, the DBMS OR clause incurs bigger overhead in bitmap creation and mixture, leading to a notable efficiency hole in comparison with the IN clause.
IN vs. OR: single unindexed attribute
We subsequent examined queries that filtered based mostly on a single unindexed attribute (worth). We once more do the identical comparability the place we generate two variations of the identical question: one utilizing a single IN clause and one utilizing a number of OR clauses with equality predicates. We then scale up the variety of comparisons per question.
SELECT * FROM merchandise WHERE worth IN (?); –- IN Clause
SELECT * FROM merchandise WHERE worth = ? OR worth = ? OR worth = ? OR ... ;
As illustrated beneath, as soon as once more the IN clause queries outperforms the a number of OR clauses variants. The distinction is way bigger than the queries on the listed attribute above and the hole widens much more with a rise within the variety of predicates. With 5000 predicates, PostgreSQL executes the IN clause question roughly 288x sooner than the OR clause (1.8 seconds vs. 518.2 seconds).
Let’s additionally examine two question plans to watch their variations:
ottertune=> EXPLAIN SELECT * FROM merchandise WHERE worth IN (1,2,3);
QUERY PLAN
Collect (price=1000.00..152048.39 rows=2988 width=44)
Staff Deliberate: 2
-> Parallel Seq Scan on merchandise (price=0.00..150749.59 rows=1245 width=44)
Filter: (worth = ANY ('{1,2,3}'::integer[]))
ottertune=> EXPLAIN SELECT * FROM merchandise WHERE worth = 1 OR worth = 2 OR worth = 3;
QUERY PLAN
Collect (price=1000.00..167673.27 rows=2987 width=44)
Staff Deliberate: 2
-> Parallel Seq Scan on merchandise (price=0.00..166374.57 rows=1245 width=44)
Filter: ((worth = 1) OR (worth = 2) OR (worth = 3))
In contrast to within the earlier experiment the place the 2 question plans are completely different (Index Scan vs. Bitmap Index Scan), this time the plans are the identical. As proven above, each queries execute Sequential Scans that learn each web page within the desk and consider whether or not the value satisfies the filtering circumstances on a row-by-row foundation. Nevertheless, the style by which PostgreSQL performs the worth comparability for filtering differs between the 2 plans, which explains the big efficiency benefit of IN versus OR.
For the OR clause question, the DBMS sequentially compares every situation’s worth one after the other. Within the given instance, it performs three comparisons for every row to find out if (worth = 1), (worth = 2), after which (worth = 3). This analysis method signifies that for N predicates, the complexity of the filtering operation per row is O(N).
However, with the IN clause, PostgreSQL builds a brief hash desk populated with the weather within the question’s IN clause (ExecEvalHashedScalarArrayOp). Then because the DBMS scans every tuple, it probes this hash desk to see whether or not the tuple’s attribute matches with any entry. This hash-based analysis has a extra environment friendly complexity of O(1) because the DBMS solely must carry out one lookup within the hash desk per tuple.
Therefore, because the variety of predicates will increase in a question, the OR clause incurs vital overhead as a result of the DBMS evaluates predicates for every row individually, in distinction to the extra environment friendly IN clause.
A number of attribute filters
We additionally ran checks utilizing queries that filter information on utilizing equality predicates on two attributes. We first consider the situations when each attributes are both listed or unindexed. We then examined queries the place there may be an index for under one of many attributes.
The spoiler for this half is that PostgreSQL generates the identical plan for the IN question because it does for the OR question in every situation. Because of this their efficiency is identical. Thus, it doesn’t matter whether or not you employ the IN clause or the OR clause when querying on a number of attributes since they each generate the identical question plan.
Two listed attributes
The queries to filter the desk on two attributes (identify, amount) utilizing the the composite index (idx_item_name_quantity) are as follows:
SELECT * FROM merchandise WHERE (identify, amount) IN (('a', 1), ('b', 2));
SELECT * FROM merchandise WHERE (identify='a' AND amount=1) OR (identify='b' AND amount=2)
The question plan is as follows:
Bitmap Heap Scan on merchandise (price=9.14..13.16 rows=1 width=44)
Recheck Cond: ((((identify)::textual content = 'a'::textual content) AND (amount = 1)) OR (((identify)::textual content = 'b'::textual content) AND (amount = 2)))
-> BitmapOr (price=9.14..9.14 rows=1 width=0)
-> Bitmap Index Scan on idx_item_name_quantity (price=0.00..4.57 rows=1 width=0)
Index Cond: (((identify)::textual content = 'a'::textual content) AND (amount = 1))
-> Bitmap Index Scan on idx_item_name_quantity (price=0.00..4.57 rows=1 width=0)
Index Cond: (((identify)::textual content = 'b'::textual content) AND (amount = 2))
These outcomes present that the execution time is about the identical for each queries. In each situations, PostgreSQL executes a separate Bitmap Index Scan per predicate within the WHERE clause because it did with the OR clause question with a single listed attribute. That’s, the DBMS doesn’t apply any optimizations for the IN clause question.
Two unindexed attributes
We subsequent think about queries that filter on two unindexed attributes (worth, amount):
SELECT * FROM merchandise WHERE (worth, amount) IN ((1, 1), (2, 2));
SELECT * FROM merchandise WHERE (worth=1 AND amount=1) OR (worth=2 AND amount=2)
As soon as once more we see that PostgreSQL generates the identical plans for each queries:
Collect (price=1000.00..177791.33 rows=1 width=44)
Staff Deliberate: 2
-> Parallel Seq Scan on merchandise (price=0.00..176791.23 rows=1 width=44)
Filter: (((worth = 1) AND (amount = 1)) OR ((worth = 2) AND (amount = 2)))
The above outcomes present that the efficiency of the 2 queries are once more basically the identical. It is because PostgreSQL doesn’t construct a brief hash desk to hurry up IN clause analysis when there may be multiple attribute. PostgreSQL notably additionally doesn’t help multi-attribute hash indexes both (see this psql-hackers email thread discussing this limitation method again in 1998!).
One listed + one unindexed
Lastly, we check queries the place there may be solely an index for one of many attributes (id) and one unindexed attribute (worth):
SELECT * FROM merchandise WHERE (id, worth) IN ((1, 1), (2, 2));
SELECT * FROM merchandise WHERE (id=1 AND worth=1) OR (id=2 AND worth=2)
And like the opposite checks, PostgreSQL chooses the identical plan for each queries:
Bitmap Heap Scan on merchandise (price=8.89..16.90 rows=1 width=36)
Recheck Cond: ((id = 1) OR (id = 2))
Filter: (((id = 1) AND (worth = 1)) OR ((id = 2) AND (worth = 2)))
-> BitmapOr (price=8.89..8.89 rows=2 width=0)
-> Bitmap Index Scan on item_pkey (price=0.00..4.44 rows=1 width=0)
Index Cond: (id = 1)
-> Bitmap Index Scan on item_pkey (price=0.00..4.44 rows=1 width=0)
Index Cond: (id = 2)
And once more we see that the queries execution instances are unsurprisingly the identical.
Finest practices
Based mostly on our outcomes, we discover that PostgreSQL constantly reveals equal or higher efficiency for queries with an IN clause than OR clauses when filtering on a single attribute. That is very true for queries with numerous predicates. For small situation sizes (e.g., lower than 10), the efficiency distinction between the 2 question varieties is negligible.
When filtering on a number of attributes, each the IN and OR clauses have the identical efficiency as a result of PostgreSQL chooses the identical plan for each queries.
Thus, we advise that you need to all the time use IN clauses with PostgreSQL, making it a most well-liked selection for higher effectivity.