The Many Faces of DISTINCT in PostgreSQL
I began my programming profession as an Oracle DBA. It took a couple of years however ultimately I acquired fed up with the company world and I went about doing my very own factor.
After I not had the cozy cushion of Oracle enterprise version I found PostgreSQL. After I gotten over not having correct partitions and MERGE assertion (aka UPSERT), I discovered some good distinctive options in PostgreSQL. Oddly sufficient, numerous them contained the phrase DISTINCT.
DISTINCT
I created a easy Worker desk with identify, division and wage utilizing mock knowledge from this site:
haki=# d worker
Column | Sort | Modifiers
------------+-----------------------+-----------
id | integer | not null
identify | character various(30) |
division | character various(30) |
wage | integer |
haki=# choose * from worker restrict 5;
id | identify | division | wage
----+----------------+----------------------+--------
1 | Carl Frazier | Engineering | 3052
2 | Richard Fox | Product Administration | 13449
3 | Carolyn Carter | Engineering | 8366
4 | Benjamin Brown | Enterprise Growth | 7386
5 | Diana Fisher | Companies | 10419
What is DISTINCT?
SELECT DISTINCT eliminates duplicate rows from the end result.
The only use of distinct is, for instance, to get a novel listing of
departments:
haki=# SELECT DISTINCT division FROM worker;
division
--------------------------
Companies
Assist
Coaching
Accounting
Enterprise Growth
Advertising and marketing
Product Administration
Human Sources
Engineering
Gross sales
Analysis and Growth
Authorized
(straightforward CS college students, I do know it is not normalized…)
We are able to do the identical factor with group by
SELECT division FROM worker GROUP BY division;
However we’re speaking about DISTINCT.
DISTINCT ON
A basic job interview query is discovering the worker with the best wage in every division.
That is what they train within the college:
haki=# SELECT
*
FROM
worker
WHERE
(division, wage) IN (
SELECT
division,
MAX(wage)
FROM
worker
GROUP BY
division
)
ORDER BY
division;
id | identify | division | wage
----+------------------+--------------------------+--------
30 | Sara Roberts | Accounting | 13845
4 | Benjamin Brown | Enterprise Growth | 7386
3 | Carolyn Carter | Engineering | 8366
20 | Janet Corridor | Human Sources | 2826
14 | Chris Phillips | Authorized | 3706
10 | James Cunningham | Authorized | 3706
11 | Richard Bradley | Advertising and marketing | 11272
2 | Richard Fox | Product Administration | 13449
25 | Evelyn Rodriguez | Analysis and Growth | 10628
17 | Benjamin Carter | Gross sales | 6197
24 | Jessica Elliott | Companies | 14542
7 | Bonnie Robertson | Assist | 12674
8 | Jean Bailey | Coaching | 13230
Authorized has two workers with the identical excessive wage. Relying on the use case, this question can get fairly nasty.
For those who graduated some time again, you already know a couple of issues about databases and also you heard about analytic and window functions, you would possibly do that:
WITH ranked_employees AS (
SELECT
ROW_NUMBER() OVER (
PARTITION BY division ORDER BY wage DESC
) AS rn,
*
FROM
worker
)
SELECT
*
FROM
ranked_employees
WHERE
rn = 1
ORDER BY
division;
The end result is similar with out the duplicates:
rn | id | identify | division | wage
----+----+------------------+--------------------------+--------
1 | 30 | Sara Roberts | Accounting | 13845
1 | 4 | Benjamin Brown | Enterprise Growth | 7386
1 | 3 | Carolyn Carter | Engineering | 8366
1 | 20 | Janet Corridor | Human Sources | 2826
1 | 14 | Chris Phillips | Authorized | 3706
1 | 11 | Richard Bradley | Advertising and marketing | 11272
...
Up till now, that is what I might have carried out.
Now for the true deal with, PostgreSQL has a particular nonstandard clause to search out the primary row in a bunch:
SELECT DISTINCT ON (division)
*
FROM
worker
ORDER BY
division,
wage DESC;
That is wild! Why no person instructed me that is doable?
The docs clarify DISTINCT ON:
SELECT DISTINCT ON ( expression [, …] ) retains solely the primary row of every set of
rows the place the given expressions consider to equal
And the explanation I have never heard about it’s:
Nonstandard Clauses
DISTINCT ON ( … ) is an extension of the SQL commonplace.
PostgreSQL does all of the heavy lifting for us. The one requirement is that we ORDER BY the sector we group by (division
on this case). It additionally permits for “grouping” by a couple of area which solely makes this clause much more highly effective.
IS DISTINCT FROM
Evaluating values in SQL may end up in three outcomes – true
, false
or unknown
:
WITH t AS (
SELECT 1 AS a, 1 AS b UNION ALL
SELECT 1, 2 UNION ALL
SELECT NULL, 1 UNION ALL
SELECT NULL, NULL
)
SELECT
a,
b,
a = b as equal
FROM
t;
a | b | equal
------+------+-------
1 | 1 | t
1 | 2 | f
NULL | 1 | NULL
NULL | NULL | NULL
The results of evaluating NULL with NULL utilizing equality (=) is UNKNOWN (marked as NULL within the desk).
In SQL 1 = 1 and NULL IS NULL however NULL != NULL.
It is vital to concentrate on this subtlety as a result of evaluating nullable fields would possibly yield sudden outcomes.
The total situation to get both true or false when evaluating nullable fields is:
(a is null and b is null)
or
(a is not null and b is not null and a = b)
And the end result:
a | b | equal | full_condition
------+------+-------+----------
1 | 1 | t | t
1 | 2 | f | f
NULL | 1 | NULL | f
NULL | NULL | NULL | t
That is the end result we wish however it is rather lengthy. Is there a greater manner?
PostgreSQL implements the SQL commonplace for safely evaluating nullable fields:
haki=# SELECT
a,
b,
a = b as equal,
a IS DISTINCT FROM b AS is_distinct_from
FROM
t;
a | b | equal | is_distinct_from
------+------+-------+------------------
1 | 1 | t | f
1 | 2 | f | t
NULL | 1 | NULL | t
NULL | NULL | NULL | f
PostgreSQL wiki clarify IS DISTINCT FROM
:
IS DISTINCT FROM and IS NOT DISTINCT FROM … deal with NULL as if it was a
recognized worth, somewhat than a particular case for unknown.
A lot better – brief and verbose.
How Other Databases Handle This?
- MySQL – A special operator
<=>
with related performance. - Oracle – Offers a perform known as LNNVL to match nullable fields (good luck with that…).
- MSSQL – Could not discover a related perform.
ARRAY_AGG (DISTINCT)
ARRAY_AGG
was one of many main promoting factors of PostgreSQL once I was
transitioning from Oracle.
ARRAY_AGG
aggregates values into an array:
haki=# SELECT
division,
ARRAY_AGG(identify) AS workers
FROM
worker
GROUP BY
division;
division | workers
----------------------+-------------------------------------
Companies | {"Diana Fisher","Jessica Elliott"}
Assist | {"Bonnie Robertson"}
Coaching | {"Jean Bailey"}
Accounting | {"Phillip Reynolds","Sean Franklin"}
Enterprise Growth | {"Benjamin Brown","Brian Hayes"}
Advertising and marketing | {"Richard Bradley","Arthur Moreno"}
Product Administration | {"Richard Fox","Randy Wells"}
Human Sources | {"Janet Corridor"}
Engineering | {"Carl Frazier","Carolyn Carter"}
Gross sales | {"Benjamin Carter"}
Analysis and Develo.. | {"Donna Reynolds","Ann Boyd"}
Authorized | {"James Cunningham","George Hanson"}
I discover ARRAY_AGG
helpful principally within the CLI for getting a fast view of the info, or when used with an ORM.
PostgreSQL took it the additional mile and carried out the DISTINCT choice for this combination perform as effectively. Utilizing DISTINCT we will, for instance, shortly view the distinctive salaries in every division:
haki=# SELECT
division,
ARRAY_AGG(DISTINCT wage) AS salaries
FROM
worker
GROUP BY
division;
division | salaries
--------------------------+---------------
Accounting | {11203}
Enterprise Growth | {2196,7386}
Engineering | {1542,3052}
Human Sources | {2826}
Authorized | {1079,3706}
Advertising and marketing | {5740}
Product Administration | {9101,13449}
Analysis and Growth | {6451,10628}
Gross sales | {6197}
Companies | {2119}
Assist | {12674}
Coaching | {13230}
We are able to instantly see that everybody within the help division are making the identical wage.
How Other Databases Handle This?
- MySQL – Has the same perform known as GROUP_CONCAT.
- Oracle – Has an combination perform known as ListAgg. It has no help for DISTINCT. Oracle launched the perform in model 11.2 and up till then the world huge internet was crammed with customized implementations.
- MsSQL – The closest I discovered was a perform known as STUFF that accepts an expression.
Take away
The take away from this text is that you need to at all times return to the fundamentals!