Now Reading
The Many Faces of DISTINCT in PostgreSQL

The Many Faces of DISTINCT in PostgreSQL

2023-05-22 04:46:19


I began my programming profession as an Oracle DBA. It took a couple of years however ultimately I acquired fed up with the company world and I went about doing my very own factor.

After I not had the cozy cushion of Oracle enterprise version I found PostgreSQL. After I gotten over not having correct partitions and MERGE assertion (aka UPSERT), I discovered some good distinctive options in PostgreSQL. Oddly sufficient, numerous them contained the phrase DISTINCT.

DISTINCT

I created a easy Worker desk with identify, division and wage utilizing mock knowledge from this site:

haki=# d worker

    Column   |         Sort          | Modifiers
------------+-----------------------+-----------
    id         | integer               | not null
    identify       | character various(30) |
    division | character various(30) |
    wage     | integer               |

haki=# choose * from worker restrict 5;

    id |      identify      |      division      | wage
----+----------------+----------------------+--------
    1 | Carl Frazier   | Engineering          |   3052
    2 | Richard Fox    | Product Administration   |  13449
    3 | Carolyn Carter | Engineering          |   8366
    4 | Benjamin Brown | Enterprise Growth |   7386
    5 | Diana Fisher   | Companies             |  10419

What is DISTINCT?

SELECT DISTINCT eliminates duplicate rows from the end result.

The only use of distinct is, for instance, to get a novel listing of
departments:

haki=# SELECT DISTINCT division FROM worker;

        division
--------------------------
    Companies
    Assist
    Coaching
    Accounting
    Enterprise Growth
    Advertising and marketing
    Product Administration
    Human Sources
    Engineering
    Gross sales
    Analysis and Growth
    Authorized

(straightforward CS college students, I do know it is not normalized…)

We are able to do the identical factor with group by

SELECT division FROM worker GROUP BY division;

However we’re speaking about DISTINCT.


DISTINCT ON

A basic job interview query is discovering the worker with the best wage in every division.

That is what they train within the college:

haki=# SELECT
    *
FROM
    worker
WHERE
    (division, wage) IN (
        SELECT
            division,
            MAX(wage)
        FROM
            worker
        GROUP BY
            division
    )
ORDER BY
    division;

 id |       identify       |        division        | wage
----+------------------+--------------------------+--------
 30 | Sara Roberts     | Accounting               |  13845
  4 | Benjamin Brown   | Enterprise Growth     |   7386
  3 | Carolyn Carter   | Engineering              |   8366
 20 | Janet Corridor       | Human Sources          |   2826
 14 | Chris Phillips   | Authorized                    |   3706
 10 | James Cunningham | Authorized                    |   3706
 11 | Richard Bradley  | Advertising and marketing                |  11272
  2 | Richard Fox      | Product Administration       |  13449
 25 | Evelyn Rodriguez | Analysis and Growth |  10628
 17 | Benjamin Carter  | Gross sales                    |   6197
 24 | Jessica Elliott  | Companies                 |  14542
  7 | Bonnie Robertson | Assist                  |  12674
  8 | Jean Bailey      | Coaching                 |  13230

Authorized has two workers with the identical excessive wage. Relying on the use case, this question can get fairly nasty.

For those who graduated some time again, you already know a couple of issues about databases and also you heard about analytic and window functions, you would possibly do that:

WITH ranked_employees AS (
    SELECT
        ROW_NUMBER() OVER (
           PARTITION BY division ORDER BY wage DESC
        ) AS rn,
        *
    FROM
        worker
)
SELECT
    *
FROM
    ranked_employees
WHERE
    rn = 1
ORDER BY
    division;

The end result is similar with out the duplicates:

 rn | id |       identify       |        division        | wage
----+----+------------------+--------------------------+--------
  1 | 30 | Sara Roberts     | Accounting               |  13845
  1 |  4 | Benjamin Brown   | Enterprise Growth     |   7386
  1 |  3 | Carolyn Carter   | Engineering              |   8366
  1 | 20 | Janet Corridor       | Human Sources          |   2826
  1 | 14 | Chris Phillips   | Authorized                    |   3706
  1 | 11 | Richard Bradley  | Advertising and marketing                |  11272
...

Up till now, that is what I might have carried out.

Now for the true deal with, PostgreSQL has a particular nonstandard clause to search out the primary row in a bunch:

SELECT DISTINCT ON (division)
    *
FROM
    worker
ORDER BY
    division,
    wage DESC;
This is wild!
That is wild!

That is wild! Why no person instructed me that is doable?

The docs clarify DISTINCT ON:

SELECT DISTINCT ON ( expression [, …] ) retains solely the primary row of every set of
rows the place the given expressions consider to equal

And the explanation I have never heard about it’s:

Nonstandard Clauses
DISTINCT ON ( … ) is an extension of the SQL commonplace.

PostgreSQL does all of the heavy lifting for us. The one requirement is that we ORDER BY the sector we group by (division on this case). It additionally permits for “grouping” by a couple of area which solely makes this clause much more highly effective.


IS DISTINCT FROM

Evaluating values in SQL may end up in three outcomes – true, false or unknown:

WITH t AS (
    SELECT 1 AS a, 1 AS b UNION ALL
    SELECT 1, 2 UNION ALL
    SELECT NULL, 1 UNION ALL
    SELECT NULL, NULL
)
SELECT
    a,
    b,
    a = b as equal
FROM
    t;

  a   |  b   | equal
------+------+-------
    1 |    1 | t
    1 |    2 | f
 NULL |    1 | NULL
 NULL | NULL | NULL

The results of evaluating NULL with NULL utilizing equality (=) is UNKNOWN (marked as NULL within the desk).

In SQL 1 = 1 and NULL IS NULL however NULL != NULL.

It is vital to concentrate on this subtlety as a result of evaluating nullable fields would possibly yield sudden outcomes.

The total situation to get both true or false when evaluating nullable fields is:

(a is null and b is null)
or
(a is not null and b is not null and a = b)

And the end result:

See Also

  a   |  b   | equal | full_condition
------+------+-------+----------
    1 |    1 | t     | t
    1 |    2 | f     | f
 NULL |    1 | NULL  | f
 NULL | NULL | NULL  | t

That is the end result we wish however it is rather lengthy. Is there a greater manner?

PostgreSQL implements the SQL commonplace for safely evaluating nullable fields:

haki=# SELECT
    a,
    b,
    a = b as equal,
    a IS DISTINCT FROM b AS is_distinct_from
FROM
    t;

  a   |  b   | equal | is_distinct_from
------+------+-------+------------------
    1 |    1 | t     | f
    1 |    2 | f     | t
 NULL |    1 | NULL  | t
 NULL | NULL | NULL  | f

PostgreSQL wiki clarify IS DISTINCT FROM:

IS DISTINCT FROM and IS NOT DISTINCT FROM … deal with NULL as if it was a
recognized worth, somewhat than a particular case for unknown.

A lot better – brief and verbose.

How Other Databases Handle This?

  • MySQL – A special operator <=> with related performance.
  • Oracle – Offers a perform known as LNNVL to match nullable fields (good luck with that…).
  • MSSQL – Could not discover a related perform.

ARRAY_AGG (DISTINCT)

ARRAY_AGG was one of many main promoting factors of PostgreSQL once I was
transitioning from Oracle.

ARRAY_AGG aggregates values into an array:

haki=# SELECT
    division,
    ARRAY_AGG(identify) AS workers
FROM
    worker
GROUP BY
    division;

    division        |        workers
----------------------+-------------------------------------
Companies              | {"Diana Fisher","Jessica Elliott"}
Assist               | {"Bonnie Robertson"}
Coaching              | {"Jean Bailey"}
Accounting            | {"Phillip Reynolds","Sean Franklin"}
Enterprise Growth  | {"Benjamin Brown","Brian Hayes"}
Advertising and marketing             | {"Richard Bradley","Arthur Moreno"}
Product Administration    | {"Richard Fox","Randy Wells"}
Human Sources       | {"Janet Corridor"}
Engineering           | {"Carl Frazier","Carolyn Carter"}
Gross sales                 | {"Benjamin Carter"}
Analysis and Develo.. | {"Donna Reynolds","Ann Boyd"}
Authorized                 | {"James Cunningham","George Hanson"}

I discover ARRAY_AGG helpful principally within the CLI for getting a fast view of the info, or when used with an ORM.

PostgreSQL took it the additional mile and carried out the DISTINCT choice for this combination perform as effectively. Utilizing DISTINCT we will, for instance, shortly view the distinctive salaries in every division:

haki=# SELECT
    division,
    ARRAY_AGG(DISTINCT wage) AS salaries
FROM
    worker
GROUP BY
    division;

division                | salaries
--------------------------+---------------
 Accounting               | {11203}
 Enterprise Growth     | {2196,7386}
 Engineering              | {1542,3052}
 Human Sources          | {2826}
 Authorized                    | {1079,3706}
 Advertising and marketing                | {5740}
 Product Administration       | {9101,13449}
 Analysis and Growth | {6451,10628}
 Gross sales                    | {6197}
 Companies                 | {2119}
 Assist                  | {12674}
 Coaching                 | {13230}

We are able to instantly see that everybody within the help division are making the identical wage.

How Other Databases Handle This?

  • MySQL – Has the same perform known as GROUP_CONCAT.
  • Oracle – Has an combination perform known as ListAgg. It has no help for DISTINCT. Oracle launched the perform in model 11.2 and up till then the world huge internet was crammed with customized implementations.
  • MsSQL – The closest I discovered was a perform known as STUFF that accepts an expression.

Take away

The take away from this text is that you need to at all times return to the fundamentals!

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top