Now Reading
Do not use DISTINCT as a “join-fixer”

Do not use DISTINCT as a “join-fixer”

2023-10-20 17:52:17

I’ve quietly resolved efficiency points by re-writing sluggish queries to keep away from DISTINCT. Typically, the DISTINCT is there solely to function a “join-fixer,” and I can clarify what which means utilizing an instance.

Let’s say we now have the next grossly simplified schema, representing prospects, merchandise, and product classes:

After which we now have tables for orders and order particulars:

And a few pattern information:

Advertising says we need to ship an e-mail or give a reduction code to all the shoppers who’ve ordered a product from the wonder class. The preliminary try at a question for this is perhaps one thing like this:

The plan doesn’t look so dangerous (but):

A simple plan

And in native or take a look at information, the output may look proper, since we could have inserted a single row into OrderDetails to match our standards (and to make our assessments go). However what if I’ve ordered two merchandise from the wonder class (in the identical order, or throughout a number of orders)?

Now the question returns that buyer twice! We actually don’t need to ship them two e-mails, or concern a number of low cost codes to the identical buyer. And the plan, by itself, can’t actually present any apparent clues that there are duplicate rows:

Hidden duplicates

However you positive will discover if you happen to examine the outcomes, or an finish consumer will discover if you happen to unleash this in manufacturing. The short repair tends to be: slap a giant ol’ DISTINCT on there which, certainly, fixes the symptom by eliminating duplicates:

However at what price? A distinct sort, that’s what!

Pain caused by that DISTINCT

If I’m testing adjustments to this question in my native surroundings, and perhaps simply testing the output and that it returned the info rapidly, I would miss clues within the plan and be fairly glad that including DISTINCT fastened the difficulty with out impacting efficiency.

This can solely worsen with extra information.

And whereas we might spend a number of time tuning indexes on all of the concerned tables to make that kind harm much less, this multi-table be part of is at all times going to supply rows you by no means in the end want. Take into consideration SQL Server’s job: sure, it must return appropriate outcomes, however it additionally ought to try this in probably the most environment friendly means attainable. Studying all the info (after which sorting it), solely to throw away some or most of it, could be very wasteful.

Can we specific the question with out DISTINCT?

Once I know I must “be part of” to tables however solely care about existence of rows and never any of the output from these tables, I flip to EXISTS. I additionally attempt to eradicate trying up values that I do know are going to be the identical on each row. On this case, I don’t want to hitch to Classes each time if CategoryID is successfully a continuing.

One solution to specific this identical question, guaranteeing no duplicate prospects and, hopefully, decreasing the price of sorting:

There’s a easy, extra index search towards Classes, in fact, however the plan for the general question has been made drastically extra environment friendly (we’re all the way down to 2 scans and a couple of seeks)

A plan using EXISTS

One other solution to specific the identical question is to pressure Orders to be scanned later:

This may be helpful in case you have extra Orders than Clients (I actually hope that’s the case). Discover within the plan that we nonetheless have two scans and two seeks, however Orders is scanned later, and Merchandise is scanned as an alternative of Clients.

A slightly different EXISTS plan

Total, fewer rows transfer via the plan, and that is mirrored in barely decrease reads. This can be amplified by extra rows within the desk, wider rows basically, and extra rows filtered out earlier.

Conclusion

DISTINCT is commonly hiding flaws within the underlying logic, and it may possibly actually repay to discover different methods to put in writing your queries with out it. There was one other attention-grabbing use case I wrote about a few years ago that confirmed how altering DISTINCT to GROUP BY – regardless that it carries the identical semantics and produces the identical outcomes – may help SQL Server filter out duplicates earlier and have a critical affect on efficiency.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top