Do not use DISTINCT as a “join-fixer”

I’ve quietly resolved efficiency points by re-writing sluggish queries to keep away from DISTINCT
. Typically, the DISTINCT
is there solely to function a “join-fixer,” and I can clarify what which means utilizing an instance.
Let’s say we now have the next grossly simplified schema, representing prospects, merchandise, and product classes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
CREATE TABLE dbo.Clients ( CustomerID int NOT NULL, Title nvarchar(255) NOT NULL, CONSTRAINT PK_Customers PRIMARY KEY (CustomerID) );
CREATE TABLE dbo.Classes ( CategoryID int NOT NULL, Title nvarchar(255) NOT NULL, CONSTRAINT PK_Categories PRIMARY KEY (CategoryID), CONSTRAINT UQ_Categories UNIQUE (Title) );
CREATE TABLE dbo.Merchandise ( ProductID int NOT NULL, CategoryID int NOT NULL, Title nvarchar(255) NOT NULL, CONSTRAINT PK_Products PRIMARY KEY (ProductID) );
|
After which we now have tables for orders and order particulars:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
CREATE TABLE dbo.Orders ( OrderID int NOT NULL, CustomerID int NOT NULL, OrderDate date, OrderTotal decimal(12,2), CONSTRAINT PK_Orders PRIMARY KEY (OrderID) );
CREATE TABLE dbo.OrderDetails ( OrderID int NOT NULL, LineItemID int NOT NULL, ProductID int NOT NULL, Amount int NOT NULL, CONSTRAINT PK_OrderDetails PRIMARY KEY (OrderID, LineItemID), INDEX IX_OrderDetails_OrderID_ProductID (OrderID, ProductID) );
|
And a few pattern information:
INSERT dbo.Clients (CustomerID, Title) VALUES (1,N‘Aaron’), (2,N‘Bob’);
INSERT dbo.Classes (CategoryID, Title) VALUES(1,N‘Magnificence’), (2,N‘Grocery’);
INSERT dbo.Merchandise (ProductID, CategoryID, Title) VALUES (1,1,N‘Lipstick’), (2,1,N‘Mascara’), (3,2,N‘Strawberries’);
INSERT dbo.Orders (OrderID, CustomerID, OrderDate, OrderTotal) VALUES (1,1,getdate(),32.50), (2,2,getdate(),47.05);
INSERT dbo.OrderDetails (OrderID, LineItemID, ProductID, Amount) VALUES (1,1,1,5), (2,1,3,10);
|
Advertising says we need to ship an e-mail or give a reduction code to all the shoppers who’ve ordered a product from the wonder class. The preliminary try at a question for this is perhaps one thing like this:
SELECT c.CustomerID, c.Title FROM dbo.Clients AS c INNER JOIN dbo.Orders AS o ON c.CustomerID = o.CustomerID INNER JOIN dbo.OrderDetails AS od ON o.OrderID = od.OrderID INNER JOIN dbo.Merchandise AS p ON od.ProductID = p.ProductID INNER JOIN dbo.Classes AS cat ON p.CategoryID = cat.CategoryID WHERE cat.Title = N‘Magnificence’;
|
The plan doesn’t look so dangerous (but):
And in native or take a look at information, the output may look proper, since we could have inserted a single row into OrderDetails
to match our standards (and to make our assessments go). However what if I’ve ordered two merchandise from the wonder class (in the identical order, or throughout a number of orders)?
INSERT dbo.OrderDetails (OrderID, LineItemID, ProductID, Amount) VALUES(1,2,2,1);
|
Now the question returns that buyer twice! We actually don’t need to ship them two e-mails, or concern a number of low cost codes to the identical buyer. And the plan, by itself, can’t actually present any apparent clues that there are duplicate rows:
However you positive will discover if you happen to examine the outcomes, or an finish consumer will discover if you happen to unleash this in manufacturing. The short repair tends to be: slap a giant ol’ DISTINCT
on there which, certainly, fixes the symptom by eliminating duplicates:
SELECT DISTINCT c.CustomerID, c.Title ——-^^^^^^^^ FROM dbo.Clients AS c INNER JOIN dbo.Orders AS o ON c.CustomerID = o.CustomerID INNER JOIN dbo.OrderDetails AS od ON o.OrderID = od.OrderID INNER JOIN dbo.Merchandise AS p ON od.ProductID = p.ProductID INNER JOIN dbo.Classes AS cat ON p.CategoryID = cat.CategoryID WHERE cat.Title = N‘Magnificence’;
|
However at what price? A distinct sort, that’s what!
If I’m testing adjustments to this question in my native surroundings, and perhaps simply testing the output and that it returned the info rapidly, I would miss clues within the plan and be fairly glad that including DISTINCT
fastened the difficulty with out impacting efficiency.
This can solely worsen with extra information.
And whereas we might spend a number of time tuning indexes on all of the concerned tables to make that kind harm much less, this multi-table be part of is at all times going to supply rows you by no means in the end want. Take into consideration SQL Server’s job: sure, it must return appropriate outcomes, however it additionally ought to try this in probably the most environment friendly means attainable. Studying all the info (after which sorting it), solely to throw away some or most of it, could be very wasteful.
Can we specific the question with out DISTINCT?
Once I know I must “be part of” to tables however solely care about existence of rows and never any of the output from these tables, I flip to EXISTS
. I additionally attempt to eradicate trying up values that I do know are going to be the identical on each row. On this case, I don’t want to hitch to Classes
each time if CategoryID
is successfully a continuing.
One solution to specific this identical question, guaranteeing no duplicate prospects and, hopefully, decreasing the price of sorting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
DECLARE @CategoryID int;
SELECT @CategoryID = CategoryID FROM dbo.Classes WHERE Title = N‘Magnificence’;
SELECT c.CustomerID, c.Title FROM dbo.Clients AS c WHERE EXISTS ( SELECT 1 FROM dbo.OrderDetails AS od INNER JOIN dbo.Orders AS o ON od.OrderID = o.OrderID INNER JOIN dbo.Merchandise AS p ON od.ProductID = p.ProductID WHERE o.CustomerID = c.CustomerID AND p.CategoryID = @CategoryID );
|
There’s a easy, extra index search towards Classes
, in fact, however the plan for the general question has been made drastically extra environment friendly (we’re all the way down to 2 scans and a couple of seeks)
One other solution to specific the identical question is to pressure Orders
to be scanned later:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
DECLARE @CategoryID int;
SELECT @CategoryID = CategoryID FROM dbo.Classes WHERE Title = N‘Magnificence’;
WITH cte AS ( SELECT CustomerID FROM dbo.Orders AS o WHERE EXISTS ( SELECT 1 FROM dbo.OrderDetails AS od INNER JOIN dbo.Merchandise AS p ON od.ProductID = p.ProductID WHERE od.OrderID = o.OrderID AND p.CategoryID = @CategoryID ) ) SELECT cust.CustomerID, cust.Title FROM dbo.Clients AS cust INNER JOIN cte ON cte.CustomerID = cust.CustomerID;
|
This may be helpful in case you have extra Orders
than Clients
(I actually hope that’s the case). Discover within the plan that we nonetheless have two scans and two seeks, however Orders
is scanned later, and Merchandise
is scanned as an alternative of Clients
.
Total, fewer rows transfer via the plan, and that is mirrored in barely decrease reads. This can be amplified by extra rows within the desk, wider rows basically, and extra rows filtered out earlier.
Conclusion
DISTINCT
is commonly hiding flaws within the underlying logic, and it may possibly actually repay to discover different methods to put in writing your queries with out it. There was one other attention-grabbing use case I wrote about a few years ago that confirmed how altering DISTINCT
to GROUP BY
– regardless that it carries the identical semantics and produces the identical outcomes – may help SQL Server filter out duplicates earlier and have a critical affect on efficiency.