Using A Non-correlated Subquery To Avoid DISTINCT

donotduplicate

Nobody wants duplicate data. It’s messy, wastes space, and looks really bad on a report – I get that. What I don’t get is why so many people (developers, I’m lookin’ at you) think DISTINCT is the only way to avoid it. I’ve even run into folks who use DISTINCT on every query, you know, just in case. I realize it’s not all your fault, T-SQL makes it too easy to do things the wrong way. I’m writing this to show you the right way. You can avoid duplicate data without using DISTINCT, and in almost every case, you’ll find that your queries run faster and more efficiently.

Let’s look at an example. You’ll need the AdventureWorks2012 database. If you don’t have it running somewhere, first, what’s wrong with you, and second, you can download it by clicking here.

Imagine this typical, real-world scenario – your boss says “Build me a report that shows me which salespeople have sold red helmets, I want to take them all out on my yacht. And make it run as fast as possible, my time is valuable and I don’t like to wait.” You start coding, and in a matter of minutes, you’ve come up with this query:

-- Return list of salespersons who have sold red helmets
SELECT Person.LastName, Person.FirstName
FROM Production.Product
INNER JOIN Sales.SalesOrderDetail
 ON Product.ProductID = SalesOrderDetail.ProductID
INNER JOIN Sales.SalesOrderHeader
 ON SalesOrderDetail.SalesOrderID = SalesOrderHeader.SalesOrderID
INNER JOIN Person.Person
 ON SalesOrderHeader.SalesPersonID = Person.BusinessEntityID
WHERE Product.Name = 'Sport-100 Helmet, Red';

 

Does it work? Sort of – it returns a list of 853 salespeople, but the list looks something like this:

The results contain several duplicate rows. Technically it’s what the boss asked for, but probably not very useful when compiling a list of invitations for a yacht outing. You need to get rid of those dupes. Lucky for you, T-SQL offers the magic DISTINCT operator to do just that. Ten quick keystrokes later, your query now looks like this:

-- Return same list without duplicates.
SELECT DISTINCT Person.LastName, Person.FirstName
FROM Production.Product
INNER JOIN Sales.SalesOrderDetail
 ON Product.ProductID = SalesOrderDetail.ProductID
INNER JOIN Sales.SalesOrderHeader
 ON SalesOrderDetail.SalesOrderID = SalesOrderHeader.SalesOrderID
INNER JOIN Person.Person
 ON SalesOrderHeader.SalesPersonID = Person.BusinessEntityID
WHERE Product.Name = 'Sport-100 Helmet, Red';

 

This one, like the first, returns a list of salespeople, and this time only 17 of them, without dupes:

I think we’re done here, right? Except – the boss said to make it as fast as possible. Have you done that? Nope, sorry, you haven’t. Let’s turn on a couple of metrics and re-run the original query:

SET STATISTICS TIME ON
SET STATISTICS IO ON

-- Return list of salespersons who have sold red helmets
SELECT Person.LastName, Person.FirstName
FROM Production.Product
INNER JOIN Sales.SalesOrderDetail
 ON Product.ProductID = SalesOrderDetail.ProductID
INNER JOIN Sales.SalesOrderHeader
 ON SalesOrderDetail.SalesOrderID = SalesOrderHeader.SalesOrderID
INNER JOIN Person.Person
 ON SalesOrderHeader.SalesPersonID = Person.BusinessEntityID
WHERE Product.Name = 'Sport-100 Helmet, Red';

 

This produces (or should) output that looks something like this:

In this example, your original query, the one that returns duplicate values, generated just under 2700 database reads (that’s I/O, folks, or memory churn). On my VM, it took 90ms to complete.

Now do the same with your second query, the one with DISTINCT:

SET STATISTICS TIME ON
SET STATISTICS IO ON

-- Return same list without duplicates.
SELECT DISTINCT Person.LastName, Person.FirstName
FROM Production.Product
INNER JOIN Sales.SalesOrderDetail
 ON Product.ProductID = SalesOrderDetail.ProductID
INNER JOIN Sales.SalesOrderHeader
 ON SalesOrderDetail.SalesOrderID = SalesOrderHeader.SalesOrderID
INNER JOIN Person.Person
 ON SalesOrderHeader.SalesPersonID = Person.BusinessEntityID
WHERE Product.Name = 'Sport-100 Helmet, Red';

 

Similar performance stats – same number of reads, a few milliseconds faster, essentially the same performance as the original.

 

It’s tempting to give up at this point, convinced that you’ve done all that you can do, this query is as fast as it’s going to get. Sadly, giving up means risking the scorn and ridicule of your resident DBA, not the mention the guilt of knowing that you didn’t give your boss the fastest report possible. What to do?

Let’s try a third variation. No, no, trust me, you want to see this. Think about the question we’re really asking – which salespeople (identified by the BusinessEntityID) exist within the subset of sales orders that include red helmets? A query that roughly matches that question might look like this:

SET STATISTICS TIME ON
SET STATISTICS IO ON

SELECT Person.LastName, Person.FirstName
FROM Person.Person
WHERE Person.BusinessEntityID IN
 (
  SELECT SalesOrderHeader.SalesPersonID
  FROM Sales.SalesOrderHeader
  INNER JOIN Sales.SalesOrderDetail
   ON SalesOrderHeader.SalesOrderID = SalesOrderDetail.SalesOrderID
  INNER JOIN Production.Product
   ON SalesOrderDetail.ProductID = Product.ProductID
  WHERE Product.Name = 'Sport-100 Helmet, Red'
 );

 

Once again, 17 salespeople are returned, just like your DISTINCT query:

What about those I/O metrics? The first two queries produced 2700 reads, completing in approximately 80-90 milliseconds. The third one is a little more impressive:

 

Just over 100 reads (less than 4% of the first two queries). Execution time of 14ms, let’s call it 18% of the time required for the first two queries to run. A significant improvement in efficiency.

Ok, yes, I realize we’re talking about a difference of 60 MILLISECONDS here, but this is a tiny sample database – the first query returned a mere 853 duplicate rows, one for each sales transaction. Imagine a large-scale production database with millions or even billions of sales transactions. Do the math – if those 853 rows become 853 million, what happens to those I/O counts? What happens to the execution time? We’re suddenly no longer talking about milliseconds – we’re talking about real time now, minutes, maybe hours. That leads to resource contention, less-than-optimal performance, and unhappy users.
DISTINCT has its place, and it’s available for a reason, but avoid using it to cover up a bad query. Fix the query.