The past few years have seen the rise of an strange beast on Minnesota’s roads – the roundabout. If you’ve done much driving in Minnesota, particularly in the Twin Cities, you can imagine that some of these drivers are absolutely flummoxed (I’ve always wanted to use that word in a blog post) by these creations. Some folks have reportedly gotten confused and spent days driving round-and-round the same roundabout, unable to figure out how to exit. It’s not that a roundabout is hard to navigate, it’s quite simple to get from one side to the other. Some people just can’t resist the urge to go in circles. Sort of like some T-SQL developers. Usually the people who grew up writing code using procedural languages who are now venturing into database-land for the first time.
There’s nothing inherently wrong with procedural coding, it is in fact quite appropriate for certain cases. It usually gets the job done, and is easily readable by most anyone. The issue is one of efficiency.
Think of your data as a pile of dirt. Your job is to move that pile from one place to another, from a table or set of tables into a report. There are a variety of tools to choose from for moving a pile of dirt. If the pile is small, you might choose a shovel, and that might actually be the best tool for the job. If the pile is more substantial, you might want to use a different tool. Something big and yellow, capable of moving the entire pile of dirt in one pass. You could, of course, stick with the shovel and follow the process of bend/scoop/turn/throw/bend/scoop/turn/throw, moving the pile one scoop at a time, but you’re going to get sweaty and dirty, and you’ll be sore in the morning.
What does any of this have to do with SQL Server? Patience, young data-juggler, I’m going to show you. If you have a copy of AdventureWorks available, you can try the following examples yourself. If not, just follow along with my screenshots.
Let’s propose that we’ve been asked to produce a report, showing all sales reps with a quote greater than $250,000, and for each of those reps, the total dollar amount for shipped items during the first half of 2008. Sticking with our dirty metaphor and the notion of going in circles, let’s look at a couple of different approaches for producing this report.
For the first approach, we’re going to use a shovel. We’re going to dig into our pile of data, removing one row at a time, turning, then throwing that row into our process. We’re going to repeat this process (going in circles) until we’ve reached the bottom of the pile. We start by pulling out the sales reps that we’re interested in, those with a quote greater than $250,000. We turn, then “throw” those sales reps into a temp table. Then we start going in circles. For each of those sales reps that we threw into that temp table, we’re going to scoop up the sales information for that rep, total up the amount, and then throw that total into the temp table. Once we find out way out of the circle, we’re going to display the contents of the temp table.
-- A procedural method<br>CREATE TABLE #SalesPersonShipTotals(SalesPersonID INT, SalesQuota MONEY, TotalShipped MONEY);<br><br>INSERT INTO #SalesPersonShipTotals (SalesPersonID, SalesQuota)<br>SELECT BusinessEntityID, SalesQuota FROM Sales.SalesPerson WHERE SalesQuota > 250000;<br><br>DECLARE @SalesPersonID INT;<br>DECLARE @TotalShipped MONEY;<br><br>DECLARE SalesPersonCursor CURSOR FOR SELECT SalesPersonID FROM #SalesPersonShipTotals;<br><br>OPEN SalesPersonCursor;<br><br>FETCH NEXT FROM SalesPersonCursor INTO @SalesPersonID;<br><br>WHILE @@FETCH_STATUS = 0<br> BEGIN<br><br> SELECT @TotalShipped = SUM(SubTotal) FROM Sales.SalesOrderHeader <br> WHERE SalesPersonID = @SalesPersonID AND ShipDate >= '2008-01-01' AND ShipDate < '2008-07-01';<br><br> UPDATE #SalesPersonShipTotals SET TotalShipped = @TotalShipped WHERE SalesPersonID = @SalesPersonID;<br><br> FETCH NEXT FROM SalesPersonCursor INTO @SalesPersonID;<br><br> END<br><br>CLOSE SalesPersonCursor;<br>DEALLOCATE SalesPersonCursor;<br><br>SELECT SalesPersonID, SalesQuota, TotalShipped FROM #SalesPersonShipTotals ORDER BY SalesPersonID;<br>
Make sense? Good. If you’re thinking “I wouldn’t have done it that way”, even better, you’re a step ahead of me.
A real man, the guy driving that big yellow machine, would take a completely different tact. Hopefully you and your shovel are out of the way. This guy is going to come in, scoop up the entire pile of data in one pass, ready to dump wherever you want.
-- A set-based method<br>SELECT<br> SalesPersonID = SalesPerson.BusinessEntityID,<br> SalesQuota = SalesPerson.SalesQuota,<br> TotalShipped = SUM(SalesOrderHeader.SubTotal)<br>FROM Sales.SalesPerson<br>INNER JOIN Sales.SalesOrderHeader<br> ON SalesPerson.BusinessEntityID = SalesOrderHeader.SalesPersonID<br>WHERE SalesPerson.SalesQuota > 250000<br> AND SalesOrderHeader.ShipDate >= '2008-01-01'<br> AND SalesOrderHeader.ShipDate < '2008-07-01'<br>GROUP BY SalesPerson.BusinessEntityID, SalesPerson.SalesQuota<br>ORDER BY SalesPerson.BusinessEntityID;<br>
This is nothing extraordinary, just a couple of joins and an aggregate. Basic T-SQL stuff. Both methods return the same results:
Some of you are scoffing at this point, thinking “If both methods return the correct results, why do I care which one to use? What’s the difference?”. I won’t tell your DBA that you said that. I will show you what the difference is.
The second method, the one using the joins, produces a nice, clean query plan. A couple of fast index seeks piped through a couple other steps, pretty straightforward. Nobody will complain about a query plan like this one.
The first method produces a much different query plan:
I won’t bother explaining it, the picture says it all. You can clearly see that we went in circles a few times, running the same steps repeatedly. Wasteful and inefficient.
Let’s go a little deeper and look at what this means in terms of server I/O. Lots of I/O generally doesn’t translate into good performance. The second method, using joins and aggregation, hits three tables, resulting in a trivial number of reads against the database:
When we look at the I/O stats for the first method, we can again clearly see that we’re repeating some of the same work over and over again. There’s a lot more I/O generated, lots of repeated hits against the same tables. Bend, scoop, turn, throw, bend, scoop, turn, throw…
Have I made my point? I hope so. If it’s not clear, that point is “always think in terms of sets”. Always try to process an entire data set as a whole instead of repeating work over and over again. Your DBAs will appreciate it, your end-users will appreciate it, and your code will be more efficient and faster!