Well, do you know? You’re the DBA, the guy in charge, the guy who gets pounced on first thing in the morning with complaints of “my report took FOREVER to run last night, why?”, or “customers are saying the web site was really slow at 3:00am this morning, why?”. You’re expected to be able to answer questions like that. Can you? Do you know what your servers are doing when you’re not looking?
As SQL Server DBA’s, we’re fortunate to have several different available methods of answering these questions. There are commercial third-party products that will show your server activity as pretty pictures that managers can understand. There are commercial products that can analyze your transaction logs to show you what happened during a specific period of time.
The masochists in the crowd might enjoy running server-side traces around the clock to capture every SQL statement – free and built-in to SQL Server, but cumbersome to work with and store, especially on a busy system. Plus, they’re being phased out in favor of Extended Events – more powerful and flexible, but yet another complicated thing to learn. Personally, I’ve found capturing trace logs around the clock to be extreme overkill for most needs. My busy OLTP system generates hundreds of gigabytes of trace log files every day. Mining through them to find “that problem” is impossible.
There are tons of free tools out there, T-SQL scripts, applications, you name it. It’s easy to cobble together a complicated monitoring “system” that tries to deliver the world. The result is usually a giant collection of “data” that is useless for anything, let alone answering the “what happened at 3:00am this morning?” questions. For most of us, we just need a simple, lightweight tool to tell us when the system is (or was) unhappy, and what was happening during that time to make it unhappy. It’s crazy simple to build such a tool – I’m going to show you how.
My system hinges on knowing what SQL Server is waiting on. If the server is waiting on something, it’s unhappy. Many people focus on knowing what the “top” queries are, but that, in my opinion, is the wrong approach. There will ALWAYS be a top query – so what? If the server is able to execute all of the queries submitted to it (even the “top” one) without waiting on anything, do we really care? No – we only care if the server has to wait on something, anything, while trying to run our queries. Having said that, the server will always be waiting on something. The key is to understand what resources it’s waiting on, understand why it’s waiting on those resources, and accepting or eliminating those waits.
SQL Server will tell us exactly what it’s waiting on, all we have to do is ask it. The dynamic management view (DMV) sys.dm_os_wait_stats tells us all of the wait types, including the number and duration of those waits, that have accumulated since the SQL instance was last started, OR since the wait stats were last reset using DBCC SQLPERF(‘waitstats’, clear). By itself, this DMV is useful for assessing the overall health of the instance. If the instance is generally I/O bound, or suffering from CPU contention, you should be able to see that here. This doesn’t help answer the “3:00am” question though. For this, we need to do something a little differently with this DMV.
Rather than knowing what the instance has waited on since it was last started, we need to know what it was waiting on during specific periods of time. We can get this by capturing the contents of the DMV, saving them someplace, resetting the DMV, and then collecting its contents again after a period of time. In order to do this, you need:
- a table to store the collected data
- a way to cleanup old data, so that our collection doesn’t fill up our database (monitoring should not affect the system being monitored)
- a method of running the data collection in recurring intervals
This T-SQL script, when run from a SQL Agent job, provides all of these things:
I run this on a 15-minute recurring schedule. This gives me enough granularity to see which points in time I’ve had a problem, with a narrow enough window in which to look for the source of the problem. You might adjust this based on your needs, but I wouldn’t recommend anything less frequently than 15 minutes.
With this collection in place, the next time somebody says to you “the system was slow at 3:00am, why?”, you’ll respond with “let me see what it was waiting on at that time”, and you’ll check your collection table by running a query like this:
SELECT * FROM WaitStats_Log WHERE CollectionTime >= '2013-11-24 02:45am' AND CollectionTime <= '2013-11-24 03:16am';
From the output of that query, you can see exactly what SQL Server spent the most time waiting on. Some of the common wait types that you’ll see, and what they mean:
- CPU/memory – on a healthy system, this will be the most common wait type reported. Disks are fast enough, there is minimal contention in the database, everything that is requested is quickly retrieved – and goes where? Into memory, where it is processed via CPU.
- CXPACKET – this is a parallelism wait. When a big, ugly, inefficient query comes in, and SQL Server is allowed to use multiple processors to “do work”, it will break that big, ugly, inefficient query up into chunks and process those chunks in parallel. As each chunk finishes, it has to wait on the other chunks to catch up – this is recorded as CXPACKET wait time. How do you eliminate CXPACKET waits? By finding out what the slow chunks are waiting on (another high wait type) and eliminating that, usually by tuning code.
- WRITELOG – this CAN indicate the need to put your transaction logs on faster disks, but on an otherwise healthy system, this will be one of your higher wait types.
- ASYNC_NETWORK_IO – if you see a query waiting on this, it’s an application problem. This wait type means that SQL is trying to push results out to the app that requested them, but the app isn’t consuming them fast enough.
- PAGELATCH_xxx – anything that starts with “PAGE” is an I/O wait. SQL is trying to read a page of data, and has to wait for the page to be loaded. This could indicate a disk problem, but more likely it indicates that the server is being slammed by queries that are doing full scans or retrieving too much data. This is a good indicator of a missing index, or a bad WHERE clause in your query.
- LCK_XXX – anything that starts with “LCK” is a lock wait, which means blocking. A query that is waiting on a lock is being blocked by another query. Before trying to fix the waiting query, find the one that is blocking and fix it first. This can also indicate missing or inadequate indexes.
Put this in place, and after a few hours you’ll have a pretty good idea of where your resource bottlenecks are.
But wait, there’s more! You now have the ability to see what your server was waiting on at 3:00am, but can you see what it was actually doing? No, not yet, but I have one more toy in the toybox to show you. This one comes to us courtesy of Adam Machanic (b|t) in the form of his sp_whoisactive script. If you’re not already using this on your servers, turn in your DBA card now, you’re doing it wrong.
Adam’s done all the work, all I’ve done is to plug his script into the same rotating collection, recurring job framework outlined above. This script uses Adam’s sp_whoisactive to collect current processes for later analysis:
I’m not interested in capturing every query that runs, I only want to know about the long-running ones. Your definition of “long-running” will vary depending on your needs. For my OLTP system, anything that runs for longer than 1 minute is a potential problem. For that reason, I run this script in 1-minute intervals.
This collection tells me everything about the processes that were running. I can see the query text, the query plan, the I/O and CPU statistics, everything. If my wait types collection tells me that 3:00am the top wait was one of the LCK waits, and my sp_WhoIsActive collection shows that at 3:00am a long-running query was doing a large table scan, it’s a pretty safe bet that this query was the cause of the slowdown. The next step is to tune the query. Rewrite the code, add or tweak an index, whatever is necessary.
This is not intended to be a full-blown monitoring solution. It’s intended to be a lightweight, better-than-nothing solution. If you’re a DBA of the “accidental” variety, one who just inherited SQL Server duty because you can spell “SQL”, chances are this solution is better than what you’re doing now. Give it a try – you might be surprised to see what your server is doing when you’re not looking.