Availability Monitoring – Part 1

This is part 1 of a three part series on using home-grown up time monitoring for your SQL Servers.

Part 2 can be found here.

Part 3 can be found here.

I was recently asked by my company’s management to provide some key performance indicators (KPIs) for our critical SQL Servers. One of the KPIs I wanted to provide was an availability percentage of our critical SQL Servers. We already have tools in place to notify us when a server goes down, but what I was looking for was something that could provide some records I could report off of to determine my system availability – you know, the five nines-type reporting management loves to see.

There are many third party tools that will do this for you. They all cost money. I was looking for a method I could create myself and that could be used on any of my SQL Servers, including the two old moldy SQL 2000 servers I still have kicking around. I also wanted something lightweight. This process was going to run on our critical servers, so it had to have minimal impact on performance. I’ve come up with what I think is a pretty good solution. It’s not perfect and it has some limitations, but it works fairly well, I think. I’m going to spend a couple posts talking about the challenges I encountered and my solution.

The Self-Monitoring Problem

Determining when SQL Server started is not too hard and there are various ways of doing it:

Use the system DMV sys.dm_os_sys_info. Starting with SQL 2008, this view returns a column named sqlserver_start_time, which tells you when SQL Server started. If you are running an earlier version of SQL Server, this method won’t work.
You can check the file date of the tempdb database files. Since those are re-created each time SQL Server starts, this is one way to get the start up time.
Again, if you are running SQL 2005, 2008, or 2008 R2, you can use the login_time column of the sys.sysprocesses DMV and the knowledge that SPID 1 is always created when SQL Server starts. In fact, for this column, BOL states “For system processes, the time at which the SQL Server startup occurred is stored.” Except that line does not appear in the SQL 2012 BOL version, so I wouldn’t count on this anymore.
Use can use the sp_readerrorlog system stored procedure and parse the output for the messages that always appear at startup. However, this is an undocumented procedure.
If you want to really get into it, you could probably write some sort of batch file or Windows procedure that reads the Windows Event Log. You would likely need to enable xp_cmdshell on your server though, and this could be a security risk.

The bottom line is, it may not be pretty, but getting the SQL Server start time can be done.

In order to calculate an availability percentage, you not only need to know how long a server has been up, you need to know how long it’s been down. It’s difficult for a machine to monitor itself for downtime. If it’s down, it can’t do anything! The problem arises not from normal shutdown processes – those are usually logged somewhere and you can parse those logs when the system comes back up. The problem comes when you have to monitor system crashes. A crash, by definition, is a non-normal shutdown. Typically, no information is logged. When the machine comes back up, you might be able to determine the system crashed and is going through crash recovery at startup, but you have no way of determining if the crash lasted 5 minutes or 5 days.

Further, I’m really not interested in machine down time. I’m interested in the SQL Server service down time. The service could crash and the machine might keep on running. Because it crashed, there won’t be any notification written to a log. So what can a poor boy to do (except sing for a rock n roll band)?

Our server monitoring tool, Solarwind’s Orion, can monitor the SQL Server service, but it does not have any out-of-the-box report that shows downtime information. The canned reports also only show information for the last 24 hours and there is no correspondence between an up and down event. Orion does has the ability to let you create custom reports, but I don’t really want to learn another report creation tool and I’m not sure if I could overcome the 24 hour limitation. I don’t know off-hand what our Orion data retention policy is (although I’m sure it could be adjusted). If I want to report this information over a 1 year period, telling Orion to save 1 year’s worth of data could result in some serious database growth, as it monitors all our servers, not just our SQL Servers.

The Only Solution

Short of using another machine to monitor your server, like a how monitoring server is used in database mirroring, you really only have one option – you have to have a SQL job that runs every x minutes and logs the date and time to indicate SQL Server is still running. Even this isn’t perfect. Doing it this way means you are really just monitoring the uptime of the SQL Agent service, not the SQL Server engine. But if you set SQL Agent to run at startup and to restart if it stops unexpectedly, you’ll be pretty close to monitoring the actual SQL engine uptime.

In part 2, I’ll talk about my logging process and in part 3, I’ll cover how to use the data to report off of (which is actually pretty complicated).

4 thoughts on “Availability Monitoring – Part 1”

Pingback: Availability Monitoring – Part 2 - SQL Server - SQL Server - Toad World
Barry says:

September 9, 2013 at 10:54 pm

I still think that it is much better to choose the corresponding software to monitor databses than to invent some other ways of doing it. There are expensive tools and there are cheap or reasonable price software e.g. Anturis. You need only to look for some solutions.

Shaun says:

September 10, 2013 at 6:54 am

That’s a valid option. Part of the reason I worked to come up with my own method was to see if I could. I was interested in challenging my skills and possibly learning something new.

Pingback: Availability Monitoring – Part 3 - SQL Server - SQL Server - Toad World

Shaun J Stuart

Just another SQL Server weblog

Availability Monitoring – Part 1

The Self-Monitoring Problem

The Only Solution

4 thoughts on “Availability Monitoring – Part 1”

Leave a Reply to Barry Cancel reply