Availability Monitoring – Part 2

Last time, I talked about the issues with trying to get a server to monitor itself for downtime. In part 2 of this series, I’ll talk about my solution for monitoring. In the final part, I’ll discuss how to use the logged data for reporting.

Part 1 can be found here.

Part 3 can be found here.

At this point, I’ve decided the best way to monitor a SQL Server for both up time and down time is to create a SQL job that runs every x minutes to log the fact that the server is still running. How often that job runs will determine, among other things, accurate our reporting will be.

The Nines Of Availabilty

Here is an edited table from the Wikipedia article on high availability that translates the “nines” of availability percentages to down times per year:

Availability %	Downtime per year
90% (“one nine”)	36.5 days
99% (“two nines”)	3.65 days
99.9% (“three nines”)	8.76 hours
99.95%	4.38 hours
99.99% (“four nines”)	52.56 minutes
99.999% (“five nines”)	5.26 minutes

We can tell right away that if we are looking to establish our servers are running at a 99.999% availability level, we’re going to need to run our logging process at least every 5 minutes. It turns out, as I will demonstrate later, to achieve that, we would actually have to run the process every 2.5 minutes. (Microsoft recommends jobs be scheduled to run no more frequently than every 10 seconds.) I’ve opted to go for a maximum reporting resolution of “4 nines”, or 99.99%. This translates to 52.56 minutes of downtime per year.

The other limiting factor to reporting our “nines level” is the datatype of the field we use to store our time stamp. Because I’m only looking to report a level of four nines, I’ve chosen to use a smalldatetime datatype for storing this information. I know is not recommended for new development, but because I want this process to work on servers as old as SQL 2000, this is what I am limited to. The accuracy of this datatype is one minute, so setting my logging process to run more than every minute would be pointless.

Logging The Data

The next question we need to answer is how do we want to log this data? I want this process to be very lightweight, not only in execution time, but in terms of storage space. Although I’m only looking to report out to 4 nines of accuracy, I’ve decided my process will run every 5 minutes. I definitely don’t want to insert a new row of data into a table every 5 minutes, especially if I want to keep a year or more of historical data. That is definitely not lightweight on storage!

(Why did I choose 5 minutes? One reason is that we need to take into account the fact that a server may go up and down multiple times. If our logging period is too large, we might miss these events.)

To keep the table size down, we can use some sleight of hand with the logging and tailor our reporting process to take this into account later. I’ve decided to perform logging as follows:

Every 5 minutes, run a procedure that checks my logging table. If the latest date in the logging table is 5 minutes or less old (the inequality is to account for rounding errors with the smalldatetime datatype), update the table with a count of the minutes the server has been up by adding the difference between the current time and the previous log time stamp.
If the time difference is greater than five minutes, we will assume we have been down, and insert a new row into the table with the current time and starting the uptime minutes counter back at zero.

By doing it this way, if we have no downtime for days, we will still only have 1 row of data in the table, even when logging every 5 minutes. In fact, we will only generate a new row of data when we have down time. I think this does a pretty good job of keeping our log table small.

The table itself only needs two columns – a column for the log time and a column for the count of minutes of uptime. I’m also going to make the log time column a clustered primary key. I know, I know. Datetime columns make bad clustered indexes! But in this case, we’ll be OK. Let’s look at why.

Desired Properties Of Clustered Indexes

You typically want your clustered index keys to be four things: narrow, static, unique, and ever-increasing.

Narrow – Because the fields of the clustered index is also included in all other indexes, we want to keep the fields in a clustered index using a minimum amount of storage space. The smalldatetime datatype uses 4 bytes of storage. This is the same amount an integer field uses. There also will not be any other indexes on this table, so we don’t have to worry about this key being duplicated in non-clustered indexes.

Static – Changing clustered index keys may result in page splits, which slow things down as SQL has to wait for large amounts of data to be moved around on disk. In our case, the only time a clustered index key changes is during an update of the last row of data. Because this will always be the last record on a data page (see Ever-Increasing, below), we will never encounter page splits.

Unique – Our log time stamp values will be unique. See below.

Ever-Increasing – I’m pretty sure Steve Miller was referring to the current_timestamp function when he sang that “time keeps on slipping into the future.” Time marches on and our log time stamps will always be increasing.

Putting It All Together

With all that out of the way, we can now concentrate on creating our logging table and the stored procedure that will be called every 5 minutes. Here’s the code to create the logging table and the stored procedure to update it. I’ve written the procedure to automatically create the table if it is not present. This was really just to make things easier for me when I was developing this process and to make things easier when I deployed it to multiple servers.

CREATE TABLE [dbo].[UpTimeTracking](
	[LogDate] [smalldatetime] NOT NULL,
	[UptimeMinutes] [int] NULL,
 CONSTRAINT [PK_UpTimeTracking] PRIMARY KEY CLUSTERED
(
	[LogDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
 ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

CREATE PROCEDURE [dbo].[up_UpTimeTracking]
	(@TimeCheckIntervalInMinutes SMALLINT = 5)

AS

SET NOCOUNT ON

DECLARE	@CurrentTime SMALLDATETIME

 	/* @TimeCheckIntervalInMinutes  - This is how often this code will be run as a scheduled
		job. If code execution is more than this many minutes apart, the server will
		be considered to have experienced downtime and a new record is started. Minimum
		interval is 1 minute. */

SET	@CurrentTime = current_timestamp

IF object_id('UpTimeTracking') IS NULL
BEGIN
	CREATE TABLE UpTimeTracking
		(LogDate SMALLDATETIME NOT NULL
		 ,UptimeMinutes int)
	ALTER TABLE UpTimeTracking ADD CONSTRAINT
		PK_UpTimeTracking PRIMARY KEY CLUSTERED
		(LogDate)
END

IF NOT EXISTS (SELECT 1 FROM UpTimeTracking)
BEGIN
	INSERT INTO UpTimeTracking
		(LogDate,
		 UptimeMinutes)
		 VALUES
		(@CurrentTime,
		0)
END

IF	((SELECT DATEDIFF(mi,MAX(LogDate),@CurrentTime) FROM UpTimeTracking) > (@TimeCheckIntervalInMinutes +1))
BEGIN

	/* insert new row - assume downtime occurred */

	INSERT INTO UpTimeTracking
		(LogDate,
		 UptimeMinutes)
		 VALUES
		(@CurrentTime,
		0)
END
ELSE
BEGIN

	/* update last row - no downtime occurred (or downtime was
	   less than @TimeCheckIntervalInMinutes */

	UPDATE UpTimeTracking
	SET	LogDate = @CurrentTime,
		UptimeMinutes = UptimeMinutes + DATEDIFF(mi,LogDate,@CurrentTime)
	WHERE	LogDate = (SELECT MAX(LogDate) FROM UpTimeTracking)
END

GO

So far, so good. We’ve got a table to store our log times and a procedure we can run every 5 minutes to update the table.

Interpreting The Data

I mentioned that I’ve made the logging process lightweight on purpose. The tradeoff to this comes when we need to report on the data. We need to really think through what exactly the data is telling us. First, let me give you some sample data so we have something to work against:

LogDate	UptimeMinutes
2013-08-21 12:00:00	100
2013-08-21 14:15:00	115
2013-08-22 08:15:00	1070
2013-09-03 08:45:00	17300

Let’s look a bit closer at this. Recall how our data is logged – every 5 minutes, the LogDate value is checked and, if the old time was 5 minutes or less than the current time, the LogDate value is updated to the current time and UptimeMinutes is incremented by the difference between the current time and the old LogDate value. This means the time stored in the LogDate field is the end of an uptime window. In this case, the first record is stating the uptime window started at 8/21/13 10:20 AM, or 100 minutes prior to 8/21/13 12:00. This is very important to keep in mind when reporting off of this table. (As an aside, the last row represents the current time period, which of course, will be out of date when this article is published. For the sake of discussion, let’s assume the time right now is 9/3/13 8:46 AM.)

I’ve found pictures to be very helpful in analyzing this data. Here is the above table represented in a graphical format:

(This is not to scale, obviously.) This is very helpful when we start talking about reporting downtimes in the next article. The blue periods represent times the server is up and running.

Margin Of Error

Let’s take a look at the margins of error we might encounter when logging data this way. We want to identify what situation would give us the maximum erroneous data. In this case, “erroneous data” can be one of two things: an uptime period that was shorter than was reported or a downtime period that is reported as being longer than it actually was.

This first scenario could occur when the server goes down. If we are logging our data every x minutes, our worst case scenario would be if the system crashed immediately after we wrote our log record. Because we only write the record every x minutes, we are assuming that we are up and running for those x minutes between log updates. Thus, crashing immediately after a record is written, means we would assume we were up for x minutes, when we were actually down for those x minutes.

What about when the server comes back up? If the logging is set to run via a SQL Agent scheduled job, we’d get our most erroneous data if the sever came up immediately after the job was supposed to run. In this case, the server would be up and running for x minutes before the job ran and logged the fact that it was up, thus our downtime would appear to be x minutes more than it actually was. This is the second scenario.

Putting these together, we can see that when the server goes down, our logging could be off by x minutes and when the server comes up, we could be off by x minutes, for a total margin of error of 2x minutes.

(Note that you can cut this in half by setting your logging job to also run on SQL Server startup. This would eliminate the error on server startup side and then your margin of error would only be x minutes. I am not in favor of this approach because it means you would have to put the logging procedure and table in the master database, a place where, according to best practices, user data should not be stored. Why? Because startup jobs may run before SQL has brought all the user databases online. SQL needs the master database to be up before it can run, so that is the only database you can guarantee will be available when a startup job executes.)

This margin of error is one reason why I set my logging job to run every 5 minutes – it gives me a maximum error or 10 minutes. Note that this is the error per down/up cycle! If your reporting period spans multiple up/down cycles, your error will be a maximum of x minutes per cycle!

My final article in the series will talk about how we must keep all this information in mind when we create reports using this data and how we have to detect and account for up/down cycles when reporting.

Shaun J Stuart

Just another SQL Server weblog