This is the final almost final part of my series of articles about creating a process to monitor your SQL Server for downtime. Part 1 can be found here. Part 2 can be found here.
Reporting
Now that we’ve got a process in place to gather data, we need to create a way to report off of that data. We designed our data recording process to be as lightweight as possible at the expense of adding possible complexity on the reporting side. Now is when we have to pay that bill.
Let me reprint the image representing our sample data:
This is a visual representation of the data in the following table:
LogDate |
UptimeMinutes |
2013-08-21 12:00:00 | 100 |
2013-08-21 14:15:00 | 115 |
2013-08-22 08:15:00 | 1070 |
2013-09-03 08:45:00 | 17300 |
Again, for the sake of our discussion, let’s assume the current time is 9/3/13 8:46 AM and the server is still running.
The blue bars represent periods where our SQL Server is up and running. The times between are times the server is down. We want to query the data and find out, for a given date range, what was the availability percentage of the server. Calculating this can be broken down into 4 possible scenarios:
- Both the start and end times of the reporting window fall within a period of the server being up
- The report start time falls within a server downtime period and the report end time falls in a server uptime period
- The report start time falls within a server uptime period and the report end time falls in a server downtime period
- Both the start and end times of the reporting window fall within a server downtime period.
Why do we need to break things out this way? For a couple reasons. First, as we demonstrated last time, in order to calculate how accurate our monitoring data is we need to know how many up/down periods occurred within the reporting interval. Second, if the report interval falls completely within one type of period (either up or down), the calculations are greatly simplified.
I’ll walk through the calculation process for each scenarios before presenting the code.
Scenario 1
In this scenario, both the start and end times of the reporting interval fall within a period where the server was up. The generic situation looks like this:
The red circle represents the start of the reporting period and the red rectangle isn the end of the reporting period. There could be more than 1 uptime period in the reporting interval, although I only show one here. The total minutes the system has been available between these two points can be found by adding:
- The total uptime of all complete uptime periods completely within the report window (8/21 between 12:20 and 14:14, or 115 minutes, in this case) and
- The uptime from the start of the reporting interval to the end of the uptime period that falls in (from the red circle to 8/21 12:00, in this case) and
- The uptime from the start of the uptime period the report end time is in to the report end time (8/21 14:25 to the red rectangle, in this case)
These calculations require us to determine which time period the report endpoints fall in. In doing this, we can determine if we are in following simplified case:
Here, the total uptime is simply the difference between the report start and end dates.
Our calculations for our margin of error, also require us to count the total number of uptime periods the report covers.
Scenario 2
In this scenario, the start time of the reporting interval falls during a server downtime period and the end time falls within a period where the server was up. The generic situation looks like this:
In this case, the total uptime can be found by adding:
- The total uptime of all uptime periods completely within the reporting interval and
- The uptime from the start of the uptime interval the report end time falls in to that endtime.
Scenario 3
In this scenario, the start time of the reporting interval falls during a server uptime period and the end time falls within a period where the server was down. The generic situation looks like this:
In this case, the total uptime can be found by adding:
- The total uptime of all uptime periods completely within the reporting interval and
- The uptime from the start of the report interval to the end of the uptime period the report start time fall in.
Scenario 4
In this final scenario, both the start and end times of the reporting interval fall within a period where the server was down. The generic situation looks like this:
In this case, the total uptime can be found by adding:
- The total uptime of all uptime periods completely within the reporting interval
Like the first scenario, there is a simplified case here, which looks like this:
Obviously, in this case, the total uptime will be zero.
That is all fairly straightforward. The tricky part comes where we try to translate those statements into T-SQL code. An extra degree of difficultly is added because our log table does not store the start time of each period. It only stores the end time and the length of the period.
Here is the code to calculate our availability percentage. The output of this procedure will be four columns containing the following text and information:
Uptime Minutes: x
Margin Of Error (Minutes): x
Uptime Percentage for server <servername>: xx.xx%
Uptime Percentage (including margin of error for <servername>: xx.xx%
The comments in the code should be helpful in explaining what is going on. The code takes two input parameters – the start and end times of the desired reporting interval. If these are not supplied, the code reports over the entire timeframe represented in the table.
The variable @MarginOfErrorMultiplier should be set to 2 unless you also have the procedure run on SQL Server startup, in which case it should be set to 1. I talked about this in the Margin Of Error section in part 2. Also related to the margin of error calculation is what I refer to as a “margin of error modifier” in the code. Recall that our margin of error comes into play every time the server goes up or down. So whenever our report interval crosses an up/down or down/up transition, we can encounter some error. I’ve defined “one margin of error unit” to be one complete cycle – up/down and then down /up (or vice-versa). Therefore, for scenarios 2 and 3 where the report interval includes incomplete cycles, we have to add in a half of a margin of error unit. And for the special cases of scenarios 1 and 4, there are no up/down or down/up transitions included, so there is no error at all.
You’ll need to modify the code to set the @ProcedureRunTimeIntervalInMinutes variable to be however often you run the logging procedure. Here, it is set to 5 minutes.
CREATE PROCEDURE [dbo].[up_CalculateServerUpTime] ( @ReportStartDateTime SMALLDATETIME = NULL ,@ReportEndDateTime SMALLDATETIME = NULL ) AS /* shaunjstuart.com */ DECLARE @ReportStartDateTimeInUptimeWindow BIT DECLARE @ReportEndDateTimeInUptimeWindow BIT DECLARE @NextWindowStartDateTimeReportStart SMALLDATETIME DECLARE @NextWindowStartDateTimeReportEnd SMALLDATETIME DECLARE @NextWindowUptimeMinutes INT DECLARE @TotalReportPeriodUptimeMinutes INT DECLARE @UptimePercentage DECIMAL(10, 3) DECLARE @UptimePercentageWithError DECIMAL(10, 3) DECLARE @MarginOfErrorMultiplier TINYINT = 2 /* This should be 1 if this procedure is set to run on server startup, 2 otherwise */ DECLARE @IntervalCount DECIMAL(3, 1) /* Number of up/down intervals in reporting period. Used for margin of error calculation */ DECLARE @ProcedureRunTimeIntervalInMinutes SMALLINT /* Set this to be how often this procedure is run. Used in margin of error calculation */ /* If no parameters are passed in, assume the oldest and most recent values for use */ IF @ReportStartDateTime IS NULL SELECT @ReportStartDateTime = DATEADD(mi, -UptimeMinutes, LogDate) FROM UpTimeTracking WHERE logDate = (SELECT MIN(logDate) FROM UptimeTracking ) IF @ReportEndDateTime IS NULL SELECT @ReportEndDateTime = MAX(LogDate) FROM UpTimeTracking /* Validate inputs and Initialize Variables */ IF @ReportStartDateTime >= @ReportEndDateTime BEGIN SELECT 'Start time must be before end time' RETURN END IF @ReportStartDateTime < (SELECT DATEADD(mi, -UptimeMinutes, LogDate) FROM UpTimeTracking WHERE LogDate = (SELECT MIN(LogDate) FROM UpTimeTracking ) ) BEGIN SELECT 'Start time is before logging started.' RETURN END IF @ReportEndDateTime > (SELECT MAX(LogDate) FROM UpTimeTracking ) BEGIN SELECT 'Requested report end time is beyond the last logging period. No data is available.' RETURN END SET @IntervalCount = 0.0 SET @ProcedureRunTimeIntervalInMinutes = 5 IF @ProcedureRunTimeIntervalInMinutes IS NULL BEGIN SELECT 'Must define how often logging occurs.' RETURN END IF @ProcedureRunTimeIntervalInMinutes <= 0 BEGIN SELECT '@ProcedureRunTimeIntervalInMinutes must be greater than zero.' RETURN END /* Check where ReportStartDateTime falls */ SELECT TOP 1 @NextWindowStartDateTimeReportStart = LogDate ,@NextWindowUptimeMinutes = UptimeMinutes FROM UpTimeTracking WHERE LogDate >= @ReportStartDateTime ORDER BY LogDate ASC IF DATEADD(mi, -@NextWindowUptimeMinutes, @NextWindowStartDateTimeReportStart) <= @ReportStartDateTime SET @ReportStartDateTimeInUptimeWindow = 1 ELSE SET @ReportStartDateTimeInUptimeWindow = 0 /* Check where ReportEndDateTime falls */ SELECT TOP 1 @NextWindowStartDateTimeReportEnd = LogDate ,@NextWindowUptimeMinutes = UptimeMinutes FROM UpTimeTracking WHERE LogDate >= @ReportEndDateTime ORDER BY LogDate ASC IF DATEADD(mi, -@NextWindowUptimeMinutes, @NextWindowStartDateTimeReportEnd) <= @ReportEndDateTime SET @ReportEndDateTimeInUptimeWindow = 1 ELSE SET @ReportEndDateTimeInUptimeWindow = 0 /* Now perform uptime calculations based on where the report boundaries lie: ReportStartDatetime during up period and ReportEndDateTime during up period ReportStartDatetime during up period and ReportEndDateTime during down period ReportStartDatetime during down period and ReportEndDateTime during up period ReportStartDatetime during down period and ReportEndDateTime during down period */ /* ReportStartDatetime during up period and ReportEndDateTime during up period */ IF (@ReportStartDateTimeInUptimeWindow = 1 AND @ReportEndDateTimeInUptimeWindow = 1) BEGIN /* SELECT 'Section: Up Up' */ /* First, get total uptime of all periods completely within the report time frame */ SELECT @TotalReportPeriodUptimeMinutes = ISNULL(SUM(UptimeMinutes), 0) FROM UpTimeTracking WHERE LogDate < @ReportEndDateTime AND DATEADD(mi, -UptimeMinutes, LogDate) > @ReportStartDateTime /* SELECT @TotalReportPeriodUptimeMinutes */ /* Now add in minutes from the starting time period the report start date falls in */ SELECT TOP 1 @TotalReportPeriodUptimeMinutes = @TotalReportPeriodUptimeMinutes + DATEDIFF(mi, @ReportStartDateTime, LogDate) FROM UpTimeTracking WHERE LogDate >= @ReportStartDateTime ORDER BY LogDate ASC /* SELECT @TotalReportPeriodUptimeMinutes */ /* Now add in minutes from the end time period the report end date falls in */ SELECT TOP 1 @TotalReportPeriodUptimeMinutes = @TotalReportPeriodUptimeMinutes + (UptimeMinutes - DATEDIFF(mi, @ReportEndDateTime, LogDate)) FROM UpTimeTracking WHERE LogDate >= @ReportEndDateTime ORDER BY LogDate ASC /* SELECT @TotalReportPeriodUptimeMinutes */ /* Calculate the margin of error. This is the number of up/down intervals in our reporting period plus some adjustments based on if the period start or ends in an up or down time segment. Here, we know both are in an uptime period, so we add one period as the modifier. */ SELECT @IntervalCount = ISNULL(COUNT(1), 0) FROM UpTimeTracking WHERE LogDate <= @ReportEndDateTime AND DATEADD(mi, -UptimeMinutes, LogDate) >= @ReportStartDateTime AND LogDate > (SELECT MIN(LogDate) FROM UpTimeTracking ) SELECT @IntervalCount = @IntervalCount + 1.0 /* This is our modifier */ /* Now some special handling for the case where both report start and end dates are in the same log period. This is a simple case where the uptime is just the difference between the report dates. */ IF @NextWindowStartDateTimeReportStart = @NextWindowStartDateTimeReportEnd BEGIN SELECT @TotalReportPeriodUptimeMinutes = DATEDIFF(mi, @ReportStartDateTime, @ReportEndDateTime) /* SELECT 'Both are in same log period period' */ /* This situation also means the margin of error is calculated differently. Because we are completely within one log period, there is no margin of error to worry about. */ SELECT @IntervalCount = 0.0 END END /* ReportStartDatetime during up period and ReportEndDateTime during down period */ IF (@ReportStartDateTimeInUptimeWindow = 1 AND @ReportEndDateTimeInUptimeWindow = 0) BEGIN /* SELECT 'Section: Up Down' */ /* First, get total uptime of all periods completely within the report time frame */ SELECT @TotalReportPeriodUptimeMinutes = ISNULL(SUM(UptimeMinutes), 0) FROM UpTimeTracking WHERE LogDate BETWEEN @ReportStartDateTime AND @ReportEndDateTime AND LogDate > (SELECT MIN(LogDate) FROM UpTimeTracking ) /* Need the AND clause to handle case where ReportStartDate is in the first log period in the table. Otherwise, it gets counted twice */ /* Now add in minutes from the starting time period the report start date falls in */ SELECT TOP 1 @TotalReportPeriodUptimeMinutes = @TotalReportPeriodUptimeMinutes + (DATEDIFF(mi, @ReportStartDateTime, LogDate)) FROM UpTimeTracking WHERE LogDate >= @ReportStartDateTime ORDER BY LogDate ASC /* Now add in minutes from the end time period the report end date falls in */ /* Oops! We already know the report end date does not fall in an up period, so there is nothing to add here. Move along. */ /* Calculate the margin of error. This is the number of up/down intervals in our reporting period plus some adjustments based on if the period start or ends in an up or down time segment. Here, we know one is in an uptime period and one isn't, so we add one-half of one period as the modifier. */ SELECT @IntervalCount = ISNULL(COUNT(1), 0) FROM UpTimeTracking WHERE LogDate BETWEEN @ReportStartDateTime AND @ReportEndDateTime AND LogDate > (SELECT MIN(LogDate) FROM UpTimeTracking ) SELECT @IntervalCount = @IntervalCount + 0.5 /* This is our modifier */ END /* ReportStartDatetime during down period and ReportEndDateTime during up period */ IF (@ReportStartDateTimeInUptimeWindow = 0 AND @ReportEndDateTimeInUptimeWindow = 1) BEGIN /* SELECT 'Section: Down Up'*/ /* First, get total uptime of all periods completely within the report time frame */ SELECT @TotalReportPeriodUptimeMinutes = ISNULL(SUM(UptimeMinutes), 0) FROM UpTimeTracking WHERE LogDate BETWEEN @ReportStartDateTime AND @ReportEndDateTime /* Now add in minutes from the starting time period the report start date falls in */ /* Oops! We already know the report start date does not fall in an up period, so there is nothing to add here. Move along. */ /* Now add in minutes from the end time period the report end date falls in */ SELECT TOP 1 @TotalReportPeriodUptimeMinutes = @TotalReportPeriodUptimeMinutes + (UptimeMinutes - DATEDIFF(mi, @ReportEndDateTime, LogDate)) FROM UpTimeTracking WHERE LogDate >= @ReportEndDateTime ORDER BY LogDate ASC /* Calculate the margin of error. This is the number of up/down intervals in our reporting period plus some adjustments based on if the period start or ends in an up or down time segment. Here, we know one is in an uptime period and one isn't, so we add one-half of one period as the modifier. */ SELECT @IntervalCount = ISNULL(COUNT(1), 0) FROM UpTimeTracking WHERE LogDate BETWEEN @ReportStartDateTime AND @ReportEndDateTime AND LogDate > (SELECT MIN(LogDate) FROM UpTimeTracking ) SELECT @IntervalCount = @IntervalCount + 0.5 /* This is our modifier */ END /* ReportStartDatetime during down period and ReportEndDateTime during down period */ IF (@ReportStartDateTimeInUptimeWindow = 0 AND @ReportEndDateTimeInUptimeWindow = 0) BEGIN /* SELECT 'Section: Down Down' */ /* First, get total uptime of all periods completely within the report time frame */ SELECT @TotalReportPeriodUptimeMinutes = ISNULL(SUM(UptimeMinutes), 0) FROM UpTimeTracking WHERE LogDate BETWEEN @ReportStartDateTime AND @ReportEndDateTime /* Now add in minutes from the starting time period the report start date falls in */ /* Oops! We already know the report start date does not fall in an up period, so there is nothing to add here. Move along. */ /* Now add in minutes from the end time period the report end date falls in */ /* Oops! We already know the report end date does not fall in an up period, so there is nothing to add here. Move along. */ /* Calculate the margin of error. This is the number of up/down intervals in our reporting period plus some adjustments based on if the period start or ends in an up or down time segment. Here, we know neither are in an uptime period, so we don't add any modifier. */ SELECT @IntervalCount = ISNULL(COUNT(1), 0) FROM UpTimeTracking WHERE LogDate BETWEEN @ReportStartDateTime AND @ReportEndDateTime AND LogDate > (SELECT MIN(LogDate) FROM UpTimeTracking ) SELECT @IntervalCount = @IntervalCount + 0.0 /* This is our modifier */ END /* Calculations done. Now output results. Outputting as a single result set for easier consumption for reports. */ /*Uptime percentage over reporting period = total uptime during reporting period / total time of that period */ SET @UptimePercentage = 100 * (CAST(@TotalReportPeriodUptimeMinutes AS DECIMAL(20, 3)) / CAST(DATEDIFF(mi, @ReportStartDateTime, @ReportEndDateTime) AS DECIMAL(20, 3))) SET @UptimePercentageWithError = 100 * (CAST((@TotalReportPeriodUptimeMinutes - (@IntervalCount * @MarginOfErrorMultiplier * @ProcedureRunTimeIntervalInMinutes)) AS DECIMAL(20, 3)) / CAST(DATEDIFF(mi, @ReportStartDateTime, @ReportEndDateTime) AS DECIMAL(20, 3))) SELECT 'Uptime Minutes: ' + CAST(@TotalReportPeriodUptimeMinutes AS VARCHAR(20)) AS UpTimeMinutes ,'Margin Of Error (Minutes): ' + CAST((@IntervalCount * @MarginOfErrorMultiplier * @ProcedureRunTimeIntervalInMinutes) AS VARCHAR(20)) AS MarginOfErrorMinutes ,'Uptime Percentage for server ' + @@servername + ': ' + CAST(@UptimePercentage AS VARCHAR(16)) AS UpTimePercentage ,'Uptime Percentage (including margin of error) for server ' + @@servername + ': ' + CAST(@UptimePercentageWithError AS VARCHAR(16)) AS UpTimePercentageWithError
Note that, in most companies, down time due to planned maintenance does not count against your up time availability percentage. This process does not account for down time due to planned maintenance.
I had planned to end the series here, but I think one more post is in order. This is a pretty complicated reporting method, so I wanted to make sure it worked for all scenarios I could think of. My final post on this will go over my testing methodology.
One thought on “Availability Monitoring – Part 3”