Availability Monitoring – Part 3

This is the final almost final part of my series of articles about creating a process to monitor your SQL Server for downtime. Part 1 can be found here. Part 2 can be found here.

Reporting

Now that we’ve got a process in place to gather data, we need to create a way to report off of that data. We designed our data recording process to be as lightweight as possible at the expense of adding possible complexity on the reporting side. Now is when we have to pay that bill.

Let me reprint the image representing our sample data:

TimelineThis is a visual representation of the data in the following table:

LogDate

UptimeMinutes

2013-08-21 12:00:00 100
2013-08-21 14:15:00 115
2013-08-22 08:15:00 1070
2013-09-03 08:45:00 17300

Again, for the sake of our discussion, let’s assume the current time is 9/3/13 8:46 AM and the server is still running.

The blue bars represent periods where our SQL Server is up and running. The times between are times the server is down. We want to query the data and find out, for a given date range, what was the availability percentage of the server.  Calculating this can be broken down into 4 possible scenarios:

  1. Both the start and end times of the reporting window fall within a period of the server being up
  2. The report start time falls within a server downtime period and the report end time falls in a server uptime period
  3. The report start time falls within a server uptime period and the report end time falls in a server downtime period
  4. Both the start and end times of the reporting window fall within a server downtime period.

Why do we need to break things out this way? For a couple reasons. First, as we demonstrated last time, in order to calculate how accurate our monitoring data is we need to know how many up/down periods occurred within the reporting interval. Second, if the report interval falls completely within one type of period (either up or down), the calculations are greatly simplified.

I’ll walk through the calculation process for each scenarios before presenting the code.

Scenario 1

In this scenario, both the start and end times of the reporting interval fall within a period where the server was up. The generic situation looks like this:

Timeline Scenario 1

The red circle represents the start of the reporting period and the red rectangle isn the end of the reporting period. There could be more than 1 uptime period in the reporting interval, although I only show one here. The total minutes the system has been available between these two points can be found by adding:

  • The total uptime of all complete uptime periods completely within the report window (8/21 between 12:20 and 14:14, or 115 minutes, in this case) and
  • The uptime from the start of the reporting interval to the end of the uptime period that falls in (from the red circle to 8/21 12:00, in this case) and
  • The uptime from the start of the uptime period the report end time is in to the report end time (8/21 14:25 to the red rectangle, in this case)

These calculations require us to determine which time period the report endpoints fall in. In doing this, we can determine if we are in following simplified case:

Timeline Scenario 1a

Here, the total uptime is simply the difference between the report start and end dates.

Our calculations for our margin of error, also require us to count the total number of uptime periods the report covers.

Scenario 2

In this scenario, the start time of the reporting interval falls during a server downtime period and the end time falls within a period where the server was up. The generic situation looks like this:

Timeline Scenario 2In this case, the total uptime can be found by adding:

  • The total uptime of all uptime periods completely within the reporting interval and
  • The uptime from the start of the uptime interval the report end time falls in to that endtime.

Scenario 3

In this scenario, the start time of the reporting interval falls during a server uptime period and the end time falls within a period where the server was down. The generic situation looks like this:

Timeline Scenario 3

In this case, the total uptime can be found by adding:

  • The total uptime of all uptime periods completely within the reporting interval and
  • The uptime from the start of the report interval to the end of the uptime period the report start time fall in.

Scenario 4

In this final scenario, both the start and end times of the reporting interval fall within a period where the server was down. The generic situation looks like this:

Timeline Scenario 4

In this case, the total uptime can be found by adding:

  • The total uptime of all uptime periods completely within the reporting interval

Like the first scenario, there is a simplified case here, which looks like this:

Timeline Scenario 4a

Obviously, in this case, the total uptime will be zero.

That is all fairly straightforward. The tricky part comes where we try to translate those statements into T-SQL code. An extra degree of difficultly is added because our log table does not store the start time of each period. It only stores the end time and the length of the period.

Here is the code to calculate our availability percentage. The output of this procedure will be four columns containing the following text and information:

Uptime Minutes: x

Margin Of Error (Minutes): x

Uptime Percentage for server <servername>: xx.xx%

Uptime Percentage (including margin of error for <servername>: xx.xx%

The comments in the code should be helpful in explaining what is going on. The code takes two input parameters – the start and end times of the desired reporting interval. If these are not supplied, the code reports over the entire timeframe represented in the table.

The variable @MarginOfErrorMultiplier should be set to 2 unless you also have the procedure run on SQL Server startup, in which case it should be set to 1. I talked about this in the Margin Of Error section in part 2.  Also related to the margin of error calculation is what I refer to as a “margin of error modifier” in the code. Recall that our margin of error comes into play every time the server goes up or down. So whenever our report interval crosses an up/down or down/up transition, we can encounter some error. I’ve defined “one margin of error unit” to be one complete cycle – up/down and then down /up (or vice-versa). Therefore, for scenarios 2 and 3 where the report interval includes incomplete cycles, we have to add in a half of a margin of error unit. And for the special cases of scenarios 1 and 4, there are no up/down or down/up transitions included, so there is no error at all.

You’ll need to modify the code to set the @ProcedureRunTimeIntervalInMinutes variable to be however often you run the logging procedure. Here, it is set to 5 minutes.

CREATE PROCEDURE [dbo].[up_CalculateServerUpTime]
       (
        @ReportStartDateTime SMALLDATETIME = NULL
       ,@ReportEndDateTime SMALLDATETIME = NULL
	   )
AS

/* shaunjstuart.com */

       DECLARE @ReportStartDateTimeInUptimeWindow BIT
       DECLARE @ReportEndDateTimeInUptimeWindow BIT
       DECLARE @NextWindowStartDateTimeReportStart SMALLDATETIME
       DECLARE @NextWindowStartDateTimeReportEnd SMALLDATETIME
       DECLARE @NextWindowUptimeMinutes INT
       DECLARE @TotalReportPeriodUptimeMinutes INT
       DECLARE @UptimePercentage DECIMAL(10, 3)
       DECLARE @UptimePercentageWithError DECIMAL(10, 3)
       DECLARE @MarginOfErrorMultiplier TINYINT = 2
	/* This should be 1 if this procedure is set to
		run on server startup, 2 otherwise */
       DECLARE @IntervalCount DECIMAL(3, 1)
	/*	Number of up/down intervals in reporting period. Used for
		margin of error calculation */

       DECLARE @ProcedureRunTimeIntervalInMinutes SMALLINT
	/*	Set this to be how often this procedure
		is run. Used in margin of error calculation */

	/* If no parameters are passed in, assume the oldest and most recent values for use */

       IF @ReportStartDateTime IS NULL
          SELECT    @ReportStartDateTime = DATEADD(mi, -UptimeMinutes, LogDate)
          FROM      UpTimeTracking
          WHERE     logDate = (SELECT   MIN(logDate)
                               FROM     UptimeTracking
                              )
       IF @ReportEndDateTime IS NULL
          SELECT    @ReportEndDateTime = MAX(LogDate)
          FROM      UpTimeTracking

/* Validate inputs and Initialize Variables */

       IF @ReportStartDateTime >= @ReportEndDateTime
          BEGIN
                SELECT  'Start time must be before end time'
                RETURN
          END

       IF @ReportStartDateTime < (SELECT    DATEADD(mi, -UptimeMinutes,
                                                    LogDate)
                                  FROM      UpTimeTracking
                                  WHERE     LogDate = (SELECT MIN(LogDate)
                                                       FROM   UpTimeTracking
                                                      )
                                 )
          BEGIN
                SELECT  'Start time is before logging started.'
                RETURN
          END

       IF @ReportEndDateTime > (SELECT  MAX(LogDate)
                                FROM    UpTimeTracking
                               )
          BEGIN
                SELECT  'Requested report end time is beyond the last logging period. No data is available.'
                RETURN
          END

       SET @IntervalCount = 0.0
       SET @ProcedureRunTimeIntervalInMinutes = 5

       IF @ProcedureRunTimeIntervalInMinutes IS NULL
          BEGIN
                SELECT  'Must define how often logging occurs.'
                RETURN
          END

       IF @ProcedureRunTimeIntervalInMinutes <= 0
          BEGIN
                SELECT  '@ProcedureRunTimeIntervalInMinutes must be greater than zero.'
                RETURN
          END

/* Check where ReportStartDateTime falls */

       SELECT TOP 1
                @NextWindowStartDateTimeReportStart = LogDate
               ,@NextWindowUptimeMinutes = UptimeMinutes
       FROM     UpTimeTracking
       WHERE    LogDate >= @ReportStartDateTime
       ORDER BY LogDate ASC

       IF DATEADD(mi, -@NextWindowUptimeMinutes,
                  @NextWindowStartDateTimeReportStart) <= @ReportStartDateTime
          SET @ReportStartDateTimeInUptimeWindow = 1
       ELSE
          SET @ReportStartDateTimeInUptimeWindow = 0

/* Check where ReportEndDateTime falls */

       SELECT TOP 1
                @NextWindowStartDateTimeReportEnd = LogDate
               ,@NextWindowUptimeMinutes = UptimeMinutes
       FROM     UpTimeTracking
       WHERE    LogDate >= @ReportEndDateTime
       ORDER BY LogDate ASC

       IF DATEADD(mi, -@NextWindowUptimeMinutes,
                  @NextWindowStartDateTimeReportEnd) <= @ReportEndDateTime
          SET @ReportEndDateTimeInUptimeWindow = 1
       ELSE
          SET @ReportEndDateTimeInUptimeWindow = 0

/* Now perform uptime calculations based on where the report
	boundaries lie:
		ReportStartDatetime during up period and ReportEndDateTime during up period
		ReportStartDatetime during up period and ReportEndDateTime during down period
		ReportStartDatetime during down period and ReportEndDateTime during up period
		ReportStartDatetime during down period and ReportEndDateTime during down period
*/

/* ReportStartDatetime during up period and ReportEndDateTime during up period */

       IF (@ReportStartDateTimeInUptimeWindow = 1
           AND @ReportEndDateTimeInUptimeWindow = 1)
          BEGIN

        /*  SELECT 'Section: Up Up'  */
		/* First, get total uptime of all periods completely within the report time frame */

                SELECT  @TotalReportPeriodUptimeMinutes = ISNULL(SUM(UptimeMinutes),
                                                              0)
                FROM    UpTimeTracking
                WHERE   LogDate < @ReportEndDateTime
                        AND DATEADD(mi, -UptimeMinutes, LogDate) > @ReportStartDateTime

           /* SELECT @TotalReportPeriodUptimeMinutes  */

	/* Now add in minutes from the starting time period the report start date falls in */

                SELECT TOP 1
                        @TotalReportPeriodUptimeMinutes = @TotalReportPeriodUptimeMinutes
                        + DATEDIFF(mi, @ReportStartDateTime, LogDate)
                FROM    UpTimeTracking
                WHERE   LogDate >= @ReportStartDateTime
                ORDER BY LogDate ASC

       /*   SELECT @TotalReportPeriodUptimeMinutes   */

	/* Now add in minutes from the end time period the report end date falls in */

                SELECT TOP 1
                        @TotalReportPeriodUptimeMinutes = @TotalReportPeriodUptimeMinutes
                        + (UptimeMinutes - DATEDIFF(mi, @ReportEndDateTime,
                                                    LogDate))
                FROM    UpTimeTracking
                WHERE   LogDate >= @ReportEndDateTime
                ORDER BY LogDate ASC

          /* SELECT @TotalReportPeriodUptimeMinutes   */

	/* Calculate the margin of error. This is the number of up/down intervals in our reporting
		period plus some adjustments based on if the period start or ends in an up or down
		time segment. Here, we know both are in an uptime period, so we add one period as the
		modifier. */

                SELECT  @IntervalCount = ISNULL(COUNT(1), 0)
                FROM    UpTimeTracking
                WHERE   LogDate <= @ReportEndDateTime
                        AND DATEADD(mi, -UptimeMinutes, LogDate) >= @ReportStartDateTime
                        AND LogDate > (SELECT   MIN(LogDate)
                                       FROM     UpTimeTracking
                                      )

                SELECT  @IntervalCount = @IntervalCount + 1.0	/* This is our modifier */

/* Now some special handling for the case where both report start and end dates are in the
	same log period. This is a simple case where the uptime is just the difference
	between the report dates. */

                IF @NextWindowStartDateTimeReportStart = @NextWindowStartDateTimeReportEnd
                   BEGIN
                         SELECT @TotalReportPeriodUptimeMinutes = DATEDIFF(mi,
                                                              @ReportStartDateTime,
                                                              @ReportEndDateTime)
                  /* SELECT    'Both are in same log period period' */

	/* This situation also means the margin of error is calculated differently. Because
		we are completely within one log period, there is no margin of error to worry about. */

                         SELECT @IntervalCount = 0.0

                   END

          END

/* ReportStartDatetime during up period and ReportEndDateTime during down period */

       IF (@ReportStartDateTimeInUptimeWindow = 1
           AND @ReportEndDateTimeInUptimeWindow = 0)
          BEGIN
         /* SELECT 'Section: Up Down' */

	/* First, get total uptime of all periods completely within the report time frame */

                SELECT  @TotalReportPeriodUptimeMinutes = ISNULL(SUM(UptimeMinutes),
                                                              0)
                FROM    UpTimeTracking
                WHERE   LogDate BETWEEN @ReportStartDateTime
                                AND     @ReportEndDateTime
                        AND LogDate > (SELECT   MIN(LogDate)
                                       FROM     UpTimeTracking
                                      )
				/* Need the AND clause to handle case where ReportStartDate is in the first
					log period in the table. Otherwise, it gets counted twice */

	/* Now add in minutes from the starting time period the report start date falls in */

                SELECT TOP 1
                        @TotalReportPeriodUptimeMinutes = @TotalReportPeriodUptimeMinutes
                        + (DATEDIFF(mi, @ReportStartDateTime, LogDate))
                FROM    UpTimeTracking
                WHERE   LogDate >= @ReportStartDateTime
                ORDER BY LogDate ASC

	/* Now add in minutes from the end time period the report end date falls in */

	/* Oops! We already know the report end date does not fall in an up period, so there
		is nothing to add here. Move along. */

	/* Calculate the margin of error. This is the number of up/down intervals in our reporting
		period plus some adjustments based on if the period start or ends in an up or down
		time segment. Here, we know one is in an uptime period and one isn't, so we add one-half
		of one period as the modifier. */

                SELECT  @IntervalCount = ISNULL(COUNT(1), 0)
                FROM    UpTimeTracking
                WHERE   LogDate BETWEEN @ReportStartDateTime
                                AND     @ReportEndDateTime
                        AND LogDate > (SELECT   MIN(LogDate)
                                       FROM     UpTimeTracking
                                      )
                SELECT  @IntervalCount = @IntervalCount + 0.5	/* This is our modifier */

          END

/* ReportStartDatetime during down period and ReportEndDateTime during up period */

       IF (@ReportStartDateTimeInUptimeWindow = 0
           AND @ReportEndDateTimeInUptimeWindow = 1)
          BEGIN
         /* SELECT 'Section: Down Up'*/

	/* First, get total uptime of all periods completely within the report time frame */

                SELECT  @TotalReportPeriodUptimeMinutes = ISNULL(SUM(UptimeMinutes),
                                                              0)
                FROM    UpTimeTracking
                WHERE   LogDate BETWEEN @ReportStartDateTime
                                AND     @ReportEndDateTime
	/* Now add in minutes from the starting time period the report start date falls in */

	/* Oops! We already know the report start date does not fall in an up period, so there
		is nothing to add here. Move along. */

	/* Now add in minutes from the end time period the report end date falls in */

                SELECT TOP 1
                        @TotalReportPeriodUptimeMinutes = @TotalReportPeriodUptimeMinutes
                        + (UptimeMinutes - DATEDIFF(mi, @ReportEndDateTime,
                                                    LogDate))
                FROM    UpTimeTracking
                WHERE   LogDate >= @ReportEndDateTime
                ORDER BY LogDate ASC

  	/* Calculate the margin of error. This is the number of up/down intervals in our reporting
		period plus some adjustments based on if the period start or ends in an up or down
		time segment. Here, we know one is in an uptime period and one isn't, so we add one-half
		of one period as the modifier. */

                SELECT  @IntervalCount = ISNULL(COUNT(1), 0)
                FROM    UpTimeTracking
                WHERE   LogDate BETWEEN @ReportStartDateTime
                                AND     @ReportEndDateTime
                        AND LogDate > (SELECT   MIN(LogDate)
                                       FROM     UpTimeTracking
                                      )

                SELECT  @IntervalCount = @IntervalCount + 0.5	/* This is our modifier */

          END

/* ReportStartDatetime during down period and ReportEndDateTime during down period */

       IF (@ReportStartDateTimeInUptimeWindow = 0
           AND @ReportEndDateTimeInUptimeWindow = 0)
          BEGIN
         /* SELECT 'Section: Down Down' */

	/* First, get total uptime of all periods completely within the report time frame */

                SELECT  @TotalReportPeriodUptimeMinutes = ISNULL(SUM(UptimeMinutes),
                                                              0)
                FROM    UpTimeTracking
                WHERE   LogDate BETWEEN @ReportStartDateTime
                                AND     @ReportEndDateTime

	/* Now add in minutes from the starting time period the report start date falls in */

	/* Oops! We already know the report start date does not fall in an up period, so there
		is nothing to add here. Move along. */

	/* Now add in minutes from the end time period the report end date falls in */

	/* Oops! We already know the report end date does not fall in an up period, so there
		is nothing to add here. Move along. */

	/* Calculate the margin of error. This is the number of up/down intervals in our reporting
		period plus some adjustments based on if the period start or ends in an up or down
		time segment. Here, we know neither are in an uptime period, so we don't add any modifier. */

                SELECT  @IntervalCount = ISNULL(COUNT(1), 0)
                FROM    UpTimeTracking
                WHERE   LogDate BETWEEN @ReportStartDateTime
                                AND     @ReportEndDateTime
                        AND LogDate > (SELECT   MIN(LogDate)
                                       FROM     UpTimeTracking
                                      )

                SELECT  @IntervalCount = @IntervalCount + 0.0	/* This is our modifier */

          END

/* Calculations done. Now output results. Outputting as a single result set for easier
	consumption for reports. */

/*Uptime percentage over reporting period = total uptime during reporting period / total time of that period */

       SET @UptimePercentage = 100
           * (CAST(@TotalReportPeriodUptimeMinutes AS DECIMAL(20, 3))
              / CAST(DATEDIFF(mi, @ReportStartDateTime, @ReportEndDateTime) AS DECIMAL(20,
                                                              3)))

       SET @UptimePercentageWithError = 100
           * (CAST((@TotalReportPeriodUptimeMinutes - (@IntervalCount
                                                       * @MarginOfErrorMultiplier
                                                       * @ProcedureRunTimeIntervalInMinutes)) AS DECIMAL(20,
                                                              3))
              / CAST(DATEDIFF(mi, @ReportStartDateTime, @ReportEndDateTime) AS DECIMAL(20,
                                                              3)))

       SELECT   'Uptime Minutes: '
                + CAST(@TotalReportPeriodUptimeMinutes AS VARCHAR(20)) AS UpTimeMinutes
               ,'Margin Of Error (Minutes): '
                + CAST((@IntervalCount * @MarginOfErrorMultiplier
                        * @ProcedureRunTimeIntervalInMinutes) AS VARCHAR(20)) AS MarginOfErrorMinutes
               ,'Uptime Percentage for server ' + @@servername + ': '
                + CAST(@UptimePercentage AS VARCHAR(16)) AS UpTimePercentage
               ,'Uptime Percentage (including margin of error) for server '
                + @@servername + ': '
                + CAST(@UptimePercentageWithError AS VARCHAR(16)) AS UpTimePercentageWithError

Note that, in most companies, down time due to planned maintenance does not count against your up time availability percentage. This process does not account for down time due to planned maintenance.

I had planned to end the series here, but I think one more post is in order. This is a pretty complicated reporting method, so I wanted to make sure it worked for all scenarios I could think of. My final post on this will go over my testing methodology.

Share

One thought on “Availability Monitoring – Part 3

Leave a Reply

Your email address will not be published. Required fields are marked *

I am a real person and can prove it by doing this math problem: (required) Time limit is exhausted. Please reload CAPTCHA.