Availability Monitoring – Part 4

For the final part of this series on Uptime Monitoring, I’d like to go over my testing methodology. The reporting process is somewhat complicated, so I want to be confident the figures my procedure outputs are correct. Here again, is the sample data I am working with:

Timeline

LogDate

UptimeMinutes

2013-08-21 12:00:00 100
2013-08-21 14:15:00 115
2013-08-22 08:15:00 1070
2013-09-03 08:45:00 17300

Again, for the sake of our discussion, let’s assume the current time is 9/3/13 8:46 AM and the server is still running.

The blue bars represent periods where our SQL Server is up and running. The times between are times the server is down.

To fully verify this procedure, I want to run it using various reporting start and end times in order to test the different scenarios I talked about in Part 3 and compare that to my manually calculated results. Here is the data I used for testing and my manual calculations:

Report Start

Report End

Scenario

Uptime (min)

Total Report Time (min)

Availability Percentage

Margin Of Error (min)

8/21 10:25

8/21 10:35

1

(special case)

10

10

100

0

8/21 10:25

8/21 12:15

3

95

110

86.36

5

8/21 12:15

8/21 12:30

2

10

15

66.67

5

8/21 12:15

8/22 8:20

4

1185

1205

98.34

20

8/21 14:20

8/21 14:22

4

(special case)

0

2

0

0

8/22 8:30

8:22 8:45

1

(special case)

15

15

100

0

8/21 12:30

8/21 13:00

1

(special case)

30

30

100

0

8/21 12:30

8/22 8:30

1

1180

1200

98.33

20

8/21 10:25

8/22 8:30

1

1285

1325

96.98

30

 

The procedure gives the same values for these reporting periods, so I’m pretty confident the code is good. You’ll notice there are a couple of cases where I test a scenario multiple times. This was done to have specific up/down time intervals included  in the report – either the first or last, etc. I also ran a test using no report time parameters so the output would result in numbers for all times covered, but that will varying depending on when you run it (because the final period will keep increasing as time goes one), so I left that off the chart.

Final Thoughts

So what did I learn from all this? First, it’s not a trivial task to get a server to monitor itself for down time. Second, I really enjoyed this challenge. Coming up with a data logging procedure was a fun exercise and the calculation portion required some serious thought. I’m sure there are other logging methods that could achieve the same result (for instance, you could probably only log the start and end times of the logging period, instead of the end time and elapsed minutes), but I can’t see one as having a clear advantage over the other. If anyone has other ideas or thoughts, please leave a comment.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

I am a real person and can prove it by doing this math problem: (required) Time limit is exhausted. Please reload CAPTCHA.