Backup Verification, Statistical Sampling, And You

I’m always eager to expand my knowledge about SQL Server and I have a list of about 10 SQL-related blogs that I check on a daily basis. I’ve picked up lots of good info from them and occasionally, I’ll come across something that sticks in my head because I think it’s just all kinds of awesome. Over a year ago, I came across a post like that on Thomas LaRock’s blog.

It was a short blog entry that just referred his readers to an article he wrote for Simple-Talk.com. In it, he talked about statistical sampling and how to use that to test your backups. (Because we all know a backup is never good until it has been successfully restored, right?) In his case, he had about 3,000 databases he was administering. That number made it impossible to verify each and every backup to make sure it was valid, so he turned to statistical analysis to determine how many he needed to test to say with 95% confidence that all his backups were good. I loved this idea and was eager to put it into practice at my company. But real life got in the way and this became one of those wish-list items that always seemed to get put off because something else more important came up. At the time, I was also at a company that didn’t have a large number of databases, so statistical sampling was more of a thought exercise than a necessity.

Now I am at a different company and I am one of two DBAs administering close to 1,000 databases. Sampling is now an attractive proposition. So I dug up LaRock’s post again and started to dive into it. One thing I quickly discovered is the same thing the commenters on his article discovered – in simplifying the statistical calculations, LaRock seemed to have left out the important step of exactly how someone might translate his work to their own environment. He made no mention of how his population of 3,000 databases played into the equations. He also did not mention how he used a Z table (information he omitted on purpose). Without those two pieces of key information, there was little anyone could do to adapt his technique to their own environment.

Well, it just so happens I have a friend who is an associate professor of mathematics at UCSD, so I sent him an email asking for a brief tutorial on how to use a Z table. He was kind enough to respond and I now have a good understanding of how to use a Z table.

So I once again sat down to develop a routine to generate a sample size when I realized I still didn’t know how the population size comes into play! So I did a bit more hunting and finally came across this website, which helpfully suggests users examine the web page source code to see how they perform their calculations to determine a sample size. So I did. And I was pleased to discover no Z tables are involved.

So I translated the code from the Javascript of the web page to T-SQL and made the stored procedure shown below. It’s a bit ugly, in part because T-SQL does not support arrays, but it gets the job done. I plugged in the values LaRock used and came up with the same result he did, so I’m going to say that it works. I also tested some values with my procedure and compared them to the results the website gives and they also matched. The stored procedure works on SQL 2005 and 2008. I have not tested it on 2000, but it should work there as well.

For my use, I accepted LaRock’s assumptions:

  • Response Distribution = 0.1%
  • Confidence Level = 95%
  • Margin Of Error = 1%

Supply your total number of databases as the @Population parameter. The output tells you how many you need to sample. Note that the percentage input parameters should be entered as percentages (minus the percent sign), not as decimals. i.e. 95% is entered as 95, not 0.95.

We’ve incorporated this into a routine my co-worker developed that gets a list of current database backups from across our enterprise, picks a random sampling, restores them to a test server, and runs DBCC CHECKDB on each. If anything fails – the restore, DBCC, etc. – an error is logged and a notification email is sent.

We’ve now got a completely automated backup testing routine that scales as our database population varies. I can tell my boss with 95% confidence that all our backups are good :-)

 

CREATE PROCEDURE up_GetSampleSize
	@ResponseDistribution float,
	@Population float,
	@ConfidenceLevel float,
	@MarginOfError float,
	@SampleSize float OUTPUT

AS

/*	This procedure is used to determine the sample size you need to test out of a given
	population for a given set of response distribution, confidence level, and margin of error.
	This routine was developed for use in database backup verification as described at
	http://www.simple-talk.com/content/article.aspx?article=1028. It avoids messing about with
	Z tables. The math is taken from http://www.raosoft.com/samplesize.html and has been 
	translated from Java to T-SQL.
	
	For database verification purposes, reasonable values are:
		@ResponseDistribution = .1%
		@ConfidenceLevel = 95%
		@MarginOfError = 1%
	
	Example usage: 
				
				DECLARE	@SampleSize FLOAT
				EXEC up_GetSampleSize	@ResponseDistribution = .1,
										@Population = 3000,
										@ConfidenceLevel = 95,
										@MarginOfError=1,
										@SampleSize=@SampleSize OUTPUT
				
				
	
	--	SJS
		9/7/11
		shaunjstuart.com

*/

SET NOCOUNT ON
DECLARE @Y FLOAT
DECLARE @Pr FLOAT
DECLARE @Real1 FLOAT
DECLARE @Real2 FLOAT
DECLARE @P1 FLOAT
DECLARE @P2 FLOAT
DECLARE @P3 FLOAT
DECLARE @P4 FLOAT
DECLARE @P5 FLOAT
DECLARE @Q1 FLOAT
DECLARE @Q2 FLOAT
DECLARE @Q3 FLOAT
DECLARE @Q4 FLOAT
DECLARE @Q5 FLOAT
DECLARE @ProbCriticalNormal FLOAT
DECLARE @d1 FLOAT
DECLARE @d2 FLOAT

SET @P1 = -0.322232431088
SET @P2 = -1.0
SET @P3 = -0.342242088547
SET @P4 = -0.0204231210245
SET @P5 = -0.453642210148E-4
SET @Q1 = 0.0993484626060
SET @Q2 = 0.588581570495
SET @Q3 = 0.531103462366
SET @Q4 = 0.103537752850
SET @Q5 = 0.38560700634E-2

SET @ConfidenceLevel = @ConfidenceLevel / 100.0

SET @Pr = 0.5 - (@ConfidenceLevel/2.0)

IF @Pr < 1.0E-8
	BEGIN
		SET @ProbCriticalNormal = 6.0
	END
	ELSE
	BEGIN
		IF @Pr = 0.5
			BEGIN
				SET @ProbCriticalNormal = 0.0
			END
			ELSE
			BEGIN
				SET @Y = SQRT(LOG(1.0/(@Pr * @Pr)))
				SET @Real1 = @P5
				SET @Real2 = @Q5
				
				SET @Real1 = @Real1 * @Y + @P4
				SET @Real2 = @Real2 * @Y + @Q4
				
				SET @Real1 = @Real1 * @Y + @P3
				SET @Real2 = @Real2 * @Y + @Q3
				
				SET @Real1 = @Real1 * @Y + @P2
				SET @Real2 = @Real2 * @Y + @Q2
				
				SET @Real1 = @Real1 * @Y + @P1
				SET @Real2 = @Real2 * @Y + @Q1
				
				SET @ProbCriticalNormal = @Y + (@Real1 / @Real2)
				
			END
	END


SET @d1 = @ProbCriticalNormal * @ProbCriticalNormal * @ResponseDistribution * (100.0 - @ResponseDistribution)
SET @d2 = (@Population - 1.0) * (@MarginOfError * @MarginOfError) + @d1
IF @d2 > 0.0
	BEGIN
		SET @SampleSize = CEILING( @Population * @d1/@d2)
	END
	

SELECT @SampleSize

3 thoughts on “Backup Verification, Statistical Sampling, And You

Leave a Reply

Your email address will not be published. Required fields are marked *

I am a real person and can prove it by doing this math problem: (required) Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.