I’m always eager to expand my knowledge about SQL Server and I have a list of about 10 SQL-related blogs that I check on a daily basis. I’ve picked up lots of good info from them and occasionally, I’ll come across something that sticks in my head because I think it’s just all kinds of awesome. Over a year ago, I came across a post like that on Thomas LaRock’s blog.
It was a short blog entry that just referred his readers to an article he wrote for Simple-Talk.com. In it, he talked about statistical sampling and how to use that to test your backups. (Because we all know a backup is never good until it has been successfully restored, right?) In his case, he had about 3,000 databases he was administering. That number made it impossible to verify each and every backup to make sure it was valid, so he turned to statistical analysis to determine how many he needed to test to say with 95% confidence that all his backups were good. I loved this idea and was eager to put it into practice at my company. But real life got in the way and this became one of those wish-list items that always seemed to get put off because something else more important came up. At the time, I was also at a company that didn’t have a large number of databases, so statistical sampling was more of a thought exercise than a necessity.
Now I am at a different company and I am one of two DBAs administering close to 1,000 databases. Sampling is now an attractive proposition. So I dug up LaRock’s post again and started to dive into it. One thing I quickly discovered is the same thing the commenters on his article discovered – in simplifying the statistical calculations, LaRock seemed to have left out the important step of exactly how someone might translate his work to their own environment. He made no mention of how his population of 3,000 databases played into the equations. He also did not mention how he used a Z table (information he omitted on purpose). Without those two pieces of key information, there was little anyone could do to adapt his technique to their own environment.
Well, it just so happens I have a friend who is an associate professor of mathematics at UCSD, so I sent him an email asking for a brief tutorial on how to use a Z table. He was kind enough to respond and I now have a good understanding of how to use a Z table.
So I once again sat down to develop a routine to generate a sample size when I realized I still didn’t know how the population size comes into play! So I did a bit more hunting and finally came across this website, which helpfully suggests users examine the web page source code to see how they perform their calculations to determine a sample size. So I did. And I was pleased to discover no Z tables are involved.
So I translated the code from the Javascript of the web page to T-SQL and made the stored procedure shown below. It’s a bit ugly, in part because T-SQL does not support arrays, but it gets the job done. I plugged in the values LaRock used and came up with the same result he did, so I’m going to say that it works. I also tested some values with my procedure and compared them to the results the website gives and they also matched. The stored procedure works on SQL 2005 and 2008. I have not tested it on 2000, but it should work there as well.
For my use, I accepted LaRock’s assumptions:
- Response Distribution = 0.1%
- Confidence Level = 95%
- Margin Of Error = 1%
Supply your total number of databases as the @Population parameter. The output tells you how many you need to sample. Note that the percentage input parameters should be entered as percentages (minus the percent sign), not as decimals. i.e. 95% is entered as 95, not 0.95.
We’ve incorporated this into a routine my co-worker developed that gets a list of current database backups from across our enterprise, picks a random sampling, restores them to a test server, and runs DBCC CHECKDB on each. If anything fails – the restore, DBCC, etc. – an error is logged and a notification email is sent.
We’ve now got a completely automated backup testing routine that scales as our database population varies. I can tell my boss with 95% confidence that all our backups are good :-)
CREATE PROCEDURE up_GetSampleSize @ResponseDistribution float, @Population float, @ConfidenceLevel float, @MarginOfError float, @SampleSize float OUTPUT AS /* This procedure is used to determine the sample size you need to test out of a given population for a given set of response distribution, confidence level, and margin of error. This routine was developed for use in database backup verification as described at http://www.simple-talk.com/content/article.aspx?article=1028. It avoids messing about with Z tables. The math is taken from http://www.raosoft.com/samplesize.html and has been translated from Java to T-SQL. For database verification purposes, reasonable values are: @ResponseDistribution = .1% @ConfidenceLevel = 95% @MarginOfError = 1% Example usage: DECLARE @SampleSize FLOAT EXEC up_GetSampleSize @ResponseDistribution = .1, @Population = 3000, @ConfidenceLevel = 95, @MarginOfError=1, @SampleSize=@SampleSize OUTPUT -- SJS 9/7/11 shaunjstuart.com */ SET NOCOUNT ON DECLARE @Y FLOAT DECLARE @Pr FLOAT DECLARE @Real1 FLOAT DECLARE @Real2 FLOAT DECLARE @P1 FLOAT DECLARE @P2 FLOAT DECLARE @P3 FLOAT DECLARE @P4 FLOAT DECLARE @P5 FLOAT DECLARE @Q1 FLOAT DECLARE @Q2 FLOAT DECLARE @Q3 FLOAT DECLARE @Q4 FLOAT DECLARE @Q5 FLOAT DECLARE @ProbCriticalNormal FLOAT DECLARE @d1 FLOAT DECLARE @d2 FLOAT SET @P1 = -0.322232431088 SET @P2 = -1.0 SET @P3 = -0.342242088547 SET @P4 = -0.0204231210245 SET @P5 = -0.453642210148E-4 SET @Q1 = 0.0993484626060 SET @Q2 = 0.588581570495 SET @Q3 = 0.531103462366 SET @Q4 = 0.103537752850 SET @Q5 = 0.38560700634E-2 SET @ConfidenceLevel = @ConfidenceLevel / 100.0 SET @Pr = 0.5 - (@ConfidenceLevel/2.0) IF @Pr < 1.0E-8 BEGIN SET @ProbCriticalNormal = 6.0 END ELSE BEGIN IF @Pr = 0.5 BEGIN SET @ProbCriticalNormal = 0.0 END ELSE BEGIN SET @Y = SQRT(LOG(1.0/(@Pr * @Pr))) SET @Real1 = @P5 SET @Real2 = @Q5 SET @Real1 = @Real1 * @Y + @P4 SET @Real2 = @Real2 * @Y + @Q4 SET @Real1 = @Real1 * @Y + @P3 SET @Real2 = @Real2 * @Y + @Q3 SET @Real1 = @Real1 * @Y + @P2 SET @Real2 = @Real2 * @Y + @Q2 SET @Real1 = @Real1 * @Y + @P1 SET @Real2 = @Real2 * @Y + @Q1 SET @ProbCriticalNormal = @Y + (@Real1 / @Real2) END END SET @d1 = @ProbCriticalNormal * @ProbCriticalNormal * @ResponseDistribution * (100.0 - @ResponseDistribution) SET @d2 = (@Population - 1.0) * (@MarginOfError * @MarginOfError) + @d1 IF @d2 > 0.0 BEGIN SET @SampleSize = CEILING( @Population * @d1/@d2) END SELECT @SampleSize
Shaun,
Nicely done. At first I started feeling bad that I left out such details about how to use a Z-table, but I am happy to see you didn’t let that stop you.
Best,
Tom