Shaun J Stuart

Just another SQL Server weblog

Browsing Posts tagged maintenance

I'm always eager to expand my knowledge about SQL Server and I have a list of about 10 SQL-related blogs that I check on a daily basis. I've picked up lots of good info from them and occasionally, I'll come across something that sticks in my head because I think it's just all kinds of awesome. Over a year ago, I came across a post like that on Thomas LaRock's blog.

It was a short blog entry that just referred his readers to an article he wrote for Simple-Talk.com. In it, he talked about statistical sampling and how to use that to test your backups. (Because we all know a backup is never good until it has been successfully restored, right?) In his case, he had about 3,000 databases he was administering. That number made it impossible to verify each and every backup to make sure it was valid, so he turned to statistical analysis to determine how many he needed to test to say with 95% confidence that all his backups were good. I loved this idea and was eager to put it into practice at my company. But real life got in the way and this became one of those wish-list items that always seemed to get put off because something else more important came up. At the time, I was also at a company that didn't have a large number of databases, so statistical sampling was more of a thought exercise than a necessity.

Now I am at a different company and I am one of two DBAs administering close to 1,000 databases. Sampling is now an attractive proposition. So I dug up LaRock's post again and started to dive into it. One thing I quickly discovered is the same thing the commenters on his article discovered - in simplifying the statistical calculations, LaRock seemed to have left out the important step of exactly how someone might translate his work to their own environment. He made no mention of how his population of 3,000 databases played into the equations. He also did not mention how he used a Z table (information he omitted on purpose). Without those two pieces of key information, there was little anyone could do to adapt his technique to their own environment.

Well, it just so happens I have a friend who is an associate professor of mathematics at UCSD, so I sent him an email asking for a brief tutorial on how to use a Z table. He was kind enough to respond and I now have a good understanding of how to use a Z table.

So I once again sat down to develop a routine to generate a sample size when I realized I still didn't know how the population size comes into play! So I did a bit more hunting and finally came across this website, which helpfully suggests users examine the web page source code to see how they perform their calculations to determine a sample size. So I did. And I was pleased to discover no Z tables are involved.

So I translated the code from the Javascript of the web page to T-SQL and made the stored procedure shown below. It's a bit ugly, in part because T-SQL does not support arrays, but it gets the job done. I plugged in the values LaRock used and came up with the same result he did, so I'm going to say that it works. I also tested some values with my procedure and compared them to the results the website gives and they also matched. The stored procedure works on SQL 2005 and 2008. I have not tested it on 2000, but it should work there as well.

For my use, I accepted LaRock's assumptions:

  • Response Distribution = 0.1%
  • Confidence Level = 95%
  • Margin Of Error = 1%

Supply your total number of databases as the @Population parameter. The output tells you how many you need to sample. Note that the percentage input parameters should be entered as percentages (minus the percent sign), not as decimals. i.e. 95% is entered as 95, not 0.95.

We've incorporated this into a routine my co-worker developed that gets a list of current database backups from across our enterprise, picks a random sampling, restores them to a test server, and runs DBCC CHECKDB on each. If anything fails - the restore, DBCC, etc. - an error is logged and a notification email is sent.

We've now got a completely automated backup testing routine that scales as our database population varies. I can tell my boss with 95% confidence that all our backups are good :-)

 

CREATE PROCEDURE up_GetSampleSize
	@ResponseDistribution float,
	@Population float,
	@ConfidenceLevel float,
	@MarginOfError float,
	@SampleSize float OUTPUT

AS

/*	This procedure is used to determine the sample size you need to test out of a given
	population for a given set of response distribution, confidence level, and margin of error.
	This routine was developed for use in database backup verification as described at
	http://www.simple-talk.com/content/article.aspx?article=1028. It avoids messing about with
	Z tables. The math is taken from http://www.raosoft.com/samplesize.html and has been 
	translated from Java to T-SQL.
	
	For database verification purposes, reasonable values are:
		@ResponseDistribution = .1%
		@ConfidenceLevel = 95%
		@MarginOfError = 1%
	
	Example usage: 
				
				DECLARE	@SampleSize FLOAT
				EXEC up_GetSampleSize	@ResponseDistribution = .1,
										@Population = 3000,
										@ConfidenceLevel = 95,
										@MarginOfError=1,
										@SampleSize=@SampleSize OUTPUT
				
				
	
	--	SJS
		9/7/11
		shaunjstuart.com

*/

SET NOCOUNT ON
DECLARE @Y FLOAT
DECLARE @Pr FLOAT
DECLARE @Real1 FLOAT
DECLARE @Real2 FLOAT
DECLARE @P1 FLOAT
DECLARE @P2 FLOAT
DECLARE @P3 FLOAT
DECLARE @P4 FLOAT
DECLARE @P5 FLOAT
DECLARE @Q1 FLOAT
DECLARE @Q2 FLOAT
DECLARE @Q3 FLOAT
DECLARE @Q4 FLOAT
DECLARE @Q5 FLOAT
DECLARE @ProbCriticalNormal FLOAT
DECLARE @d1 FLOAT
DECLARE @d2 FLOAT

SET @P1 = -0.322232431088
SET @P2 = -1.0
SET @P3 = -0.342242088547
SET @P4 = -0.0204231210245
SET @P5 = -0.453642210148E-4
SET @Q1 = 0.0993484626060
SET @Q2 = 0.588581570495
SET @Q3 = 0.531103462366
SET @Q4 = 0.103537752850
SET @Q5 = 0.38560700634E-2

SET @ConfidenceLevel = @ConfidenceLevel / 100.0

SET @Pr = 0.5 - (@ConfidenceLevel/2.0)

IF @Pr < 1.0E-8
	BEGIN
		SET @ProbCriticalNormal = 6.0
	END
	ELSE
	BEGIN
		IF @Pr = 0.5
			BEGIN
				SET @ProbCriticalNormal = 0.0
			END
			ELSE
			BEGIN
				SET @Y = SQRT(LOG(1.0/(@Pr * @Pr)))
				SET @Real1 = @P5
				SET @Real2 = @Q5
				
				SET @Real1 = @Real1 * @Y + @P4
				SET @Real2 = @Real2 * @Y + @Q4
				
				SET @Real1 = @Real1 * @Y + @P3
				SET @Real2 = @Real2 * @Y + @Q3
				
				SET @Real1 = @Real1 * @Y + @P2
				SET @Real2 = @Real2 * @Y + @Q2
				
				SET @Real1 = @Real1 * @Y + @P1
				SET @Real2 = @Real2 * @Y + @Q1
				
				SET @ProbCriticalNormal = @Y + (@Real1 / @Real2)
				
			END
	END


SET @d1 = @ProbCriticalNormal * @ProbCriticalNormal * @ResponseDistribution * (100.0 - @ResponseDistribution)
SET @d2 = (@Population - 1.0) * (@MarginOfError * @MarginOfError) + @d1
IF @d2 > 0.0
	BEGIN
		SET @SampleSize = CEILING( @Population * @d1/@d2)
	END
	

SELECT @SampleSize

Share

One of the problems DBAs encounter when working with very large databases (VLDBs) is an obvious one – lack of disk space. These databases grow over time, seeming to swallow whatever disk space you can throw at them. As they get bigger, it also takes longer to perform basic maintenance tasks. If the database gets big enough, it’s possible that the time it takes to perform the maintenance exceeds the scheduled maintenance window. When that happens, it’s tempting to just stop performing maintenance, especially if you have never encountered any problems with the database in the past. This is a recipe for disaster.

As I went through all the databases here at my new company, I discovered a VLDB that was about 600 GB. This particular server is running on local disk drives, not SANs, and there are no more drive bays available to add disk space. The former DBA here, in an effort to stop running out of space, implemented a job to shrink the database weekly. That temporarily solved the disk space issue, but created others. Notably, performance dropped. Not only does shrinking the database completely fragment the indexes, it creates poor performance because every time the database needs to grow, it has to zero out about 60 GB of disk space. (The autogrowth setting was set to 10% of the file size.)

Did I also mention there was no maintenance being performed (other than the shrink)?

This database was a mess. The first thing I did was kill the shrink job. I then changed the autogrowth settings to something more reasonable – a fixed 500 MB. This particular database didn’t seem to be growing as much as I initially thought it would, so I started looking at getting some index maintenance going to fix all the fragmentation I was sure was present. As I dove into this more deeply, I discovered this database had over 65,000 tables with each table having about 2 to 5 indexes. Checking them, I did indeed find near 99% fragmentation in almost all of indexes. I also found something else – the Page Fullness of many of the indexes was between 8% and 12%! Now, if this database was handling thousands of DML statements daily, this might be an appropriate value. But after looking at the server load, it was clear to me this was not the case. There was no need for the index pages to be that sparsely populated. So as part of my index maintenance, I also rebuilt the indexes with a fill factor of 90%. (It also helped that this was on an Enterprise edition of SLQ Server, so on-line index rebuilds were supported and I was therefore not so chained to my maintenance window.)

Turns out, this simple maintenance solved two problems at once. Not only did performance of the database improve due to the index defrags, but increasing the page density of the indexes freed up a ton of space, helping to curb the need for additional drive space. Before I started my maintenance, the database was 605 GB in size with 75 GB of free space. After maintenance was complete, the database was  still 605 GB in size but it now had 102 GB of free space. (Now granted, the log file backups and differential backup sizes increased during the few days I was performing my maintenance – all those index rebuilds are logged after all -  but those are temporary increases that will drop off the hard disks as the backup files age.)  This free space is internal to the database files, so the disk drive available space hasn’t changed, but at least now the database shouldn’t need to grow for a while.

So keep in mind that routine maintenance not only helps performance, but can also help manage disk space as well. Do not neglect basic maintenance just because it takes a long time. That will only cause you more trouble down the road!

Share

Back in April, I ran across what I consider to be a horrendous piece of code in a stored procedure. This code cost me several hours of after hours work on a Saturday night plus wasted the time of at least four other people who were involved in this particular project. It also caused the project to be set back by 2 months.

Back then, we were looking to upgrade a piece of software we use. The database back end for this software was a huge 500 GB database. Before doing the upgrade, which was to occur on a Saturday night, I was asked to create a backup of the database. A full backup of this database takes 13 hours, so I made a full backup Friday night and planned to make a differential Saturday night. Once the differential was done, I was to email some other people and they would proceed with the upgrade.

Backups here are done using a home-grown stored procedure. This procedure is rather nice in that it checks to see if the server it is running on has LiteSpeed installed. If it does, it uses LiteSpeed to create the backup and if not, if uses native SQL commands. You pass in a parameter to specify if you want to perform a full, differential, or transaction log backup.

Friday night, the full backup was made without incident. Saturday night rolled around and I launched the procedure, requesting it to make a differential backup. Based on past history, I expected this to take anywhere from one to two hours. This is what I communicated to the other people on the project and they had planned their night accordingly.

Two hours passed and the backup was still running. Three hours passed. Still running. It was now approaching midnight. I open a new query window in SSMS and queried sys.dm_exec_requests to check on the progress. I was shocked to see that backup was only about 20% complete. I launched activity monitor and looked at the command that was being executed. It was making a full backup! I checked my job again to make sure I specified a differential and, sure enough, I had. Yet the procedure was making a full backup.

At this point, I decided to check the stored procedure and got my second surprise of the evening: I discovered the stored procedure was encrypted! At the time, I was still fairly new at this company and didn't know where I might find the source code for the procedure and since it was midnight on a Saturday, I didn't have many options. I contacted the rest of the team and told them what was going on. The decision was made to abort the upgrade and reschedule.

When I got back into the office the following Monday, I talked to the other DBA about this. He said he knew the stored procedure was encrypted but didn't know why. It was something that was done by a previous DBA who was no longer with the company. But he was able to point me to the source code. I reviewed it and found the following little nugget:

		IF				@FullDifTran = 'dif'
		AND				DATEPART( dw, getdate()) = 7

			SET				@FullDifTran = 'full'

If the procedure is called requesting a differential backup and the day of the week is Saturday, the procedure ignores what the user requested and makes a full backup.

WTF?!?!

This cost my company about 20 man hours. And since the software company's support team wasn't available again for another 2 months, we lost two months on this project.

Never never never write code like this!!! Your code should never change what the user told it to do. If you are trying to protect against the user doing something, then print a message or throw an error and abort and make the user resubmit with an additional flag to indicate "Yes, I really want to do this." But never change the command in the background without any sort of warning!

Share

One of the challenges of my new job is working with a couple very large databases (VLDBs) and very large tables. One database in particular has one table with 61 million rows and another with 78 million rows. They are growing by about 80,000 rows per day. My job is to purge data older than 1 year from these tables. Did I mention these tables support a 24 x 7 financial system? And blocking of more than a few seconds sets off all kinds of alarms? This presents something of a challenge.

Unfortunately, the tables are not partitioned. If they were (and if they were portioned correctly), this task would be simple, fast, and easy. Although I won’t go into details of partitioning here, suffice it to say that deleting or moving all the data in a partition is just a meta-data change that SQL Server can do in seconds without moving any actual rows of data. Going forward, we will need to create partitions on these tables, but for now, I needed to come up with a way to purge vast amounts of data (roughly 2/3 of each table) without causing excessive blocking or transaction log growth.

I was handed a script that was developed by a previous DBA here as a starting point. This script worked, but it was slow and caused lots of blocking. Here is a portion of it:

DECLARE	@DeletePriorToDate	DATETIME
SET	@DeletePriorToDate = DATEADD(yy,-1,current_timestamp)

/*Get the TOP x PKs to delete */

DECLARE @PKsToDelete TABLE (id INT)

INSERT   INTO @PKsToDelete
			(id)
		SELECT  TOP 1000 user_node_log_id
        FROM    v_t_u_logs WITH (NOLOCK)
        WHERE   create_time <= @DeletePriorToDate

DELETE	vtul
FROM	v_t_u_logs vtul WITH (ROWLOCK)
		INNER JOIN @PKSToDelete ptd
		ON vtul.user_node_log_id = ptd.id

That removed records from the smaller of the two tables. The next step in the script was to delete records from the larger table. In this case, the smaller table is related to the larger table via a field called node_id. Further, there is a many to many relationship between the smaller table and the larger table, so that you can’t just get a list of the node_ids to delete from the smaller table and delete them all from the larger table. Some might need to be retained. So the script I was given did the following:

CREATE TABLE #NodeIDs (Node_Id NUMERIC(18, 0))
INSERT   INTO [#NodeIDs]
        (Node_Id)
        SELECT  DISTINCT
                vtn.NODE_ID
        FROM    dbo.v_t_NODE vtn WITH (NOLOCK)
        LEFT JOIN dbo. v_t_u_logs vtul
                WITH (NOLOCK)
                ON vtn.NODE_ID = vtul.NODE_ID
        WHERE   vtul.NODE_ID IS NULL


CREATE INDEX tmpidxNodeID ON [#NodeIDs] (Node_ID ASC)

DELETE   v
FROM     dbo. v_t_n_history v
INNER JOIN [#NodeIDs] nid
        ON v.Node_Id = nid.Node_ID

Kudos for throwing an index on the temporary table to speed things up, but this still didn’t work well. In words, the process worked like this: find the first 1000 records from vtul that are older than 1 year and store their primary key values in a table variable. Then, delete those records from vtul by joining to that table variable. Next, look through vtn and find all the records that do not have a corresponding node_id left in vtul. Store those node_ids in a temp table. Delete records with those node_ids from vtnh.

This approach has one benefit – if the query is canceled at any time (due to blocking issues), subsequent runs will pick up any records that still need to be deleted. This is thanks to the left join on vtul and selecting records where vtul.node_id is null. But this approach has multiple problems. First, the temp table is populated by joining the 61 million row table with vtn, a 3.5 million row table. That’s a lot of work and creates blocking, even with the NOLOCK hint. (Much of the blocking simply came from disk I/O waits, as all those rows needed to be read.) Second, the deletion from the vtnh table, the one with 78 million rows, was also causing blocking. This one was harder to control because there was no way of controlling how many records might be in the temp table. If you had canceled the script part way through several times, the temp table might contain 10,000 or more node_ids that needed to be deleted. Clearly, this process needed to be re-worked.

My first attempt was to try to limit the number of rows being deleted with a TOP X clause and looping. This was marginally successful. Blocking still occurred and to completely eliminate it, I had to drop down to deleting the TOP 5 rows, which is pretty close to RBAR – row by agonizing row. This was surprising, as I felt the TOP clause surely would have been useful here. But in doing some research, I found a post on Robert Davis’ (twitter | blog) blog about archiving and purging from large tables. His post contained this statement: “One more side note: simply adding TOP to the delete statement will NOT minimize the amount of locks taken initially. Please don’t do this!!” Yup. I could confirm that. (I’m guessing SQL Server does the whole join first anyway, and then simply picks the top X to delete, which is why you don’t see a performance benefit.)

Reading the rest of his post showed me I was at least heading in the right direction. I was selecting records to delete based on the primary key and then doing a targeted delete of those records. But I needed to make some modifications to make this script truly useful. My first target for optimization was that join of two big tables to determine which node_ids needed to be kept. This was done to allow the routine to pick up any records that should have been deleted in prior runs, but might not have been due to the query being cancelled. I knew this delete process was a one-time task. Once we got caught up, we were going to implement a scheduled job to keep up  with deleting old records on a daily basis. So I made the decision to not worry about records that might be “orphaned” due to canceling a query. My thinking was to get most of the records deleted first, then, once we are caught up, I could run a query to find all those stragglers that were missed. The tables would be much smaller by that time, so the query should run quicker then. Another purpose of that join was to retain any records that might still have node_ids in the vtn table (which was in a many-to many relation, remember). My solution was to grab my list of primary keys of records to delete and the associated node_ids. I would delete the records identified by the primary keys, then scan the table again to see if any of the node_ids I had grabbed  were still present in the table. These needed to be retained, so I removed those ids from my table of ids to delete.

Finally, I needed some way to control the number of records I was deleting at once in order to both minimize or eliminate blocking and to prevent the transaction log from growing out of control with millions of deletions. I did this by using two TOP statements. The first limits the number of records from vtul to be deleted. The second limits to number of records to delete at a time. I already knew that TOP didn’t really seem to limit locking when joining tables in a DELETE statement, so instead, I inserted my node_ids to delete in a second table variable which would be used to join on for the DELETE. By using the TOP statement to limit the number of rows inserted into this table to a subset of my total records, I would effectively be able to limit the number of records deleted at one time. By looping through the DELETE statement multiple times, I would be able to process all the records selected from vtul. The looping would also help eliminate blocking  by periodically freeing up the table for use by other processes – also known as “trickle delete.” Here is the code I ended up using (which also includes some stuff for reporting):

DECLARE @DeletePriorToDate DATETIME
	-- Delete all records prior to this date (usually 1 year ago)
DECLARE @RowsDeletedInTable INT
	-- Counter variable for reporting
DECLARE @StartTime DATETIME
	-- For tracking execution time
DECLARE @EndTime DATETIME
	-- For tracking execution time
DECLARE @CycleLoopCounter TINYINT
DECLARE @MaxCycleLoops TINYINT
	-- Number of times to loop through this whole routine
DECLARE @WorkTable TABLE (id INT)
 -- Table containing small set of PK of records to delete
DECLARE @LastPKDeleted INT
 -- Variable used to load @WorkTable with values

SET @LastPKDeleted = 0
SET @MaxCycleLoops = 40
SET @CycleLoopCounter = 1
WHILE @CycleLoopCounter <= @MaxCycleLoops 
    BEGIN

        PRINT '=========================='
        PRINT ' Starting pass ' + CONVERT(VARCHAR(2), @CycleLoopCounter) + ' of ' + CONVERT(VARCHAR(2), @MaxCycleLoops)
        PRINT '=========================='
        SET @StartTime = CURRENT_TIMESTAMP


        SET @DeletePriorToDate = DATEADD(yy, -1, CURRENT_TIMESTAMP)
        SET @RowsDeletedInTable = 0

        DECLARE @PKsToDelete TABLE
            (
             id INT,
             node_id NUMERIC(18, 0) NOT NULL
            )

-- Get PKs to delete from vtul.

        INSERT  INTO @PKsToDelete
                (id,
                 node_id
                )
                SELECT TOP 5000
                        user_node_log_id,
                        node_id
                FROM    v_t_u_logs WITH (NOLOCK)
                WHERE   create_time <= @DeletePriorToDate


        PRINT 'Deleting data from v_t_u_LOGS (1/4)..'

/*
	Only delete in small batches to prevent blocking. Note, using the TOP keyword in a delete statement
	does not prevent blocking. So take a subset of the primary keys, put those in their own table
	and delete by joining on that 
	
*/
        WHILE EXISTS ( SELECT   vtul.*
                       FROM     dbo.v_t_u_logs vtul WITH (NOLOCK)
                                INNER JOIN @PKSToDelete pk2d ON vtul.user_node_log_id = pk2d.id ) 
            BEGIN

                INSERT  INTO @WorkTable
                        (id
                        )
                        SELECT TOP 100
                                id
                        FROM    @PKsToDelete
                        WHERE   ID > @LastPKDeleted
                        ORDER BY ID ASC

                SELECT  @LastPKDeleted = MAX(id)
                FROM    @WorkTable

                DELETE  vtul
                FROM    v_t_u_logs vtul WITH (ROWLOCK)
                        INNER JOIN @WorkTable wt ON vtul.user_node_log_id = wt.id

                SET @RowsDeletedInTable = @RowsDeletedInTable + @@rowcount

                DELETE  FROM @WorkTable

                PRINT 'Total rows deleted in table: ' + CONVERT (VARCHAR(10), @RowsDeletedInTable)
                PRINT 'Deletion completed for v_t_u_LOGS.'


            END

/* Some records with node_ids created more than 1 year ago might still be in other tables due to
	updates. We need to keep these records. So purge any node_ids that need to remain */
	
        DELETE  FROM @PKsToDelete
        WHERE   node_id IN (SELECT  node_id
                            FROM    dbo.v_t_u_logs WITH (NOLOCK))


        PRINT 'Starting Purging for v_t_n_HISTORY (2/4)..'
        SET @RowsDeletedInTable = 0

-- trickle delete from vtnh to prevent blocking

        SET @LastPKDeleted = 0
        
        WHILE EXISTS ( SELECT   v.*
                       FROM     dbo.v_t_n_history v WITH (NOLOCK)
                                INNER JOIN @PKSToDelete pk2d ON v.Node_Id = pk2d.Node_ID ) 
            BEGIN

                INSERT  INTO @WorkTable
                        (id
                        )
                        SELECT TOP 100
                                node_id
                        FROM    @PKsToDelete
                        WHERE   node_ID > @LastPKDeleted
                        ORDER BY node_ID ASC

                SELECT  @LastPKDeleted = MAX(id)
                FROM    @WorkTable

                DELETE  v
                FROM    dbo.v_t_n_history v WITH (ROWLOCK)
                        INNER JOIN @WorkTable wt ON wt.Id = v.Node_ID

                SET @RowsDeletedInTable = @RowsDeletedInTable + @@rowcount

                DELETE  FROM @WorkTable

            END        

        PRINT 'Total rows deleted in table: ' + CONVERT (VARCHAR(10), @RowsDeletedInTable)    
        PRINT 'Purging completed for v_t_n_HISTORY.'


        PRINT 'Process completed.'
        SET @EndTime = CURRENT_TIMESTAMP

        PRINT 'Elapsed time for one delete cycle: ' + CONVERT(VARCHAR(5), DATEDIFF(s, @StartTime, @EndTime))
            + ' seconds.'

        DELETE FROM @PKsToDelete
        SET @CycleLoopCounter = @CycleLoopCounter + 1
    END

(I should note that in actuality, I have to delete from 4 tables altogether, not just the two I have shown here, but the others are handled the same way. They were left out for clarity and this is why some of the messages say 1/4, 2/4, etc.)

One other thing to note. Even though the DELETE statements specify the ROWLOCK hint, this is only a hint and SQL Server will ignore it when it wants to. For example, if a data page contains many rows to be deleted, SQL Server will escalate the row locks into a page lock. This is because each lock takes up memory and resources to manage and SQL Server will opt for a page lock instead of multiple row locks on a page to conserve resources. This is one reason why I was seeing blocking with larger delete batches.

After some experimentation, I discovered that selecting the TOP 5000 records to delete from vtul and then deleting from the other table in batches of 100 seems to work well. This is  a huge improvement over deleting in batches of 5 rows from my first attempt. I’ve added a loop counter so that I can loop the whole script multiple times. I’ve been running this script continuously now for a couple of days and have not encountered any blocking issues. Also, the deletes are happening slow enough that our regular transaction log backups can keep up with the pace and the transaction log does not grow at all. (If this were not the case, I would have to add a transaction log backup as part of the script.)

When working with very large tables, partitioning is clearly the way to go. Unfortunately, DBAs often find themselves dealing with large tables that were not set up with partitions initially and are stuck with purging data the “old-fashioned” way. However, once you have caught up on your purging, the first thing you should do is implement partitioning so that going forward, you won’t have to go through this tedious process again.

Share

Got a call from the company I used to work for and they needed some help with restoring a backup. No problem. Sounded easy. Turns out, it wasn't quite so simple. In fact, I had to put on my robot hat again.

The company was trying to restore a database to a point in time. This involved restoring a full backup, then a differential backup. The full backup would restore but when they tired to restore the differential, they got this error:

This differential backup cannot be restored because the database has not been restored to the correct earlier state.

Hmm. First thing I did was try the restore myself. Sure enough, I got the same error. Ok. So the error is telling me that the differential I am trying to restore is based on a different full backup than the one I restored. Except, there was no other full backup made between the full backup I had and the differential. There were no other full backup files in the backup directory. I checked the backupset table in msdb. It only showed the full backup I had. Something strange was going on.

I used the following comands to look at the two backup files I had:

RESTORE HEADERONLY FROM DISK = '<path and filename of fullbackup>'

RESTORE HEADERONLY FROM DISK = '<path and filename of a differential backup>'

 
If the differential was based on the full backup I had, the number in the DifferentialBaseLsn column from the second command should be the same as the number in the FirstLsn column from the first command. They did not match. Clearly, there was another full backup somewhere. But nothing was logged in the msdb backup tables, so how was that possible?

Time to step back and look at the bigger picture. The database in question was a mirrored database. We were trying to restore it to a differently named database for some testing. I also discovered they had implemented some sort of SAN backup job that was taking snapshots of the SAN. This wasn't working quite right and was causing the mirrored databases to fail over to the mirror server. But each day, the admin would fail everything back to the original principal server.

Could the SAN backup somehow be making a SQL backup without SQL logging it? Maybe. Then something my old co-worker said got me thinking. I've written about the backup jobs I use before. They are written such that if a differential or transaction log backup is needed and there is no full backup, it will make a full backup so the job won't fail.

Here's what I surmised was happening: The SAN backup causes the mirrored database to fail over to the other server - where this particular database is normally in the restoring state because it's usually the mirror copy. On that server, the differential backup job runs. It discovers the failed over database, which is no longer in the restoring state but is now active, does not have a full backup made on that server (my job queries the msdb backup tables to determine this and since it is usually in the restoring state, it is never backed up, hence no backup log entry in msdb), so it makes one. Hours later, the admin comes in and fails the databases back to the other server. That means a full backup was made on what was now the mirror server and left there. Bingo! We checked the backup directories on that server and found a full backup. We used that and were able to restore with the differential we had. This also explains why the principle server had no record of the full backup being made in the msdb tables - it was actually made on the other server.

Share