Shaun J Stuart

Just another SQL Server weblog

What annoys you? For me, a few things I find annoying are people talking on cell phones in restaurants, people chewing with their mouth open, and getting calls from telemarketers. SQL jobs failing also annoy me. OK, maybe not as much as Chatty Cathy yakking with her friend on the phone during dinner, but it's close.

I hate it when my maintenance jobs fail for some reason. I REALLY hate it when they fail intermittently. That means there is usually some strange situation that I didn't think about when writing my code that pops its head up once in a while and crashes my routines. Unless it's something serious, I can ignore it for a while. If it happens infrequently, it's tempting to just let it be. But it always sticks in my craw, bothering me like a tiny sliver in my finger.

This is how I discovered the issue with TRY / CATCH blocks and restoring databases that I wrote about last time. I finally got so fed up with the seemingly random failures, I had to dig in and fix the thing once and for all.

This is also the driving fforce behind the iterative revisions of my backup routines. I got tired of my transaction log backup failing every time a new database was created or a database was restored onto a new server. Rather than having the routine fail with a "no current database backup" error, I wrote code to check to make sure there was a full backup. If there wasn't, the script would make one. Problem solved.

Or mostly solved. Occasionally, the script would still fail with the same error. Failures were much rarer, but they were still happening. It was annoying.

Some research led me to discover that the latest cause of failure was databases that were in full recovery mode, switched to simple mode, then back to full recovery. If you try to make a log backup in this situation, the backup will fail with a "no current database backup" error. The question was, how can I determine if the recovery mode of a database changed?

The answer lies in the sys.database_recovery_status table. This table has one row for each database and if the last_log_backup_lsn column is NULL, a full backup needs to be performed.

So I revised my log backup script once again, this time adding this:

  IF EXISTS ( SELECT  1
      FROM    sys.database_recovery_status
      WHERE   database_id = DB_ID(@Database)
             AND last_log_backup_lsn IS NULL )
        /* make a full backup, then continue */

Now, at long last, my log backup script should be fairly bullet proof. It will detect if a database is new, if it has been restored from another server, or if the recovery mode has changed. None of these will cause it to fail with a "no current database backup" error. And I can't think of another reason this error might be raised,  so I am hopeful I will never see this error again.

Note: Reading the MSDN article about this table does not seem to state that this column behaves this way. However, I've verified this behavior on SQL 2008 R2 and this tip was given to me by Paul Randal, so I'm pretty comfortable using it.

Share

I've been experiencing an intermittent problem with my automated backup testing routine for some time now. I would occasionally get a failure during a restore, but when I manually ran the exact same restore command, the restore worked without issue. Because my testing routine selects a random sample of backups to test, I was not hitting this error all the time. Being somewhat busy, I put this on my back burner to investigate later.

After the problem happened a couple more times, I realized that the error always seemed to occur with a particular group of databases. The error message in the SQL log said "During upgrade, database raised exception 4147, severity 25, state 1, address 000000000069A96B. Use the exception number to determine the cause." Googling this exception didn't return anything useful except for this forum post. At first, I didn't put two and two together and I didn't see how this might help me. Then I finally had some inspiration.

Suppose the databases that were failing the restore contained some old style non-ANSI joins, such as *= or =*. These have been deprecated. The message from the forum seemed to imply that restoring a database with these commands manually in SSMS would work, but if you did the same restore in a TRY / CATCH block, it would fail. I was starting to think I had found my problem. When I manually restored the databases, they restored fine. But if my automated routine restored them, which is does inside of a TRY / CATCH block, it fails. I decided to see if I could prove this out. It turns out, I could.

If you want to follow along at home, you'll need a SQL 2005 server and a SQL 2008 R2 server. This may work on a 2008 server, but I have not tried it. It will probably also work on a SQL 2012 server.

First, on the SQL 2005 machine, run the following to create a test database, two test tables, some fake data, and a stored procedure that uses the old-style ANSI joins. Note that I am explicitly setting the database to SQL 2000 compatibility mode.

/* run this on a SQL 2005 server */

CREATE DATABASE [JoinTest];
GO
USE [master]
GO

EXEC dbo.sp_dbcmptlevel
    @dbname = N'JoinTest'
   ,@new_cmptlevel = 80
GO

USE JoinTest;
CREATE TABLE dbo.Table_1
       (
        Column1 INT NULL
       ,Column2 NVARCHAR(50) NULL
       );
GO

CREATE TABLE dbo.Table_2
       (
        Column1 INT NULL
       ,Column2 NVARCHAR(50) NULL
       );
GO

INSERT  INTO table_1
VALUES  (1, 'Table one-one');
INSERT  INTO table_1
VALUES  (2, 'Table one-two');
INSERT  INTO table_2
VALUES  (1, 'Table two-one');
GO

CREATE PROCEDURE ProcJoinTest
AS
       SELECT   *
       FROM     Table_1 o
               ,table_2 t
       WHERE    o.Column1 *= t.Column1;
GO

If you do not set the compatibility mode to SQL 2000, the CREATE PROCEDURE statement would throw an error: "The query uses non-ANSI outer join operators ("*=" or "=*"). To run this query without modification, please set the compatibility level for current database to 80 or lower, using stored procedure sp_dbcmptlevel. It is strongly recommended to rewrite the query using ANSI outer join operators (LEFT OUTER JOIN, RIGHT OUTER JOIN). In the future versions of SQL Server, non-ANSI join operators will not be supported even in backward-compatibility modes."

However, suppose this database has been around a while and has been moved from a SQL 2000 server to SQL 2005 or the server itself was upgraded, so the compatibility mode was at 80 when the procedure was created. After the upgrade, a DBA realized this was an old database and changed the compatibility mode to 2005 (90). Here's the rub - just changing the compatibility mode will not automatically raise the same error. If that join with the non-ANSI syntax ever gets executed, then the error will be raised, but until then, no warnings are given. So it's possible you can have this little time bomb sitting in your database for a long time before it blows up with an error.

So, continuing with our experiment, now let's change the compatibility mode to 90 and make a backup. We'll drop the database after we are done, as we won't need it again on this server. Be sure to modify the path to suit your environment.

EXEC dbo.sp_dbcmptlevel
     @dbname = N'JoinTest'
    ,@new_cmptlevel = 90
GO

BACKUP DATABASE [JoinTest]
TO  DISK = 'C:\Program Files\Microsoft SQL Server\MSSQL.4\MSSQL\Backup\JoinTest.bak';
GO

USE master;
DROP DATABASE JoinTest;
GO

Now copy your backup file over to your SQL 2008 R2 server. We're going to restore this two ways - 1) using a straight RESTORE statement and 2) using a RESTORE in a TRY / CATCH block.  Again, change the paths to suit your environment.

/* the following command works */
RESTORE DATABASE JoinTest
	FROM DISK = 'E:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Backup\JoinTest.bak'
	WITH MOVE 'JoinTest' TO 'E:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Data\JoinTest.mdf',
	MOVE 'JoinTest_log' TO 'F:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Data\JoinTest.ldf';
GO

That works. If you look at the output, you can see SQL going through the upgrade steps to convert the database for use on a SQL 2008 R2 system:

WorkingUpdate

Now let's drop the database and restore it using a TRY / CATCH block:


USE master;
DROP DATABASE JoinTest;
GO

/* the following should fail and hit the catch block */
BEGIN TRY
      RESTORE DATABASE [JoinTest]
		FROM DISK = 'E:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Backup\JoinTest.bak'
		WITH MOVE 'JoinTest' TO 'E:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Data\JoinTest.mdf',
		MOVE 'JoinTest_log' TO 'F:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Data\JoinTest.ldf';
END TRY

BEGIN CATCH
      SELECT    ERROR_MESSAGE();
END CATCH

USE master;
DROP DATABASE JoinTest;
GO

Look at the output from this command. The upgrade stopped during the conversion from version 628 to 629 and the command aborted.

NonWorkingUpdate

The only other mention of this behavior I could find on the web was this post, also in a forum. It's very strange that SQL Server raises an error when doing a restore one way but not another. Hopefully, this post can raise people's awareness of this little quirk.

 

Share

I've been playing around some more with SQL 2012 and I discovered another nice little improvement. This isn't anything huge, but rather a nice little touch that makes my life as a DBA a little bit easier.

If you have your SQL Agent jobs configured to send you email when they fail (or succeed or complete), in the past, you got an email with a subject line that read:

SQL Server Job System: '<job name> completed on \\<instance name>

That just tells you that the job ran. You had to actually open the email and see if SQL reported failure  or success.

With SQL 2012, email notification subject lines now look like this:

[The job succeeded.] SQL Server Job System: '<job name>' completed on \\<instance name>

or

[Failed] SQL Server Job System: '<job name>' completed on \\<instance name>

To me, that's a huge quality of life improvement. Now it's really easy to see if a job failed or succeeded. It's also very easy to write automatic routing rules in your email client to route the messages appropriately. It also can make things less confusing for end users. Sometimes another member of the IT team will be updating some software during a maintenance window and they want a backup of the database made before they start. In this case, I would often schedule a backup job at the start of the maintenance window and have SQL send an email to the person doing the upgrade when it completed. When they got this email, they were OK to start the update. That enabled me to stay out of the process (which is nice when the maintenance window starts at 1 AM). But the problem is I would have to make sure to tell them that subject line just says the job completed and they would need to open the email to see if it actually succeeded or failed. Now, they can just read the subject line.

Share

Last weekend was patch weekend and several of my SQL Server machines got Windows Updates applied and were rebooted. Today, I discovered that on a couple of severs, SQL Agent did was not running. The first thing I did was verify that the service was set to autostart. It was. Hmm. Time to dig into the logs.

SQLAgentTriggerStartupError

The user that generated the login failure was the SQL Server agent account. Database 5 was my DBA utility database. Now things were starting to make sense. I also realized that all the servers where SQL Agent did not start after a reboot were ones where I had a logon trigger installed. The trigger is an auditing tool that logs each user's login time to my DBA utility database.

What was going on? SQL Agent was set to autostart, so it was coming up as soon as SQL Server itself was running. But looking at the log, I could see that SQL Agent was starting up before SQL Server had brought all the databases online. In fact, prior to the entries shown above, the only "Starting up database..." message was for the master database and tempdb. All the other databases came online after the "login failure" errors. Notice the 904 error  - 'Database 5 cannot be autostarted during server shutdown or startup." This is telling me that the database was not started when the trigger tried to access it. Because that database has autoclose disabled, I know the only reason it was not started yet was that SQL Server was still in the process of bringing it online during system startup.

So how do I fix this problem? There are a couple of possible solutions. Because the master database is the first one brought online by SQL Server, I could move the trigger logging tables into that database. This would ensure that the logon trigger will always work because the master database will always be online before SQL Agent starts. However, I strongly dislike putting any user code in any system databases, so I wanted to find another solution.

Another option would be to change the logon trigger to not log anything by the SQL Agent service account. This would require hard coding the service account into the stored procedure, something I do not like doing. If we ever change that account, I'd have to remember to also change the stored procedure on every server where it was used. (My actual trigger does have a list of accounts to ignore, but that data is stored in a table in the utility database, so it is not accessible during this boot up period.)

The fix I ended up using turned out to be the Windows Delayed Start option for services. Windows Server 2008 includes a new service startup option called "Automatic (Delayed Start)." I changed my SQL Agent service to use this setting and my problem went away. SQL Agent did not start for about 1 to 2 minutes after SQL Server came up, which allowed plenty of time for my DBA utilities database to be brought online. The trigger could then access it when the Agent service started.

SQLAgentDelayedStart

Note that this may have some other effects on your system. If you have any jobs that are set to run when SQL Agent starts, you'll need to test these and make sure they still do what they are supposed to do when the service starts a bit later than normal. However, in my case, the delayed start option was a perfect solution to this problem.

Share

Brent Ozar Unlimited has released a new version of their awesome sp_Blitz server checkup script. I'm pleased to report that this version includes some checks that I submitted: a check for the last successful DBCC CHECKDB, a check to see if @@servername is null, and an update to the backup check to make sure the backup was made on the server the script is running on. These checks were all based on situations I ran into and have blogged about here.

I'm glad to have contributed to this effort!

Share