Too Many Bits

A commenter recently reminded me that, although I had mentioned that I attempted to do a migration from SQL 2000 to 2008 a while back, failed, and had to go back to the old version, I never really explained what happened. So now I will!

We had a SQL Server cluster running SQL 2000. One of the machines in the cluster failed. The other machine was old as well, so we decided to build a new cluster using new machines and running SQL Server 2008. This cluster is heavily used by our external customers and it was approaching our peak busy season. (We’re in the education field, so the August / September back to school season is our peak usage time.) We wanted to replace the cluster and move to the new version of SQL before the big rush started.

This cluster supports several applications. First and foremost, it supports our own propriety software that is used in schools. It also houses databases for our accounting software, our development bug-tracking software, and some internal and external websites. (I don’t like the setup. The server is used by both internal and external programs and I want to segregate those to different servers. It’s on my to-do list, but that was not within the scope of this move.) Our software development team had looked into migrating to SQL 2008 previously and, although they never did it, they didn’t think there would be any problems.

I did some research to see if I could identify other potential problems before the actual move. I downloaded the Microsoft SQL Server 2008 Upgrade Advisor and fed it some traces of typical SQL usage that I gathered. It came back with several possible problems, which I narrowed down to 4 that were what I considered to be the likeliest risks:

  1. One database referenced the undocumented system table spt_values and the procedure sp_spaceused.
  2. Full text indexing used the command sp_fulltext_service ‘load_os_resources’, 1.
  3. Possible data type conflicts in one database.
  4. Trace flag 3226 was in use.

Item 1 was a concern simply because system tables definitely changed between versions 2000 and 2008. We had other SQL 2008 servers already running and I checked those. I found spt_values still existed in SQL 2008. The warning about sp_spaceused was just a note that SQL 200 did not always store space allocation information correctly and that I should run DBCC UPDATEUSAGE after converting the database. Noted. There was some uncertainty still about the spt_values table. I wasn’t sure it contained the same data or columns as it did in SQL 2000, or was used in the same way. However, the database these warnings were from was for a Blackberry program and there was a newer version available that supported SQL 2008. If we ran into problems, we would upgrade to the newer version.

On closer inspection, Item 2 didn’t really seem to be as big a problem as I thought it might be. Be default, this feature it turned off, and this command was turning it on, which is why the Upgrade Advisor flagged it. According to this, we could run into problems if the verify_signature flag was set and our operating system binaries were not signed. That flag didn’t show up in the Advisor report, so I didn’t think it was being used and therefore, wouldn’t be a problem.

Item 3 could be an issue, but this was in a database that wasn’t really in use anyway. It was used by ScriptLogic’s Active Administrator, an Active Directory tool. Someone in our IT department had installed it to test it out and we were not currently using it. If it stopped working, it was no big deal.

Item 4 referred to a trace flag that stops the writing of backup completion messages to the Windows event log. I have this currently set on our other SQL 2008 machines, so I knew this would not cause a problem.

One last thing I needed to do was to make sure I could transfer the logins from the old server to the new server. We had a couple SQL server logins that were in use and the passwords had been lost, or at least were not readily available, so I would not be able to recreate them. I wrote about how I transferred these logins previously. I also had to set up replication, which I also wrote about previously.

So from my research, it looked like we should be ok to move to SQL 2008 and we set a date for the move. We had most of our IT department involved. As the database guy, I was definitely involved. Our network guy and system admin guys were involved because, unfortunately, our company’s legacy software that used this database server had some IP addresses hard coded, so we needed to put in some DNS aliases to redirect to the new server once the database had been transferred. We also had several members our of QA department lined up to put the new system through their testing and make sure things still worked.

The weekend of the migration came. We started on a Friday night. We first put our websites into a maintenance mode so no one could access the databases. I made backups of all the databases, then copied those backups to the new server, then shut the old cluster down. There was a large amount of data to copy, so it took a couple of hours to move everything to the new server. Then I restored the backups to the new server. Everything went well. I ran DBCC UPDATEUSAGE on all the databases. It found lots of errors, but fixed them all. The DNS redirects were put in place and, after waiting a bit for that info to propagate through the network, our QA team started testing. It was Saturday morning at this point.

We ran into a couple of problems but they mostly involved the DNS redirects. We had to restart some of our web servers so that they got the new DNS information. All in all, things were going surprisingly well. While QA was testing our software, I started working on getting replication back up and running. I already wrote about how that went, so you know that took several hours of my time. But things were looking good overall. All our third party programs seemed to be functioning correctly with the new server. Things were looking up.

Then Saturday night rolled around and QA found a problem. Our company’s product allows customers to upload data from ASCII files into our database. That was failing. We were getting an error like this:

Cannon initialize the data source object of OLE DB provider “MSDASQL for linked server “<servername>”.  OLE DB provider “MSDASQL” for linked server, “<servername>” returned message “[Microsoft][ODBC Driver manager] Data source name not found and no default driver specified”. (Microsoft SQL Server, Error: 7303)

Hmm.. Linked servers? I didn’t think we had any of those on the old cluster. It was offline now and we couldn’t bring it back up to check because we’d get IP conflicts. So I made some linked servers to the other servers I knew had any sort of interaction with this server. The error persisted. We spent several hours chasing down possible problems but nothing seemed to work. It was very late at this point, early morning actually, and we decided to call it a night and get some rest. Hopefully, a little sleep would help us find something we were missing.

Sunday morning came. We gave upper management a status update. We had all functionality working except this one upload routine. We were given a 1 PM deadline. If we couldn’t get it working by then, we had to roll back to the old server. At least rolling back would take very little time. It pretty much only involved removing the DNS redirects and bringing the old server back online. But I still didn’t have any idea what the problem was.

As is so often the case, the text of the error message was leading us down the wrong path. It turned out the problem had nothing to do with linked servers, at least, not the linked servers any DBA would think of.

We spent a couple more hours spinning our wheels and getting nowhere. I starting thinking about all the things that were different between the two servers. The version of SQL Server, obviously. The hardware, but that shouldn’t be an issue. The OS was a newer version of Windows Server. Hmm. And the new cluster and new SQL were 64 bit while the old one was 32 bit. That was my only lead at this point, so started investigating that.

And it turned out, that was where the problem was. The upload routine that was failing was called from a web page. The web server serving up that page was a 32 bit machine. I found this post with a Microsoft employee saying that 64 bit SQL Server can only use a 64 bit ODBC driver. The one installed on the web server was 32 bit. The only solution I could think of at the time was to upgrade the web server to 64 bit. That was not something we were prepared to do, so that was out. We were rapidly approaching out 1 PM deadline.

At this point, I thought we were out of options. I assumed that the upload process was driven by some code running somewhere outside the database. Then one of my colleagues said something that got me thinking – maybe a stored procedure was involved. So I started hunting around and, sure enough, I found 5 stored procedures that were used in the upload process! What’s more, those procedures all used the OPEN ROWSET command! And MSDN says this about OPEN ROWSET: “This method is an alternative to accessing tables in a linked server…” Linked server?! Bingo!! Even though we were importing from an ASCII file, OPEN ROWSET typically deals with linked servers, which is why the error message referenced linked servers! And it was in a stored procedure, which I could change, not program code somewhere that I couldn’t!

At this point it was about 12:15 PM. I was sure I had found the problem, but time was running out. I communicated my thoughts to the rest of the team. Could I change the stored procedure to not use OPEN ROWSET? I tried on one, re-writing the procedure to use BULK INSERT instead. This was a native T-SQL command and shouldn’t have any ODBC issues. Once I made the change, I had QA test again. It failed again, but this time on a different file – meaning it got through the one stored procedure I changed and hit one of the other four that I hadn’t. We were very close to 1 PM now and I quickly changed the remaining procedures to use BULK INSERT. QA tested again and it worked! Cheers all around!!

A short time later, we got an email from QA. Another test failed. Our spirits plummeted.

It turns out we allow our customers to import files with or without some optional fields. BULK INSERT can’t handle this and this was the cause of the latest failure. It needs to have files in the exact same format each time. We were stuck and there was no magic T-SQL code I could pull out of my hat to fix this one. But I did feel this was a change that our software department could make fairly quickly on their end – simply format the file to a fixed format before calling the stored procedure. It was about a bit after 1 PM by this time, so we put another call in to upper management. Everything worked except this one case. We gave the reason for the failure, as well as my opinion that a fix from software engineering would be relatively easy to make. (As a former database developer, I felt I had a reasonable idea of the work involved in the change.) Should we stay on the new server with 99.9% of our functionality and give software engineering a day or two to develop a fix that would bring us to 100% or should we roll back to the old server (which, we pointed out, was no longer part of a cluster, leaving us exposed to another hardware failure that would cripple us). The answer came back: roll back to the old server. It was not a decision I agreed with, but one I had to work with.

We spent another fairly depressing hour moving everything back to the old server. As expected, that part went pretty easily. We were all disheartened to have some so close, only to have to turn back.

The next day, Monday, I wrote up a summary of what we found for my boss and included an executive summary that presented the key issues. The good news was that BULK INSERT is not a 2008-specific command, so our software team could develop the fix and implement on the current 2000 server at any time. Then, when we tried the migration again, we wouldn’t run into this issue and things should go smoothly. As it stands at the time of this writing, we are still waiting for software development to make the change. We are well into our busy season now, so we’ll have to wait until we are through that before we try the migration again.

All in all, I think we were remarkably successful in everything we tried to do. We had over a dozen programs that used this database server and we got 99.9% of them working on the new cluster before we had to turn back. I still can’t believe Microsoft would allow this 64-bit limitation to have gone out the door. Surely, they can’t think companies would upgrade all their machines to 64-bit at the same time? Nevertheless, this was a good example that, no matter how well you might prepare, there is usually something you’ll miss that can mess you up and it will likely come from a direction you don’t expect. If you are able to quickly and accurately diagnose the problem, you’ll stand a better chance of fixing it during your maintenance window and continue moving forward.

Update: After writing this, our software development team looked into the problem and came up with a different and simpler solution. We need to install the Microsoft Access Database Engine Redistributable driver (found here) and change the stored procedures to use this driver instead of the MSDASQL driver they were using. When you install the Access driver, you will see a new Provider in SSMS:

(If you are doing this on a cluster, be sure to install the driver on all machines in the cluster.) Once that is installed, we had to make two changes to the server. First, we need to enable the new Provider. Right click and provider name and choose Properties. Put a check in the Allow inprocess checkbox:

The second change was to run sp_configure and enable “Ad Hoc Distributed Queries”.

Finally, the stored procedures that were using the OPEN ROWSET command needed to be changed to use this new driver. The old statement was:

SELECT *
FROM OPENROWSET(‘MSDASQL’, ‘Driver={Microsoft Text Driver (*.txt; *.csv)};DefaultDir=c:\temp\;’,’select * from import.csv’)

The new statement is:

SELECT *
FROM OPENROWSET(‘Microsoft.ACE.OLEDB.12.0’, ‘Text;Database=C:\temp\;’,’select * from import.csv’)

There is one more issue you may encounter, depending on what account your SQL Server is running under and how you have permissions set up on your machine. When I tried to run the above statement, I was getting this error:

OLE DB provider “Microsoft.ACE.OLEDB.12.0” for linked server “(null)” returned message “Unspecified error”.
Msg 7303, Level 16, State 1, Line 1
Cannot initialize the data source object of OLE DB provider “Microsoft.ACE.OLEDB.12.0” for linked server “(null)”.

Believe it or not, this is a file permissions issue and it has nothing to do with the location of the file you are trying to import. What is happening is some temp files are made during the import and it is the location of these temp files that is causing the problem. The account that is executing the OPEN ROWSET command needs to have access to the user temp directory of the account SQL Server is running under. This post led me to this solution. The actual directory in my case (Windows Server2008 R2) was C:\Users\<SQL Service account name>\AppData\Local\Temp. I granted full access to everyone on this directory and my query worked. Again, if you are setting this up on a cluster, be sure to set the permissions on all machines in the cluster.

Phew. Talk about going down the rabbit hole!

We have not yet tried the entire migration again, but we have installed the above driver on our new cluster and have successfully loaded a test file, so it looks like this fix will work.

Leave a Reply

Your email address will not be published. Required fields are marked *

I am a real person and can prove it by doing this math problem: (required) Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.