The big replication project that I spent a couple weeks planning and testing for arrived last weekend. To recap, I needed to set replication of a rather large database with SQL 2000 as the publisher and distributor and a new SQL 2008 cluster as the subscriber. As I wrote previously, I had tested this out fairly thoroughly on a test system and was pretty sure everything would go smoothly. Of course, that rarely happens.
To start the project, I got out my checklist. You do use checklists, don’t you? There’s a good reason why airline pilots use them. When something is pretty complicated, the chances of forgetting a step are pretty high. So when I was testing, I wrote a step by step checklist of everything that needed to be done, in the order it needed to be done. When the time came to set this up for real, I went through my checklist without any problems. Looking good! Then I tested the replication.
Whenever I set up replication, I always make a small test table and include it in the publication. That way, I can do inserts, delete, and updates in that table and verify replication is functioning. It’s also a good way to show others that replication is functioning. Rather than just show someone an icon that says a job is running, show them that you can make a change in a table on one server and, voilà, the change appears on the other server. I usually name my test table aaasjs, aaa so it shows up at the top of an alphabetical list and sjs, my initials, so I know this table is used by me only and I can delete or alter it at will. So I made a change in my test table on the publisher and looked to see if it was replicated to the subscriber. Nope. Checked the distribution agent on the publisher and saw it had failed with this error: Could not find the stored procedure ‘sp_MSins_aaasjs’.
Hmm.. I had run this whole process through about 5 times during my testing and I was pretty sure the process worked ok. I even tested using a copy of the actual database being replicated, so I didn’t expect any surprises from weird data or tables. Now, I was initializing replication using a backup copy of the source database. This was because the database was simply too large to send a snapshot over the network for initialization in a reasonable amount of time. But still, I tested this multiple times and it worked. The one difference was that my testing was done with the subscriber being SQL 2008 and here I was actually using SQL 2008 R2 CU1. I didn’t think that was the issue though.
I won’t rehash all the steps in the process, as I outlined them in detail here. But just to make sure I didn’t make a mistake somewhere, I re-did the whole procedure from scratch. Same results. I decided to take a closer look at the error the distribution agent was throwing.
If you didn’t know, SQL Server implements transactional replication by, in part, using stored procedures. When you set up replication, SQL Server creates three stored procedures for each table that is being replication. One procedure handles data inserts, one handles data deletions, and one handles updates. These procedures are named sp_MSins_tablename, sp_MSdel_tablename, and sp_MSupd_tablename, where tablename is the name of the table that is being replicated. These procedures are created on the subscriber in the replicated database. My error message was telling me the insert stored procedure for table aaasjs, which is my test table, was missing. I looked at the stored procedures and found the update and delete procedures for that table were present, as were the insert procedures for other tables. Something strange was going on.
Because I was initializing from a database backup, part of the setup process involves running scripts that create all these procedures. I look at the script that creates the insert tables. (This is Step 8 in the Details section of my previous post and the step marked with an asterisk in my checklist above.) Notice that, due to a bug in SQL Server, this script is expected to generate errors when it is run. Specifically, it will generate one error for every table that has an identity column in it. The fix for this is to run a second script that makes update procedures that replace the ones that generated the errors in the step before.
I took a closer look at the errors this script generated. Most were the expected error: “cannot update identity column…” But interspersed with those, I saw a couple errors that said “column not found: msrepl_tran_version” Wow.. That’s a unique identifier column that SQL will add to tables that are being replicated when queued updating is enabled. This is where things get really weird. The tables in my publication database all had this column. Some of the tables (about 40 out of 780) in my subscriber database did not! This despite the fact that the subscriber database was created from a backup of the publisher database! I was doing this work on a Saturday and the new SQL Server wasn’t yet available to others, so I know I was the only one using it. So either some process dropped those columns or something I don’t want to think about was going on. (I did run DBCC UPDATEUSAGE on the subscriber database and it fixed some errors, but I don’t think that would cause columns to drop. More on this function later.)
I didn’t really want to try starting from scratch again. As I said this was a huge database. It took about 30 minutes to copy the compressed backup over the network from the publisher to the subscriber, plus about 30 or so to restore it. I was working on a Saturday afternoon and I didn’t want to stay any longer than I had to. Since the list of tables missing the columns was fairly short, I decided to go ahead and manually add the column to those tables. After that was done, I re-ran the scripts to generate the replication procedures, and my tests then worked.
My workload right now doesn’t permit me time to go back and investigate how those columns got dropped from some tables. I had time enough only to fix the errors and continue on with the process. (Unfortunately, other portions of the migration, unrelated to databases, failed and we had to roll the whole thing back and return to the old server.) But this is a great example of where knowing how SQL Server works can be a big help. If I didn’t know that SQL uses three procedures per table to implement transactional replication, I would have been in a world of hurt. Additionally, my use of checklists enabled me to easily start over and to back out my changes without fear of missing a step somewhere.