I’ve been spending some time lately working with statistics, specifically auto-created statistics. When auto-create statistics is enabled, SQL Server will create column statistics when it needs them. No surprises there. Conventional wisdom also states that SQL Server never creates multi-column statistics, only single column statistics. In fact, this MSDN page says this of the auto create statistics option:
“It applies strictly to single-column statistics for the full table.”
Seems pretty clear. SQL Server will not automatically create multi-column statistics. I’ve heard this over and over from many different sources.
So imagine my surprise when, while testing some code, I came upon entries in the sys.stats_column system table that looked like this:
Wow.. That sure looks like three auto-created, multi-column statistics! We have three stats: stats_ids 3, 4, and 5. The sys.stats_column table contains one row for each column that is in a statistic, so multiple rows for a single statistic (i.e., a single stats_id value), indicate multiple columns in that stat. Indeed, the column_id values indicate the table columns contained the stat. So stats_id 3 contains columns VersionMajor and ApplicationID (column_ids 3 and 1), stats_id 4 contains columns VersionMinor and ApplicationID (column_ids 4 and 1), and stats_id 5 contains columns VersionRevision and ApplicationID (column_ids 5 and 1). And, clearly, the auto_created flag is true, so these three stats were auto-created. What’s going on?
I puzzled over this for quite some time got nowhere. I finally reached out to the SQL Server community for help. Erin Stellato (b | t) said she had never seen that before, but she did mention the naming of the statistics seemed strange. I noticed this as well. As Paul Randal has blogged, auto-created statistics are named as _WA_Sys_<hex value of column>_<hex value of table>. Typically, the names look something like _WA_Sys_0000000D_00000029.
But these stats don’t follow that format. Instead of the hex value for the column, they have the column name.
A couple hours later, Paul White (b | t) responded to my tweet:
Aha! It seemed highly likely that this was the cause of what I was seeing. Although I was working with a database on a SQL 2012 server, I was pretty sure that database was originally created on an older version of SQL. I wanted to confirm this however.
To verify what SQL version the database was created with, I used the DBCC DBINFO command. The value of the dbi_createVersion field indicates what SQL version was used to create the database.
Thanks to Paul Randal’s TechNet post, we know that a value of 539 means the database was created with SQL 2000. Looks like Paul White might be on to something.
The next piece I wanted to verify was that auto create stats for SQL 2000 included the table’s clustering key. Unfortunately, deep dive details for auto-create stats for SQL 2000 are somewhat hard to come by these days and I was not able to find anything that definitely stated the clustering key was included. However, in all the auto-created stats in all the old databases I found on my servers, this was the case. I also verified that if the clustering key was more than one column, all the clustering columns were included in the auto created stats. I’m going to go ahead and assume this is the way it worked back in the day.
And finally, I wanted to find something confirming the naming convention. Again, details were hard to find, but I did find this ebook which states that the names of the auto-created statistics in SQL 2000 use the column name, not the hex value of the column.
This is enough evidence for me, so I’m going to call it: I’m going to go ahead and say that auto created column statistics can be created on multiple columns. Granted, you have to be running SQL 2000 and the additional columns are always the clustering key columns, so it’s very specific case and it no longer happen in modern versions of SQL Server.
Still, if you are ever asked if SQL Server will automatically create multi-column statistics, you can now give the classic SQL Server answer: It depends :-)
If SQL doesn’t behave this way anymore, why is this important? This issue caught my eye when I was writing some code to examine the statistics tables and when I joined to the sys.stats_columns table, I was getting back more rows than I expected. Because I was joining to other tables as well, this messed up my final result set. If you’ve got databases that have been migrated from older versions of SQL, you might run into this problem as well. Come back next week for the story on that…
P.S. Just for kicks, I deleted one of the old, two column stats and ran a query so that a new stat would be created. Would the new statistic follow the old or new SQL Server rule? Turns out, not surprisingly, the newly auto-created stat followed the new rules: it was single column, did not include the clustering key, and followed the new naming convention using the hex value for the column name. This was tested using SQL 2012.
I also wanted to see what would happen if the database was running in SQL 2000 compatibility mode. On a SQL 2005 server, I found a database that was created in SQL 2000 and had some of these strange auto-created stats. I changed the database to SQL 2000 compatibility mode, deleted one of the auto-created stats, and ran a query to force a new one to be created. The result? A single column stat that did not include the clustering key and that was named using the hex id of the column. In other words, the stat was made using the method we see in newer versions of SQL Server. So the database compatibility mode does not affect this. The only way multi-column auto created stats can be generated is if you are using the SQL Server 2000 (or earlier, most likely) database engine.
I know this is an old post, and it took me a while to find it – but I am so glad I did. I could not understand what I was seeing in a database system that probably started out in SQL7 and which the vendor just upgraded over the years. I was busy assessing the indexes and was unable to drop an index due to the presence of the statistics. I was surprised to see the same column appearing multiple times in multi-column statistics and was wondering whether the statistics would just return if I dropped them, in which case there is not much point in dropping them; I started doubting that one just needed one set of stats per column (as everyone was stating) – perhaps each index needed separate stats (which is what it almost looked like). Although I have been working with SQL for many years, I did not have to focus much on statistics details – so I did not realize that SQL Server had changed its position on this matter. So thanks for taking the time to document this!