It’s a SQL Server best practice to have Auto Create Statistics enabled on your databases. When this is on, SQL Server will create statistics on single columns when it needs them. However if someone comes along later and adds an index, that auto-created statistic might not be needed any longer.
The Potential Problem
When an index is created, SQL creates a statistic on the columns in that index. If the first column in that index was a column that SQL had previously auto-created a statistic on, you’ve now got duplicate statistics on that column. Is this bad? Maybe.
A duplicate statistic can impact storage space and performance. Statistics are relatively small objects in a database – they contain a maximum of 200 rows in the histogram and a very small amount of data in the density vector and header. Therefore, the impact on storage space is very minimal, but definitely not non-existent.
The biggest impact of duplicate statistics would be manifested as decreased performance. We all know too many indexes can slow down table inserts, deletes, and updates because SQL has to update all those indexes each time data is modified. The same situation exists with statistics. The more there are, the more SQL has to modify as data in the table changes. It’s not quite as bad as indexes because the number of data modifications has to reach a certain threshold before statistics are updated, whereas for indexes every single data modification of a column used by an index requires an index update. Nevertheless, the potential exists for SQL to have to do more work than it needs to when you’ve got duplicated statistics.
But that’s not the only way performance can be affected. As Erin Stellato points out in the comments to her blog post, it’s possible that SQL Server will pick a column statistic over an index statistic or vice-versa when generating a query plan. If one of these stats is out of date and that is the one SQL picks, a less-than-optimal query plan might be generated.
So it may be worth checking your databases for duplicate statistics, especially if you have a highly active database that you have recently added indexes to.
Being the good DBA that I am, I decided to do this check on my SQL Servers. I found a handy script at https://mattsql.wordpress.com/2013/08/28/duplicate-statistics/. The code looked ok, so I ran it against one of my smaller databases and removed the statistics it said were duplicates. I re-ran the query and, sure enough, it returned no rows, meaning it found no more duplicates. Then I saw something shiny and got distracted.
When my focus returned to SQL, I ran the query again – accidentally against the same database I ran it against earlier. Strangely, this time it reported it found some duplicate statistics. Hmm.
That shouldn’t have happened. If SQL auto-created statistics again, it meant that it needed them – which meant that the stats on the indexes I thought were duplicated were not actually duplicates.
I puzzled on this for a bit and got sidetracked by the strange way SSMS displays statistics columns on the Property page. Then it got to be the end of the day and I went home. The next day, I had a comment on my previous post from Aaron Bertrand who mentioned there is a related bug with the stats_column_id column of the sys.stats_columns view. It does not contain what the MSDN documentation says it contains. The Connect item for this, along with a workaround, can be found here.
The script I was using did not reference that column, but it did get me thinking that perhaps the script was not correctly identifying the first column in an index.
What The Script Does
In looking at the script, I could see it looks for any statistics created for indexes. Through a join to the sys.stats table, the script selects any auto-created statistics on the same column that is the first column in an index. The key line is in the WHERE clause of the CTE:
and ic.index_column_id = 1
You would think that would select the first column in the index. That’s even what you might expect from reading the MSDN page for the view. Except that’s not always the case. If you want to find the first column in the index, you have to select the record where key_ordinal = 1, not index_column_id =1.
A Short Example To Illustrate This
Let’s go to a database we can mess around in and see if we can prove this. First, let’s create a table and a clustered index. (I chose this structure because it is the same as the table that caused me to see this problem in the first place.)
CREATE TABLE [dbo].[IndexStatTest] ( [ID] [INT] IDENTITY(1, 1) NOT NULL , [TypeLID] [INT] NOT NULL , [Hidden] [TINYINT] NOT NULL , [ObjectID] [INT] NULL , [ObjectTypeLID] [INT] NULL , [TMGUID] [CHAR](32) NOT NULL , [LMG] [CHAR](32) NOT NULL , [LMD] [DATETIME] NOT NULL , [LMU] [NVARCHAR](100) NOT NULL , [IsVisibleInCM] [TINYINT] NOT NULL , CONSTRAINT [PK_IndexStatTest_ID] PRIMARY KEY NONCLUSTERED ( [ID] ASC ) ON [PRIMARY] ) ON [PRIMARY]; GO CREATE UNIQUE CLUSTERED INDEX [UK_IndexStatTest_02] ON [dbo].[IndexStatTest] ( [ObjectID] ASC, [ObjectTypeLID] ASC, [TypeLID] ASC ) ON [PRIMARY]; GO
Clearly, we defined the index with ObjectID as the first column. Now, let’s run a query that joins the sys.index_columns view with sys.indexes and sys.objects (so we can see some names instead of just object IDs) and see what we get.
SELECT ic.object_id , ic.index_id , ic.index_column_id , ic.column_id , ic.key_ordinal , COL_NAME(o.object_id, ic.column_id) AS ColName , i.name AS IndexName , OBJECT_NAME(ic.object_id) AS TableName FROM sys.indexes i JOIN sys.objects o ON o.object_id = i.object_id JOIN sys.index_columns ic ON ic.index_id = i.index_id AND ic.object_id = i.object_id WHERE i.name = 'UK_IndexStatTest_02';
Recall the column order we defined for our index: ObjectID, ObjectTypeLID, and TypeLID. We can see that index_column_id does not number the columns in that order. It is the numbering in the key_ordinal column that indicates the order the columns are defined in the index.
This explains the behavior I was seeing. Because the script I used incorrectly identified the initial column the index statistic was created on, it incorrectly identified an auto-created statistic as a duplicate. Luckily, because I have auto create statistics enabled, I didn’t cause much chaos when I deleted the incorrect statistic – SQL Server just made a new one when it needed it.
One More Issue
In testing, I also ran across one more issue – namely auto-created statistics that have more than one column. I know this is supposedly not possible, but I discovered a weird case where it can happen. I wrote about it in some detail last week, so check that out if you are interested. Bottom line is this situation might occur if you are working with a database that was created in SQL 2000 (or possibly earlier).
What this meant for my script is that, due to the join on the sys.stats_columns table, some statistics might be incorrectly flagged as being duplicated when they really aren’t. Again, this would only happen in the strange case where you were working with an old database.
In deciding how to handle this, I came to the conclusion that I would just let those stats be selected for deletion anyway. Because auto-create statistics is enabled in the database, if any deleted stat is needed, SQL will go ahead and re-create it. (If auto-create statistics is not enabled in the database, there would be no reason to run this script in the first place.)
Nevertheless, if you are concerned about this, perhaps because you know performance in your environment is highly dependent on statistics and the overhead of auto-creating them in production could cause issues, look over the script output before running it. Any suspect statistics would be named _WA_Sys_<column name>_<hex ID of table> instead of the newer format _WA_Sys_<hex id of column>_<hex ID of table>. Don’t run the commands with the column name instead of the hex ID. Instead, run those commands during an outage window and then manually re-create the statistics, taking care to identify which column that stat is for and which ones were the included clustering keys. This will re-create those stats as the newer format single column stats.
The Corrected And Improved Script
I took Matt’s script as a starting point and I made the correction to the CTE. Because I wanted to run this script against a bunch of databases, I also modified it to cycle through all the databases on the SQL Server instance. I also changed it to output the commands to drop the duplicate statistics.
The script below will cycle through all user databases on a server, looking for auto-created statistics that are on a column that is also the first column in an index on the table. It ignores disabled and filtered indexes and will output a list of commands that drops the duplicate statistics. You can then take this output and run it against your server to remove these statistics. The script itself makes no changes to any database. If you want to see the details of which statistic is duplicated by which index, simply uncomment out the second to last SELECT statement in the script.
This script was tested against SQL 2012. I suspect it may not work with partitioned indexes – I did not test it against those.
Also note that this script only finds duplicate statistics between stats that were auto created and those from an index. It will not find duplicate between two column statistics. For example, if SQL auto created a stat on a column and someone later manually created one on the same column, this script would not find this.
This script has been tested against SQL Server 2008 R2, 2012, SQL 2014, and SQL 2016. It does NOT work on SQL 2005 and earlier.
/* Script to find auto-created statistics that are no longer needed due to an index existing with the first column the same as the column the statistic is on. Disabled indexes are ignored. Based on script from mattsql.wordpress.com/2013/08/28/duplicate-statistics/ The script will loop through all the user databases on a server and output a list of commands to drop the duplicate statistics. For more details on the duplicates, uncomment the second to the last SELECT statement. Disabled indexes are ignored. Filtered indexes are ignored. This script makes no changes to any databases. ShaunJStuart.com 8-8-16 Initial release */ DECLARE @SQLCmd VARCHAR(MAX); DECLARE @LoopCounter INT; DECLARE @WorkingDB sysname; IF OBJECT_ID('tempdb..##UnneededStats') IS NOT NULL DROP TABLE ##UnneededStats; CREATE TABLE ##UnneededStats ( [ID] INT , [Schema] sysname NOT NULL , [Table] sysname NOT NULL , [Column] sysname NOT NULL , [AutoCreatedStatistic] sysname NOT NULL , [Index] sysname NOT NULL , [DropStatCommand] VARCHAR(4000) NOT NULL ); DECLARE @DatabasesTable TABLE ( DBName sysname ); INSERT INTO @DatabasesTable SELECT name FROM sys.databases WHERE source_database_id IS NULL /* no database snapshots */ AND is_read_only = 0 /* no read-only dbs (log shipping) */ AND state = 0 /* online databases */ AND name NOT IN ( 'master', 'model', 'msdb', 'tempdb' ) /* non-system databases only */ ORDER BY name; SELECT @LoopCounter = COUNT(*) FROM @DatabasesTable; WHILE @LoopCounter > 0 BEGIN SELECT TOP 1 @WorkingDB = DBName FROM @DatabasesTable; SET @SQLCmd = ' USE [' + @WorkingDB + ']; WITH stats_on_indexes ( [object_id], [table_column_id], [index_name] ) AS ( SELECT o.[object_id] AS [object_id] , ic.[column_id] AS [table_column_id] , i.name FROM sys.indexes i JOIN sys.objects o ON i.[object_id] = o.[object_id] JOIN sys.stats st ON i.[object_id] = st.[object_id] AND i.name = st.name JOIN sys.index_columns ic ON i.index_id = ic.index_id AND i.[object_id] = ic.[object_id] WHERE o.is_ms_shipped = 0 AND i.has_filter = 0 AND ic.key_ordinal = 1 AND i.is_disabled = 0 ) INSERT INTO ##UnneededStats ( [ID] , [Schema] , [Table] , [Column] , [AutoCreatedStatistic] , [Index] , [DropStatCommand] ) SELECT o.[object_id] AS [ID] , sch.name AS [Schema] , o.name AS [Table] , c.name AS [Column] , s.name AS [AutoCreatedStatistic] , stats_on_indexes.index_name AS [Index] , ''USE [' + @WorkingDB + ']; DROP STATISTICS ['' + sch.name + ''].['' + o.name + ''].['' + s.name + '']'' AS DropStatCommand FROM sys.stats s JOIN sys.stats_columns sc ON s.stats_id = sc.stats_id AND s.[object_id] = sc.[object_id] JOIN sys.objects o ON sc.[object_id] = o.[object_id] JOIN sys.schemas sch ON o.schema_id = sch.schema_id JOIN sys.columns c ON sc.[object_id] = c.[object_id] AND sc.column_id = c.column_id JOIN stats_on_indexes ON o.[object_id] = stats_on_indexes.[object_id] AND stats_on_indexes.table_column_id = c.column_id WHERE s.auto_created = 1 AND s.has_filter = 0;'; EXEC(@SQLCmd); DELETE FROM @DatabasesTable WHERE DBName = @WorkingDB; SET @LoopCounter -= 1; END; /* SELECT * FROM ##UnneededStats; */ SELECT DISTINCT DropStatCommand FROM ##UnneededStats; /* DISTINCT is needed because sometimes more than one index will start with the same column, resulting in two DROP statements being generated for the same statistic, which would cause an error when executed. */ DROP TABLE ##UnneededStats;
4 thoughts on “Removing Duplicate Statistics”
WHERE s.auto_created = 1
AND sc.stats_column_id = 1
AND s.has_filter = 0;
Stats only done on first column of statistic.
Same as the index ordinal=1.
This may not work. See the paragraph in the post (“That’s strange…”) regarding the bug with this column that Aaron Bertrand found. Unfortunately, the Connect item is no longer available due to Connect being retired, but if I recall, there was an issue where stats_colunm_id = 1 was not necessarily the first column in the index.
Nice one family.
1) When creating the temporary table ##UnneededStats, I think you want to use varchar.
1a) Currently, it reads [DropStatCommand] VARCHAR(4000)
1b) Please use [DropStatCommand] NVARCHAR(4000)
This will help you with unicode characters.
2) Also, I think a local temporary table will suffice
2a) Currently, Global Temporary Table ( ##UnneededStats)
2b) Localized version ##UneededStats
Temporary table can be used as call to exec is managed in-session.
Result set available to local temporary table.