Baselining SQL Server (Part 1)

If you are reading this, chances are you like SQL Server. You may even love SQL Server. You might be addicted to SQL Server. I’m here to enable your addiction. Now I can’t say mainlining SQL is good for you, but I can say baselining your SQL servers is.

There are several good blog posts on the benefits of generating baseline performance data for your SQL servers, so I won’t go into that here. And there is a great post by Feodor Georgiev here on how to collect performance data using the typeperf Windows utility. I’m going to expand a bit on his post because I decided that was the way I was going to create my baselines. Feodor’s post links to a couple other articles about using PerfMon data in the first paragraph, one being Brent Ozar’s post on how to do this and use Excel to massage the data and create reports. With all due respect to Brent, bleh. Excel? Really? You know, since SQL 2005, SQL Server has shipped with this handy little piece called SQL Server Reporting Services. Why not use that? Well, probably because Brent’s method of collecting PerfMon data involved saving the data to a csv file. Feodor’s method, which is in turn based on Jonathan Allen’s method, uses typeperf instead of PerfMon itself, to collect data. Like Brent however, Jonathan’s method still outputted data to a csv file. Feodor’s brilliant contribution was to realize that csv files are très 1990 and that typeperf can write its data directly into SQL Server. Ah! Now we’re getting somewhere!! Couple that was SSRS and we’ve got the makings of a real monitoring tool!

Feodor did all of the hard work and included tons of screenshots in his post, so I recommend reading that to get started. I’m going to write about the issues I encountered when I tried to implement his plan.

  • Typeperf is like a stubborn child. Just like a bratty kid who says “I want ice cream” over and over at dinner time, typeperf seems to have one basic error message: “Error: no valid counters.” This message refers to the file of PerfMon counters you feed it. But this error can mean there are no valid counters in the file (perhaps you mis-typed a name), or, as was my case, it can’t even find the file! (Either the file doesn’t exist or you mistyped the name.)  Typically, you would expect a program to return a “file not found” error in that case, but not typeperf!
  • The ODBC driver matters. Feodor’s post used the SQL Server driver for his ODBC connection. I wanted to use the SQL Server Native Client 10 driver. Why? No real reason, other than the fact that it was newer. I couldn’t get typeperf to work using this, so I had to go back to the standard SQL Server driver.
  • Security matters. If you are planning on automating this process by running it as a SQL job (and you probably are), you need to make sure the account your SQL Agent runs under belongs to the Performance Monitor Users security group on all the servers you want to monitor. This will allow it access to the PerfMon counters.
  • Security still matters. I had some issues with not being able to see all the PerfMon counters. Specifically, sometimes I was not able to see the SQL counters. I could see them if I was using PerfMon as myself, but when I tried when running PerfMon as my SQL Agent service account, I could not see them. Adding the SQL Agent account to the Local Admins security group seemed to solve this. I will admit right now that this is a potential security issue. I have not fully investigated this and there may be some other solution.
  • Sharpen your DOS batch file programming skills. Feodor’s process calls typeperf from a batch file. This requires some mad DOS batch programming skilz. I forgot all my batch file commands 20 years ago, but luckily Google has plenty of examples. I’ll post my batch file and some of the hoops I had to jump through in just a minute.
  • Named-instance SQL Servers have different object name values. If you just have the default instance installed on your server, your counter object names will be of the form “SQL Server: <counter name>”. But named instances will have object names of the form “MSSQL$<instance name>: <counter name>”. This makes reporting a bit trickier.
  • Typeperf automatically creates tables in the database. And changes them too. When you first run typeperf using the SQL file type so that it writes to a SQL database, it will automagically create three tables. Those are described nicely in Feodor’s post. However, if you try to index the tables, you might be in for a surprise. In the Counter Details table, I wanted to index the MachineName, ObjectName, and CounterName fields. However, each of these columns is created as a varchar(1024) datatype. The maximum index key length is 900 bytes. If you try to make an index with one or more of these fields, you will get the following warning:

OK. I know that none of my machine names, counter object names, or counter names are that long, so I went ahead and changed those fields to smaller values and make my indexes. But – typeperf will change those fields back to 1024 characters! Again, it’s quite the stubborn child. Now, as long as you are sure your indexed fields aren’t going to exceed 900 bytes, you’re ok ignoring this warning. But just be aware of this behavior.

Now, on to the DOS batch file goodness.

If you notice, when writing to a SQL server, typeperf wants you to use “!<textstring> after the DSN name. This is a tag that gets saved to the table so that you can label your collection set. When using the SQL file type, typeperf requires this string. (Again with the stubbornness.)

Now let’s think a bit about what were are trying to do here. I want to create an automated process that collects perfmon data periodically. I’m going to be calling this from a batch file and I want to collect 4 data samples, 15 seconds apart every 30 minutes. I’m going to report on how this data changes over time, so I’m going to need some way to group each collection set. The typeperf tables do include a date and time, but it changes with each sample collected, making it hard to use that to group on. Ideally, I’d like to use this textstring for that purpose – use it to make my own datetime stamp which does not change for each of the 4 data collection samples, but does change every execution each 30 minutes. Which means I need to have some way to change it each time the batchfile executes. That means using some variables and programming. And, as an added twist, DOS batch files use the exclamation point as a keyword for delayed environment variable expansion, so we need to escape it in our code.

In order to create our own datetime stamp, we need to get the current date and time and parse that into variables. In batch files, that’s done using the FOR /f “tokens=…” command. You remember how to do that, right? Me neither. Google saves the day. Below is my batch file. I’ll explain it, line by line.

@echo off
echo ------------------- Starting -------------------
echo %date% %time%
rem @echo off
setlocal enableextensions
setlocal enabledelayedexpansion
for /f "tokens=1-5 delims=/ " %%s in ("%date%") do (set CurrDate=%%t%%u%%v)
for /f "tokens=1-5 delims=:. " %%d in ("%time%") do (set CurrTime=%%d%%e)
if %CurrTime% LSS 1000 SET CurrTime=0%CurrTime%

@echo on
TYPEPERF -f SQL -cf "D:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\JOBS\CountersList.txt" -si 15 -sc 4 -o SQL:SQLServerDS^^!%CurrDate%_%CurrTime%
@echo off
echo
echo --- End ---
echo

I capture the output of this job to a batch file and lines 1 through 4 are used just to make the logfile easier to read. Lines 5 and 6 are DOS commands needed to work the variable parsing magic.

Line 7 takes the output of the DATE command and splits it into 4 pieces using “/” and ” ” (space) as delimiters. (This command, and the next one, are dependent on your local system settings for the format.) On my system, the DATE command returns the date in the format “Fri 10/07/2011”. Getting back to the command, the four pieces are parsed into the variables %%s through %%v. I grab %%t (the month), %%u (the day), and %%v (the year).

Line 8 does the same thing as line 7 except for the TIME command. Again, this will vary based on your local system settings. In my case, the time is formatted as HH:MM:SS.mmm. A couple of things to note here. First, if the hour is less than 10, the command returns a space plus a one digit hour. If you only use the colon as a delimiter, your hour variable will include a leading space. When all the variables are built into the final typeperf command line, this will result in the ! tag being something like !10072011 850. Anything after the space will be ignored by typeperf, so your tag will end up being just 10072011 and your reporting will be off because all data sets collected while the hour is only one digit will be grouped together. To overcome this, we need to also specify a space as a delimiter. And finally, I want my hour number to always be two digits, so I need to add a leading zero where needed. This is what line 9 does. (LSS is “less than” in batch file programming.)

After all that setup is done, line 12 is where the actual typeperf command is built and executed. Notice we have to escape the exclamation point by using two carets (^).

The typeperf flags are well covered in Feodor’s post, but I’ll run through them quickly here. The -f flag specifies the file type to use. A value of “SQL” indicates a SQL ODBC datasource. -cf indicates the text file of counters to collect. The format of this file is also well covered at Feodor’s blog. Be aware of the difference in counter names between SQL default and named instances that I mentioned earlier. The -si flag tells typeperf the time between collection cycles (in seconds) and the -sc flag tells it how many samples to collect. The -o flag is the output file. In the case of the SQL filetype, this is the name of the ODBC datasource to use to connect to the SQL server. In the above example, my datasource is named SQLServerDS.

Lines 13 through 16 are for my log file to make things easier to read. One thing to note. If you actually run typeperf manually in a command window, you’ll see it displays a spinning text cursor when running. When you run it as a batch file and capture the output to a text file, this show up as long sequences of “[|][/][-][\]”.  So to avoid your logfile from growing like crazy, be sure to have each job execution overwrite the existing file instead of appending to it.

So that’s the nitty gritty of collecting the data. I simply scheduled a SQL Server job that executes this batch file every 30 minutes. I’m collecting 17 counters from about 40 different servers and it takes about 5 minutes to run.

Next week, I’ll go over the SQL statements I use to pull data from the tables (again, based on Feodor’s code). I’ll also show you the nifty reports I made with SSRS that are guaranteed to impress your manager!

Share

16 thoughts on “Baselining SQL Server (Part 1)

    1. Ah, one more thing: please do NOT add the SQL Server Agent account as a local admin. Instead, use a proxy in you SQL Server and use a separate acount which has ONLY enough permissions for collecting performance data.

  1. It is the easy work, actually. :) The hard work is to make the named instances and the default instances fit together (as you know there are some machine counters which are common for the server, and then each instance has its independent counters as well). Also, the hard work really is comes to ‘how do we interpret the data’ and how do we make sense of it in the timeslice context.

    It gets even more fun after this: how do we run profiler and typeperf and how do we correlate them.

    Ah, too many questions. :)

  2. Hi

    Thanks for the great blog quick question trying to implement this but wondering how you would go about executing this on multiple servers?

    Thanks :)

  3. I’m not sure I understand the question. Do you want to run this on multiple servers or collect data from multiple servers? If you want to collect data from multiple servers, you store a list of servers and counters in a text file that is used by typeperf. See Feodor’s post for info on this. To collect from other computers, you just have to append \\computername\ before each counter and have your security set up so the account running the batch file has permission to read those counters. If you want to run this on multiple servers, just load the batch file on multiple servers and set a SQL job to execute it.

  4. Awesome that answers my question!! :) Was having a blonde moment… Thanks for the quick response much appreciated!!

    And thanks for the great Blog!!!

  5. Very powerful, easy to setup and requiring minimum changes to environment. Worked like a charm, many thanks Shaun & Feodor.

  6. I have been experimenting with typeperf. I’m very impressed with it. I wish there was more documentation for the application. I too noticed that it takes about 5 minutes to collect the counters from about 20 servers. However, once started, the sets are logged about every 15 seconds with si at 15. The CounterDateTime logged in one set spans 5 minutes. In steady state, the counters are collected every 15 seconds on average; however, the counters collected from single 15 second interval are spread over many RecordIndex sets. For example, the first counter in a set can be about five minutes before the set was logged. The last counter can be from very close to when the set was logged. Because it takes 5 minutes to reach steady state, it is best not to stop and start the collection very often.

    The lag is due to collecting so many counters from many different servers. If the lag is too long (want alerts sooner?), you can run one job per server. There is very little lag with the collection and logging from a single server.

    The table is not optimized for space. I am trying to move data to a new table for processing. Tracking what has been moved and deleted can be tricky. When running multiple instances, the RecordIndex values are specific to the instance of TypePerf. If you want to use the DisplayToID table to track these, then each instance will need it’s own GUID. This can be done by using a collection set name (the part after the SQL:dns!) per instance.

    I’m still not sure what I will end up with.

  7. I have also noticed that running the same typeperf command several times does not produce additional logging. Running 4 typeperfs to log once a minute produces 1 set per minute. Not sure how it works under the covers. Perhaps only one instance can collect the set. After killing several instances, including the one doing the logging, I noticed a slight delay before another instance started logging the data. Documentation would have been very helpful. Somebody put a lot of work into typeperf, but we have to guess what it does.

  8. Well, I have deployed the typeperf scheduled task in our environment which has more than 50 servers, on sql server we collect about 50~60 counters, but, I sometimes experience “Communication link failure” which fills up the event log every quickly( I run typeperf every 1 minute), and sometimes, it pops up “Cannot alter CounterDetail table layout in SQL database.” error, and sometimes it shouts out “Duplicate item in primary key”, we are on latest version of windows and sql server, so it is quite frustrated. Just wondering if anyone has the same issue like me.

    I end up with developing my own program to collect performance counter values now, so far it looks good and I keep improving it for now.

    1. Do you run a new instance of typeperf every minute – or just make sure it’s running with that schedule? It takes about 5 minutes for my job to collect a full set with that many servers. There is a lot of overhead starting up that should be avoided by not restarting very often. Try using the sc parameter to set how often the parameters are collected. Don’t specify a limit. Start the job every minute, but only if it’s not already running. My typeperf has been running for about 2 months without stopping. Restarting does works fine – picks up with the same counters.

      I am now trying to create one job per server (automatically via powershell scheduledtasks module and a servers config table). This will avoid the 5 minute delay between collecting the counters and logging the counters. Typeperf waits until a single set of counters from all servers has been collected before bulk inserting them to the database. The delay can be very large with a large number of servers, but it is very small with one server. Not sure if it’s worth the extra effort, but I’m want the option of “live” alerts.

      BTW, a single job per server in the SQL Agent is not an option because there is a subsystem tread limit that can be a pain to get around. Jobs will be queued and never run.

      1. Albert, another thing to consider when running identical typeperfs at the same time is that they will use the same GUID. I believe the GUID maps to the “DisplayString” in the DisplayToID table. This is the value specified after the DSN (e.g., -o “SQL:mydsn!mydisplaystring”). I did test this case and it seemed to work alright, but only one typeperf would actually do anything. Another typeperf would pick up the load only after I killed the instance actually logging. This was okay in my case because each collected all counters every 15 seconds and there was no limit on the number to collect. I only wanted one to be working. I expect that if I started and stopped these often during the test, I too would have seen contention. Perhaps I’d have to also set a collection count limit as well to see an issue.

        If you want one run one instance per minute, one after the other, to collect only one set and stop, then perhaps a different display string and GUID for each? A timestamp or cycled number of some sort would work. The result might be that 5-10 instances are running at the same time. Every minute a new typeperf starts. Also, every minute the oldest typeperf finishes collecting from the last server, logs, and exits. On average in steady state, all counters are logged once a minute, but the counter time from the first server in the list might be many minutes prior to the log time. If a different GUID is not used for each, then they will be all using the same DisplayToID record. That could be the source of contention. If using a timestamp, the DisplayToID table will get loaded with many records. Something to clean up with old counter data.

        I still think it’s better to just let it run.

  9. Hi,
    I’m facing problem with typeperf, maybe somebody could help. Any MSSQL counters does not work for me from CLI, but it works in perfmon. I can find it but I get “Error: No valid counters” when I try to acces it:

    PS C:\Program Files\zabbix_agent> typeperf -q | find /I “buffer manager”
    \MSSQL$SQL02:Buffer Manager\Buffer cache hit ratio
    \MSSQL$SQL02:Buffer Manager\Page lookups/sec
    \MSSQL$SQL02:Buffer Manager\Free list stalls/sec
    \MSSQL$SQL02:Buffer Manager\Free pages

    PS C:\Program Files\zabbix_agent> typeperf “\MSSQL$SQL02:Buffer Manager\Buffer cache hit ratio”
    Error: No valid counters.

    Thank you so much for your help.

  10. Ivo – sorry for taking so long to get your comment posted It was stuck in my spam folder.

    That error “No valid counters” seems to be the only error typeperf returns, so it can mean a multitude of things. Look at my bullet points in the post above and see if any of them are the cause, particularly the first one.

Leave a Reply

Your email address will not be published. Required fields are marked *

I am a real person and can prove it by doing this math problem: (required) Time limit is exhausted. Please reload CAPTCHA.