I came back into the office after New Years to find roughly 275,000 email messages in my inbox. Yup. Over one quarter million emails. It took over 2 hours to download them all and even longer to delete them. Obviously, something was going on. (Normally, I would have seen these alerts on my work cell phone over the weekend and started investigating sooner. Unfortunately, my work cell phone number was posted in a Craigslist ad and I was constantly getting calls asking about a car for sale that I wasn’t selling, so I turned my phone off until Craigslist took down the ad. As a result, I didn’t see the issue until I got into the office again.)
The emails were alerts for severity 17 errors – out of disk space. I checked the server and noticed that one database had increased it’s growth rate slightly, but nothing major. Log backups were still happening, so the space issue wasn’t caused by an out of control logfile growth. In short, it didn’t seem like my databases were the things eating the disk space.
Further investigating showed my culprit to be my SQL error log, which is on the same drive as my database files. The error log had grown to 15 GB! I have a job that rolls the log to a new file on the first of each month, so the current error log wasn’t huge yet and I was able to open it in a reasonable amount of time. I saw the problem right away:
There were tons of these errors. Forty five errors each second. For days. I traced the problem back to a machine that was trying to access a database that we had moved to a new server a couple of weeks ago. I contacted the person in charge of that program and had them change the configuration to point to the new machine and the errors went away. Because my error logs had been rolled, I was able to delete the 15 GB file and free up a ton of disk space and my severity 17 errors went away.
I’m not sure why this wasn’t picked up when we moved the database initially. And I do have alerts set up for severity 20 errors, which is what is raised when a login fails using Windows Authentication (“SSPI handshake failed”). But a failed SQL login is a severity 14 error, so my alerts never fired for this. Apparently the machine that was trying to login was not checked after the database move. It would appear it is not performing a critical function :-)