Last week I wrote about some bad programming I discovered in Solarwind’s Orion monitoring software. One of their maintenance utilities reindexed the tables in the database, then performed a database shrink. This immediately creates about 100% index fragmentation. I don’t mean to pick on Orion, but I have been doing some more digging and discovered another problem in one of their maintenance routines. While the previous problem I found might not technically be considered a bug, this one definitely is.
Orion has a module called Netflow Traffic Analyzer that stores historical traffic patterns for your IP addresses. There is a utility that allows you to delete what they call “expired data flows,” which are data about IP address that are no longer in use. In our case, the utility reported over 800,000 expired data flows. So I ran their utility to delete them. One of the configuration options allows you to set a time limit for the length of processing, so I expected the command to take a while. Instead, when I ran it, it returned almost immediately. I started the utility again and it still reported over 800,000 expired data flows. Something was obviously not getting deleted.
So I fired up SQL Profiler to find out what exactly the utility was doing. I discovered what stored procedure it was calling to do the deletes and I tried manually running the procedure myself via SSMS. It returned a primary key violation error. So Orion’s bug is not having their client program check the return status of the command from SQL and just assuming everything worked. I searched the company’s Knowledge Base for anything related to a problem like this, but nothing turned up. I decided to dig a little deeper myself.
I investigated the stored procedure that was called and discovered it creates a temporary table to store IP addresses and makes a primary key of the IP address column. But some IPs had multiple records, so duplicate primary key violations were occurring when it tried to insert data into this temp table. I looked at the rest of the procedure to understand what it was doing and it looked like I could fix the error by simply including a DISTINCT clause in the IP address SELECT statement. I put this in and the routine then ran and correctly deleted our expired data.
If anyone else encounters the same problem, the stored procedure was called swps_RemoveExpiredFlows and the DISTINCT clause was added to the command stored in the @Script variable. In all fairness, we are not running the latest version of the Orion software, so this, and the issue I wrote about last week, might be fixed now. But these bugs exist in the version we are running: Netflow Performance Monitor version 9.5 SP5 and Netflow Traffic Analyzer version 3.5 SP1. We are still under a maintenance agreement, so we are looking into upgrading to the latest version. Unfortunately, I am unable to find a comprehensive list of bug fixes for each version, so I don’t know if these issues have been resolved or not.
So was this process part of a DBA’s job? Probably not. Since we have a maintenance agreement, I probably should have just called their tech support and let them troubleshoot it. But I am willing to bet it would have taken a lot longer and they probably wouldn’t have found the underlying cause of records not getting deleted. They probably would have come back and told me to upgrade to the latest version and hope that fixed the problem, which it might or might not. By investigating this myself, I saved a bunch of time and found the real source of the problem and fixed it. I think you can be a much more valuable DBA if you are willing to roll up your sleeves and dive into this type of problem solving.