Tuesday, April 26, 2011

Don't Fear the Trace

Not a week goes by that I don't run across either a comparison of server-side (file) and Profiler (rowset) trace performance, with rowset ending up with the short end of the stick; and/or someone who's afraid (or not allowed) to run traces against a production server because it will undoubtedly kill SQL Server performance. The amount of misinformation out there on this topic is mind-boggling – IMO it's one of the most misrepresented aspects of SQL Server. Unfortunately, it seems many have taken the misinformation at face value, and have perpetuated it without performing sufficient diligence.

There are a some key points that always should be, but rarely seem to be, considered when discussing the performance impact of tracing:

  • Disk speed matters for server-side traces. File traces are "lossless", so can and will block SQL Server activity if the disk where the trace file is being written is not keeping up. A nice overview by Adam Machanic (b|t) is here: http://technet.microsoft.com/en-us/library/cc293610.aspx. This means that if you are writing too many events to a busy or otherwise slow disk, it can bring SQL Server to a screeching halt. This is easy to repro – try writing a heavy file trace to a USB thumb drive. This may sound extreme, but it's probably not far off many SQL Server disk systems I see where busy data and log files are mixed on the same set of spindles, causing severe contention.
  • Rowset traces will drop events to avoid stopping SQL Server activity. The longest a rowset trace will wait before dropping events is 20 seconds. This is less than the commonly used application query timeout of 30 seconds. This can be a bit harder to repro with Profiler, but we've done it in our lab by creating a .NET-based rowset trace consumer with artificial delays in code to slow trace event consumption. In this sense, rowset traces can be considered "safer" than file traces, since although they can slow SQL Server, they won't stop it completely.
  • Filters matter, for both trace types. Although the above points are important to understand, they only come into play in extreme scenarios that most of us should never see. You should almost never need to run an unfiltered trace, or use "Starting" or other events that aren't filterable. For example, if you use only RPC:Completed and SQL:BatchCompleted events and restrict the rows returned using selective filter thresholds against integer-based columns like Duration, CPU, etc., and avoid text filters, especially with wildcards, you can minimize the impact to performance of either trace type.
  • Profiler <> Rowset. Profiler uses a rowset trace, and so frequently when Profiler is knocked, rowset is condemned by extension. They are not one and the same. SQL Sentry Performance Advisor (and some other tools) also use a rowset trace. We stream the trace binary data back and decode it, as described at the bottom of this page: http://technet.microsoft.com/en-us/library/cc293613.aspx. While Profiler has a relatively heavy interface for working with trace data, our trace server, a Windows service, does not. It uses a highly optimized multi-threaded scheme for consuming trace rows as quickly as possible, and efficiently storing the trace data for later retrieval. So although the network overhead may be roughly equivalent for the same trace run by Profiler (remotely) or our trace server, that's where any similarity ends. If the trace uses smart filtering, the overhead should be a non-issue anyway – by default we use a conservative filter of Duration>=5000ms, with a floor of 100ms for safety if no other CPU or I/O filters have been applied.

A Simple Test

We recently conducted some cursory tests, primarily to confirm the blocking/non-blocking behavior described above. We simulated high activity via a lightweight T-SQL statement in a 1 million cycle loop, and used a trace with only "Completed" events and no filters. The goal was to generate 1 million trace events as quickly as possible, and assess the impact consuming those events by each trace type had on the total load run time. I want to emphasize that this is an isolated test scenario, and the results should not be over-generalized. We plan on conducting much more comprehensive testing in the near future. Nonetheless, the results are interesting:

  • Both server-side and rowset traces always caused the load to run slower than when there were no traces.
  • A server-side trace writing to slow local disk caused the load to run much slower than a rowset trace consumed locally.
  • A server-side trace writing to a fast local disk caused the load to run slightly faster than a rowset trace being consumed remotely over a 1000Mbps network.
  • After factoring in the additional overhead required to copy the trace file over the network and load it into Profiler, the overall time for the server-side trace from the last test was greater.

It seems that with remote tracing, you can pay me now with rowset, or pay me later with server-side. Although the results were repeatable, I'm not publishing test details at this point because these tests were not strictly controlled. The point here is that trace impact on performance is always an "it depends" scenario. Unfortunately it rarely seems to be presented as such – blanket statements such as "Profiler is dangerous" and "server-side traces are always faster" are commonplace. Sure, consumer type and location, network speed, etc., can certainly affect the performance impact of rowset traces, but so can writing a server-side trace file to the same spindles as a busy transaction log... and the impact can potentially be much more detrimental to your SQL Server.

The Source

Many thousands of SQL Servers are monitored 24/7 by our rowset-based trace server and we haven't had a single report where its impact on the target was even noticeable, so we're always puzzled by the anti-rowset sentiments. I've tried to assess how the confusion and fear started, and much of it seems to have a common source: the "Standard" Profiler trace:

std_trace_props

It's been with us forever, and I'm not sure who created it as the default trace... but best case it is of little to no value, and worst case it is dangerous to your SQL Server. Presenting this trace as the default to a SQL Server newcomer is analogous to putting a loaded weapon in front of a toddler. There are no filters applied, and it includes Audit Login and SQL:BatchStarting events, neither of which are filterable by any performance metrics (Duration, CPU, Reads, Writes), meaning you will be capturing all of them. It also includes Audit Logout, which is effectively not filterable by Reads or Writes when connection pooling is in use, since those metrics are cumulative. On a busy OLTP system this trace is bad, very bad!

std_trace_filters

So unless you've changed the default trace (easy to do via Edit Template), every time you open Profiler you've got a very high overhead trace as the starting point. As a result, this trace is referenced in books, trace performance benchmarks have been conducted and published using it, and countless users have naively run it as-is against busy servers. I'm convinced that this is a big reason that Profiler is so frequently kicked around, and why so many are now afraid to use it at all.

Bottom line: Don't use this trace! Certainly don't use it alone for benchmarking the relative performance impact of the two trace types.

A true comparative test should consider:

  • Events and columns used
  • Trace consumer location (local or remote)
  • Disk location (local or remote)
  • Disk speed and type (shared or dedicated)
  • Network speed
  • Load type (T-SQL text size, DML mix, etc.)
  • Transactions/Batches per second on the target
  • CPU and memory resources available
  • Filter types
  • Filter thresholds
  • ...

As I mentioned, we have more comprehensive testing coming which covers these factors, and we'll publish all of the details, with hopes of clearing up a lot of confusion and fear around tracing once and for all.

In the meantime, don't fear the trace. If you target the specific events and columns you need, and use conservative integer-based filters to restrict the rows returned, the benefits gained through intelligent analysis of the data should far outweigh any performance impact on your SQL Servers.

No comments:

Post a Comment