Tuesday, April 26, 2011

Don't Fear the Trace

Not a week goes by that I don't run across either a comparison of server-side (file) and Profiler (rowset) trace performance, with rowset ending up with the short end of the stick; and/or someone who's afraid (or not allowed) to run traces against a production server because it will undoubtedly kill SQL Server performance. The amount of misinformation out there on this topic is mind-boggling – IMO it's one of the most misrepresented aspects of SQL Server. Unfortunately, it seems many have taken the misinformation at face value, and have perpetuated it without performing sufficient diligence.

There are a some key points that always should be, but rarely seem to be, considered when discussing the performance impact of tracing:

  • Disk speed matters for server-side traces. File traces are "lossless", so can and will block SQL Server activity if the disk where the trace file is being written is not keeping up. A nice overview by Adam Machanic (b|t) is here: http://technet.microsoft.com/en-us/library/cc293610.aspx. This means that if you are writing too many events to a busy or otherwise slow disk, it can bring SQL Server to a screeching halt. This is easy to repro – try writing a heavy file trace to a USB thumb drive. This may sound extreme, but it's probably not far off many SQL Server disk systems I see where busy data and log files are mixed on the same set of spindles, causing severe contention.
  • Rowset traces will drop events to avoid stopping SQL Server activity. The longest a rowset trace will wait before dropping events is 20 seconds. This is less than the commonly used application query timeout of 30 seconds. This can be a bit harder to repro with Profiler, but we've done it in our lab by creating a .NET-based rowset trace consumer with artificial delays in code to slow trace event consumption. In this sense, rowset traces can be considered "safer" than file traces, since although they can slow SQL Server, they won't stop it completely.
  • Filters matter, for both trace types. Although the above points are important to understand, they only come into play in extreme scenarios that most of us should never see. You should almost never need to run an unfiltered trace, or use "Starting" or other events that aren't filterable. For example, if you use only RPC:Completed and SQL:BatchCompleted events and restrict the rows returned using selective filter thresholds against integer-based columns like Duration, CPU, etc., and avoid text filters, especially with wildcards, you can minimize the impact to performance of either trace type.
  • Profiler <> Rowset. Profiler uses a rowset trace, and so frequently when Profiler is knocked, rowset is condemned by extension. They are not one and the same. SQL Sentry Performance Advisor (and some other tools) also use a rowset trace. We stream the trace binary data back and decode it, as described at the bottom of this page: http://technet.microsoft.com/en-us/library/cc293613.aspx. While Profiler has a relatively heavy interface for working with trace data, our trace server, a Windows service, does not. It uses a highly optimized multi-threaded scheme for consuming trace rows as quickly as possible, and efficiently storing the trace data for later retrieval. So although the network overhead may be roughly equivalent for the same trace run by Profiler (remotely) or our trace server, that's where any similarity ends. If the trace uses smart filtering, the overhead should be a non-issue anyway – by default we use a conservative filter of Duration>=5000ms, with a floor of 100ms for safety if no other CPU or I/O filters have been applied.

A Simple Test

We recently conducted some cursory tests, primarily to confirm the blocking/non-blocking behavior described above. We simulated high activity via a lightweight T-SQL statement in a 1 million cycle loop, and used a trace with only "Completed" events and no filters. The goal was to generate 1 million trace events as quickly as possible, and assess the impact consuming those events by each trace type had on the total load run time. I want to emphasize that this is an isolated test scenario, and the results should not be over-generalized. We plan on conducting much more comprehensive testing in the near future. Nonetheless, the results are interesting:

  • Both server-side and rowset traces always caused the load to run slower than when there were no traces.
  • A server-side trace writing to slow local disk caused the load to run much slower than a rowset trace consumed locally.
  • A server-side trace writing to a fast local disk caused the load to run slightly faster than a rowset trace being consumed remotely over a 1000Mbps network.
  • After factoring in the additional overhead required to copy the trace file over the network and load it into Profiler, the overall time for the server-side trace from the last test was greater.

It seems that with remote tracing, you can pay me now with rowset, or pay me later with server-side. Although the results were repeatable, I'm not publishing test details at this point because these tests were not strictly controlled. The point here is that trace impact on performance is always an "it depends" scenario. Unfortunately it rarely seems to be presented as such – blanket statements such as "Profiler is dangerous" and "server-side traces are always faster" are commonplace. Sure, consumer type and location, network speed, etc., can certainly affect the performance impact of rowset traces, but so can writing a server-side trace file to the same spindles as a busy transaction log... and the impact can potentially be much more detrimental to your SQL Server.

The Source

Many thousands of SQL Servers are monitored 24/7 by our rowset-based trace server and we haven't had a single report where its impact on the target was even noticeable, so we're always puzzled by the anti-rowset sentiments. I've tried to assess how the confusion and fear started, and much of it seems to have a common source: the "Standard" Profiler trace:

std_trace_props

It's been with us forever, and I'm not sure who created it as the default trace... but best case it is of little to no value, and worst case it is dangerous to your SQL Server. Presenting this trace as the default to a SQL Server newcomer is analogous to putting a loaded weapon in front of a toddler. There are no filters applied, and it includes Audit Login and SQL:BatchStarting events, neither of which are filterable by any performance metrics (Duration, CPU, Reads, Writes), meaning you will be capturing all of them. It also includes Audit Logout, which is effectively not filterable by Reads or Writes when connection pooling is in use, since those metrics are cumulative. On a busy OLTP system this trace is bad, very bad!

std_trace_filters

So unless you've changed the default trace (easy to do via Edit Template), every time you open Profiler you've got a very high overhead trace as the starting point. As a result, this trace is referenced in books, trace performance benchmarks have been conducted and published using it, and countless users have naively run it as-is against busy servers. I'm convinced that this is a big reason that Profiler is so frequently kicked around, and why so many are now afraid to use it at all.

Bottom line: Don't use this trace! Certainly don't use it alone for benchmarking the relative performance impact of the two trace types.

A true comparative test should consider:

  • Events and columns used
  • Trace consumer location (local or remote)
  • Disk location (local or remote)
  • Disk speed and type (shared or dedicated)
  • Network speed
  • Load type (T-SQL text size, DML mix, etc.)
  • Transactions/Batches per second on the target
  • CPU and memory resources available
  • Filter types
  • Filter thresholds
  • ...

As I mentioned, we have more comprehensive testing coming which covers these factors, and we'll publish all of the details, with hopes of clearing up a lot of confusion and fear around tracing once and for all.

In the meantime, don't fear the trace. If you target the specific events and columns you need, and use conservative integer-based filters to restrict the rows returned, the benefits gained through intelligent analysis of the data should far outweigh any performance impact on your SQL Servers.

Friday, April 8, 2011

SQL Sentry v6.1: Something Old, Several Things New

Since releasing v6 a couple of months ago, we've been hard at work on some cool new features for v6.1. Our x.1 releases are usually comprised mostly of fixes for bugs not caught during the major release (x.0) beta, and a few new features. This x.1 release is different in that contains many exciting and high impact new features, and a higher feature:fix ratio than normal. This is likely due to the lengthy v6 beta, as well as some new QA processes we've put in place. Change list is here.

We've just published the v6.1 Release Candidate to the website (new users|current users), and our team is showing it off now at SQLBits 8 in Brighton, UK. Read on for an overview...

SNMP Support

I'm covering this one first, as it's been on the feature list the longest. Believe it or not, we obtained our SNMP "enterprise number" and started on our MIBs in 2004, not long after the original release of Event Manager v1! Our enterprise number is 20707, and the latest enterprise number issued as of this post was... gulp... 37722. (I think I'll filter this feature from our "average turnaround time" calculation ;-) We didn't actually start coding SNMP support until recently though – for a few reasons it kept getting prioritized down: high complexity, relatively low demand, and the fact that users have been able to work around it and get alerts into other monitoring systems via our "execute" actions (Execute Process and Execute SQL). SNMP is not a feature that everyone uses, but those that do really depend on it.

Fortunately, earlier this year, the most knowledgeable and capable SNMP developer on the planet, Eric Eicke (b|t), joined our development team. Eric developed the first .NET-based SNMP component several years ago, and it's used by many large organizations, including Microsoft and HP. Eric was also the one that helped us get started on SNMP back in 2004. There was no fumbling around with RFC's, Eric knew exactly what needed to be done and did it, including v3 support, authentication, and encryption!

MIBs are installed as part of the new setup. Once you've loaded them into your enterprise monitoring system, activating SNMP in SQL Sentry is straightforward from the new SNMP Config tab:

snmp_settings

Once configured, simply tick the Send SNMP Trap action for any condition (along with a ruleset if desired):

snmp_action

Qure Integration

This next feature is not quite as old, but is equally as exciting ;-) Performance Advisor has always had "Top SQL" collection and normalization/aggregation, and this year we became the first to introduce integrated query plan capture and analysis. So we automatically catch your heaviest queries, and give you best-in-class features for making manual query analysis as quick and easy as possible. If your server has 5 or 10 "problem" queries, and you've got some experience dealing with execution plans and indexes, no problem, you can use Performance Advisor to straighten everything out in an afternoon. But what if you're dealing with 20, 50, 100, or more problem queries? Or what if it's not the individual query that's the problem, but rather the cumulative impact?

For many systems, automated workload-level analysis may make more sense. That's where Qure comes in. Developed by DBSophic, Qure is the brainchild of MVP Ami Levin, and is the leading software available for SQL Server workload analysis. It uses a copy of your production database and makes index, schema, and/or query changes, then tests those changes with real workloads to see which work best, before making any recommendations. Of course, we all trust the query optimizer to behave predictably and make good decisions <ahem>, but there's just no substitute for good, old-fashioned brute force validation.

SQL Sentry and Qure provide complementary functionality with very little overlap, so bringing the two together made a lot of sense for both companies. As such, DBSophic has just released a new version of Qure which only works with trace data from the SQL Sentry repository database (v6 or above), Qure for SQL Sentry:

qure_se

Here are my top 3 reasons to try Qure for SQL Sentry:

  • Faster Analysis. With the new special edition of Qure, you don't have to deal with manual workload traces. You simply point Qure to your existing SQL Sentry repository database, select the SQL Server and database to optimize, and hit "Go". Qure automatically pulls in all Top SQL trace and QuickTrace events previously captured by SQL Sentry, and uses it for the analysis.
  • Easier Validation. With the Performance Advisor dashboard's history mode, the Performance Counter Date Range Comparison report, query runtime graphs, and query plan history, with SQL Sentry you can easily validate the results of a Qure optimization from several angles.
  • Cost-effective. The new combined sku, Power Suite + Qure Quick Start Pack, gives you 5 perpetual Performance Advisor and Event Manager licenses plus 5 full Qure optimizations, all at a significant discount over purchasing the products separately. If you're already a SQL Sentry customer, you can purchase the new Qure edition separately as well.

You can run an unlimited number of sample analyses with the Qure for SQL Sentry trial software, and it will give you some good, actionable recommendations for FREE.

"Too Much Data" Begone!

Longtime Event Manager users will really appreciate this feature. Prior to this release, if you had too many event instances in a range to render them on the calendar in a meaningful fashion, we would show a block like this for the entire range:

too_much_data

When we introduced Performance Advisor and began showing performance events on the calendar, Top SQL in particular, these blocks became much more prevalent, and could really hamper visibility as well as increase time spent on filters.

Now, instead of a big "too much data" block, you'll see a small hatched rectangle on the far right of the range:

too_much_data_new

When you hover over it the entire range is highlighted, and a tooltip is displayed which shows how many other events exist in the range. The events actually displayed in the range have been prioritized by status and runtime – failures and longer running events are shown first, the rest are filtered from view. Just like before, you can double-click to zoom into a smaller range and see the other events.

Custom Calendar Highlighting

Previously, when you selected an event instance on the calendar, we would highlight other related instances using basic logic. For jobs and tasks we used the server and job name, and for Top SQL we used the "Text Data". So other Top SQL instances would only highlight if the SQL matched exactly. Depending on the scenario, this could be very limiting.

Now, you can pick and choose exactly which attribute or combination of attributes to use for highlighting related instances. Here are some examples:

  • You want to see all SQL, blocks and deadlocks for a particular Application
  • You want to see all SQL, blocks and deadlocks associated with a SQLAgent job
  • You want to see all SQLAgent jobs that are part of the same chain

All of these scenarios and more are possible. You simply right-click a calendar event and select the common attribute(s) using the new Highlight context menus:

calendar_highlighting

Tab Undocking / Multi-monitor Support

You now have the ability to drag any SQL Sentry main level tab outside of the Console. This is especially nice for multiple monitor setups. This shot shows a main tab being undocked:

tab_undocking

Fusion-io Drive Support

SSDs are definitely on the upswing in the SQL Server world. Over the past couple of years we've seen much greater acceptance and adoption in mission critical SQL Server environments, and the company that's making the most waves here is Fusion-io. Previously, none of their drives would show up in our Disk Activity view, however, due to their PCI-based architecture and the fact that they're represented differently in the Windows subsystems we use to gather drive data.

We've addressed this in v6.1, so if you have any Fusion-io drives, single or duo, they'll be rendered on Disk Activity just like all other drive types. This gives you a great way to validate their performance – latency on these drives should be low enough that the flow lines are always green. If they are not, there is a problem.

Here's a shot of a high volume OLTP system in the U.K. with lots of database files, running on a Fusion-io ioDrive and ioDrive Duo card with Windows RAID-0, courtesy of Christian Bolton (b|t) of Coeo:

Fusion-DiskActivity

New Plan Explorer Features

The new version of Plan Explorer is 1.2, and the features I'll cover below apply to both the full SQL Sentry v6.1 and Plan Explorer v1.2. The codebase is the same between them, but there are additional capabilities that open up when you use the full software, as covered in my post on SQL Sentry v6 Plan Analysis Features.

Actual Plan Retrieval

Previously you could retrieve the estimated plan from a server, but to view an actual plan you had to run the query in SSMS and copy the plan into Plan Explorer. Now you'll see an Actual Plan toolbar button that allows you to retrieve the actual plan for any query:

actual_plan

Just like SSMS, the query must be executed against the target in order to get the actual plan. You will see a progress bar while the query is running, but you won't see any query results when it completes, only the actual plan info along with actual CPU and read IO metrics.

Note the new Command Text tab above. This is an editable view of the query text, and this is what gets executed. The Text Data tab is no longer editable – that SQL comes directly from the plan, and there were several problems with making it executable that are beyond the scope of this post (and now water under the bridge ;-) I think you'll find the new design much simpler, and more robust.

We've had some lively discussion on whether or not to go down the path of showing results, but where we ended up is that Plan Explorer is a plan analysis tool, not a query tool. More often than not, viewing results isn't required to make good plan optimization decisions. We of course already show the number of estimated and actual rows, and in most cases this is sufficient.

One big advantage of not returning results is that you'll often get the actual plan back much faster than you would otherwise. For example, a query that returns 200,000 wide rows and takes 30 seconds to load in SSMS may take only 2-3 seconds in Plan Explorer.

Expressions Tab

Prior to this release, we've shown expressions only in operator tooltips. The higher you go up the plan tree, the more levels of nesting you can have, and tracing an expression all the way back to its source using tooltips could be daunting.

This is why we've added a new dedicated tab for expressions. The tab only appears if expressions exist in the plan, and it shows standard and expanded views of all expressions, along with the entire references tree for each:

pe_expressions_tab

I should warn you, there is a lot going on behind the scenes with expressions... the optimizer generates many expressions that you'd never normally see, and it can cause overload at first. More on this in a future post.

Join Diagram Tab

This tab only appears if the query has joins. It presents a view similar to the Query Designer diagram in SSMS, although only joined columns are shown for each table. Join information exists on the Plan Diagram and Query Columns tabs, but this is a different look at it which can be especially valuable if your query has views or nested views. The query optimizer flattens all views down to their base tables as part of the plan generation process, but in SSMS you see joins for the views, not the base tables. This can make it difficult to decipher which tables and columns are actually involved in a join. Fortunately, the plan contains this data, so we're able to reconstruct and show the base table joins from the plan XML. This can be very helpful for making good indexing decisions to support those joins.

pe_joins_tab

This is our first cut at joins, and getting it to this point has not been trivial. What looks like a simple join on the plan diagram can be oh so much more behind the scenes. We're still finding scenarios that we haven't seen before, but we think it's far enough along that it can provide value. If you run into anything weird, please do let us know about it – email the plan to support at sqlsentry.net.

It's your feedback that enables us to continue to improve and evolve the software, so please keep it coming!