Wednesday, February 24, 2010

SQL Sentry v5.5 Need to Know

SQL Sentry v5.5 Release Candidate 2 was just published to the website.  Although it's a dot release, don't be fooled, v5.5 stands to be a watershed release for us in more ways than one.  Not only does it represent the culmination of many months of hard work by the team, but it contains what have easily been the most heavily requested enterprise features since the original release of Performance Advisor for SQL Server in 2008:  the Global Server Status View and Performance Reporting.  There are also many very cool secondary features such as Automatic Disk Misalignment Detection, Mount Point Support, as well as a variety of performance optimizations.  Read on for more details…

Global Server Status View

Finally, you can view key performance metrics for more than one server at a time from the SQL Sentry Console!  You simply right-click the Devices root node, or any device group node, and select the "Open Performance Advisor Overview" context item:

open_pa_overview

There are a couple of different ways to look at the data.  The default is Windows metrics at the top level:

overview_windows

From there you drill in to see metrics for each SQL Server or SSAS instance on the server.  Just like on the PA dashboard, we also show how much Windows resources are being consumed by the SQL Server and SSAS processes for network, CPU, and memory:

overview_windows_instance

The level below that shows the actual instance metrics, which will of course change depending on whether you’re looking at SQL Server or SSAS.

overview_windows_instance_expanded

If you only want to see SQL Server and/or SSAS metrics without the Windows, simply uncheck Windows in the dropdown at top.  This gives you a very clean look across all SQL/SSAS instances in your environment:

overview_instances

As shown in the shots above, if a performance metric exceeds commonly accepted preset thresholds (such as Page life expectancy < 300 secs), the cell and associated server name will turn red to highlight it.  And to answer the question you are about to ask… no, there is currently no way to customize this, but this is coming ;-)  For this release the primary focus was getting the alarming framework in place with some basic presets.

However, to hold you over until, you can set custom filters for any column or combination of columns.  For example, if you want to see only servers with "Transactions/sec > 100 OR Blocks > 0 OR Deadlocks > 0", simply click the column header and set a filter:

overview_filter_set

As the view auto-refreshes and metrics change you’ll see server rows pop in and out.  This way you can stay focused on the servers requiring your immediate attention, based on limits you’ve defined.  Here's what a filter looks like once set:

overview_filter_multi

Note there are now only two visible servers, the rest are hidden by the filter.

When you combine this with the ability to adjust the interval displayed, filtering becomes an even more powerful feature:

overview_intervals

Let me elaborate.  By default the grid shows the last raw sample value collected, but since many metrics are sampled every 10 seconds there may be temporary spikes that you want to ignore.  You may not want the filtered server list changing every refresh.  If you simply increase the interval to anything other than "Last Sample" -- say "10 Minutes" -- the metrics displayed will be last calculated average over that interval.  This can effectively smooth out many spikes, and thus prevent a lot of volatility with the server list.

Even without the use of filters, this is a great feature for getting quick snapshot of how all servers have been performing over different ranges, up to the past 3 days.

One last item on the global view -- if you don’t care for the default dark skin, you can change to a standard light grid color using a new user preference:

overview_light_skin

Performance Reporting

There has always been an enormous amount of valuable performance data contained in the SQLSentry database, but aside from the PA dashboard there really wasn’t an easy way to get at it.  With the new SSRS-based performance reporting, now there is!  To see all of the new reports, select "Reports –> Performance Advisor" from the menu bar:

pa_reporting

Here are a couple of quick shots of the "Waits Analysis" and "File Utilization by Database" reports:

reports_wait_stats  reports_file_database

Not only do we have a great list of the most commonly requested reports, but we have two ways to generate completely customized reports – the Performance Counter History report, which lets you select any combination of servers, counters and instances:

reporting_params_counter_hist

reporting_custom

…as well as the Performance Counter Date Range Comparison report, which lets you do pretty much the same thing, but using two different date ranges.

One of the coolest aspects of our custom reports is that you’re not limited to only standard Windows perf counters – you can report on any of the performance data we collect, including data from DMV’s, WMI, and even internally calculated metrics like Waits by Category!  And just like the PA dashboard, you can select any date range, from 3 minutes to 3 years… and it won't kill your server in the process.  A lot of effort has been put into ensuring that the reports generate quickly and will minimally impact the performance of the SQLSentry database.

You can save any of the custom reports by clicking "Save" on the parameter selection window, as well as deploy the reports to any SSRS instance using the "Deploy Reports" menu item.

Disk Activity View Enhancements

We’ve introduced two significant enhancements to our patent-pending Disk Activity view:  Auto-Misalignment Detection and Mount Point support.  Which you will like best will depend a lot on your environment.

Automatic Disk Misalignment Detection

As you probably know, misalignment is no longer an issue with new Windows Server 2008 installs since it uses a default offset of 1024KB.  However, what we have found is that many servers out there are either still running Windows Server 2003, or were upgraded from Server 2003 to 2008.  As a result this is still a big issue in many environments, and we’ve just made it dramatically easier to detect – you simply watch a server with PA, then look at the Disk Activity screen.  If the logical disk has a red background, it means the offset is suspect:

disk_misalignment

PA shows you exactly what the offset is in the tooltip for the logical disk.  I wish I could say we will then auto-fix it for you, but you probably realize that would not be possible ;-)  However, on this same screen you can clearly see whether or not latency is actually a problem on the misaligned disk, and from there determine whether or not you should actually go through the process of moving database files and rebuilding the partition with a proper offset.  Here's another server where all partitions are misaligned, and where the disks are experiencing pretty severe latency:

disk_misalignment_multi2

BTW, if you haven't seen this screen before, all of those color-coded bars are database files -- data files and transaction logs.  Same color means same database, diagonal hash marks mean transaction log.  Bottom connector pipe for each disk is writes, top pipe is reads.  Wide red fill in the pipes means high latency, thin green means IO activity but acceptable latency.

For more details on the performance implications of disk misalignment and what you can do about it, check out Microsoft's Disk Partition Alignment Best Practices for SQL Server whitepaper and Paul Randal's blog post on this topic.

Mount Point Support

Another common request that we’ve been getting more and more over the past year or so is for mount point support, in large part because of the explosion of VMs and associated LUNs in the environment.  Now when you look at the Disk Activity screen, you’ll see mount points displayed alongside all other disks:

mount_points

PA shows the same latency and volume information for mount points and database files on mount points as it always has for other disk types.

Technically these disk-related changes were no small feat to pull off, but once again the dev team rose to the occasion!

Performance Optimizations

We are always looking for new ways to increase the performance of SQL Sentry and minimize our footprint in the environment… a footprint, I might add, that is already dramatically lower than any competitive product.  (We will actually be proving this very soon… stay tuned.)  However, in this release we’ve made some changes that would be considered significant, even by us.

Redesigned Threading Model

Two changes have been made to the threading model.  Previously, dedicated threads were spawned for each monitored server, and each would consume memory, so the more servers you were monitoring, the more memory would be used by the SQL Sentry Server service process (you need one of these services for approximately every 100 or so monitored servers).  A new thread pooling system is now in place that eliminates most of these dedicated threads, dramatically reducing the process’ memory footprint and increasing scalability at the same time.

Next, for some of the collection subsystems we’ve moved from a fork/join threading model to more of an asynchronous work queue model.  Fork/join threading is similar to parallelism with SQL Server.  There are some benefits to this approach, but a primary drawback is that the process can only be as fast as the slowest thread.  The new model effectively eliminates the possibility of collection delays caused by offline servers, network issues, etc., that we’ve sometimes seen affect other critical SQL Sentry processes like chaining and alerting.

Reduced Plan Cache Footprint

Prior to v5.5, ad hoc SQL was being used in some places by both the SQL Sentry Console and Server, and in certain scenarios over time this could lead to plan cache bloat and some buffer pressure on the server where the SQLSentry database was installed.  In v5.5 we’ve eliminated almost all uses of ad hoc SQL, which dramatically reduces the plan cache footprint of the SQLSentry database.  This will not impact monitored servers, only the server where the SQLSentry database resides (unless of course it is monitored as well ;-)  After you upgrade, I recommend that you run DBCC FREEPROCCACHE on that server so the buffer can immediately reclaim any of the space previously used for the ad hoc cache.

I really hope you enjoy using v5.5.  New users can get bits here, existing customers please login to your portal account.  As always, we fully support upgrading to this build from any earlier SQL Sentry version, and from this build to the final release of v5.5.  Thanks to everyone who has shared their ideas with us, our great beta testers, and of course to the entire SQL Sentry team for your efforts in finally making these great new features a reality!

Friday, February 5, 2010

Analysis Services and the Windows File System Cache

In my last post I revealed a problem one of our customers encountered when SSAS memory limits were inadvertently set to KB instead of bytes, leading to abnormally low limits and some pretty serious performance issues.  I alluded to a “saving grace” that prevented things from being much worse than they actually were, and that’s what I’ll cover in this post.

A little background first, or rather an expansion of something I touched on in my original post -- unlike the SQL Server Database Engine, much of the SSAS data can be resident in physical memory by virtue of the Windows file system cache.  This is simply because SSAS databases consist of a collection of Windows files of various types, and SSAS does not directly manage memory/disk access to those files using Windows APIs as complex memory managers like SQL Server do.

This means that even if the SSAS internal caches are effectively non-existent, you can still have a significant amount of raw SSAS file data loaded into RAM which will not show up as part of the SSAS process memory, or in any way be governed by the SSAS memory limits.  This was the saving grace here.  If the storage engine actually had to read the files from disk for every query, performance would have been significantly worse than it already was.

I’m going to use the some Performance Advisor for Analysis Services dashboard charts to fully illustrate this phenomena.  Since the customer’s server has 32GB of RAM, I’ve blown up the System Memory chart so it’s a little easier to see what’s going on: 

memory_file_cache

The thin grey line shows the physical memory used by the file cache (about 700MB), and the even thinner orange line shows the memory used by the SSAS instance with invalid memory limits (about 450MB).  The large teal-colored area is memory used by a SQL Server instance on the same server.

Assuming there wasn’t much file serving happening on this server, most of the file cache is likely comprised of SSAS database files.  Without file serving activity, usually you will not see the file cache get this large, even though on 64 bit machines like this one it is capable of growing to use up all available memory.  (On 32 bit systems the max file cache size is about 1 GB.)  Read more about the Windows file cache here.

Ok, interesting theory, but how can we confirm that SSAS was able to get it’s data from the file cache?  Fortunately the SSAS Storage\File Read/Write charts make this is pretty easy:

memory_file_read

The File Queries chart on the right shows SSAS is querying database files at a rate of about 20 per sec, and the File Read KB chart shows that even though there is a lot of file read activity on the system (about 40MB per sec), almost none of it is coming off of disk.  This is clear because the dashed orange Total Disk line is flat, and there are also no Windows read faults at that time:

memory_read_faults

Anytime SSAS files are read from disk into physical memory, you should see hard faults.  To better demonstrate this, take a look at a similar chart for a different environment:

memory_file_read_high_disk 

The off-white Total System series shows ALL file read activity flowing through the Windows system (both from memory and disk), the light orange Total SSAS series shows the portion of total read activity directly related to the SSAS process (barely visible here), and the blue series shows the portion of SSAS read activity from data files.  The dashed Total Disk read line is overlaid on top of the stacked area chart so you can quickly see how everything correlates.  It’s clear that a sizable chunk of the SSAS data file reads are coming off disk.  Also note how this time we have significant Windows read faults:

memory_file_read_high_faults

To get a different look at the data as well as the actual file cache hit ratios, if we click on a point on the chart in the middle of the read activity spike, then click the Sample button on the toolbar it loads up the data for that specific point in time:

memory_file_read_high_disk_sample 

What this shows us is that approximately 81% of the file read activity for this sample came from the file system cache.  This is not bad, but in general you always want Total Disk to be at or near zero, which will be the case whenever SSAS is pulling data from either its internal caches or the file system cache.

How can you tell whether the data is coming from SSAS’s internal caches versus the file cache?  That’s easy -– you won’t see any storage engine file queries or associated file read activity at all, and the SSAS cache hit ratios (visible in sample mode, or when using the new Performance Advisor v5.5 reporting) will be high, such as in the shot below where the measure group cache hit ratio is 99%:

memory_high_ssas_cache_hit_ratios 

In addition to the cache hit ratio meters, there is also a visual hit ratio indicator right on the column chart -– the “fullness” of the columns reflect the hit ratios.  The more gray fill you see between he column hash lines, the higher the hit ratio:

memory_high_ssas_cache_hit_ratios_low 

Note that the measure group column is only about half full, reflecting the 52% hit ratio.  The calculation and flat cache hit ratios are both zero, so those columns are empty.  Clever, eh? ;-)  Perhaps even cleverer, on the measure group column there are two types of hash lines, horizontal and diagonal -- the horizontal hashes represent direct cache hits, while diagonal hashes represent filtered hits.  These occur when existing cache entries have to be filtered first to obtain the desired data, and are slightly higher overhead than direct hits.

In an ideal world, aside from when the cache is first warming up, you don’t want to see SSAS file queries at all, except perhaps for the occasional aggregation hit, which is also shown on the File Queries chart.  Bottom line, if you see file queries there were cache misses.  Here’s a quick custom Performance Counter History report using Performance Advisor v5.5 which shows this clearly :

memory_hit_ratio_report 

Even so, the Windows file cache hit ratio may be 100% if all of the referenced files were in the file cache.  But in that scenario, as we saw in the last post, even though performance will be better than if they were read from disk, it will not be nearly as good as if the data had been found in the SSAS cache because of the additional overhead involved when the storage engine retrieves file data from Windows.

If you want to see how your Analysis Services system is performing in this regard, you can download a fully functional 15-day trial of Performance Advisor for Analysis Services here.  Cheers.