Tuesday, March 9, 2010

SQLSaturday Charlotte: Lessons Learned

We just wrapped up what have been the busiest, most stress-filled, yet rewarding couple of weeks in recent memory.  Overall the feedback we've received about the SQLSaturday Charlotte event has been amazing, thanks to all for the kind words that have been tweeted and blogged thus far!  I do think for the most part the event went off about as well as it could have despite our lack of experience planning technical conferences.  However, we definitely learned a lot that we'll be able to apply should we ever do it again.  I had many questions when I first started working on this event, so I wanted to get my thoughts down while still fresh with hopes that it might help future SQLSaturday planners.

Scheduling Concerns

One of my primary tasks was trying to coordinate the flood of speakers and sessions we had for the event, a flood created by none other than CSSUG president and SQL Sentry partner channel manager Peter Shire, whose skills at promoting can only be described as P.T. Barnum-like.  (I thought for sure that P.T.'s first name must have been "Peter", but turns out it was "Phineas")  Peter did a phenomenal job securing the support of an amazing list of 35 speakers, and getting the word out to the SQL Server community to the tune of about 400 registered attendees, 280 of whom actually made it to the event.

The starting point for the schedule planning was the facility.  The Microsoft campus here is a fantastic facility in a great location close to the airport, downtown, etc., however, it wasn't necessarily designed to run an event like this.  The initial room breakdown was:

Building AP1:
     Room 1: 200 seats
     Rooms 2,3,4,5: 20 seats each
     Cafeteria: 220 seats

Building AP2:
     Room 6,7,8: 50 seats each

The two buildings are about a 3-minute walk apart.  Because of the relative size difference between the largest and smallest rooms, my biggest fears were putting a hot session in one of the smaller training rooms causing massive overflow, and putting a not-so-hot one of the big rooms… either of which can exacerbate the other.  The original layout had us using the 4 training rooms in AP1, the same building with the cafeteria and the 200 seat room, however, about a week before the event we got word from Microsoft that we'd been bumped by some internal training and had to use 4 training rooms in AP2 instead.  So that left us with:

Building AP1:
     Room 1: 200 seats
     Cafeteria: 220 seats

Building AP2:
     Rooms 2,3,4,5: 20 seats
     Room 6,7,8: 50 seats

The AP2 training rooms are an exact mirror of those in AP1, however it seemed to make for a very lopsided layout since it left us with a single large room in one building, and 8 rooms in the other building.  This caused concern since we'd been counting on a lot of interplay between the 80-100 people in the training rooms and the 200-seat room where I'd put all of the "hottest" sessions.  We didn't understand why the other group couldn't just use the AP2 training rooms, but eventually found out – they were doing lab sessions, and apparently the workstations in AP2 aren't the same class as those in AP1.  We weren't using the workstations so it didn't really matter to us, but apparently it did to them ;)

As it turned out, it probably did end up causing lower attendance for the featured sessions and some overflow in some of the smaller rooms, mainly because some seemed reluctant to leave AP2 when there were so many other good sessions going on next right there.  But did it even come close to ruining the conference?  Certainly not.

Another issue with the change was that we didn't get the new room names in time to put them on the signage, so we ended up using numbered rooms that were offset from the tracks by one – Track 2 was in Training 1, Track 3 in Training 2, etc.  This was very confusing, even to me.  At one point I even directed Wayne Snyder into the wrong room for his session.  Sorry Wayne!

The first thing I did was try and categorize the 80+ submitted sessions so I could get a better handle on the actual areas of interest we had to work with.  What I was really looking for at that point was whether or not we could have any focused tracks.  I ended up with 4 major categories and 24 subcategories:

Category – Subcategory Session Count
Admin - Broker 2
Admin - Clustering 2
Admin - DB Design 2
Admin - Hardware 2
Admin - Maintenance 1
Admin - Memory 1
Admin - Performance 11
Admin - PowerShell 5
Admin - Replication 3
Admin - Troubleshooting 2
Admin - Virtualization 2
BI - Dev 5
BI - SSAS 1
BI - SSIS 5
BI - SSRS 4
Dev - DB Design 2
Dev - Performance 4
Dev - Tools 2
Dev - TSQL 2
Dev - XML 1
General - Career 2
General - General 4
General - Social 3
General - Tools 2

It was apparent that although for a couple like Performance and BI, focused tracks might work, but for others we only had one or two sessions so trying to do clean, interest-based tracks would be fruitless.  I should mention that although for some sessions the categorization was easy, for others it was not clean at all, so I just took my best stab at it.  Even so, there were a few others that seemed to defy categorization, like Kevin Kline's "Top 10 Mistakes on SQL Server" and Sergey Pustovit's "SQL Server 2008/R2 Overview", and those ended up in "General – General" ;)

I was also hoping that by having these categories, I'd also be able to send them out to a few people that had planned other recent SQLSaturdays and ask them to order the list by most to least popular at their event.  I got some extremely valuable feedback from those planners, but since they didn't have accurate attendance numbers for all topics we didn't get any rankings back.  Since this info could have been invaluable to me and helped to avoid some of the session/room mismatches we experienced, I'm posting all of our attendance numbers here.  Bear in mind they are only estimates and not exact.  If you see anything that looks way off, let me know.

Next, to my sessions spreadsheet I added a few columns, and used them to come up with a universal rating measure to help decide where sessions should go.  Here they are:

Hot Topic – Yes/No – Is the topic particularly popular right now, and are there a lot of people talking about it in the media?  For example, I don't think that anyone would argue that PowerPivot is a hot topic right now.  If Yes, 2 points were added to the rating.
Premiere Speaker – Yes/No – Certain speakers have significant name recognition, and will draw attendees regardless of the topic.  If Yes, 3 points were added to the rating.
Appeal – This is a 1-10 point value added to the rating for how broad of an audience would possibly be interested in the topic, or the size of the base population.  For example, PowerPivot may indeed be a "Hot Topic", but if the potential base of users is small the appeal number will be lower.

I used the above to calculate a simple rating value, and the higher the number the bigger the room.  Certainly not very scientific, but probably good enough for our purposes.  Ultimately there are so many other variables involved, IMO if you try to be any more exacting you run the risk of skewing things too far one way or another.  For example, if I'd actually tried to estimate the number of "BI" users or "Admin" users that would show in lieu of the general "Appeal" number, and my estimates for those populations were off, it could cause trouble.  Now, keep in mind that this approach is really most applicable to a first event, where you really have no idea who is going to show up.  Next time we'll have a much better idea of the different populations here in our region and will use actual numbers as a basis for predicting session popularity.

Another idea that came up was surveying people in advance to see which sessions or categories they are most interested in.  If we had had more time, and/or if there was a capability to do this built into the sqlsaturday.com website, we likely would have made use of it.  As it was, there was just too much else going on with the planning to even think about attempting something like that.

Below I'll discuss a few of the bigger session surprises in a bit more detail.

Hotter-Than-Expected
TSQL
Specifically Mike Walsh's "You Can Improve Your Own SQL Code" and Geoff Hiten's "Bad SQL" both ended up in a smaller 20-seat room, and both could have easily been in a 50-seat room.

SSRS/SSIS
The 25-seat room we had set aside for these BI topics had 46 people crammed into it at one point.  I didn't put this track into one of the larger rooms mainly because I just wasn't quite convinced it would compete with Performance, Virtualization, PowerShell, and other "hot" topics I had in the 50-person rooms.  I also had a lot of people that know telling me that BI is always a smaller crowd at these things, so I went with our only "in between" space, the 25-seat conference room that we knew could handle up to 35 with some standing.  Good call for not putting it in a 20-seat room I guess, but bad call for not giving it a 50-seater.  If you were one of those that had to act like a sardine for an hour, my sincerest apologies, we'll give you more room next time.

Not-So-Hot
Virtualization

This one shocked me.  With as much talk (and uncertainty) as there is right now about this topic and SQL Server, I was worried that a 50-seat room would overflow for Aaron Nelson's and Denny Cherry's sessions.  I was way off.  They would have both been fine in one of the 20-seat rooms.

SSDs (Solid State Drives)
Again, I was shocked when Kendal Van Dyke's session wasn't packed.  Great, well-known speaker with what I thought would be a hot topic, but it would have been fine in a small room.

Data Compression
The two sessions we had on this topic were the most lightly attended of all, only 4-5 people in each.  This was in a smaller room already, but I had expected to see more interest since it's new to 2008 and disk space is always a concern, however the Enterprise-only caveat or the other concurrent sessions may have been a big factor. 

Too Much of a Good Thing?

The "good thing" in this case being PowerShell.  I guesstimated that this would be a hot topic as it has only seemed to gather steam with DBAs over the past year or so, and with remoting and other cool features in Windows Server 2008 R2 and PowerShell 2.0 there's a lot of new stuff to learn about as well.  Turns out I was right… kind of.  Aaron Nelson's first two sessions in a 50-seat room were pretty much full, but Allen White's two sessions immediately following were, well, not so much.  Again, it certainly wasn't Allen as a speaker since he's highly regarded and authoritative on this topic.  However, I did hear from more than one person that two PowerShell sessions probably would have been sufficient for this crowd.  I think Allen was a little disappointed, but being a glass is half full kind of guy, he said it was good practice for TechEd.  Thanks Allen!

Premature Raffling

One of our biggest mix-ups of the day was the fact that we started the vendor giveaways right after the last session was supposed to end at 5pm.  We still had a session going in the other building that went a bit long, so those attendees and the speaker missed out on the raffles.  To make matters worse, that particular speaker would have won an iTouch had he been present!  Sincere apologies here.  We felt bad, and he did seem a bit upset -- I should have reminded him that he won a WinMo device at our very first SQL Sentry giveaway several years ago, although I'm not sure it would have helped at that point ;)

Bottom line, ensure all sessions have finished and folks have had time to get to the raffle before you start.  Apologies to anyone else that missed out on this – if you did, send us an email and we'll send you either a SQL Sentry T-shirt, USB 2.0 hub, or iPhone cover.

Where'd the Speaker Go?

We had a particularly well-attended session on IO Performance… only problem is we had no speaker.  I made a quick call over the two-way to registration to see if anyone had seen this particular speaker.  They had not.  I let the attendees know this, and they quickly filed out to look for another session, some with looks of disappointment.  As it turns out, the speaker had cancelled a couple of weeks earlier, but apparently it got lost in the hundreds of other emails that were flying around about the event.  Mistakes like this happen and stuff just gets missed sometimes… which is why there should have been a mechanism in place to catch this.  Had we done a speaker roll check earlier in the day we would have caught it, and probably could have gotten another speaker to fill in, or if not, at least made attendees aware of the cancellation in advance.  We just assumed all speakers were there – not safe with 35 speakers.

Bear in mind that some speakers may show up later if their first session isn't until later in the day, so you may not want to check roll first thing in the a.m., but either way it's generally an easy matter to at least confirm whether or not the speaker was in town.  In this case they were not.

Plan for Cancellations

Jorge Segarra (aka, @SQLChicken), one of the organizers for the Tampa SQLSaturdays told me to expect between 5-7 cancellations.  I thought that sounded high – how can that many speakers commit for something like this, then just cancel?!?  Well, he was right.  There were many different reasons we heard, but what can you do.  Fortunately we had enough speakers with enough sessions to make up for it.  I think the 5-7 backups is a good number to expect for an event of this size, probably less for smaller events.  If you saw a session on the schedule that seemed a bit out of place for the room or other sessions around it, it was probably a backfill from a cancellation.

A Final Word

It's important to note that a lot of these scheduling dilemmas were due to the broad range of room sizes we had here.  If your event has 300 people and 4-5 rooms that hold 75 people each you've got a lot more margin for error, and you're probably not going to need to go to this level of detail.  Anyway, I hope some future SQLSaturday planners will find something here that's useful.  If any of you planners have any questions about what we did or why, I'll be glad to help however I can.  Please post them here or shoot me a DM or email.

A Huge Thanks to All

I wanted to take this opportunity to thank all of those that helped with the planning and organization for this great event.  First, I wanted to thank Bill Walker, head of the SQL Server CSS team here in Charlotte, and his counterpart Lynne Moore for their immediate and unwavering support.  From day one, they threw all of their considerable resources behind the event and without them it just wouldn't have happened.  All of us were truly amazed at the forces they were able to mobilize so quickly to just get it done.  I also wanted to thank Sergey Pustovit, Chris Skorlinski and Evan Basalik from the SQL Server CSS team for putting on some great sessions, and all of the other CSS members who volunteered their time whose names I don't know – they ran the "SQL Clinic" booth and helped out in innumerable other ways.

Next I wanted to thank all of the people here at SQL Sentry who volunteered their time, in no particular order -- Peter Shire, Karen Gonzalez, Nick Harshbarger, Brooke Philpott, Jason Hall, Natalie Wieland, Jason Ackerman, Steve Wright, and also Ken Teeter for the great photography.

Again in no certain order, sincere thanks go out to Jorge Segarra, Aaron Bertrand, Andy Kelly, Grant Fritchey, Tim Ford, John Welch, Rafael Salas, Geoff Hiten and of course Andy Warren for allowing me to pick their brains during the planning process.  Not sure what I would have done without their input and words of wisdom.

And last but certainly not least, thanks to all of the speakers who came from near and far to be here.  We are greatly appreciative that you decided to take the time to provide such a fantastic educational opportunity for our SQL Server community here in the Carolinas.

Wednesday, February 24, 2010

SQL Sentry v5.5 Need to Know

SQL Sentry v5.5 Release Candidate 2 was just published to the website.  Although it's a dot release, don't be fooled, v5.5 stands to be a watershed release for us in more ways than one.  Not only does it represent the culmination of many months of hard work by the team, but it contains what have easily been the most heavily requested enterprise features since the original release of Performance Advisor for SQL Server in 2008:  the Global Server Status View and Performance Reporting.  There are also many very cool secondary features such as Automatic Disk Misalignment Detection, Mount Point Support, as well as a variety of performance optimizations.  Read on for more details…

Global Server Status View

Finally, you can view key performance metrics for more than one server at a time from the SQL Sentry Console!  You simply right-click the Devices root node, or any device group node, and select the "Open Performance Advisor Overview" context item:

open_pa_overview

There are a couple of different ways to look at the data.  The default is Windows metrics at the top level:

overview_windows

From there you drill in to see metrics for each SQL Server or SSAS instance on the server.  Just like on the PA dashboard, we also show how much Windows resources are being consumed by the SQL Server and SSAS processes for network, CPU, and memory:

overview_windows_instance

The level below that shows the actual instance metrics, which will of course change depending on whether you’re looking at SQL Server or SSAS.

overview_windows_instance_expanded

If you only want to see SQL Server and/or SSAS metrics without the Windows, simply uncheck Windows in the dropdown at top.  This gives you a very clean look across all SQL/SSAS instances in your environment:

overview_instances

As shown in the shots above, if a performance metric exceeds commonly accepted preset thresholds (such as Page life expectancy < 300 secs), the cell and associated server name will turn red to highlight it.  And to answer the question you are about to ask… no, there is currently no way to customize this, but this is coming ;-)  For this release the primary focus was getting the alarming framework in place with some basic presets.

However, to hold you over until, you can set custom filters for any column or combination of columns.  For example, if you want to see only servers with "Transactions/sec > 100 OR Blocks > 0 OR Deadlocks > 0", simply click the column header and set a filter:

overview_filter_set

As the view auto-refreshes and metrics change you’ll see server rows pop in and out.  This way you can stay focused on the servers requiring your immediate attention, based on limits you’ve defined.  Here's what a filter looks like once set:

overview_filter_multi

Note there are now only two visible servers, the rest are hidden by the filter.

When you combine this with the ability to adjust the interval displayed, filtering becomes an even more powerful feature:

overview_intervals

Let me elaborate.  By default the grid shows the last raw sample value collected, but since many metrics are sampled every 10 seconds there may be temporary spikes that you want to ignore.  You may not want the filtered server list changing every refresh.  If you simply increase the interval to anything other than "Last Sample" -- say "10 Minutes" -- the metrics displayed will be last calculated average over that interval.  This can effectively smooth out many spikes, and thus prevent a lot of volatility with the server list.

Even without the use of filters, this is a great feature for getting quick snapshot of how all servers have been performing over different ranges, up to the past 3 days.

One last item on the global view -- if you don’t care for the default dark skin, you can change to a standard light grid color using a new user preference:

overview_light_skin

Performance Reporting

There has always been an enormous amount of valuable performance data contained in the SQLSentry database, but aside from the PA dashboard there really wasn’t an easy way to get at it.  With the new SSRS-based performance reporting, now there is!  To see all of the new reports, select "Reports –> Performance Advisor" from the menu bar:

pa_reporting

Here are a couple of quick shots of the "Waits Analysis" and "File Utilization by Database" reports:

reports_wait_stats  reports_file_database

Not only do we have a great list of the most commonly requested reports, but we have two ways to generate completely customized reports – the Performance Counter History report, which lets you select any combination of servers, counters and instances:

reporting_params_counter_hist

reporting_custom

…as well as the Performance Counter Date Range Comparison report, which lets you do pretty much the same thing, but using two different date ranges.

One of the coolest aspects of our custom reports is that you’re not limited to only standard Windows perf counters – you can report on any of the performance data we collect, including data from DMV’s, WMI, and even internally calculated metrics like Waits by Category!  And just like the PA dashboard, you can select any date range, from 3 minutes to 3 years… and it won't kill your server in the process.  A lot of effort has been put into ensuring that the reports generate quickly and will minimally impact the performance of the SQLSentry database.

You can save any of the custom reports by clicking "Save" on the parameter selection window, as well as deploy the reports to any SSRS instance using the "Deploy Reports" menu item.

Disk Activity View Enhancements

We’ve introduced two significant enhancements to our patent-pending Disk Activity view:  Auto-Misalignment Detection and Mount Point support.  Which you will like best will depend a lot on your environment.

Automatic Disk Misalignment Detection

As you probably know, misalignment is no longer an issue with new Windows Server 2008 installs since it uses a default offset of 1024KB.  However, what we have found is that many servers out there are either still running Windows Server 2003, or were upgraded from Server 2003 to 2008.  As a result this is still a big issue in many environments, and we’ve just made it dramatically easier to detect – you simply watch a server with PA, then look at the Disk Activity screen.  If the logical disk has a red background, it means the offset is suspect:

disk_misalignment

PA shows you exactly what the offset is in the tooltip for the logical disk.  I wish I could say we will then auto-fix it for you, but you probably realize that would not be possible ;-)  However, on this same screen you can clearly see whether or not latency is actually a problem on the misaligned disk, and from there determine whether or not you should actually go through the process of moving database files and rebuilding the partition with a proper offset.  Here's another server where all partitions are misaligned, and where the disks are experiencing pretty severe latency:

disk_misalignment_multi2

BTW, if you haven't seen this screen before, all of those color-coded bars are database files -- data files and transaction logs.  Same color means same database, diagonal hash marks mean transaction log.  Bottom connector pipe for each disk is writes, top pipe is reads.  Wide red fill in the pipes means high latency, thin green means IO activity but acceptable latency.

For more details on the performance implications of disk misalignment and what you can do about it, check out Microsoft's Disk Partition Alignment Best Practices for SQL Server whitepaper and Paul Randal's blog post on this topic.

Mount Point Support

Another common request that we’ve been getting more and more over the past year or so is for mount point support, in large part because of the explosion of VMs and associated LUNs in the environment.  Now when you look at the Disk Activity screen, you’ll see mount points displayed alongside all other disks:

mount_points

PA shows the same latency and volume information for mount points and database files on mount points as it always has for other disk types.

Technically these disk-related changes were no small feat to pull off, but once again the dev team rose to the occasion!

Performance Optimizations

We are always looking for new ways to increase the performance of SQL Sentry and minimize our footprint in the environment… a footprint, I might add, that is already dramatically lower than any competitive product.  (We will actually be proving this very soon… stay tuned.)  However, in this release we’ve made some changes that would be considered significant, even by us.

Redesigned Threading Model

Two changes have been made to the threading model.  Previously, dedicated threads were spawned for each monitored server, and each would consume memory, so the more servers you were monitoring, the more memory would be used by the SQL Sentry Server service process (you need one of these services for approximately every 100 or so monitored servers).  A new thread pooling system is now in place that eliminates most of these dedicated threads, dramatically reducing the process’ memory footprint and increasing scalability at the same time.

Next, for some of the collection subsystems we’ve moved from a fork/join threading model to more of an asynchronous work queue model.  Fork/join threading is similar to parallelism with SQL Server.  There are some benefits to this approach, but a primary drawback is that the process can only be as fast as the slowest thread.  The new model effectively eliminates the possibility of collection delays caused by offline servers, network issues, etc., that we’ve sometimes seen affect other critical SQL Sentry processes like chaining and alerting.

Reduced Plan Cache Footprint

Prior to v5.5, ad hoc SQL was being used in some places by both the SQL Sentry Console and Server, and in certain scenarios over time this could lead to plan cache bloat and some buffer pressure on the server where the SQLSentry database was installed.  In v5.5 we’ve eliminated almost all uses of ad hoc SQL, which dramatically reduces the plan cache footprint of the SQLSentry database.  This will not impact monitored servers, only the server where the SQLSentry database resides (unless of course it is monitored as well ;-)  After you upgrade, I recommend that you run DBCC FREEPROCCACHE on that server so the buffer can immediately reclaim any of the space previously used for the ad hoc cache.

I really hope you enjoy using v5.5.  New users can get bits here, existing customers please login to your portal account.  As always, we fully support upgrading to this build from any earlier SQL Sentry version, and from this build to the final release of v5.5.  Thanks to everyone who has shared their ideas with us, our great beta testers, and of course to the entire SQL Sentry team for your efforts in finally making these great new features a reality!

Friday, February 5, 2010

Analysis Services and the Windows File System Cache

In my last post I revealed a problem one of our customers encountered when SSAS memory limits were inadvertently set to KB instead of bytes, leading to abnormally low limits and some pretty serious performance issues.  I alluded to a “saving grace” that prevented things from being much worse than they actually were, and that’s what I’ll cover in this post.

A little background first, or rather an expansion of something I touched on in my original post -- unlike the SQL Server Database Engine, much of the SSAS data can be resident in physical memory by virtue of the Windows file system cache.  This is simply because SSAS databases consist of a collection of Windows files of various types, and SSAS does not directly manage memory/disk access to those files using Windows APIs as complex memory managers like SQL Server do.

This means that even if the SSAS internal caches are effectively non-existent, you can still have a significant amount of raw SSAS file data loaded into RAM which will not show up as part of the SSAS process memory, or in any way be governed by the SSAS memory limits.  This was the saving grace here.  If the storage engine actually had to read the files from disk for every query, performance would have been significantly worse than it already was.

I’m going to use the some Performance Advisor for Analysis Services dashboard charts to fully illustrate this phenomena.  Since the customer’s server has 32GB of RAM, I’ve blown up the System Memory chart so it’s a little easier to see what’s going on: 

memory_file_cache

The thin grey line shows the physical memory used by the file cache (about 700MB), and the even thinner orange line shows the memory used by the SSAS instance with invalid memory limits (about 450MB).  The large teal-colored area is memory used by a SQL Server instance on the same server.

Assuming there wasn’t much file serving happening on this server, most of the file cache is likely comprised of SSAS database files.  Without file serving activity, usually you will not see the file cache get this large, even though on 64 bit machines like this one it is capable of growing to use up all available memory.  (On 32 bit systems the max file cache size is about 1 GB.)  Read more about the Windows file cache here.

Ok, interesting theory, but how can we confirm that SSAS was able to get it’s data from the file cache?  Fortunately the SSAS Storage\File Read/Write charts make this is pretty easy:

memory_file_read

The File Queries chart on the right shows SSAS is querying database files at a rate of about 20 per sec, and the File Read KB chart shows that even though there is a lot of file read activity on the system (about 40MB per sec), almost none of it is coming off of disk.  This is clear because the dashed orange Total Disk line is flat, and there are also no Windows read faults at that time:

memory_read_faults

Anytime SSAS files are read from disk into physical memory, you should see hard faults.  To better demonstrate this, take a look at a similar chart for a different environment:

memory_file_read_high_disk 

The off-white Total System series shows ALL file read activity flowing through the Windows system (both from memory and disk), the light orange Total SSAS series shows the portion of total read activity directly related to the SSAS process (barely visible here), and the blue series shows the portion of SSAS read activity from data files.  The dashed Total Disk read line is overlaid on top of the stacked area chart so you can quickly see how everything correlates.  It’s clear that a sizable chunk of the SSAS data file reads are coming off disk.  Also note how this time we have significant Windows read faults:

memory_file_read_high_faults

To get a different look at the data as well as the actual file cache hit ratios, if we click on a point on the chart in the middle of the read activity spike, then click the Sample button on the toolbar it loads up the data for that specific point in time:

memory_file_read_high_disk_sample 

What this shows us is that approximately 81% of the file read activity for this sample came from the file system cache.  This is not bad, but in general you always want Total Disk to be at or near zero, which will be the case whenever SSAS is pulling data from either its internal caches or the file system cache.

How can you tell whether the data is coming from SSAS’s internal caches versus the file cache?  That’s easy -– you won’t see any storage engine file queries or associated file read activity at all, and the SSAS cache hit ratios (visible in sample mode, or when using the new Performance Advisor v5.5 reporting) will be high, such as in the shot below where the measure group cache hit ratio is 99%:

memory_high_ssas_cache_hit_ratios 

In addition to the cache hit ratio meters, there is also a visual hit ratio indicator right on the column chart -– the “fullness” of the columns reflect the hit ratios.  The more gray fill you see between he column hash lines, the higher the hit ratio:

memory_high_ssas_cache_hit_ratios_low 

Note that the measure group column is only about half full, reflecting the 52% hit ratio.  The calculation and flat cache hit ratios are both zero, so those columns are empty.  Clever, eh? ;-)  Perhaps even cleverer, on the measure group column there are two types of hash lines, horizontal and diagonal -- the horizontal hashes represent direct cache hits, while diagonal hashes represent filtered hits.  These occur when existing cache entries have to be filtered first to obtain the desired data, and are slightly higher overhead than direct hits.

In an ideal world, aside from when the cache is first warming up, you don’t want to see SSAS file queries at all, except perhaps for the occasional aggregation hit, which is also shown on the File Queries chart.  Bottom line, if you see file queries there were cache misses.  Here’s a quick custom Performance Counter History report using Performance Advisor v5.5 which shows this clearly :

memory_hit_ratio_report 

Even so, the Windows file cache hit ratio may be 100% if all of the referenced files were in the file cache.  But in that scenario, as we saw in the last post, even though performance will be better than if they were read from disk, it will not be nearly as good as if the data had been found in the SSAS cache because of the additional overhead involved when the storage engine retrieves file data from Windows.

If you want to see how your Analysis Services system is performing in this regard, you can download a fully functional 15-day trial of Performance Advisor for Analysis Services here.  Cheers.

Friday, June 19, 2009

Analysis Services Memory Limits, Part II

In my last post I discussed some of the basics of SSAS memory management -- how it utilizes limits to control the amount of memory consumed by the SSAS process, and how the memory limits are set as a percentage of the total physical memory.  I also mentioned that we had a beta tester who reported what they thought was an odd graphical anomaly on the Performance Advisor for Analysis Services “Memory Usage” chart, and I offered up a challenge to see if anyone could explain it.  (Details about the winner at bottom.)

I’ll begin with an admission -– I intentionally left out a rather critical aspect of memory limit configuration.  The omission was that, like several other SSAS settings, once the configured value goes over 100% it changes to an explicit setting.  In the case of memory limits the explicit setting is translated as bytes.

Unfortunately, for some reason this is not covered in the SSAS documentation… however, it is covered elsewhere, including this doc: 
SQL Server 2005 Analysis Services (SSAS) Server Properties

In this case, the first thing we had the tester check was the actual configured setting values, as well as the values being returned from the associated Windows performance counters as a sanity check.  Here is what they sent back:

Memory Limit Setting –> Configured Value (bytes) –> Perf Counter Value (KB)
Memory\LowMemoryLimit –> 2097152 –> 8192
Memory\TotalMemoryLimit –> 13531488 –> 13312

Hopefully the problem is starting to become clearer.  Apparently the configuring user had tried to use an explicit amount of memory, however they had inadvertently entered kilobytes (KB) instead of bytes!

These values translated to approx 8MB and 13MB for the Low and Total limits respectively, which explains what they were seeing on the chart.  It was no graphical anomaly.  (Note that even though the configured Low value should have been approx 2MB, apparently the minimum SSAS will use is 8MB, which explains the associated perf counter value of 8192 KB.)

Ok, so we figured out what was going on on the chart, but what about the impact these abnormally low settings would have on SSAS performance?  As I mentioned before, this client had had many problems with SSAS performance, and as a result had engaged Microsoft to come in and help.

Here’s where things get very interesting.  First, take a look at the dashboard shot the client sent to us, which shows the memory issue along with the execution of a single MDX query that ran for a couple of minutes:

memory_limit_issue(click to enlarge)

Here’s a closeup of the Memory Usage chart:

mem_cleaner_anomaly

What’s wrong with this picture?  Well, a couple of things.  First, SSAS memory usage is almost completely flat, and is almost entirely nonshrinkable.  Typically SSAS memory will grow and shrink continuously as memory is allocated and deallocated for querying and processing operations, and will stabilize with a much larger base proportion of shrinkable memory than we see here.

To illustrate this better, on a test SSAS instance I configured the limits artificially low and let it run for a while, then increased the limits much higher so that there was no memory pressure:

mem_cleaner_increased_limits

The same system, zoomed into a 10 minute range:

mem_cleaner_no_pressure

What the flat memory line means is that the unusually low limits are effectively preventing SSAS from loading any data into either of its caches -- the Formula Engine (FE) cache (aka, “query engine cache”) or Storage Engine (SE) cache.

This is further confirmed by looking at a few other charts.  Notice the Cache Activity chart shows almost continual evictions and memory shrinkage while the query is running:

cache_activity

This is not normal!  You rarely want to see these values above zero, because when you do it means that memory cleaner threads have kicked in because of internal memory pressure and are moving data out of memory.  In this case the data is being moved out as fast as it is moved in.

Also have a look at the Avg Time chart:

avg_time

What this shows is that SSAS is spending almost all of it’s time on non-cached storage engine operations (SE non-cached), which is where you typically want it to spend the least amount of time.  This means that every time a query needs data it’s unable to find it in the FE cache, so the formula engine issues requests to the storage engine, which is unable to find the data in its cache, so it has to go to the file system to get it from either measure group partitions or aggregations.  Partitions can be orders of magnitude slower than retrieving the data directly from cache, and is almost certainly why the query took so long.

As an aside, originally I’d actually called this the “Wait Time” chart since what we are showing here is similar to what you see with SQL Server wait types, but Carl Rabeler convinced me otherwise since what the chart is actually showing is “time spent” on the different operations, versus time spent waiting.  You could probably get into a semantical debate about this since many of the SQL Server wait types pretty much show “time spent” as well… but it’s just not worth any potential confusion from injecting some Database Engine terminology into the Analysis Services lexicon, no matter how tempting it may be ;)

In any event, this chart is critical because you can see exactly where SSAS is spending most of it’s time over any date range, which can be extremely important for profiling a server and determining where the bottlenecks lie.

Next have a look at the File Queries chart, the Total metric in particular:

file_queries

As I mentioned, the formula engine generates requests to the storage engine for measure group data when it’s not in the FE cache.  The total queries metric shows how many times the storage engine had to go to the file system to read data from a measure group partition (or aggregation) when the data wasn’t in the SE cache.  Here again is another value you typically want to be as close to zero as possible, because when it’s >0 it means that SSAS couldn’t find the data in either the FE cache OR the SE cache.

Note that you will see values >0 when an aggregation is first hit and loaded into the SE cache, since Total covers both Agg hits and partition queries.  But since agg hits are much lighter weight than partition queries this is generally not a problem.

It is also perfectly normal to see values >0 when the cache is cold and queries first start coming in.  For example, if you process your cubes overnight the associated cache data will be invalidated, and you’ll probably see file queries spike up in the morning when users come in and start issuing MDX queries, then settle down as the data is cached.  If you see that total queries are continually above zero throughout the day, memory pressure is a likely culprit.

Impact on MDX Query Performance

So what happened after the invalid memory limit settings were corrected?  Here is some actual query data from the Performance Advisor Top Queries tab which our beta tester was nice enough to export for me:

query_performance (click to enlarge)

The queries in red ran with the invalid low memory limits.  Notice every single one had to go to the storage engine for the data every time.  All had 53 non-cached operations, high cpu cycles, durations of >10 seconds, and all spent 94-97% of that time in the storage engine, for obvious reasons.

Once the limits were changed, the 2nd execution of the query (green) took about 1.5 seconds, with zero non-cached ops, very low cpu, and 100% of the time was spent in the formula engine!  This is a fantastic example of how things are supposed to work when there is memory available and limits are configured properly.  Most future runs of this query actually dropped under our default trace filter of >=1000ms, and so no longer showed up at all in Top Queries.

Note that the first run of the query after changing the memory limits (orange) actually took longer than the runs with the invalid limits, although everything else was about the same.  At first I was puzzled by this, but then it occurred to me that this may be because SSAS was finally able to actually load the data into cache, and the extra 6 seconds may just be overhead related to the associated memory allocations.  Or it may have just been an anomaly related to other activity on the server at the time, not sure.

Even though the performance before the change was pretty bad, things actually could have been much worse.  There was one very big saving grace for this client that allowed SSAS to run as well as it did with the invalid memory limits.  It’s actually very clear on the dashboard what that is, but I’ll save that one for another post ;)

Mini-contest Winner!

Congrats to Norbert Kessler from WebTelligence, a SQL Server consulting firm in Germany and one of our beta testers, for nailing the answer pretty much right on the head.  He wins the free license of Performance Advisor for Analysis SerivcesJohn Welch was very close, and actually brought up another pretty serious misconfiguration -- when the high and low limits are configured with the same value.  This is a common practice in the SQL Server world, but a big no-no with SSAS.  Another topic for a future post perhaps…

Monday, June 15, 2009

Analysis Services Memory Limits

One of the least understood areas of Analysis Services performance seems to be that of memory limits – how they work, and how to configure them effectively.  I was going to start with an overview of Performance Advisor for Analysis Services (aka, PA for SSAS), but during the beta I ran across such a good example of the software in action in diagnosing a problem with memory, I think it’ll work just as well to go through that issue step-by-step.

First some background.  SSAS has two general categories of memory, shrinkable and nonshrinkable, and they work pretty much like it sounds.  Shrinkable memory can be easily reduced and returned back to the OS.  Nonshrinkable memory, on the other hand, is generally used for more essential system-related stuff such as memory allocators and metadata objects, and is not easily reduced.

PA for SSAS shows both types on the SSAS Memory Usage chart:

Cleaner MemoryIf you switch to “Memory by Category” mode via right-click menu, you get a more detailed breakdown.  Just like the chart above, the memory with hash marks is nonshrinkable:

Memory by Category

SSAS uses memory limit settings to determine how it allocates and manages its internal memory.  Memory\LowMemoryLimit defaults to 65% of the total available physical memory on the machine (75% on AS2005), and Memory\TotalMemoryLimit (also sometimes called the High Memory Limit) defaults to 80%.  This is the total amount of memory that the SSAS process itself (msmdsrv.exe) can consume.  Note on the charts above, the orange line represents the Low memory limit and the red line the Total memory limit.  This makes it very easy to see how close SSAS’s actual memory consumption is to these limits.  Once memory usage hits the Low limit, memory cleaner threads will kick in and start moving data out of memory in a relatively non-aggressive fashion.  If memory hits the Total limit, the cleaner goes into crisis mode… it spawns additional threads and gets much more aggressive about memory cleanup, and this can dramatically impact performance.

Between the Low and Total limits, SSAS uses an economic memory management model to determine which memory to cleanup.  This model can be adjusted by some other parameters related to the memory price, but I’m not going to get into those here.

When it comes to memory management SSAS is entirely self-governing, and unlike SQL Server it doesn’t consider external low physical memory conditions (which can be signaled by Windows) or low VAS memory.  This may be partly because SSAS is already much more subject to Windows own memory management than is SQL Server, since Analysis Services databases are a collection of files on the file system and can make heavy use of the file system cache, whereas SQL Server does not.  This means that you can have a significant amount of SSAS data loaded into RAM by virtue of the file cache, and this will not show up as part of the SSAS process memory, or in any way be governed by the SSAS memory limits.  (For this reason we now show system file cache usage and file cache hit ratios on the dashboard as well so you can see this.)

On a side note, a byproduct of this design is that if available memory is low and other processes on the server are competing for memory resources, SSAS will pretty much ignore them, and look only to its own memory limit settings to determine what to do.  This means, for example, that memory for a SQL Server using the default “Minimum server memory” setting of 0 can be aggressively and dramatically reduced by an SSAS instance on the same machine with default memory limits, to the point where the SQL Server can become unresponsive. SQL Server is signaled by Windows that memory is low, and responds accordingly by releasing memory back to the OS, which is then gobbled up by SSAS.

Graphical Anomaly, or Serious Memory Problem?

Very early in our beta for PA for SSAS, a beta tester reported what they thought was an odd graphical anomaly in the bottom right of the SSAS Memory Usage chart:

mem_cleaner_anomaly

To the untrained eye, this would certainly look like an anomaly, but once you’ve seen what this chart is supposed to look like, it’s fairly obvious that it’s an “H” on top of an “L”.

This particular tester had been eager to get the beta because they had suffered from poor performance and other strange behaviors for quite some time.  They had actually already scheduled Microsoft to come in and help troubleshoot the issues.

What exactly did this “anomaly” mean?  What caused it, how it was resolved, and what was the net effect on performance, if any?  All great questions… and this will be the subject of my next post, so stay tuned ;-)

If anyone has any guesses in the meantime feel free to comment.  The first person to get it right will win a free license of Performance Advisor for Analysis Services! (that’s a $2495 value)

Thursday, June 4, 2009

Performance Advisor for… Analysis Services?

It’s been some time since my last post, and several folks have asked “Hey Greg, after such a promising start, what happened?”  The short answer is, “Performance Advisor for Analysis Services.”  Yes, you heard that right.

The long answer is… not long after my last post I was having a chat with John Welch at Mariner, a great Microsoft BI consulting firm also based here in Charlotte, and I was trying explain how Performance Advisor for SQL Server might be used to help troubleshoot Analysis Services performance issues.  Let’s just say that conversation didn’t go quite as well as I’d hoped ;-)  Sure, PA for SQL shows many of the pertinent Windows counters on the dashboard, and the “Other files” element on the disk activity and space tabs will capture SSAS-related IO and space, and Event Manager handles SSIS and SSRS… but it became apparent very quickly that this was not nearly enough.  John asked, “What about temp file spillage… or file cache utilization… or aggregation usage?”

Although PA for SQL may be the best tool out there for optimizing SQL Server performance, clearly an SSAS performance tool it was not.  Needless to say, I left feeling a bit down about that point… but energized by another tidbit I discovered during the meeting: apparently there was no commercial software on the market for Analysis Services performance!  When John said this I really couldn’t believe it.  Up to that point our focus had always been on the SQL database engine and directly related technologies, and I guess I just assumed that, as big of an industry as BI is for Microsoft, that there had to be something out there.  After a lot of looking around, I confirmed that there wasn’t.

Shortly thereafter I went to the dev team and said, “Remember what I said last week about focusing exclusively on the enhanced performance alerting features for PA for SQL that we didn’t get into the initial release… well, forget that for the moment.  An opportunity just fell into our laps, and this type of opportunity doesn’t come along every day!”  Blank stares.  Although I’d like to think that by this point I’ve earned the confidence of the team, let’s just say this sudden change in direction was met with a certain amount of skepticism.

Here we are about 9 months later and within a couple of weeks of the release of PA for SSAS, and I can safely say there are no more skeptics.  When I stop to think about it I’m amazed we were able to get this software to market so quickly.  It took about two years of focused effort to build PA for SQL Server, and if we had started from scratch it would have taken at least as long to build PA for SSAS, probably longer.  The big difference was that we’d already built this fantastic architecture for collecting, managing, and presenting performance data from diverse sources, so a lot of it was pretty much plug-and-play.  The most difficult aspect of the project was definitely the R&D phase, which took several months.  If you’ve worked with both the SQL database engine and Analysis Services even a little, you know they are completely different animals that bear almost no resemblance to each other.  From the way they manage memory and disk access, to the threading models, to usage patterns, they almost look like they were developed by two different companies.  (Probably because originally they were – Microsoft acquired the original OLAP engine from Panorama Software in 1996, around the time of SQL Server 6.5.)

The goal for PA for SSAS was pretty much the same as PA for SQL Server – build a product which presents key, actionable performance metrics in an intuitive, easy-to-consume fashion.  However, due to the vast architectural differences between SQL and SSAS, almost none of the extensive R&D done for SQL Server was reusable.  Determining exactly which SSAS metrics to present and how, so that literally within a few minutes someone who may not even be very familiar with SSAS can understand how it works under the covers, was one of the most challenging undertakings I’ve ever been involved with.  I can remember leaving the office many times during that phase feeling like a tiny ant on the backside of the Analysis Services elephant.

Little by little it all came together though, and I think you’ll be very pleased with the end result.  I can honestly say that it would not have been possible without the extensive work by and encouragement from Carl Rabeler (Microsoft SQLCAT), Chris Webb, and John Welch; Brooke Philpott and our phenomenal technical team; and our great beta testers.  Many, many thanks to all of you!

Wednesday, October 8, 2008

More Fun with v4.1...

I mentioned in my last post how some users weren't even aware of the Performance Advisor dashboard's History mode because of issues with the toolbar buttons. What I didn't mention was that for some an even more significant "hidden" feature was Performance Advisor itself!

First some background.  Both Event Manager and Performance Advisor share the same code base, so you only need one console, service and database for both products.  There were many reasons for taking this approach, including ease of installation via a unified setup program, ease of configuration, and seamless integration possibilities between the products.  For example, we can show you all performance events (heavy SQL, blocks, & deadlocks) on the Event Manager calendar without any configuration (as long as you have a license for both products).

calendar_jump_to_pa

We also enable you to quickly jump back and forth between Event Manager and Performance Advisor via context menus and toolbar buttons.  These menus are shown in both the Event Manager calendar shot above where a block is selected, and below with the same block selected in Performance Advisor.  (I should mention that the easiest way to get from a block on the calendar to the same block in Performance Advisor is actually to double-click it, or right-click and select Open.)

blocking_jump_to_cal

Note how the two products give you a completely different yet complementary view of the same event.  Performance Advisor shows you all of the great details of all versions of the block chain, while Event Manager shows you in chronological fashion how the block was interacting with all other events around the same timeframe (heavy SQL, SQLAgent jobs, SSIS, maint plans, etc.).  You can stare at a grid for hours and not gain the same level of understanding about a performance event as you get after looking at the calendar for a few seconds.  You'll see relationships between events that you never new existed... which enables you to be exponentially more effective when troubleshooting and resolving difficult performance problems.

Hopefully by now you can see some of the great benefits of a single SQL Sentry console for both products.  However, a downside of the single console approach is that, well, there is only one console.  Let me explain.  If each product was its own Windows application, the separation between the products would be quite clear.  There would be little chance of confusion regarding how to monitor (or "watch" in SQL Sentry lingo) a server with each product, how to open the primary screen for each product, etc.

But since we leverage the same Navigator pane that we've always had in Event Manager, and it looks pretty much the same as before, some users had difficulty figuring out how to pull up Performance Advisor.  At first we were puzzled by this, since if you use the context menus it's clear that there are now two sets of Watch and Open child menu items on computer (aka Device) and SQL Server nodes, and you are forced to pick one or the other:

open_context_menus

Ah, but there was yet another way to open the Event Manager calendar in previous versions -- via double-click of the left mouse button.  As it turns out, many existing users had become accustomed to using this approach since it's faster/easier.  Therein lied the "problem".  Although in v4.0 we had added a User Preference to control the "default view" (or product) when left-clicking a node, we had set this to "Event Manager" for upgrades to avoid changing behavior for existing users.  So if you were a left-clicker, unless you went into the User Prefs and adjusted this, you'd continue to get the Event Manager calendar and never see Performance Advisor!  Right.  2+ years of effort down the drain :O

After many hours of careful deliberation, to address this problem we decided to introduce a new feature in v4.1 called the "Product Selector" ;)  Now what you will get by default when you double-click a computer or SQL Server in the Navigator is this screen:

product_selector

The screen actually serves a few purposes:  1) to let you know you have multiple products available, 2) to allow you to continue to double-click and let you choose which product to open, and 3) to allow you to adjust the new "default view" User Pref via the "Default" column checkbox.

NOTE: All Event Manager users who upgrade to v4.x get a fully functional 30-day, 5-server trial for Performance Advisor for your efforts, so #1 applies to you as well.

#3 only comes into play if you deselect the "Always prompt me" checkbox, after which point you will no longer see the Product Selector... we'll just open the product you checked as the default.

We've already found that this in itself is causing some confusion though!  Some users expected the "Default" checkbox to control which product is pre-selected the next time you see the form, which it does not.  To avoid this, in v4.2 we've simplified the form slightly; we've eliminated the Default column and just use the currently selected product as the default if you ever deselect the "Always prompt me" option.  In addition, the last product selected will always be pre-selected the next time you open the form.

It's truly amazing to me that I was able to fill up an entire blog post with a relatively insignificant topic... although it is indeed typical of types of issues I spend much of my days on.  Hopefully someone out there will find it helpful ;)  Please bear with, this whole blogging thing is still new to me.  I promise to extricate myself from the day-to-day details of UI design & usability issues soon and delve more into how to use SQL Sentry to solve real world performance issues.  Until next time...