Wednesday, June 2, 2010

SQLCruise Contest Winner

As you may have heard, as part of our sponsorship of the inaugural SQLCruise, we decided to run a contest to provide one lucky winner a free ticket for the cruise and training.  When we started the contest we expected that most of the submissions would be in the form of tweets or short blog posts, and we did receive many of those which were quite good.  We were surprised and impressed, however, by the effort, ingenuity and humor that went into four submissions in particular.  So much so that we had a tremendously hard time selecting a winner… how could we possibly choose between:

  • Jorge Segarra (blog | twitter) – Here's a guy who, and I can't think of any other way to put it, was born to go on this cruise.  Not only was his video hilarious, it was replete with gratuitous SQL Sentry references, and was actually submitted before the contest officially started.  Talk about enthusiasm!  It was a shot across the SQLCruise bow – "If you want to go on the cruise, you're gonna have to go through me!!"  It also served to ensure we wouldn't spend any time reviewing half-hearted submissions (thanks man! ;-)  How could we in good faith not send him, and deprive the other SQLCruisers of four days of laughs and learning with the Chicken?
  • Matthew Velic (blog | twitter) – A guy who's new to SQL Server and who desperately wants training, and who would apply that training daily for a very worthy cause, fighting Pediatric AIDS.  He clearly spent a lot of time on his video as well, as the production values were through the roof.  How could we not send Matthew?
  • Rebecca Mitchell (blog | twitter) – A DBA in Colorado who not only has the technical skills required to successfully manage lots of SQL Servers, but the creativity to write a poem about SQL Server (yes, that's right, and probably the longest poem I've ever seen, let alone read), with cleverly infused references to SQL Sentry's deadlock analysis and every DBA’s favorite breakfast meat to boot.  How the same person can have both abilities is frankly beyond my comprehension.  How could we not send this rare talent on the cruise?
  • Erin Stellato (twitter) – Erin lives and works in Cleveland.  I mean, how could we not send her? ;-)  Well, her video also happened to be extremely well done, with Hollywood quality acting, score, humor and special effects.  Seriously, she too needed to go!

So after going round and round for hours, we couldn't decide, which left us with a couple of options: either put it out for a community vote (which we may actually do next time), or just bag the whole idea.  Unfortunately we had a hard deadline from Carnival so there was no time to put it out to vote.  Punting on the contest was out too as we certainly couldn't punish everyone for being "too good"… that's just not how we operate here.

After some more agonizing, another idea came to us.  It was extreme… but clearly it was the best option, and a quick review of the official contest rules confirmed that it was indeed "legal":  Jorge, Matthew, Rebecca, Erin, pack your bags, you are all going on SQLCruise!!! ;-)

We have no doubt you will all make the most of this awesome opportunity.  Have a great time, and be sure to spread the word!

Thanks to everyone who participated, I do wish we could send all of you… maybe someday.  Meantime, I hear there are still a few seats left!

Monday, April 5, 2010

Getting Immersed in SQL Server

I had the pleasure of spending the past week at the SQL Server Immersion event in Boston.  I was joined by our lead developer, Brooke Philpott (blog | twitter).  This was an intensive course on SQL Server internals, and much more, put on by the esteemed Paul Randal (blog | twitter) and Kimberly Tripp (blog | twitter), and hosted by Adam Machanic (blog | twitter).  I knew going in that Paul and Kimberly were authorities in their respective areas – Paul in the storage engine and related areas like backup/recovery, and Kimberly in query tuning and index optimization.  What I didn’t expect was the sheer breadth and depth of their knowledge, and that they also share quite a bit of knowledge in common.  They are of course married, so I expect this leads to a lot of information-sharing at the dinner table that wouldn’t occur otherwise.

The material covered is far too extensive to get into here, but I will say that every day, and just about every hour of every day, was a challenge… and I have worked with SQL Server for almost 15 years.  This was not basic stuff.   A couple of times in the past I’ve made the mistake of paying for training where the outline looks great, but you get in class only to find that it’s far too basic and high level, and therefore not worth the time away to be there.  This was not that class.  Even though GAMS, IAMS, PFS pages, headers, slots and associated bits and bytes could be considered “basic” in the sense that they are the building blocks of SQL Server, they are certainly not simple.  The amazing thing is that this class really costs no more than many other lower quality week-long SQL Server training courses, when in reality it should probably cost twice as much.  It is an absolute bargain.

When it comes to the instructors themselves, Kimberly and Paul simply have no equal.  Not only are they both ultimate authorities, but they have endless enthusiasm about the material, and spot on delivery.  If either ever got tired they never showed it, even after going all day and all week.  We witnessed countless demos over the course of the week, some extremely involved, multi-step processes, and I can’t recall a one that didn’t go the way it was supposed to.  When has that ever happened?  It certainly hasn’t for me.  Seems as if whenever I demonstrate something in public there is invariably some glitch or another.  These are not beginner presenters, and they put an extreme amount of preparation and attention to detail into everything that they do.  Completely, utterly professional.

You might think that with this extreme level of skill comes extreme levels of egotism and lack of patience.  Nothing could be further from the truth.  There were no stupid questions the entire week – well, there were of course, but the response from Paul or Kimberly usually started with “Excellent question…” even when the question wasn’t so excellent.  They simply know how to teach, and are approachable, humble, and patient.

Anyone, from the accidental DBA (there were actually several in our class) to the experienced MVP, will gain significantly from this training.  Not everyone has 3 weeks and $20K to spend on MCM training, the first week of which is Paul and Kimberly on similar material.  Is Immersion the same as the first week of MCM?  According to Kimberly it is not quite as deep, which surprised me… I mean, how much deeper can you go? ;-) There is also no test at the end of the week at Immersion, so probably a lot less pressure as a result.  However, if you are thinking about going after the MCM and want to stick some toes in the water first, I suspect Immersion will give you a good idea of what to expect, as well as serve as a serious primer for much of the first week.

What I think it comes down to is that Paul and Kimberly truly enjoy what they do.  They have a passion for sharing the unique knowledge they have accumulated over the years while working at Microsoft and consulting with some of the largest SQL Server enterprises in the world, and it shows.  The SQL Server community should be thankful that they are so willing to share this knowledge, and that they are able to do it so well.  I count myself privileged to have been able to spend a week with them, and learn many things that will undoubtedly help me and my company, today and years into the future.

This was the only public Immersion event they will hold in the U.S. this year.  Keep a close eye on the SQLskills.com upcoming events page for announcements about future events.  I believe there is one in Dublin later this year, but I'm not seeing it on the site, so it may be sold out.

Tuesday, March 9, 2010

SQLSaturday Charlotte: Lessons Learned

We just wrapped up what have been the busiest, most stress-filled, yet rewarding couple of weeks in recent memory.  Overall the feedback we've received about the SQLSaturday Charlotte event has been amazing, thanks to all for the kind words that have been tweeted and blogged thus far!  I do think for the most part the event went off about as well as it could have despite our lack of experience planning technical conferences.  However, we definitely learned a lot that we'll be able to apply should we ever do it again.  I had many questions when I first started working on this event, so I wanted to get my thoughts down while still fresh with hopes that it might be helpful for future SQLSaturday planners.

Scheduling Concerns

One of my primary tasks was trying to coordinate the flood of speakers and sessions we had for the event, a flood created by none other than CSSUG president and SQL Sentry partner channel manager Peter Shire, whose skills at promoting can only be described as P.T. Barnum-like.  (I thought for sure that P.T.'s first name must have been "Peter", but turns out it was "Phineas")  Peter did a phenomenal job securing the support of an amazing list of 35 speakers, and getting the word out to the SQL Server community to the tune of about 400 registered attendees, 280 of whom actually made it to the event.

The starting point for the schedule planning was the facility.  The Microsoft campus here is a fantastic facility in a great location close to the airport, downtown, etc., however, it wasn't necessarily designed to run an event like this.  The initial room breakdown was:

Building AP1:
     Room 1: 200 seats
     Rooms 2,3,4,5: 20 seats each
     Cafeteria: 220 seats

Building AP2:
     Room 6,7,8: 50 seats each

The two buildings are about a 3-minute walk apart.  Because of the relative size difference between the largest and smallest rooms, my biggest fears were putting a hot session in one of the smaller training rooms causing massive overflow, and putting a not-so-hot one of the big rooms… either of which can exacerbate the other.  The original layout had us using the 4 training rooms in AP1, the same building with the cafeteria and the 200 seat room, however, about a week before the event we got word from Microsoft that we'd been bumped by some internal training and had to use 4 training rooms in AP2 instead.  So that left us with:

Building AP1:
     Room 1: 200 seats
     Cafeteria: 220 seats

Building AP2:
     Rooms 2,3,4,5: 20 seats
     Room 6,7,8: 50 seats

The AP2 training rooms are an exact mirror of those in AP1, however it seemed to make for a very lopsided layout since it left us with a single large room in one building, and 8 rooms in the other building.  This caused concern since we'd been counting on a lot of interplay between the 80-100 people in the training rooms and the 200-seat room where I'd put all of the "hottest" sessions.  We didn't understand why the other group couldn't just use the AP2 training rooms, but eventually found out – they were doing lab sessions, and apparently the workstations in AP2 aren't the same class as those in AP1.  We weren't using the workstations so it didn't really matter to us, but apparently it did to them ;)

As it turned out, it probably did end up causing lower attendance for the featured sessions and some overflow in some of the smaller rooms, mainly because some seemed reluctant to leave AP2 when there were so many other good sessions going on next right there.  But did it even come close to ruining the conference?  Certainly not.

Another issue with the change was that we didn't get the new room names in time to put them on the signage, so we ended up using numbered rooms that were offset from the tracks by one – Track 2 was in Training 1, Track 3 in Training 2, etc.  This was very confusing, even to me.  At one point I even directed Wayne Snyder into the wrong room for his session.  Sorry Wayne!

The first thing I did was try and categorize the 80+ submitted sessions so I could get a better handle on the actual areas of interest we had to work with.  What I was really looking for at that point was whether or not we could have any focused tracks.  I ended up with 4 major categories and 24 subcategories:

Category – Subcategory Session Count
Admin - Broker 2
Admin - Clustering 2
Admin - DB Design 2
Admin - Hardware 2
Admin - Maintenance 1
Admin - Memory 1
Admin - Performance 11
Admin - PowerShell 5
Admin - Replication 3
Admin - Troubleshooting 2
Admin - Virtualization 2
BI - Dev 5
BI - SSAS 1
BI - SSIS 5
BI - SSRS 4
Dev - DB Design 2
Dev - Performance 4
Dev - Tools 2
Dev - TSQL 2
Dev - XML 1
General - Career 2
General - General 4
General - Social 3
General - Tools 2

It was apparent that although for a couple like Performance and BI, focused tracks might work, but for others we only had one or two sessions so trying to do clean, interest-based tracks would be fruitless.  I should mention that although for some sessions the categorization was easy, for others it was not clean at all, so I just took my best stab at it.  Even so, there were a few others that seemed to defy categorization, like Kevin Kline's "Top 10 Mistakes on SQL Server" and Sergey Pustovit's "SQL Server 2008/R2 Overview", and those ended up in "General – General" ;)

I was also hoping that by having these categories, I'd be able to send them out to a few people that had planned other recent SQLSaturdays and ask them to order the list by most to least popular at their event.  I got some extremely valuable feedback from those planners, but since they didn't have accurate attendance numbers for all topics we didn't get any rankings back.  Since this info could have been invaluable to me and helped to avoid some of the session/room mismatches we experienced, I'm posting all of our attendance numbers here.  Bear in mind they are only estimates and not exact.  If you see anything that looks way off, let me know.

Next, to my sessions spreadsheet I added a few columns, and used them to come up with a universal rating measure to help decide where sessions should go.  Here they are:

Hot Topic – Yes/No – Is the topic particularly popular right now, and are there a lot of people talking about it in the media?  For example, I don't think that anyone would argue that PowerPivot is a hot topic right now.  If Yes, 2 points were added to the rating.
Premiere Speaker – Yes/No – Certain speakers have significant name recognition, and will draw attendees regardless of the topic.  If Yes, 3 points were added to the rating.
Appeal – This is a 1-10 point value added to the rating for how broad of an audience would possibly be interested in the topic, or the size of the base population.  For example, PowerPivot may indeed be a "Hot Topic", but if the potential base of users is small the appeal number will be lower.

I used the above to calculate a simple rating value, and the higher the number the bigger the room.  Certainly not very scientific, but probably good enough for our purposes.  Ultimately there are so many other variables involved, IMO if you try to be any more exacting you run the risk of skewing things too far one way or another.  For example, if I'd actually tried to estimate the number of "BI" users or "Admin" users that would show in lieu of the general "Appeal" number, and my estimates for those populations were off, it could cause trouble.  Now, keep in mind that this approach is really most applicable to a first event, where you really have no idea who is going to show up.  Next time we'll have a much better idea of the different populations here in our region and will use actual numbers as a basis for predicting session popularity.

Another idea that came up was surveying people in advance to see which sessions or categories they are most interested in.  If we had had more time, and/or if there was a capability to do this built into the sqlsaturday.com website, we likely would have made use of it.  As it was, there was just too much else going on with the planning to even think about attempting something like that.

Below I'll discuss a few of the bigger session surprises in a bit more detail.

Hotter-Than-Expected
TSQL
Specifically Mike Walsh's "You Can Improve Your Own SQL Code" and Geoff Hiten's "Bad SQL" both ended up in a smaller 20-seat room, and both could have easily been in a 50-seat room.

SSRS/SSIS
The 25-seat room we had set aside for these BI topics had 46 people crammed into it at one point.  I didn't put this track into one of the larger rooms mainly because I just wasn't quite convinced it would compete with Performance, Virtualization, PowerShell, and other "hot" topics I had in the 50-person rooms.  I also had a lot of people that know telling me that BI is always a smaller crowd at these things, so I went with our only "in between" space, the 25-seat conference room that we knew could handle up to 35 with some standing.  Good call for not putting it in a 20-seat room I guess, but bad call for not giving it a 50-seater.  If you were one of those that had to act like a sardine for an hour, my sincerest apologies, we'll give you more room next time.

Not-So-Hot
Virtualization

This one shocked me.  With as much talk (and uncertainty) as there is right now about this topic and SQL Server, I was worried that a 50-seat room would overflow for Aaron Nelson's and Denny Cherry's sessions.  I was way off.  They would have both been fine in one of the 20-seat rooms.

SSDs (Solid State Drives)
Again, I was shocked when Kendal Van Dyke's session wasn't packed.  Great, well-known speaker with what I thought would be a hot topic, but it would have been fine in a small room.

Data Compression
The two sessions we had on this topic were the most lightly attended of all, only 4-5 people in each.  This was in a smaller room already, but I had expected to see more interest since it's new to 2008 and disk space is always a concern, however the Enterprise-only caveat or the other concurrent sessions may have been a big factor. 

Too Much of a Good Thing?

The "good thing" in this case being PowerShell.  I guesstimated that this would be a hot topic as it has only seemed to gather steam with DBAs over the past year or so, and with remoting and other cool features in Windows Server 2008 R2 and PowerShell 2.0 there's a lot of new stuff to learn about as well.  Turns out I was right… kind of.  Aaron Nelson's first two sessions in a 50-seat room were pretty much full, but Allen White's two sessions immediately following were, well, not so much.  Again, it certainly wasn't Allen as a speaker since he's highly regarded and authoritative on this topic.  However, I did hear from more than one person that two PowerShell sessions probably would have been sufficient for this crowd.  I think Allen was a little disappointed, but being a glass is half full kind of guy, he said it was good practice for TechEd.  Thanks Allen!

Premature Raffling

One of our biggest mix-ups of the day was the fact that we started the vendor giveaways right after the last session was supposed to end at 5pm.  We still had a session going in the other building that went a bit long, so those attendees and the speaker missed out on the raffles.  To make matters worse, that particular speaker would have won an iTouch had he been present!  Sincere apologies here.  We felt bad, and he did seem a bit upset -- I should have reminded him that he won a WinMo device at our very first SQL Sentry giveaway several years ago, although I'm not sure it would have helped at that point ;)

Bottom line, ensure all sessions have finished and folks have had time to get to the raffle before you start.  Apologies to anyone else that missed out on this – if you did, send us an email and we'll send you either a SQL Sentry T-shirt, USB 2.0 hub, or iPhone cover.

Where'd the Speaker Go?

We had a particularly well-attended session on IO Performance… only problem is we had no speaker.  I made a quick call over the two-way to registration to see if anyone had seen this particular speaker.  They had not.  I let the attendees know this, and they quickly filed out to look for another session, some with looks of disappointment.  As it turns out, the speaker had cancelled a couple of weeks earlier, but apparently it got lost in the hundreds of other emails that were flying around about the event.  Mistakes like this happen and stuff just gets missed sometimes… which is why there should have been a mechanism in place to catch this.  Had we done a speaker roll check earlier in the day we would have caught it, and probably could have gotten another speaker to fill in, or if not, at least made attendees aware of the cancellation in advance.  We just assumed all speakers were there – not safe with 35 speakers.

Bear in mind that some speakers may show up later if their first session isn't until later in the day, so you may not want to check roll first thing in the a.m., but either way it's generally an easy matter to at least confirm whether or not the speaker was in town.  In this case they were not.

Plan for Cancellations

Jorge Segarra (aka, @SQLChicken), one of the organizers for the Tampa SQLSaturdays told me to expect between 5-7 cancellations.  I thought that sounded high – how can that many speakers commit for something like this, then just cancel?!?  Well, he was right.  There were many different reasons we heard, but what can you do.  Fortunately we had enough speakers with enough sessions to make up for it.  I think the 5-7 backups is a good number to expect for an event of this size, probably less for smaller events.  If you saw a session on the schedule that seemed a bit out of place for the room or other sessions around it, it was probably a backfill from a cancellation.

A Final Word

It's important to note that a lot of these scheduling dilemmas were due to the broad range of room sizes we had here.  If your event has 300 people and 4-5 rooms that hold 75 people each you've got a lot more margin for error, and you're probably not going to need to go to this level of detail.  Anyway, I hope some future SQLSaturday planners will find something here that's useful.  If any of you planners have any questions about what we did or why, I'll be glad to help however I can.  Please post them here or shoot me a DM or email.

A Huge Thanks to All

I wanted to take this opportunity to thank all of those that helped with the planning and organization for this great event.  First, I wanted to thank Bill Walker, head of the SQL Server CSS team here in Charlotte, and his counterpart Lynne Moore for their immediate and unwavering support.  From day one, they threw all of their considerable resources behind the event and without them it just wouldn't have happened.  All of us were truly amazed at the forces they were able to mobilize so quickly to just get it done.  I also wanted to thank Sergey Pustovit, Chris Skorlinski and Evan Basalik from the SQL Server CSS team for putting on some great sessions, and all of the other CSS members who volunteered their time whose names I don't know – they ran the "SQL Clinic" booth and helped out in innumerable other ways.

Next I wanted to thank all of the people here at SQL Sentry who volunteered their time, in no particular order -- Peter Shire, Karen Gonzalez, Nick Harshbarger, Brooke Philpott, Jason Hall, Natalie Wieland, Jason Ackerman, Steve Wright, and also Ken Teeter for the great photography.

Again in no certain order, sincere thanks go out to Jorge Segarra, Aaron Bertrand, Andy Kelly, Grant Fritchey, Tim Ford, John Welch, Rafael Salas, Geoff Hiten and of course Andy Warren for allowing me to pick their brains during the planning process.  Not sure what I would have done without their input and words of wisdom.

And last but certainly not least, thanks to all of the speakers who came from near and far to be here.  We are greatly appreciative that you decided to take the time to provide such a fantastic educational opportunity for our SQL Server community here in the Carolinas.

Wednesday, February 24, 2010

SQL Sentry v5.5 Need to Know

SQL Sentry v5.5 Release Candidate 2 was just published to the website.  Although it's a dot release, don't be fooled, v5.5 stands to be a watershed release for us in more ways than one.  Not only does it represent the culmination of many months of hard work by the team, but it contains what have easily been the most heavily requested enterprise features since the original release of Performance Advisor for SQL Server in 2008:  the Global Server Status View and Performance Reporting.  There are also many very cool secondary features such as Automatic Disk Misalignment Detection, Mount Point Support, as well as a variety of performance optimizations.  Read on for more details…

Global Server Status View

Finally, you can view key performance metrics for more than one server at a time from the SQL Sentry Console!  You simply right-click the Devices root node, or any device group node, and select the "Open Performance Advisor Overview" context item:

open_pa_overview

There are a couple of different ways to look at the data.  The default is Windows metrics at the top level:

overview_windows

From there you drill in to see metrics for each SQL Server or SSAS instance on the server.  Just like on the PA dashboard, we also show how much Windows resources are being consumed by the SQL Server and SSAS processes for network, CPU, and memory:

overview_windows_instance

The level below that shows the actual instance metrics, which will of course change depending on whether you’re looking at SQL Server or SSAS.

overview_windows_instance_expanded

If you only want to see SQL Server and/or SSAS metrics without the Windows, simply uncheck Windows in the dropdown at top.  This gives you a very clean look across all SQL/SSAS instances in your environment:

overview_instances

As shown in the shots above, if a performance metric exceeds commonly accepted preset thresholds (such as Page life expectancy < 300 secs), the cell and associated server name will turn red to highlight it.  And to answer the question you are about to ask… no, there is currently no way to customize this, but this is coming ;-)  For this release the primary focus was getting the alarming framework in place with some basic presets.

However, to hold you over until, you can set custom filters for any column or combination of columns.  For example, if you want to see only servers with "Transactions/sec > 100 OR Blocks > 0 OR Deadlocks > 0", simply click the column header and set a filter:

overview_filter_set

As the view auto-refreshes and metrics change you’ll see server rows pop in and out.  This way you can stay focused on the servers requiring your immediate attention, based on limits you’ve defined.  Here's what a filter looks like once set:

overview_filter_multi

Note there are now only two visible servers, the rest are hidden by the filter.

When you combine this with the ability to adjust the interval displayed, filtering becomes an even more powerful feature:

overview_intervals

Let me elaborate.  By default the grid shows the last raw sample value collected, but since many metrics are sampled every 10 seconds there may be temporary spikes that you want to ignore.  You may not want the filtered server list changing every refresh.  If you simply increase the interval to anything other than "Last Sample" -- say "10 Minutes" -- the metrics displayed will be last calculated average over that interval.  This can effectively smooth out many spikes, and thus prevent a lot of volatility with the server list.

Even without the use of filters, this is a great feature for getting quick snapshot of how all servers have been performing over different ranges, up to the past 3 days.

One last item on the global view -- if you don’t care for the default dark skin, you can change to a standard light grid color using a new user preference:

overview_light_skin

Performance Reporting

There has always been an enormous amount of valuable performance data contained in the SQLSentry database, but aside from the PA dashboard there really wasn’t an easy way to get at it.  With the new SSRS-based performance reporting, now there is!  To see all of the new reports, select "Reports –> Performance Advisor" from the menu bar:

pa_reporting

Here are a couple of quick shots of the "Waits Analysis" and "File Utilization by Database" reports:

reports_wait_stats  reports_file_database

Not only do we have a great list of the most commonly requested reports, but we have two ways to generate completely customized reports – the Performance Counter History report, which lets you select any combination of servers, counters and instances:

reporting_params_counter_hist

reporting_custom

…as well as the Performance Counter Date Range Comparison report, which lets you do pretty much the same thing, but using two different date ranges.

One of the coolest aspects of our custom reports is that you’re not limited to only standard Windows perf counters – you can report on any of the performance data we collect, including data from DMV’s, WMI, and even internally calculated metrics like Waits by Category!  And just like the PA dashboard, you can select any date range, from 3 minutes to 3 years… and it won't kill your server in the process.  A lot of effort has been put into ensuring that the reports generate quickly and will minimally impact the performance of the SQLSentry database.

You can save any of the custom reports by clicking "Save" on the parameter selection window, as well as deploy the reports to any SSRS instance using the "Deploy Reports" menu item.

Disk Activity View Enhancements

We’ve introduced two significant enhancements to our patent-pending Disk Activity view:  Auto-Misalignment Detection and Mount Point support.  Which you will like best will depend a lot on your environment.

Automatic Disk Misalignment Detection

As you probably know, misalignment is no longer an issue with new Windows Server 2008 installs since it uses a default offset of 1024KB.  However, what we have found is that many servers out there are either still running Windows Server 2003, or were upgraded from Server 2003 to 2008.  As a result this is still a big issue in many environments, and we’ve just made it dramatically easier to detect – you simply watch a server with PA, then look at the Disk Activity screen.  If the logical disk has a red background, it means the offset is suspect:

disk_misalignment

PA shows you exactly what the offset is in the tooltip for the logical disk.  I wish I could say we will then auto-fix it for you, but you probably realize that would not be possible ;-)  However, on this same screen you can clearly see whether or not latency is actually a problem on the misaligned disk, and from there determine whether or not you should actually go through the process of moving database files and rebuilding the partition with a proper offset.  Here's another server where all partitions are misaligned, and where the disks are experiencing pretty severe latency:

disk_misalignment_multi2

BTW, if you haven't seen this screen before, all of those color-coded bars are database files -- data files and transaction logs.  Same color means same database, diagonal hash marks mean transaction log.  Bottom connector pipe for each disk is writes, top pipe is reads.  Wide red fill in the pipes means high latency, thin green means IO activity but acceptable latency.

For more details on the performance implications of disk misalignment and what you can do about it, check out Microsoft's Disk Partition Alignment Best Practices for SQL Server whitepaper and Paul Randal's blog post on this topic.

Mount Point Support

Another common request that we’ve been getting more and more over the past year or so is for mount point support, in large part because of the explosion of VMs and associated LUNs in the environment.  Now when you look at the Disk Activity screen, you’ll see mount points displayed alongside all other disks:

mount_points

PA shows the same latency and volume information for mount points and database files on mount points as it always has for other disk types.

Technically these disk-related changes were no small feat to pull off, but once again the dev team rose to the occasion!

Performance Optimizations

We are always looking for new ways to increase the performance of SQL Sentry and minimize our footprint in the environment… a footprint, I might add, that is already dramatically lower than any competitive product.  (We will actually be proving this very soon… stay tuned.)  However, in this release we’ve made some changes that would be considered significant, even by us.

Redesigned Threading Model

Two changes have been made to the threading model.  Previously, dedicated threads were spawned for each monitored server, and each would consume memory, so the more servers you were monitoring, the more memory would be used by the SQL Sentry Server service process (you need one of these services for approximately every 100 or so monitored servers).  A new thread pooling system is now in place that eliminates most of these dedicated threads, dramatically reducing the process’ memory footprint and increasing scalability at the same time.

Next, for some of the collection subsystems we’ve moved from a fork/join threading model to more of an asynchronous work queue model.  Fork/join threading is similar to parallelism with SQL Server.  There are some benefits to this approach, but a primary drawback is that the process can only be as fast as the slowest thread.  The new model effectively eliminates the possibility of collection delays caused by offline servers, network issues, etc., that we’ve sometimes seen affect other critical SQL Sentry processes like chaining and alerting.

Reduced Plan Cache Footprint

Prior to v5.5, ad hoc SQL was being used in some places by both the SQL Sentry Console and Server, and in certain scenarios over time this could lead to plan cache bloat and some buffer pressure on the server where the SQLSentry database was installed.  In v5.5 we’ve eliminated almost all uses of ad hoc SQL, which dramatically reduces the plan cache footprint of the SQLSentry database.  This will not impact monitored servers, only the server where the SQLSentry database resides (unless of course it is monitored as well ;-)  After you upgrade, I recommend that you run DBCC FREEPROCCACHE on that server so the buffer can immediately reclaim any of the space previously used for the ad hoc cache.

I really hope you enjoy using v5.5.  New users can get bits here, existing customers please login to your portal account.  As always, we fully support upgrading to this build from any earlier SQL Sentry version, and from this build to the final release of v5.5.  Thanks to everyone who has shared their ideas with us, our great beta testers, and of course to the entire SQL Sentry team for your efforts in finally making these great new features a reality!

Friday, February 5, 2010

Analysis Services and the Windows File System Cache

In my last post I revealed a problem one of our customers encountered when SSAS memory limits were inadvertently set to KB instead of bytes, leading to abnormally low limits and some pretty serious performance issues.  I alluded to a “saving grace” that prevented things from being much worse than they actually were, and that’s what I’ll cover in this post.

A little background first, or rather an expansion of something I touched on in my original post -- unlike the SQL Server Database Engine, much of the SSAS data can be resident in physical memory by virtue of the Windows file system cache.  This is simply because SSAS databases consist of a collection of Windows files of various types, and SSAS does not directly manage memory/disk access to those files using Windows APIs as complex memory managers like SQL Server do.

This means that even if the SSAS internal caches are effectively non-existent, you can still have a significant amount of raw SSAS file data loaded into RAM which will not show up as part of the SSAS process memory, or in any way be governed by the SSAS memory limits.  This was the saving grace here.  If the storage engine actually had to read the files from disk for every query, performance would have been significantly worse than it already was.

I’m going to use the some Performance Advisor for Analysis Services dashboard charts to fully illustrate this phenomena.  Since the customer’s server has 32GB of RAM, I’ve blown up the System Memory chart so it’s a little easier to see what’s going on: 

memory_file_cache

The thin grey line shows the physical memory used by the file cache (about 700MB), and the even thinner orange line shows the memory used by the SSAS instance with invalid memory limits (about 450MB).  The large teal-colored area is memory used by a SQL Server instance on the same server.

Assuming there wasn’t much file serving happening on this server, most of the file cache is likely comprised of SSAS database files.  Without file serving activity, usually you will not see the file cache get this large, even though on 64 bit machines like this one it is capable of growing to use up all available memory.  (On 32 bit systems the max file cache size is about 1 GB.)  Read more about the Windows file cache here.

Ok, interesting theory, but how can we confirm that SSAS was able to get it’s data from the file cache?  Fortunately the SSAS Storage\File Read/Write charts make this is pretty easy:

memory_file_read

The File Queries chart on the right shows SSAS is querying database files at a rate of about 20 per sec, and the File Read KB chart shows that even though there is a lot of file read activity on the system (about 40MB per sec), almost none of it is coming off of disk.  This is clear because the dashed orange Total Disk line is flat, and there are also no Windows read faults at that time:

memory_read_faults

Anytime SSAS files are read from disk into physical memory, you should see hard faults.  To better demonstrate this, take a look at a similar chart for a different environment:

memory_file_read_high_disk 

The off-white Total System series shows ALL file read activity flowing through the Windows system (both from memory and disk), the light orange Total SSAS series shows the portion of total read activity directly related to the SSAS process (barely visible here), and the blue series shows the portion of SSAS read activity from data files.  The dashed Total Disk read line is overlaid on top of the stacked area chart so you can quickly see how everything correlates.  It’s clear that a sizable chunk of the SSAS data file reads are coming off disk.  Also note how this time we have significant Windows read faults:

memory_file_read_high_faults

To get a different look at the data as well as the actual file cache hit ratios, if we click on a point on the chart in the middle of the read activity spike, then click the Sample button on the toolbar it loads up the data for that specific point in time:

memory_file_read_high_disk_sample 

What this shows us is that approximately 81% of the file read activity for this sample came from the file system cache.  This is not bad, but in general you always want Total Disk to be at or near zero, which will be the case whenever SSAS is pulling data from either its internal caches or the file system cache.

How can you tell whether the data is coming from SSAS’s internal caches versus the file cache?  That’s easy -– you won’t see any storage engine file queries or associated file read activity at all, and the SSAS cache hit ratios (visible in sample mode, or when using the new Performance Advisor v5.5 reporting) will be high, such as in the shot below where the measure group cache hit ratio is 99%:

memory_high_ssas_cache_hit_ratios 

In addition to the cache hit ratio meters, there is also a visual hit ratio indicator right on the column chart -– the “fullness” of the columns reflect the hit ratios.  The more gray fill you see between he column hash lines, the higher the hit ratio:

memory_high_ssas_cache_hit_ratios_low 

Note that the measure group column is only about half full, reflecting the 52% hit ratio.  The calculation and flat cache hit ratios are both zero, so those columns are empty.  Clever, eh? ;-)  Perhaps even cleverer, on the measure group column there are two types of hash lines, horizontal and diagonal -- the horizontal hashes represent direct cache hits, while diagonal hashes represent filtered hits.  These occur when existing cache entries have to be filtered first to obtain the desired data, and are slightly higher overhead than direct hits.

In an ideal world, aside from when the cache is first warming up, you don’t want to see SSAS file queries at all, except perhaps for the occasional aggregation hit, which is also shown on the File Queries chart.  Bottom line, if you see file queries there were cache misses.  Here’s a quick custom Performance Counter History report using Performance Advisor v5.5 which shows this clearly :

memory_hit_ratio_report 

Even so, the Windows file cache hit ratio may be 100% if all of the referenced files were in the file cache.  But in that scenario, as we saw in the last post, even though performance will be better than if they were read from disk, it will not be nearly as good as if the data had been found in the SSAS cache because of the additional overhead involved when the storage engine retrieves file data from Windows.

If you want to see how your Analysis Services system is performing in this regard, you can download a fully functional 15-day trial of Performance Advisor for Analysis Services here.  Cheers.

Friday, June 19, 2009

Analysis Services Memory Limits, Part II

In my last post I discussed some of the basics of SSAS memory management -- how it utilizes limits to control the amount of memory consumed by the SSAS process, and how the memory limits are set as a percentage of the total physical memory.  I also mentioned that we had a beta tester who reported what they thought was an odd graphical anomaly on the Performance Advisor for Analysis Services “Memory Usage” chart, and I offered up a challenge to see if anyone could explain it.  (Details about the winner at bottom.)

I’ll begin with an admission -– I intentionally left out a rather critical aspect of memory limit configuration.  The omission was that, like several other SSAS settings, once the configured value goes over 100% it changes to an explicit setting.  In the case of memory limits the explicit setting is translated as bytes.

Unfortunately, for some reason this is not covered in the SSAS documentation… however, it is covered elsewhere, including this doc: 
SQL Server 2005 Analysis Services (SSAS) Server Properties

In this case, the first thing we had the tester check was the actual configured setting values, as well as the values being returned from the associated Windows performance counters as a sanity check.  Here is what they sent back:

Memory Limit Setting –> Configured Value (bytes) –> Perf Counter Value (KB)
Memory\LowMemoryLimit –> 2097152 –> 8192
Memory\TotalMemoryLimit –> 13531488 –> 13312

Hopefully the problem is starting to become clearer.  Apparently the configuring user had tried to use an explicit amount of memory, however they had inadvertently entered kilobytes (KB) instead of bytes!

These values translated to approx 8MB and 13MB for the Low and Total limits respectively, which explains what they were seeing on the chart.  It was no graphical anomaly.  (Note that even though the configured Low value should have been approx 2MB, apparently the minimum SSAS will use is 8MB, which explains the associated perf counter value of 8192 KB.)

Ok, so we figured out what was going on on the chart, but what about the impact these abnormally low settings would have on SSAS performance?  As I mentioned before, this client had had many problems with SSAS performance, and as a result had engaged Microsoft to come in and help.

Here’s where things get very interesting.  First, take a look at the dashboard shot the client sent to us, which shows the memory issue along with the execution of a single MDX query that ran for a couple of minutes:

memory_limit_issue(click to enlarge)

Here’s a closeup of the Memory Usage chart:

mem_cleaner_anomaly

What’s wrong with this picture?  Well, a couple of things.  First, SSAS memory usage is almost completely flat, and is almost entirely nonshrinkable.  Typically SSAS memory will grow and shrink continuously as memory is allocated and deallocated for querying and processing operations, and will stabilize with a much larger base proportion of shrinkable memory than we see here.

To illustrate this better, on a test SSAS instance I configured the limits artificially low and let it run for a while, then increased the limits much higher so that there was no memory pressure:

mem_cleaner_increased_limits

The same system, zoomed into a 10 minute range:

mem_cleaner_no_pressure

What the flat memory line means is that the unusually low limits are effectively preventing SSAS from loading any data into either of its caches -- the Formula Engine (FE) cache (aka, “query engine cache”) or Storage Engine (SE) cache.

This is further confirmed by looking at a few other charts.  Notice the Cache Activity chart shows almost continual evictions and memory shrinkage while the query is running:

cache_activity

This is not normal!  You rarely want to see these values above zero, because when you do it means that memory cleaner threads have kicked in because of internal memory pressure and are moving data out of memory.  In this case the data is being moved out as fast as it is moved in.

Also have a look at the Avg Time chart:

avg_time

What this shows is that SSAS is spending almost all of it’s time on non-cached storage engine operations (SE non-cached), which is where you typically want it to spend the least amount of time.  This means that every time a query needs data it’s unable to find it in the FE cache, so the formula engine issues requests to the storage engine, which is unable to find the data in its cache, so it has to go to the file system to get it from either measure group partitions or aggregations.  Partitions can be orders of magnitude slower than retrieving the data directly from cache, and is almost certainly why the query took so long.

As an aside, originally I’d actually called this the “Wait Time” chart since what we are showing here is similar to what you see with SQL Server wait types, but Carl Rabeler convinced me otherwise since what the chart is actually showing is “time spent” on the different operations, versus time spent waiting.  You could probably get into a semantical debate about this since many of the SQL Server wait types pretty much show “time spent” as well… but it’s just not worth any potential confusion from injecting some Database Engine terminology into the Analysis Services lexicon, no matter how tempting it may be ;)

In any event, this chart is critical because you can see exactly where SSAS is spending most of it’s time over any date range, which can be extremely important for profiling a server and determining where the bottlenecks lie.

Next have a look at the File Queries chart, the Total metric in particular:

file_queries

As I mentioned, the formula engine generates requests to the storage engine for measure group data when it’s not in the FE cache.  The total queries metric shows how many times the storage engine had to go to the file system to read data from a measure group partition (or aggregation) when the data wasn’t in the SE cache.  Here again is another value you typically want to be as close to zero as possible, because when it’s >0 it means that SSAS couldn’t find the data in either the FE cache OR the SE cache.

Note that you will see values >0 when an aggregation is first hit and loaded into the SE cache, since Total covers both Agg hits and partition queries.  But since agg hits are much lighter weight than partition queries this is generally not a problem.

It is also perfectly normal to see values >0 when the cache is cold and queries first start coming in.  For example, if you process your cubes overnight the associated cache data will be invalidated, and you’ll probably see file queries spike up in the morning when users come in and start issuing MDX queries, then settle down as the data is cached.  If you see that total queries are continually above zero throughout the day, memory pressure is a likely culprit.

Impact on MDX Query Performance

So what happened after the invalid memory limit settings were corrected?  Here is some actual query data from the Performance Advisor Top Queries tab which our beta tester was nice enough to export for me:

query_performance (click to enlarge)

The queries in red ran with the invalid low memory limits.  Notice every single one had to go to the storage engine for the data every time.  All had 53 non-cached operations, high cpu cycles, durations of >10 seconds, and all spent 94-97% of that time in the storage engine, for obvious reasons.

Once the limits were changed, the 2nd execution of the query (green) took about 1.5 seconds, with zero non-cached ops, very low cpu, and 100% of the time was spent in the formula engine!  This is a fantastic example of how things are supposed to work when there is memory available and limits are configured properly.  Most future runs of this query actually dropped under our default trace filter of >=1000ms, and so no longer showed up at all in Top Queries.

Note that the first run of the query after changing the memory limits (orange) actually took longer than the runs with the invalid limits, although everything else was about the same.  At first I was puzzled by this, but then it occurred to me that this may be because SSAS was finally able to actually load the data into cache, and the extra 6 seconds may just be overhead related to the associated memory allocations.  Or it may have just been an anomaly related to other activity on the server at the time, not sure.

Even though the performance before the change was pretty bad, things actually could have been much worse.  There was one very big saving grace for this client that allowed SSAS to run as well as it did with the invalid memory limits.  It’s actually very clear on the dashboard what that is, but I’ll save that one for another post ;)

Mini-contest Winner!

Congrats to Norbert Kessler from WebTelligence, a SQL Server consulting firm in Germany and one of our beta testers, for nailing the answer pretty much right on the head.  He wins the free license of Performance Advisor for Analysis SerivcesJohn Welch was very close, and actually brought up another pretty serious misconfiguration -- when the high and low limits are configured with the same value.  This is a common practice in the SQL Server world, but a big no-no with SSAS.  Another topic for a future post perhaps…

Monday, June 15, 2009

Analysis Services Memory Limits

One of the least understood areas of Analysis Services performance seems to be that of memory limits – how they work, and how to configure them effectively.  I was going to start with an overview of Performance Advisor for Analysis Services (aka, PA for SSAS), but during the beta I ran across such a good example of the software in action in diagnosing a problem with memory, I think it’ll work just as well to go through that issue step-by-step.

First some background.  SSAS has two general categories of memory, shrinkable and nonshrinkable, and they work pretty much like it sounds.  Shrinkable memory can be easily reduced and returned back to the OS.  Nonshrinkable memory, on the other hand, is generally used for more essential system-related stuff such as memory allocators and metadata objects, and is not easily reduced.

PA for SSAS shows both types on the SSAS Memory Usage chart:

Cleaner MemoryIf you switch to “Memory by Category” mode via right-click menu, you get a more detailed breakdown.  Just like the chart above, the memory with hash marks is nonshrinkable:

Memory by Category

SSAS uses memory limit settings to determine how it allocates and manages its internal memory.  Memory\LowMemoryLimit defaults to 65% of the total available physical memory on the machine (75% on AS2005), and Memory\TotalMemoryLimit (also sometimes called the High Memory Limit) defaults to 80%.  This is the total amount of memory that the SSAS process itself (msmdsrv.exe) can consume.  Note on the charts above, the orange line represents the Low memory limit and the red line the Total memory limit.  This makes it very easy to see how close SSAS’s actual memory consumption is to these limits.  Once memory usage hits the Low limit, memory cleaner threads will kick in and start moving data out of memory in a relatively non-aggressive fashion.  If memory hits the Total limit, the cleaner goes into crisis mode… it spawns additional threads and gets much more aggressive about memory cleanup, and this can dramatically impact performance.

Between the Low and Total limits, SSAS uses an economic memory management model to determine which memory to cleanup.  This model can be adjusted by some other parameters related to the memory price, but I’m not going to get into those here.

When it comes to memory management SSAS is entirely self-governing, and unlike SQL Server it doesn’t consider external low physical memory conditions (which can be signaled by Windows) or low VAS memory.  This may be partly because SSAS is already much more subject to Windows own memory management than is SQL Server, since Analysis Services databases are a collection of files on the file system and can make heavy use of the file system cache, whereas SQL Server does not.  This means that you can have a significant amount of SSAS data loaded into RAM by virtue of the file cache, and this will not show up as part of the SSAS process memory, or in any way be governed by the SSAS memory limits.  (For this reason we now show system file cache usage and file cache hit ratios on the dashboard as well so you can see this.)

On a side note, a byproduct of this design is that if available memory is low and other processes on the server are competing for memory resources, SSAS will pretty much ignore them, and look only to its own memory limit settings to determine what to do.  This means, for example, that memory for a SQL Server using the default “Minimum server memory” setting of 0 can be aggressively and dramatically reduced by an SSAS instance on the same machine with default memory limits, to the point where the SQL Server can become unresponsive. SQL Server is signaled by Windows that memory is low, and responds accordingly by releasing memory back to the OS, which is then gobbled up by SSAS.

Graphical Anomaly, or Serious Memory Problem?

Very early in our beta for PA for SSAS, a beta tester reported what they thought was an odd graphical anomaly in the bottom right of the SSAS Memory Usage chart:

mem_cleaner_anomaly

To the untrained eye, this would certainly look like an anomaly, but once you’ve seen what this chart is supposed to look like, it’s fairly obvious that it’s an “H” on top of an “L”.

This particular tester had been eager to get the beta because they had suffered from poor performance and other strange behaviors for quite some time.  They had actually already scheduled Microsoft to come in and help troubleshoot the issues.

What exactly did this “anomaly” mean?  What caused it, how it was resolved, and what was the net effect on performance, if any?  All great questions… and this will be the subject of my next post, so stay tuned ;-)

If anyone has any guesses in the meantime feel free to comment.  The first person to get it right will win a free license of Performance Advisor for Analysis Services! (that’s a $2495 value)