Friday, June 19, 2009

Analysis Services Memory Limits, Part II

In my last post I discussed some of the basics of SSAS memory management -- how it utilizes limits to control the amount of memory consumed by the SSAS process, and how the memory limits are set as a percentage of the total physical memory.  I also mentioned that we had a beta tester who reported what they thought was an odd graphical anomaly on the Performance Advisor for Analysis Services “Memory Usage” chart, and I offered up a challenge to see if anyone could explain it.  (Details about the winner at bottom.)

I’ll begin with an admission -– I intentionally left out a rather critical aspect of memory limit configuration.  The omission was that, like several other SSAS settings, once the configured value goes over 100% it changes to an explicit setting.  In the case of memory limits the explicit setting is translated as bytes.

Unfortunately, for some reason this is not covered in the SSAS documentation… however, it is covered elsewhere, including this doc: 
SQL Server 2005 Analysis Services (SSAS) Server Properties

In this case, the first thing we had the tester check was the actual configured setting values, as well as the values being returned from the associated Windows performance counters as a sanity check.  Here is what they sent back:

Memory Limit Setting –> Configured Value (bytes) –> Perf Counter Value (KB)
Memory\LowMemoryLimit –> 2097152 –> 8192
Memory\TotalMemoryLimit –> 13531488 –> 13312

Hopefully the problem is starting to become clearer.  Apparently the configuring user had tried to use an explicit amount of memory, however they had inadvertently entered kilobytes (KB) instead of bytes!

These values translated to approx 8MB and 13MB for the Low and Total limits respectively, which explains what they were seeing on the chart.  It was no graphical anomaly.  (Note that even though the configured Low value should have been approx 2MB, apparently the minimum SSAS will use is 8MB, which explains the associated perf counter value of 8192 KB.)

Ok, so we figured out what was going on on the chart, but what about the impact these abnormally low settings would have on SSAS performance?  As I mentioned before, this client had had many problems with SSAS performance, and as a result had engaged Microsoft to come in and help.

Here’s where things get very interesting.  First, take a look at the dashboard shot the client sent to us, which shows the memory issue along with the execution of a single MDX query that ran for a couple of minutes:

memory_limit_issue(click to enlarge)

Here’s a closeup of the Memory Usage chart:

mem_cleaner_anomaly

What’s wrong with this picture?  Well, a couple of things.  First, SSAS memory usage is almost completely flat, and is almost entirely nonshrinkable.  Typically SSAS memory will grow and shrink continuously as memory is allocated and deallocated for querying and processing operations, and will stabilize with a much larger base proportion of shrinkable memory than we see here.

To illustrate this better, on a test SSAS instance I configured the limits artificially low and let it run for a while, then increased the limits much higher so that there was no memory pressure:

mem_cleaner_increased_limits

The same system, zoomed into a 10 minute range:

mem_cleaner_no_pressure

What the flat memory line means is that the unusually low limits are effectively preventing SSAS from loading any data into either of its caches -- the Formula Engine (FE) cache (aka, “query engine cache”) or Storage Engine (SE) cache.

This is further confirmed by looking at a few other charts.  Notice the Cache Activity chart shows almost continual evictions and memory shrinkage while the query is running:

cache_activity

This is not normal!  You rarely want to see these values above zero, because when you do it means that memory cleaner threads have kicked in because of internal memory pressure and are moving data out of memory.  In this case the data is being moved out as fast as it is moved in.

Also have a look at the Avg Time chart:

avg_time

What this shows is that SSAS is spending almost all of it’s time on non-cached storage engine operations (SE non-cached), which is where you typically want it to spend the least amount of time.  This means that every time a query needs data it’s unable to find it in the FE cache, so the formula engine issues requests to the storage engine, which is unable to find the data in its cache, so it has to go to the file system to get it from either measure group partitions or aggregations.  Partitions can be orders of magnitude slower than retrieving the data directly from cache, and is almost certainly why the query took so long.

As an aside, originally I’d actually called this the “Wait Time” chart since what we are showing here is similar to what you see with SQL Server wait types, but Carl Rabeler convinced me otherwise since what the chart is actually showing is “time spent” on the different operations, versus time spent waiting.  You could probably get into a semantical debate about this since many of the SQL Server wait types pretty much show “time spent” as well… but it’s just not worth any potential confusion from injecting some Database Engine terminology into the Analysis Services lexicon, no matter how tempting it may be ;)

In any event, this chart is critical because you can see exactly where SSAS is spending most of it’s time over any date range, which can be extremely important for profiling a server and determining where the bottlenecks lie.

Next have a look at the File Queries chart, the Total metric in particular:

file_queries

As I mentioned, the formula engine generates requests to the storage engine for measure group data when it’s not in the FE cache.  The total queries metric shows how many times the storage engine had to go to the file system to read data from a measure group partition (or aggregation) when the data wasn’t in the SE cache.  Here again is another value you typically want to be as close to zero as possible, because when it’s >0 it means that SSAS couldn’t find the data in either the FE cache OR the SE cache.

Note that you will see values >0 when an aggregation is first hit and loaded into the SE cache, since Total covers both Agg hits and partition queries.  But since agg hits are much lighter weight than partition queries this is generally not a problem.

It is also perfectly normal to see values >0 when the cache is cold and queries first start coming in.  For example, if you process your cubes overnight the associated cache data will be invalidated, and you’ll probably see file queries spike up in the morning when users come in and start issuing MDX queries, then settle down as the data is cached.  If you see that total queries are continually above zero throughout the day, memory pressure is a likely culprit.

Impact on MDX Query Performance

So what happened after the invalid memory limit settings were corrected?  Here is some actual query data from the Performance Advisor Top Queries tab which our beta tester was nice enough to export for me:

query_performance (click to enlarge)

The queries in red ran with the invalid low memory limits.  Notice every single one had to go to the storage engine for the data every time.  All had 53 non-cached operations, high cpu cycles, durations of >10 seconds, and all spent 94-97% of that time in the storage engine, for obvious reasons.

Once the limits were changed, the 2nd execution of the query (green) took about 1.5 seconds, with zero non-cached ops, very low cpu, and 100% of the time was spent in the formula engine!  This is a fantastic example of how things are supposed to work when there is memory available and limits are configured properly.  Most future runs of this query actually dropped under our default trace filter of >=1000ms, and so no longer showed up at all in Top Queries.

Note that the first run of the query after changing the memory limits (orange) actually took longer than the runs with the invalid limits, although everything else was about the same.  At first I was puzzled by this, but then it occurred to me that this may be because SSAS was finally able to actually load the data into cache, and the extra 6 seconds may just be overhead related to the associated memory allocations.  Or it may have just been an anomaly related to other activity on the server at the time, not sure.

Even though the performance before the change was pretty bad, things actually could have been much worse.  There was one very big saving grace for this client that allowed SSAS to run as well as it did with the invalid memory limits.  It’s actually very clear on the dashboard what that is, but I’ll save that one for another post ;)

Mini-contest Winner!

Congrats to Norbert Kessler from WebTelligence, a SQL Server consulting firm in Germany and one of our beta testers, for nailing the answer pretty much right on the head.  He wins the free license of Performance Advisor for Analysis SerivcesJohn Welch was very close, and actually brought up another pretty serious misconfiguration -- when the high and low limits are configured with the same value.  This is a common practice in the SQL Server world, but a big no-no with SSAS.  Another topic for a future post perhaps…

Monday, June 15, 2009

Analysis Services Memory Limits

One of the least understood areas of Analysis Services performance seems to be that of memory limits – how they work, and how to configure them effectively.  I was going to start with an overview of Performance Advisor for Analysis Services (aka, PA for SSAS), but during the beta I ran across such a good example of the software in action in diagnosing a problem with memory, I think it’ll work just as well to go through that issue step-by-step.

First some background.  SSAS has two general categories of memory, shrinkable and nonshrinkable, and they work pretty much like it sounds.  Shrinkable memory can be easily reduced and returned back to the OS.  Nonshrinkable memory, on the other hand, is generally used for more essential system-related stuff such as memory allocators and metadata objects, and is not easily reduced.

PA for SSAS shows both types on the SSAS Memory Usage chart:

Cleaner MemoryIf you switch to “Memory by Category” mode via right-click menu, you get a more detailed breakdown.  Just like the chart above, the memory with hash marks is nonshrinkable:

Memory by Category

SSAS uses memory limit settings to determine how it allocates and manages its internal memory.  Memory\LowMemoryLimit defaults to 65% of the total available physical memory on the machine (75% on AS2005), and Memory\TotalMemoryLimit (also sometimes called the High Memory Limit) defaults to 80%.  This is the total amount of memory that the SSAS process itself (msmdsrv.exe) can consume.  Note on the charts above, the orange line represents the Low memory limit and the red line the Total memory limit.  This makes it very easy to see how close SSAS’s actual memory consumption is to these limits.  Once memory usage hits the Low limit, memory cleaner threads will kick in and start moving data out of memory in a relatively non-aggressive fashion.  If memory hits the Total limit, the cleaner goes into crisis mode… it spawns additional threads and gets much more aggressive about memory cleanup, and this can dramatically impact performance.

Between the Low and Total limits, SSAS uses an economic memory management model to determine which memory to cleanup.  This model can be adjusted by some other parameters related to the memory price, but I’m not going to get into those here.

When it comes to memory management SSAS is entirely self-governing, and unlike SQL Server it doesn’t consider external low physical memory conditions (which can be signaled by Windows) or low VAS memory.  This may be partly because SSAS is already much more subject to Windows own memory management than is SQL Server, since Analysis Services databases are a collection of files on the file system and can make heavy use of the file system cache, whereas SQL Server does not.  This means that you can have a significant amount of SSAS data loaded into RAM by virtue of the file cache, and this will not show up as part of the SSAS process memory, or in any way be governed by the SSAS memory limits.  (For this reason we now show system file cache usage and file cache hit ratios on the dashboard as well so you can see this.)

On a side note, a byproduct of this design is that if available memory is low and other processes on the server are competing for memory resources, SSAS will pretty much ignore them, and look only to its own memory limit settings to determine what to do.  This means, for example, that memory for a SQL Server using the default “Minimum server memory” setting of 0 can be aggressively and dramatically reduced by an SSAS instance on the same machine with default memory limits, to the point where the SQL Server can become unresponsive. SQL Server is signaled by Windows that memory is low, and responds accordingly by releasing memory back to the OS, which is then gobbled up by SSAS.

Graphical Anomaly, or Serious Memory Problem?

Very early in our beta for PA for SSAS, a beta tester reported what they thought was an odd graphical anomaly in the bottom right of the SSAS Memory Usage chart:

mem_cleaner_anomaly

To the untrained eye, this would certainly look like an anomaly, but once you’ve seen what this chart is supposed to look like, it’s fairly obvious that it’s an “H” on top of an “L”.

This particular tester had been eager to get the beta because they had suffered from poor performance and other strange behaviors for quite some time.  They had actually already scheduled Microsoft to come in and help troubleshoot the issues.

What exactly did this “anomaly” mean?  What caused it, how it was resolved, and what was the net effect on performance, if any?  All great questions… and this will be the subject of my next post, so stay tuned ;-)

If anyone has any guesses in the meantime feel free to comment.  The first person to get it right will win a free license of Performance Advisor for Analysis Services! (that’s a $2495 value)

Thursday, June 4, 2009

Performance Advisor for… Analysis Services?

It’s been some time since my last post, and several folks have asked “Hey Greg, after such a promising start, what happened?”  The short answer is, “Performance Advisor for Analysis Services.”  Yes, you heard that right.

The long answer is… not long after my last post I was having a chat with John Welch at Mariner, a great Microsoft BI consulting firm also based here in Charlotte, and I was trying explain how Performance Advisor for SQL Server might be used to help troubleshoot Analysis Services performance issues.  Let’s just say that conversation didn’t go quite as well as I’d hoped ;-)  Sure, PA for SQL shows many of the pertinent Windows counters on the dashboard, and the “Other files” element on the disk activity and space tabs will capture SSAS-related IO and space, and Event Manager handles SSIS and SSRS… but it became apparent very quickly that this was not nearly enough.  John asked, “What about temp file spillage… or file cache utilization… or aggregation usage?”

Although PA for SQL may be the best tool out there for optimizing SQL Server performance, clearly an SSAS performance tool it was not.  Needless to say, I left feeling a bit down about that point… but energized by another tidbit I discovered during the meeting: apparently there was no commercial software on the market for Analysis Services performance!  When John said this I really couldn’t believe it.  Up to that point our focus had always been on the SQL database engine and directly related technologies, and I guess I just assumed that, as big of an industry as BI is for Microsoft, that there had to be something out there.  After a lot of looking around, I confirmed that there wasn’t.

Shortly thereafter I went to the dev team and said, “Remember what I said last week about focusing exclusively on the enhanced performance alerting features for PA for SQL that we didn’t get into the initial release… well, forget that for the moment.  An opportunity just fell into our laps, and this type of opportunity doesn’t come along every day!”  Blank stares.  Although I’d like to think that by this point I’ve earned the confidence of the team, let’s just say this sudden change in direction was met with a certain amount of skepticism.

Here we are about 9 months later and within a couple of weeks of the release of PA for SSAS, and I can safely say there are no more skeptics.  When I stop to think about it I’m amazed we were able to get this software to market so quickly.  It took about two years of focused effort to build PA for SQL Server, and if we had started from scratch it would have taken at least as long to build PA for SSAS, probably longer.  The big difference was that we’d already built this fantastic architecture for collecting, managing, and presenting performance data from diverse sources, so a lot of it was pretty much plug-and-play.  The most difficult aspect of the project was definitely the R&D phase, which took several months.  If you’ve worked with both the SQL database engine and Analysis Services even a little, you know they are completely different animals that bear almost no resemblance to each other.  From the way they manage memory and disk access, to the threading models, to usage patterns, they almost look like they were developed by two different companies.  (Probably because originally they were – Microsoft acquired the original OLAP engine from Panorama Software in 1996, around the time of SQL Server 6.5.)

The goal for PA for SSAS was pretty much the same as PA for SQL Server – build a product which presents key, actionable performance metrics in an intuitive, easy-to-consume fashion.  However, due to the vast architectural differences between SQL and SSAS, almost none of the extensive R&D done for SQL Server was reusable.  Determining exactly which SSAS metrics to present and how, so that literally within a few minutes someone who may not even be very familiar with SSAS can understand how it works under the covers, was one of the most challenging undertakings I’ve ever been involved with.  I can remember leaving the office many times during that phase feeling like a tiny ant on the backside of the Analysis Services elephant.

Little by little it all came together though, and I think you’ll be very pleased with the end result.  I can honestly say that it would not have been possible without the extensive work by and encouragement from Carl Rabeler (Microsoft SQLCAT), Chris Webb, and John Welch; Brooke Philpott and our phenomenal technical team; and our great beta testers.  Many, many thanks to all of you!