Monitoring Cube Performance and Usage - Expert Cube Development with SSAS Multidimensional Models (2014)

Expert Cube Development with SSAS Multidimensional Models (2014)

Chapter 11. Monitoring Cube Performance and Usage

We need to be able to monitor activity on Analysis Services if we're going to be able to detect possible performance issues, to provide good service levels and to optimize individual queries or processing operations. In order to learn how to do this, first of all we need to know how Analysis Services interacts with the operating system. Resources consumed by Analysis Services are controlled by the operating system and we need to understand these relationships. After an introduction to the tools and performance counters provided by the operating system, we will look at Analysis Services Memory Manager and how it can be configured. In the last part of this chapter, we will examine how to monitor specific operations in Analysis Services such as processing and querying.

Analysis Services and the operating system

The first concept we have to understand when learning about monitoring is that Analysis Services is a process running as a service on a Microsoft Windows operating system. Running as a service, it usually starts when the operating system starts and is runningeven when no users are connected. The process name is msmdsrv.exe and knowing this name is very important because it allows us to recognize the Analysis Services process in lists of all processes running on a server, for example, when we're using the TaskManager as in the following screenshot. You might have more Analysis Services processes running at the same time—later in this chapter we will see how to recognize which instance is linked to which process.

However, let's first concentrate on the following screenshot:

Analysis Services and the operating system

As Analysis Services is a Windows process, the first thing we can analyze is the interaction between that process and the hosting operating system. It's the operating system that provides a running process with the resources that it needs, and that are shared with other processes: CPU, memory, and I/O.

Resources shared by the operating system

It is important to understand the interaction between Analysis Services and the resources provided by the operating system, because at the end of the day, almost every bottleneck is caused by an abnormal consumption of resources by a process.

Note

We will examine the behavior of Analysis Services using several tools. It is important to point out that we should run our tests when there is no other activity on our server, so we can reduce the amount of noise in our measurements. The only exception is when we're trying to reproduce real-world conditions, such as processing a partition when we have both SQL Server and Analysis Services running on the same server.

CPU

Analysis Services consumes CPU during two operations: processing and querying. Many of the operations performed by Analysis Services can scale on multiple cores. We won't go into detail here about the extent that each single operation can scale on multiple cores, simply because this might be subject to change in future releases or service packs. However, Analysis Services can scale up because it runs multiple operations in parallel (for example, in Enterprise Edition, multiple objects can be processed in parallel), but also because a single operation (for example, processing a single partition) can be executed across several threads.

During partition processing, CPU is used mainly to build aggregations. If a partition doesn't have any aggregations designed, then CPU usage will be minimal. However, as we will explain in more detail later on, CPU consumption will only be high during partition processing if there is enough RAM available for Analysis Services, and memory paging is kept to a minimum. If we don't have enough physical memory for Analysis Services, we will see long processing times even if there is little CPU usage. In general, having high CPU usage during processing is a good thing because it generally means that there are no other bottlenecks in memory or I/O operations. If we want to reduce processing times when we already have high CPU usage, we need to either reduce the number of aggregations in the cube design or add more CPU cores on the server running Analysis Services.

During querying, CPU is used by Analysis Services to perform calculations on the data that has been requested by a query. Most of these calculations scale on a multicore server, but sometimes (and this happens more often than during processing) it might not be possible to parallelize an operation across several threads. In such cases, a single query cannot take advantage of a multicore server. For example, the Formula Engine is single-threaded whereas the Storage Engine is multithreaded. If the query has complex MDX calculations, the Formula Engine can only use a single CPU. This scenario can be recognized by observing the CPU usage of the msmdsrv.exe process while a single MDX query is executing. If there is a bottleneck in the Formula Engine, the process does not consume more than the equivalent of one core.

In other words, on an 8-core server we will see a constant consumption of one-eighth of our available CPU, that's to say 12 to 13 percent (you see a slightly higher percentage in the following screenshot because of the time spent by other processes).

CPU

In such cases, there are two ways we can try to optimize query response time. On one hand, we can try to change our cube design or our MDX query to enable greater parallelism during query execution; on the other hand, we can try to reduce the time needed for a sequential operation, for example, by reducing the size of an intermediate fact table used in a many-to-many dimension relationship, or by rewriting MDX to use bulk mode instead of cell-by-cell mode (as discussed in Chapter 8, Query Performance Tuning).

In general, Analysis Services can consume a lot of CPU resources during processing, and depending on the conditions, also while running queries. We need to bear this in mind when specifying servers for Analysis Services to run on, or when deciding if Analysis Services should be installed on the same server as other CPU-intensive applications.

Memory

Analysis Services uses memory for a lot of different purposes. It is not a goal of this book to explain all the details of Analysis Services' memory settings or how to tune them; in any case, for most scenarios the default settings are good enough. However, it is important to understand what happens when Analysis Services requests memory from the operating system, because this memory is not always physical RAM and this could have important consequences such as increased paging of memory to disk.

Analysis Services, like any other process in Windows, requires memory from the operating system, which in turn provides blocks of virtual memory. Each process has a separate address space that is called a virtual address space. Each allocation made by a Windows process inside the virtual address space gets a part of the operating system virtual memory, and that might correspond to either physical RAM or disk paging file. The decision whether a page of memory (which corresponds to 4K) is in physical RAM or is moved to the paging disk file is up to the operating system. This concept is very important, especially when we have several other services running on the same server, such as Reporting Services, Integration Services, and the relational engine of SQL Server itself.

Note

The memory allocated by Analysis Services might be paged to disk due to other process activities and this is partially controlled by some memory settings. An explanation of these settings is available in books online at http://tinyurl.com/ssas2012memory. A guide to their optimization can be found in the Microsoft SQL Server Analysis Services Multidimensional Performance and Operations Guide white paper at http://tinyurl.com/ssasmdperfguide. We will also cover a few of these settings later, talking about memory differences between 32-bit and 64-bit systems.

In order to understand how much virtual and physical memory a process is using, it is important that we know how to read the numbers provided by Task Manager.

The total amount of virtual memory requested by a process is displayed in a column called Commit Size.

The total amount of physical RAM used by a process is displayed in a column called Working Set (Memory).

Note

A better column to watch if we want to monitor the physical RAM consumed by a process is Memory (Private Working Set), which is a subset of the Working Set (Memory) counter that includes only the data that cannot be shared with other processes. However, these two numbers shouldn't be too different for Analysis Services.

The virtual memory manager in Windows is a complex system that aims to optimize the usage of physical memory, sharing data between processes whenever it is possible, but in general isolating each virtual address space from all the others in a secure manner. For these reasons it is not easy to interpret the counters we just mentioned, and it could be useful for you to recap how virtual memory allocation works in Windows.

As we mentioned, data can be shared across different processes. For example, if two different processes use the same DLL, the physical memory containing DLL code is shared between both processes using read-only pages, regardless of the virtual position of the code in each virtual address space. However, this feature is not particularly important for Analysis Services because most of the memory is consumed for data and it is private to the process. For this reason, for the rest of this section we will concentrate on private memory and not on shared memory.

When a process allocates private memory, as Analysis Services does when it requires space for its data, it is requested from virtual memory. When that memory is written, the operating system ensures that the page is in physical RAM. When there is not enough RAM to hold all the virtual memory pages used by running processes, the operating system moves older pages from RAM to disk. These pages will be recalled from disk as soon as a process needs to read or write data there. This activity is called memory pagingand we want it to happen as little as possible. One way to stop it happening would be to remove the paging file from the operating system. This is possible using the No paging file setting, but we do not recommend using this option on a server that is running SQL Server or Analysis Services.

Note

There has been a lot of discussion over the years about what the "right" size for the Windows paging file is, and there have been a number of suggestions made that are simply wrong. There are, for example, several formulas around based on the amount of physical RAM, and the most common uses what is reported in the Virtual Memory dialog box, which is 1.5 times the physical RAM in the system. However, most of these suggestions are baseless and inherited from an era where the RAM of a server were measured in hundreds of megabytes.

Today we might have tens of gigabytes of RAM, and paging 1.5 times this amount of data seems crazy, even just for the time required to read and write that much data (we are talking about minutes, not seconds here). There are only two conditions that justify a big paging file size: if we need a full memory dump in the event of a system crash/blue screen, then the size of the paging file must be at least the amount of physical RAM; and if a kernel memory dump is required, then a variable amount of paging file is required, up to the amount of physical RAM, but generally lower than that.

Where only a small memory dump is required as a result of a blue screen, the size of the paging file only has an impact on the performance of the dump. Having a few gigabytes should be enough even on servers with tens of gigabytes of RAM. If you want to work out what a good custom size of the paging file would be for your server, we suggest reading http://tinyurl.com/PagingSize and http://support.microsoft.com/kb/889654, and also http://tinyurl.com/VirtualMemory, which is a more complete description of virtual memory, written by Mark Russinovich.

Thus, we have a paging file and we need to optimize its use; ideally Analysis Services should not use it at all. If Analysis Services were the only process running on the system, it would be sufficient to set its memory limits to a value that doesn't exceed the amount of physical RAM of the system. In fact, the default settings of Analysis Services are below this limit, but they don't take into account that other memory-hungry processes may run concurrently on the same machine. For example, it's quite common to have both SQL Server and Analysis Services running on the same machine. Think what would happen when we processed a cube, which of course would mean that Analysis Services would need to query the fact table in SQL Server: both services require memory and paging to disk could be unavoidable. There is a difference between SQL Server and Analysis Services in terms of memory management, in that SQL Server can adapt the amount of virtual memory it requests from the operating system to the amount of physical RAM available to it. Analysis Services is not as sophisticated as SQL Server and does not dynamically reduce or increase the size of its requests for memory to the operating system based on current available physical memory.

Tip

Another reason for paging physical memory to disk is the System File Cache. By default, Windows uses the available unused physical memory as System File Cache. This would not be an issue if the concept of "unused" was really of memory not used by anyone. If applications or file sharing result in a lot of sustained cached read I/O, the operating system gives physical RAM to the file cache and paginates the memory of idle processes if necessary. In other words, if we were working on a 16 GB RAM server and we copied a 20 GB file containing a backup file to an external disk, we could paginate all the memory used by Analysis Services if it was not in use during the copy. To avoid this kind of situation, we could reduce the size of the paging file (if virtual memory cannot be paginated to disk, it has to be kept in physical RAM) or use the SetCache utility that is discussed at http://tinyurl.com/SetCache.

The memory requested by a process is always requested as virtual memory. In situations where the virtual memory allocated by Analysis Services is much larger than the available physical RAM, some Analysis Services data will be paged to disk. As we said, this could happen during cube processing. We should avoid these situations by configuring Analysis Services' memory settings (more on this we discuss later), so that they limit the amount of memory that can be allocated by it. However, when no other processes are asking for memory, we might find that by having limited Analysis Services' memory usage in this way, we are preventing it from using extra memory when it needs it, even when that memory is not used by anything else.

Later in this chapter, we will explore the available memory options for Analysis Services and see how to monitor its memory usage.

I/O Operations

Analysis Services generates I/O operations both directly and indirectly. A direct I/O request from Analysis Services is made when it needs to read data from or write data to disk; and when it sends query results back to the client, which involves an inter-processcommunication, typically made through network I/O operations. The indirect I/O requests generated by Analysis Services come from paging disk operations, and it is very important to be aware that they can happen, because they cannot be seen using the performance counters we might typically monitor for Analysis Services. Paging operations are not visible to the Analysis Services process and can be seen only by using the appropriate operating system performance counters, such as Memory: Pages / sec.

As we said, there are two different types of directly generated I/O operations that we need to worry about when monitoring Analysis Services. First, there are disk-related operations, which are made when Analysis Services processes cubes and dimensions (mainly to write data to disk) and when Analysis Services reads cube data to answer queries. Monitoring reads to the disk is important if we are to spot situations where more RAM or more aggregations are needed to improve performance. Another I/O operation generated by Analysis Services is the transfer of query results to the client. Usually this is not a slow operation, but if a query returns a very large number of cells in a cellset (for example, a list of one million customers with several measure values for each one), the query response time might be affected by the time needed to transfer the cellset from the server to the client. Take a look at network traffic to understand if this is a possible issue.

Note

Most of the I/O bottlenecks for Analysis Services operations are related to disk access. It is out of scope of this book to talk about hardware configurations. However, it is important to highlight that for Analysis Services, a set of local disks (especially if in RAID 0 or 10) could be much faster than accessing volumes on a SAN. Getting the same bandwidth for a SAN as we can get with local disks could be very expensive, and bandwidth is very important when Analysis Services needs to do something such as scanning all the partitions in a measure group. An interesting Microsoft whitepaper about hardware optimization for a multiterabyte data warehouse and cube is available at http://tinyurl.com/ProjectRealSsas.

Tools to monitor resource consumption

The first tools we're going to look at for monitoring resource consumption are Task Manager and Performance Monitor. These tools are available on any server we might use to host an Analysis Services instance and they can also be used to monitor any other process that might compete with Analysis Services for the use of system resources.

Windows Task Manager

The Windows Task Manager is a utility included with the Microsoft Windows operating system that provides information about running applications and their system resource usage (for example, CPU, memory, and network activity). Task Manager also allows us to kill a process and change its priority. However, we are only interested in the monitoring features of this tool. There were been some changes and improvements in Task Manager moving from the XP/2003 version to the Vista/2008 version. From here on, we will only focus on the latest release of the tool.

The Services tab displays a list of services along with the process ID (PID) of each service. In the following screenshot, we can see that there are two instances of Analysis Services running on the same machine: one is named MSOLAP$K8—a named instance called K8—and has the PID 2568, the other is named MSSQLServerOLAPServices, is the default instance and has the PID 2768.

Windows Task Manager

On the Processes tab, we can distinguish between the two running msmdsrv.exe processes based on the PIDs that we have seen on the Services tab. The following screenshot also shows several other columns (chosen through the View | Select Columnsmenu) with information about CPU and memory usage, which we'll now explain.

Windows Task Manager

The CPU column shows what percentage each process uses of the total processing power available on the system. On a dual core machine, a single thread can consume up to 50 percent; but the same thread can only consume up to 25 percent on a quad-core system. Analysis Services scales well on multiple cores in most scenarios, so we should expect to see very high CPU values for Analysis Services during cube processing and query resolution.

Working Set (Memory) and Commit Size are the most important counters that we can analyze and correspond to the physical RAM and virtual memory used by the process. As we said earlier, Memory (Private Working Set) is a value that, for Analysis Services, should be slightly lower than Working Set (Memory). Therefore, for our purposes, they can be used almost in the same way. What is interesting is the comparison between virtual memory usage and physical RAM usage. When virtual memory usage is higher than physical RAM usage, probably part of the Analysis Services cache has been paged to disk. This condition is not an issue if Analysis Services does not have a high workload, but if it happens when users are running queries and does not improve over time (the physical RAM usage should grow to almost matching the virtual memory usage), then we should investigate the cause of the issue. The server might have insufficient physical RAM or there could be other processes that are consuming physical RAM at the same time. In both cases, adding RAM would be the best way to solve the problem. We can also mitigate the issue by moving other processes to different servers if we can afford to do so, or by reducing the amount of memory available to Analysis Services (which we'll see later in this chapter). Page Faults Delta is another counter that we can monitor to check which process is responsible for any paging activity.

Performance counters

The Windows operating system gathers data about system performance and exposes it through performance counters, which can be displayed through a Microsoft Management Console (MMC) snap-in called Performance Monitor. In reality, performance counters are available through a set of APIs and there are third-party tools that can access them too. However, in this book our goal is to highlight the most important performance counters to monitor on a server running Analysis Services. We will use Performance Monitor, but the concepts we are going to explain are valid regardless of the tool used to display them. Moreover, in this section we will cover only operating system counters, but there are also other counters specific to Analysis Services instances that we will look at later on in this chapter.

Performance Monitor can be found by clicking on the Administrative Tools | Performance Monitor shortcut, as shown in the following screenshot. There are differences in the user interface of this tool in different versions of Windows, but they are not significant for the purposes of this chapter. Performance Monitor can display performance counter data captured in real time, or it can be used to display a trace session of performance counter data recorded using the Data Collector Sets feature of Performance Monitor. Using this trace data is very useful for monitoring a production server to detect bottlenecks and measure average workload. We suggest reading the documentation available at http://tinyurl.com/CollectorSet to understand how to make good use of Data Collector Sets.

Performance counters

The memory counters are divided in two groups. A first set of counters, included in the Memory category, provides information about the overall consumption of memory regardless of the consumer (it could be a process, a driver, or the operating system itself).

· Pages/sec: This is the number of pages read from and written to disk to resolve hard page faults. An access to a page of virtual memory that has been paged to disk generates this type of event; the cause is insufficient physical RAM to satisfy all the memory requests

· Committed Bytes: This is the amount of virtual memory requested by all running processes. If this value is higher than the amount of available physical RAM, data has been paged to disk. However, if the Pages/sec rate is low or null, data paged to disk is not used often. As long as it does not need to be accessed, there will be no performance problems caused by this situation.

· % Committed Bytes in Use: This is the ratio of Committed Bytes to Commit Limit, which is the maximum amount of virtual memory that can be allocated (a value slightly below the sum of physical RAM and paging file).

A second group is made up of performance counters that are available for any running process. This group is included in the Process category and includes a few counters that we have already seen in Task Manager, as well as other counters that are useful for analyzing the state of an Analysis Services instance, as shown:

· Virtual Bytes: This is generally not so useful, because it represents the amount of virtual memory used for both private allocations and file mapping (including executable files).

· Page File Bytes and Private Bytes: These are usually very similar for Analysis Services, and correspond to the Commit Size counter we have seen in Task Manager.

· Page File Bytes Peak: This is very important because it reports the maximum value reached by Page File Bytes since the start of process execution, regardless of when we started monitoring. If this value is higher than Page File Bytes, there has been a reduction in the amount of memory requested by Analysis Services, but since this peak has been reached earlier, this usually implies that it could be reached again in the future.

· Working Set: This corresponds to the Working Set (Memory) value in Task Manager

· Working Set—Private: This corresponds to the Memory (Private Working Set) value in Task Manager

· Working Set Peak: This is the maximum value reached by Working Set since the start of process execution, regardless of when we started monitoring. This value can be higher than the actual Working Set counter, because even if a process has not released memory previously allocated, part of that memory may have been paged as a result of memory activity by other processes. If this happens with Analysis Services, we could investigate which other processes are requesting memory concurrently on the same server.

The Disk I/O counters at the operating system level are divided up into two categories: LogicalDisk and PhysicalDisk. The difference between these counters is significant when there are multiple partitions on a physical disk or when we use operating system support to create RAID volumes. When the RAID service is provided by the disk controller, the physical disk and logical disk should be the same unless we have several partitions on a disk.

Note

Logical and Physical Disk Performance counters can be enabled or disabled by using the Diskperf command line utility available from the operating system. We can check the state of these counters by simply running this utility without parameters from the command line.

Here are the important counters available in these categories:

· Current Disk Queue Length: This represents the current number of queued requests to the disk. If this number is constantly above the number of spindles plus 2, the disk subsystem could be improved by adding spindles (to a RAID 0/10 configuration) or by using faster disks. If this number is below that limit, even if there is a continuous workload on the disk, it will not be the primary cause of any performance issues.

· %Disk Time: This is a counter that should represent the percentage of time spent accessing the disk, but the calculation used could provide a value higher than 100 percent.

· Disk Read Bytes/sec: This is important to determine if the speed of the disk subsystem corresponds to its theoretical limits. There could be issues that make this value lower: disk fragmentation, RAID overhead (for example, during a RAID rebuild), under-dimensioned LUN channels, and so on. If this number is near to the physical limits of the disk, we can improve performance by reducing data, distributing data to different disks, or by using faster disks.

· Disk Write Bytes/sec: This is also important to determine if the speed of the disk subsystem corresponds to its theoretical limits. Write performance is extremely important for Analysis Services processing operations. Other than the issues we already described for the Disk Read Bytes/sec counter, a common source of issues during writing is the use of a RAID 5 volume. Writing on RAID 5 volumes is slower than normal volumes, and write performance is inversely proportional to the number of spindles in a RAID 5 configuration.

Tip

Know your hardware configuration!

Too many times, we have seen performance issues caused by a lack of knowledge about hardware configuration. For example, RAID 5 disks are very common, but they can be a performance nightmare for ETL and cube processing. The data we collect from performance counters only makes sense if we can compare it to the known physical limits of our hardware. Therefore, we need to know how to interpret terms as LUN, RAID and spindles; we also need to know exactly what hardware we're using in our own environment.

For disk-related terminology, the following glossary is a good place to start: http://tinyurl.com/DiskGlossaryTerms. A description of all possible RAID levels is available at http://en.wikipedia.org/wiki/Standard_RAID_levels andhttp://en.wikipedia.org/wiki/Nested_RAID_levels.

If we want to see data on I/O operations broken down by process, we can use some of the counters available in the Process category, as shown:

· IO Data Bytes/sec: This is the amount of data transferred in I/O operations. For Analysis Services, this includes reads and writes to files and communication between client tools and Analysis Services. Monitoring it can help us understand if a lack of CPU activity while queries are running on Analysis Services is the result of I/O operations or not.

· IO Write Bytes/sec: This is the amount of data transferred in write I/O operations.

· IO Read Bytes/sec This is the amount of data transferred in read I/O operations

Finally, we have the Processor category of counters, which provides data on the usage of each processor as well as aggregated values for all processors on the server. When monitoring Analysis Services, the only significant counter is % Processor Time, which allows us to compare the CPU usage of Analysis Services to the total CPU usage of all processes on the server. If, for example, Analysis Services is consuming 50 percent of CPU and overall CPU usage is at 100 percent, then Analysis Services is sharing the CPU with other processes. In this kind of situation, to free up resources for Analysis Services, we either need to move other processes to a different server, or we can add more processing power to the server. To monitor the CPU usage of Analysis Services, we can use the % Processor Time counter from the Process category, and select the msmdsrv process.

Resource Monitor

Another useful tool to monitor the core performance counters is Resource Monitor, which can be found as a shortcut in Administrative Tools or by clicking on the Open Resource Monitor button in the Performance tab of Task Manager. The following screenshot shows Resource Monitor:

Resource Monitor

As well as giving us a concise view of the most important performance counters on a server, for each type of resource (Memory, CPU, and I/O, which is divided into Disk and Network), we can drill down into the details of each process' resource consumption, as shown in the following screenshot:

Resource Monitor

In the case of the Disk section, the details show one line for each open file. The Network section shows each open connection; Memory has a line for each process and CPU has a line for each process or service running on the system (if a single process hosts several services, their consumption is split across several lines—however, Analysis Services is hosted in a dedicated process and so this is not relevant for us).

Analysis Services memory management

Analysis Services has its own Memory Manager, which provides allocation services for all of its components, distributing the memory that is allocated by the operating system. It is important to understand the implications of this for virtual and physical memory, because in extreme conditions Analysis Services might continue to allocate virtual memory even if there is no more physical RAM available, thus increasing disk paging, and possibly carrying on until it reaches the limit of allocatable virtual memory (for example, when paging disk cannot increase over a certain quota). Even if we don't get a memory allocation error, if allocation requests go out of control, we'll end up with a very slow system because it will always be busy paging memory to disk.

Memory differences between 32 bit and 64 bit

Having talked about memory, it is important to make a distinction between the 32-bit and 64-bit versions of Analysis Services. Nowadays, any serious Analysis Services deployment should use the 64-bit version. However, we might still encounter legacy servers, or simply small deployments running on virtual machines that use a 32-bit operating system. For all these reasons, it makes sense to discuss the differences between the 32-bit and 64-bit versions of Analysis Services regarding memory management.

Regardless of the physical RAM installed on a system, if Analysis Services is running on a 32-bit operating system, it can only address a maximum of 2 GB or 3 GB of virtual memory (it depends on whether we set the /3GB switch in the boot.ini file; seehttp://support.microsoft.com/kb/833721 for further details). This limit can be an issue for scalability and performance. First of all, even if there was more physical memory available, a process cannot use more virtual memory than is available in its virtual address space. Unlike SQL Server, the Analysis Services memory manager cannot use services such as AWE to access more memory. So as we can see, it is important that the memory manager doesn't try to allocate more virtual memory than can be addressed because this will cause errors in Analysis Services.

With the 64-bit version of Analysis Services, there are no practical limits for virtual memory allocation. However, this doesn't prevent memory allocation errors, because we can still reach the limit of virtual memory that can be allocated, which is the sum of the total physical memory and paging file. This condition is much more frequent on a 64-bit server than on a 32-bit one, but if we understand how this can happen, we can also apply the right countermeasures.

Controlling the Analysis Services Memory Manager

The default settings of the Analysis Services Memory Manager work well in most cases. However, there might be scenarios where more tuning is required to optimize performance or simply to prevent low memory conditions.

The main server properties that control the Analysis Services Memory Manager are shown in the following screenshot:

Controlling the Analysis Services Memory Manager

Note

All Analysis Services server property values are held in a configuration file called msmdsrv.ini. Whenever we edit server property values in SQL Server Management Studio, the changes are written back to this file. Some Analysis Services server properties can only be seen by clicking on the Show Advanced (All) Properties checkbox, but even then there are still some properties in msmdsrv.ini that aren't visible at all in SQL Server Management Studio. The only way to access these properties is to edit the msmdsrv.inifile manually. However, all the settings we discuss in this book are visible in SQL Server Management Studio.

In the Memory category, there are two important properties: LowMemoryLimit and TotalMemoryLimit. For these properties, values greater than 100 are absolute values (scientific notation is supported in msmdsrv.ini) in bytes. Values of less than 100 are interpreted as a percentage, where the reference value of 100 percent is calculated as the minimum of the physical memory and the virtual memory available for a process. So, for example a value of 100 would be interpreted as follows: on a 32-bit server with 8 GB of RAM, there is still a 2 GB limit on addressable memory (or 3 GB if we have the /3GB option enabled), so it would be interpreted as 2 GB; on a 64-bit server with 8 GB of RAM, it would be interpreted as 8 GB; on a 32-bit server with only 1 GB RAM, it would be interpreted as 1 GB. Here's what each of these properties actually represents:

· LowMemoryLimit: This defines the point at which the server is low on memory. The default setting for a single instance installation is 65 (percent), and that is appropriate in most cases, but if we had other processes on the same machine, we might want to use a lower value. If we want to know the actual value that a percentage setting represents, we can look at the MSAS Memory / Memory Limit Low KB performance counter.

· TotalMemoryLimit: This is the maximum percentage of virtual memory that Analysis Services will use. The default is 80 (percent) and again this is appropriate in most cases. If we want to know the actual value that a percentage setting represents is, we can look at the MSAS Memory / Memory Limit High KB performance counter.

The Analysis Services Memory Manager will carry on requesting memory from the operating system until memory usage reaches the value specified in TotalMemoryLimit. At the same time, though, it also tries to reduce memory usage when it reaches theLowMemoryLimit: when memory usage is between LowMemoryLimit and TotalMemoryLimit, Analysis Services tries to reduce memory requests and to free up memory previously allocated.

However, it takes time to free up memory and there are also some types of internal allocations that cannot be freed up. For these reasons, it is not always possible to control the memory allocated by Analysis Services and increasing the gap between LowMemoryLimitand TotalMemoryLimit gives Analysis Services more leeway to react to changes in memory usage. Making the gap between these two properties' values too small is not a good idea, and the default values in Analysis Services 2012 generally provide a good balance (Analysis Services 2005 had a LowMemoryLimit of 75 percent, which proved to be too high in some cases).

Finally, remember that the properties we have looked at here only control the Analysis Services Memory Manager, and not any other virtual memory allocations made by components related to Analysis Services: for example, an Analysis Services stored procedure could allocate its own memory directly from the operating system, outside of the control of the Memory Manager.

Tip

Other Memory Manager Settings

The Memory Manager can be fine-tuned using other server properties such as MidMemoryPrice and HighMemoryPrice. More details on these properties, and the inner workings of the Memory Manager, can be found in Chapter 27 of Microsoft SQL Server 2008 Analysis Services Unleashed, Gorbach, Berger, and Melomed, Sams Publishing.

The most important thing to remember is that Analysis Services requests virtual memory from the operating system and cannot control whether it is using physical RAM or not. This is intentional. Analysis Services might have a pattern of use very different from SQL Server (which has a completely different model to handle memory allocation) and paging memory to disk could be a good compromise during periods of low usage, especially if other services such as SQL Server are running on the same server. However, if paging activity becomes an issue, we can try to reduce the number of memory requests made by Analysis Services to the operating system. First, we can set the LowMemoryLimit property to a lower value, making the Memory Manager more aggressive in freeing up memory. If this is not enough, we set the TotalMemoryLimit property to a lower value too. Changing these properties doesn't require restarting the service.

Out of memory conditions in Analysis Services

So far, we have explained how to control the Analysis Services Memory Manager at a high level, but we also said that in most cases, it is better to not change the default settings except in certain circumstances. Another reason why we might want to change these settings is when we find Analysis Services has run out of memory. This can happen when the operating system cannot provide the memory requested by Analysis Services, as in the following scenarios:

· Processes other than Analysis Services are consuming virtual memory. Regardless of the actual use of memory made by Analysis Services, if other processes consume virtual memory, new allocations might be denied to Analysis Services when it requests them.

· Analysis Services consumes too much memory, using up all the virtual memory available. This might happen under heavy load, and we cannot completely control this by changing the memory settings of Analysis Services.

If we experience out-of-memory conditions, increasing the size of the paging file is a workaround that will at least prevent Analysis Services from crashing, even if it will not help performance. If the problem is caused by too many concurrent processes, another solution could be limiting the memory usage of these other processes (for example, SQL Server does have settings to limit its memory allocation). However, if the cause of the out-of-memory condition is a program or a service other than Analysis Services with memory leaks, a better idea would be to investigate the cause of the excessive memory allocation, replacing the leaking application version with a fixed version or at least isolating it on another server.

As we said, it's not all that common to find an Analysis Services instance that consumes too much memory, because TotalMemoryLimit is based on available physical RAM and virtual memory is typically larger than physical RAM. Therefore, if there are no other processes that are consuming significant amounts of RAM and Analysis Services is the only "big service" running on a server, a reasonable amount of paging file should be enough to avoid such situation. However in the real world, we hardly ever find servers with only a single service running on them.

The TotalMemoryLimit property is not an absolute limit for the Analysis Services Memory Manager: as we said, virtual memory allocated by other libraries (such as stored procedures) used by Analysis Services is not under the control of the Memory Manager. The Memory Manager itself might be unable to respect the TotalMemoryLimit setting. Under heavy load, memory requests cannot be avoided; sometimes memory requests come too fast to allow cleaning of older, rarely-used allocations in a timely manner.

In both cases, the result is that the memory allocated exceeds the TotalMemoryLimit. This is not an issue in itself, but it could break the balance of memory usage between all the processes running on the server. Therefore, we cannot assume that Analysis Services will always respect the limit we have defined, especially when the gap between LowMemoryLimit and TotalMemoryLimit is tight.

Here are some possible causes of the type of heavy load on Analysis Services that could result in abnormal virtual memory consumption:

· Parallel processing: If a server has many processors available, parallel processing of cubes and dimensions will require a larger amount of memory. To reduce memory usage, we can try to decrease the amount of parallelism during processing.

· Too many queries and/or concurrent users: In this case, we should check the usage of cache and aggregations (adding more aggregations and rewriting queries and calculations if necessary), and make sure that sufficient memory is available to Analysis Services to support the number of concurrent sessions.

· Security: Security has implications on Analysis Services' ability to share cache between users, which means more work is likely to be taking place for each query. Using the Enable Visual Totals option with Dimension Security also has an impact on aggregation usage. In both cases, we should check if we can make structural changes to the cube or to how we have implemented security that can improve aggregation and cache usage.

A final consideration is that an out-of-memory condition should be prevented by dealing with memory issues in advance. Memory issues cause paging and paging causes bad performance. Therefore, monitoring operating system performance counters is the first step to ensuring that Analysis Services performs as well as possible.

Sharing SQL Server and Analysis Services on the same machine

Many Analysis Services installations share a server with SQL Server. This happens mainly because of licensing. The license required for Analysis Services is in fact a SQL Server license, so having both installed on the same machine requires only one server license to be purchased. However, this configuration has a hidden cost; since memory and sometimes CPU must be shared between these services, many of the scenarios we describe in this chapter are much more likely to occur.

It is not always true that it is a good idea to install Analysis Services on a different server to SQL Server though. If we don't have a multiterabyte data warehouse and the hardware necessary to support it, chances are that the bandwidth available between two servers could be a bottleneck compared with the faster transfer rate that we could have if Analysis Services and SQL Server were installed on the same server. This is particularly important during cube processing. There is also the question of cost. Often Analysis Services and SQL Server use neither CPU nor disks at the same time, and only require a modest amount of memory to cache data. Nowadays, 64-bit servers allow us to double the memory on a server at a fraction of the cost necessary to provide a new server with the same amount of RAM. In our experience, adding RAM to a server has a much better ROI than investing in a new server. In the end, having both services on the same server results in faster processing and cheaper hardware costs; although it can of course, increase the chance of memory issues.

As we saw in previous sections, we can try to control the memory usage of Analysis Services and SQL Server on servers that host both services. However, our suggestion is to make changes only if it is really necessary. SQL Server has a default setting that allocates memory if no other processes are claiming it. When Analysis Services is running and requests memory, SQL Server might release some of the memory it has previously allocated because it can detect when a system starts pagination. This is not true for Analysis Services. Memory allocated by Analysis Services up to the LowMemoryLimit threshold is never released. However, it could be paginated. Sometimes, it is better to page a section of the Analysis Services cache than drop it completely and have to recalculate the values it contains for future queries. Therefore, even if Analysis Services allocated memory and SQL Server pushed some of this memory out to a paging file, this is not a bad thing in itself. Only if paging activity (measured at the operating system level) is continuous, for example, because both Analysis Services and SQL Server are used intensively at the same time, then we might want to try to limit the memory used by one or both services. Last of all, remember that any attempt to optimize memory settings should be tested properly: we should compare system performance before and after making any changes, measuring the impact on things such as query response time, and not only looking at how they affect performance counter values.

Monitoring processing performance

Processing Analysis Services dimensions and partitions is a resource-intensive operation. It should use all of the resources—CPU, memory, and disk I/O—available to Analysis Services. For this reason, it is important to be able to understand the impact of processing operations on the operating system, as we saw earlier on in this chapter.

When a database goes into production, Analysis Services processing usually takes place on a rigid schedule. We should monitor processing for the following two reasons:

· Optimize processing times: Usually we want to reduce the time required to process partitions and dimensions.

· Check for any increase in the processing window over time: The amount of time processing takes could increase over time, especially if we are running a full process on our database. We want to ensure that the amount of time processing takes won't exceed our assigned processing timeslot. To do that, we need to log information about the start and finish time of each processing operation, as well as details on the number of rows processed too, if possible.

Analysis Services doesn't provide a ready-to-use logging system for processing operations. However, there are a number of ways to gather this data and make it available for analysis.

Monitoring processing with trace data

Analysis Services generates trace data about its internal operations. This data allows us to analyze the start and end times of many operations, including processing operations. There are several tools that we can use to retrieve this data.

SQL Server Profiler

The first tool we'll look at is the SQL Server Profiler, which (despite its name) is a tool that can collect both SQL Server and Analysis Services trace data. After connecting to Analysis Services to initiate a new trace session, we have to choose the events we want to capture and where to save them (for example, either to a file or to SQL Server table). The events that we need to monitor for processing operations are as shown in the following screenshot:

SQL Server Profiler

The events chosen in a profiling session are in fact classes of events. For each class, there are many actual events that can be generated, and these are shown in the EventSubClass column in Profiler:

SQL Server Profiler

Looking at these events in Profiler itself is not particularly easy, and so saving trace data to a SQL Server table is a good idea since it allows us to query and report on it much more easily. We can save a captured trace session by using the File | Save As | Trace Table... menu or we can choose to save a trace session in advance by using the Save to table option in the Trace Properties dialog box that is shown when we define a new trace session.

The trace events that we are interested in are listed next. Note that event classes and subclasses are identified by an integer value when saved in SQL Server log tables; we provide these integer values in parentheses after the events in the following list and there is a script to insert all these IDs into SQL Server tables available at http://tinyurl.com/SsasTraceIDs.

· Command Begin (15)/Command End (16): This contains only one interesting subclass event:

· Batch (12) contains the XMLA command sent to Analysis Services to process one or more objects.

· Progress Report Begin (5)/Progress Report End (6): This contains several subclass events that apply to both processing and query operations. We list only the subclass events relevant to processing, as follows:

· Process (1)

· ExecuteSQL(25)

· WriteData (16)

· ReadData (17)

· BuildIndex (20)

· BuildAggsAndIndexes (28)

It is important to note there is a sort of nesting of events that can be seen in trace data. For example, the Process event for a database initiates several other Process events for related objects, such as the cubes and dimensions in that database. The outermost events have an execution time (the column Duration, in milliseconds), which includes the time taken for all the operations executed within those events. Therefore, the values in the Duration column for different events cannot easily be summed, because we have to be careful not to sum events that might include each other. A good approach is to filter rows by event class, event subclass and ObjectPath (which uniquely identifies the object that the event refers to). For example, if we want to find how long it took to process theAdventureWorks cube, we need to find the row in the trace that has the event class Progress Report End, the event subclass 1 Process and the object path ServerName.Adventure Works DW 2008.Adventure Works.

ASTrace

Using SQL Server Profiler to capture trace data is a good option if we want to create a trace manually, but it is not the best way to automate trace data capture on a production server. A useful tool is ASTrace, which is part of Microsoft SQL Server Community Samples for Analysis Services, available from http://sqlsrvanalysissrvcs.codeplex.com. ASTrace captures an Analysis Services trace and logs it into a SQL Server table.

This utility runs as a Windows service that connects to Analysis Services, creates a trace, and logs trace events into a SQL Server table using the SQL Server Profiler format. To customize the trace (for example, to filter on certain events), we can use a standard trace template authored using SQL Server Profiler. Running as a service, this tool does not require a logged in user, unlike SQL Server Profiler.

XMLA

We can also initiate a trace by executing an XMLA command. The columns and events used in the trace are defined in the XMLA command itself. An example of an XMLA script is available at http://tinyurl.com/XmlaTrace.

Flight Recorder

Flight Recorder is a feature of Analysis Services that maintains a log of all events that happened in the recent past, which might be useful when investigating crashes or performance problems; it works by running a trace. By default, it doesn't capture all events and only keeps data for a limited time (so as not to fill the disk with trace data), but we can customize it by changing both the length of time it keeps data and the events that it records. We have to remember though that Flight Recorder can affect performance: the more events it records, the more I/O operations are required to update the trace files it generates.

Flight Recorder trace files can be opened with SQL Server Profiler and are stored in the OLAP\Log folder (usually found at C:\Program Files\Microsoft SQL Server\MSSQL.2\OLAP\Log). To customize the trace definition it uses, we can use a SQL Profiler template in the same way we did with ASTrace.

Tip

Flight Recorder and Trace Architecture

A more complete description of the Trace architecture of Analysis Services and the behavior of Flight Recorder is available in Chapter 38 of Microsoft SQL Server 2008 Analysis Services Unleashed, Gorbach, Berger, and Melomed, Sams Publishing.

Monitoring processing with Performance Monitor counters

In the first part of this chapter, we saw how to use Performance Monitor to monitor the operating system. Analysis Services has its own set of Performance Monitor counters too. The most important ones for analyzing Analysis Services processing operations can be found under the MSOLAP Processing category. Total Rows Converted, Total Rows Read, and Total Rows Written are respectively the count of the number of rows converted, the count of the number of rows read from all relational data sources, and the count of the number of rows written during processing. These values allow us to monitor the amount of data processed, and can be useful for evaluating if the amount of time taken for processing is related to the amount of data processed.

There is another set of performance counters relating to the Memory Manager. These counters are found in the MSOLAP Memory category:

· Cleaner Memory non-shrinkable KB: This is the amount of memory that cannot be purged by the Memory Manager.

· Cleaner Memory shrinkable KB: This is the amount of memory that can be purged by the Memory Manager.

· Cleaner Memory KB: This is the sum of the amount of shrinkable and non-shrinkable memory returned by the two counters discussed, and this is the total amount of memory controlled by the Memory Manager.

· Memory Limit High KB: This returns the actual amount of memory that corresponds to the threshold set by the TotalMemoryLimit property.

· Memory Limit Low KB: This returns the actual amount of memory that corresponds to the threshold set by the LowMemoryLimit property.

· Memory Usage KB: This is the total of private virtual memory allocated by Analysis Services and corresponds to the Process\PrivateBytes counter provided by the operating system. This number should be always greater than or equal to the CleanerMemoryKB counter. We should also look out for any paging caused by Analysis Services virtual memory.

As we previously said in the Controlling the Analysis Services Memory Manager section, it is important to compare the value of Cleaner Memory KB to the values of Memory Limit Low KB and Memory Limit High KB, but also to look for any paging caused by Analysis Services virtual memory. We might consider increasing our memory limits only if there is no paging and there is enough physical RAM available. On the other hand, we might want to decrease our memory limits if paging is having a negative impact on performance (for example, if cache is paged to disk, it is much less efficient).

Monitoring processing with Dynamic Management Views

Dynamic Management Views (DMVs) were introduced in Analysis Services 2008 to track server resources used, and can be queried using a SQL-like syntax. We can run a DMV query in SQL Server Management Studio in an MDX query window. For example, the following query shows the top ten objects that used the most CPU:

SELECT TOP 10 *

FROM $System.DISCOVER_OBJECT_ACTIVITY

ORDER BY OBJECT_CPU_TIME_MS DESC

The result of this query will display a large number of columns; the following table shows just the first few columns from a sample Resultset:

OBJECT_PARENT_PATH

OBJECT_ID

OBJECT_CPU_TIME_MS

OBJECT_READS

OBJECT_READ_KB

HPVISTA\K8.Databases.Adv Works - DW.Dimensions

Date

62

1011

265

HPVISTA\K8.Databases.Adv Works - DW.Dimensions

Products

62

973

232

HPVISTA\K8.Databases.Adv Works - DW.Dimensions

Employee

46

747

278

Note

DMV queries support only a subset of standard SQL. We can use WHERE conditions, DISTINCT and ORDER BY, but constructs such as GROUP BY, JOIN, LIKE, CAST, and CONVERT are not supported. However, we can also project data using expressions and calling some functions, such as Format.

Unfortunately, the documentation on DMVs is incomplete, although there is a lot of useful information on Vidas Matelis' blog at http://tinyurl.com/vidasdmv, Darren Gosbell's blog at http://tinyurl.com/darrendmv, and Vincent Rainardi's blog athttp://dwbi1.wordpress.com/2010/01/01/ssas-dmv-dynamic-management-view/. We can return a list of schema rowsets that can be queried through DMVs by running the following query:

SELECT * FROM

$System.Discover_Schema_Rowsets

Despite the lack of documentation, DMVs are very useful for monitoring Analysis Services. When we want to monitor processing activity through DMVs, we probably want to see which objects are consuming the most memory and CPU. Each allocation in Analysis Services can be one of two types: shrinkable and non-shrinkable. The shrinkable objects that consume the most memory can be returned using a query such as this:

SELECT *

FROM $System.DISCOVER_OBJECT_MEMORY_USAGE

ORDER BY OBJECT_MEMORY_SHRINKABLE DESC

Shrinkable objects can be purged by Memory Manager if new allocations require memory. However, it can also be useful to take a look at non-shrinkable objects because they cannot be purged, and we might want to know what the most expensive objects are. They will give us an indication of which databases should be moved to another server to reduce resource consumption:

SELECT *

FROM $System.DISCOVER_OBJECT_MEMORY_USAGE

ORDER BY OBJECT_MEMORY_NONSHRINKABLE DESC

Much of the information provided by DMVs is not available from other sources. For this reason, it is important to know of their existence and to know how to use them in our monitoring infrastructure.

Monitoring query performance

We have already introduced many tools that can be used to get data on processing operations, using trace, performance counters, and dynamic management views. The same tools can also be used to monitor query performance, and in this section we'll see how this can be done.

Monitoring queries with trace data

Regardless of the tool we use to collect it (SQL Server Profiler, ASTrace, XMLA, or Flight Recorder), trace data is the most important source of information on query-related operations. Trace data provides information on the internal operations of the Storage Engine and the Formula Engine, for example, showing if aggregations are used or not, and if calculations are evaluated in bulk mode or not.

The most important trace events for analyzing query performance are as follows (once again, the integer identifier for each event is shown in parentheses after its name):

· Progress Report Begin (5)/Progress Report End (6): There is only one subclass event that is relevant for query operations:

· Query (14): This shows when the Storage Engine accesses a partition or aggregation to get data. This will only happen if the data required is not available in the Storage Engine cache.

· Query Begin (9)/Query End (10): These are raised at the start and end of query evaluation. They usually contain only one interesting subclass event:

· MDXQuery (0): This shows the MDX statement sent to Analysis Services. For the Query End event, the Duration column shows the overall amount of time taken to run the query and return its results back to the client.

· Calculate Non Empty Begin (72)/Calculate Non Empty End (74): These have no related subclass events. The Calculate Non Empty events are raised when Analysis Services performs non-empty filtering operations, for example, when the NonEmpty MDX function or theNon Empty statement is used in a query, and these operations are often the cause of slow query performance. When the value of the Duration column for Calculate Non Empty End is large, we should check the IntegerData column to see if it bulk mode is being used or not. When the value of IntegerData is 11, Analysis Services iterates over all the tuples in a set to perform non-empty filtering, and this can be very slow. For all other IntegerData values (usually 1), Analysis Services is operating in bulk mode, which is usually much faster.

· Execute MDX Script Begin (78)/Execute MDX Script End (80): These have no related subclass events. They are raised at the beginning and end of the evaluation of the MDX Script on a cube, which is usually a very quick operation. However, if there are any complex named sets defined on the cube then MDX Script evaluation can take a long time.

· Get Data From Aggregation (60): This has no related subclass events. This event is raised when the Storage Engine reads data from an aggregation.

· Get Data From Cache (61): This is raised when data is read from cache. There are several subclass events that identify the type of cache used. This event will be seen very frequently on a production server, because after a query has been run, all of the data it returns should be kept in cache. If this is not the case, it usually indicates that either there is not enough memory available for caching, or that calculations have been written so that the values they return cannot be cached. The subclass events are as follows:

· Get data from measure group cache (1)

· Get data from flat cache (2)

· Get data from calculation cache (3)

· Get data from persisted cache (4)

· Query Cube Begin (70)/Query Cube End (71): These have no related subclass events. After a Query Begin event has been raised, in some scenarios, the MDX Script of a cube must be evaluated before query evaluation proper can start; in these scenarios Query CubeBegin is raised after the Execute MDX Script End event and shows when query evaluation actually starts.

· Query Dimension (81): This event is raised when queries retrieve members from dimension hierarchies. These members may or may not be in cache already; if they aren't, we would expect to see some I/O read operations executed:

· Cache data (1)

· Non-cache data (2)

· Query Subcube (11): As we saw in Chapter 8, Query Performance Tuning, a single MDX query might result in several different subcubes of data being requested from the Storage Engine. A Query Subcube event is raised for each of these requests, and may be one of the following types:

· Cache data (1)

· Non-cache data (2)

· Internal data (3)

· SQL data (4)

· Query Subcube Verbose (12): This is functionally identical to Query Subcube, but it adds more information about the multidimensional space of the subcube that is queried:

· Cache data (1)

· Non-cache data (2)

· Internal data (3)

· SQL data (4)

· Calculation Evaluation (110)/Calculation Evaluation Detailed Information (111): These provide information about internal operations in the Formula Engine. Several subclass events are available:

· InitEvalNode Start (1)

· InitEvalNode End (2)

· BuildEvalNode Start (3)

· BuildEvalNode End (4)

· RunEvalNode Start (7)

· RunEvalNode End (8)

· BuildEvalNode Eliminated Empty Calculations (100)

· BuildEvalNode Subtracted Calculation Spaces (101)

· BuildEvalNode Applied Visual Totals (102)

· Serialize Results Begin (75)/Serialize Results End (77): These have no related subclass events. They mark the start and end of query results that are sent back to the client. The Serialize Results Begin event might be raised before the calculation of all of the cell values in a cellset has been completed, although it often starts just after all calculation is finished (for example, a Non Empty statement in an MDX query could force this behavior).

Monitoring queries with Performance Monitor counters

Analysis Services Performance Monitor counters are more useful for monitoring general query behavior than understanding what is happening when individual queries are run. Logging these counters can help us to understand the characteristics of all of the queries our users are running, and therefore help us to find query performance bottlenecks. Here's a list of all of the relevant categories and the important counters in them:

· MSOLAP: Cache category: The following are the counters present:

· CurrentEntries: This is the number of entries in the cache.

· Total Direct Hits: This is the number of subcube queries answered from existing cache entries—this number should be compared to total misses to evaluate the percentage of subcube queries that are resolved using cache.

· Total Misses: This is the number of cache misses—this number should grow after the cache has been cleared (for example, after the Analysis Services service has been restarted, or after cube processing) and should not grow too much once the cache is warm. If this number continues to grow, we need to evaluate if there is not enough memory for cache or if the queries that are being run cannot take advantage of existing cache.

· MSOLAP:MDX: This contains several counters with very detailed information, the most important of which is:

· Total NON EMPTY un-optimized: This is the number of Calculate Non Empty operations (as described earlier in the section on trace data) that are using an un-optimized algorithm. If this number continues to grow, there may be MDX queries that are running slowly for this reason and we might want to find them and optimize them.

· MSOLAP: Memory: We can use the same counters used to monitor processing operations to monitor query operations.

· MSOLAP: Storage Engine Query: This contains several counters with detailed information that we can use to measure the workload of Analysis Services; the most generic counter here is:

· Total Queries Answered: This shows the total number of queries run against Analysis Services; if this number does not increase, no queries are being run.

Monitoring queries with Dynamic Management Views

DMVs are not really useful for finding slow queries or understanding what happens when a query is run; trace data is much more useful for this purpose. That said, the data that DMVs return on memory usage, for example, is very useful for understanding the state of Analysis Services at a given point in time, which will have an impact on query performance.

Monitoring usage

Apart from monitoring the performance of processing and queries, another important activity is monitoring the overall usage of Analysis Services. At any given point in time, we might want to know who is connected to Analysis Services and who is running queries, for example, because we might want to warn users that we're about to restart the service for maintenance reasons. Also, in the long term, it is interesting to know which cubes and hierarchies are used and which are not, what type of queries are being run, who the most active users are, which users never use the system, and so on.

To monitor usage, we can use the same tools that we've used in the previous sections.

Monitoring usage with trace data

Collecting trace data in a SQL Server table can be a very useful way to determine the most active users, or the most frequent or expensive queries, over a long period of time. The most important event to capture is Query End/MDX Query. The Duration column for this event can be used to identify the most expensive queries, and it allows us to see the actual MDX that was run for a query. However, it's very often the case that users will run several very similar MDX queries—for example, they might run the same query, but slice by a different date—and it is very difficult to identify this kind of query pattern with just the MDX, so we may also choose to capture lower-level trace events too, such as Get Data From Aggregation or even Query Subcube.

Another important piece of information in the Query End/MDX Query event is the NTUserName column, which identifies the user who ran the query. The only issue here is that if we are using some form of application-level security, it might be that the same user is used for all Analysis Services queries. In this case, we would need to create a dedicated log at the application level because Analysis Services doesn't have enough information in the trace to identify the end user.

Monitoring usage with Performance Monitor counters

From the Performance Monitor counter point of view, there are specific counters that give us information about current connections and running queries. Of course, the entire Memory category of counters we saw earlier also play an important part in any usage monitoring strategy. We may also add the following counters:

· MSOLAP: Connection

· Current connections: This is the number of active client connections at the current point in time.

· Current user sessions: This is the number of active user sessions; a connection is always tied to a session, but there could be sessions without active user connections.

More detailed information about which connections are available using the DMVs is discussed next.

Monitoring usage with Dynamic Management Views

DMVs are very useful way to get a snapshot of the current state of Analysis Services. As DMVs can be queried by any client that can connect to Analysis Services, we can build Reporting Services reports that access this information directly, with minimum latency. For example, we can get a detailed list of active client connections by using the following command:

SELECT *

FROM $System.DISCOVER_CONNECTIONS

The result of this query includes the username, the client application, the date/time when the connection started, and the date/time of the last command. For each connection, there is always a user session; a list of these sessions can be obtained with the following command:

SELECT *

FROM $System.DISCOVER_SESSIONS

In the resultset, for each session we can see the last command issued and the current database, the date/time when the connection started and the date/time of the last command.

Finally, we can see a list of all currently executing commands with the following statement:

SELECT *

FROM $System.DISCOVER_COMMANDS

The resultset will contain nonuser-related activity such as active traces, queries to metadata information, and so on. We will probably want to concentrate our attention on only those commands that are consuming CPU or that are engaged in read/write activity (the columns COMMAND_CPU_TIME_MS, COMMAND_READS, COMMAND_READ_KB, COMMAND_WRITES, and COMMAND_WRITE_KB provide this type of information).

Activity Viewer

A better way to query DMVs is to use a tool specifically designed for this purpose. Unfortunately, SQL Server Management Studio does not have any features to do this for Analysis Services, but there are several versions of a tool called Activity Viewer (one each for Analysis Services 2005, 2008, 2012 and 2014), which can be used to monitor Analysis Services state.

The first version of Activity Viewer was designed for Analysis Services 2005 and is part of the SQL Server Analysis Services Samples. Strangely enough, it is still part of the Analysis Services community samples available at http://sqlsrvanalysissrvcs.codeplex.com, even though it doesn't actually work on an Analysis Services 2008 instance.

The following versions are available at these URLs:

· http://asactivityviewer2008.codeplex.com

· http://asactivityviewer2012.codeplex.com

· http://activitymonitor2014.codeplex.com

The following screenshot shows it displaying a list of users currently connected on an instance of Analysis Services. This tool offers also an interesting functionality such as custom alerts, which can be defined by selecting the Rules option in the left-hand pane:

Activity Viewer

These tools allow us to see current connections, sessions, and commands. They also allow us to cancel a running query and this single feature makes Activity Viewer a very important addition to any cube administrator's toolkit.

Building a complete monitoring solution

In this chapter, we saw a set of tools and techniques that we can use to monitor an Analysis Services instance. We already mentioned that it is a good idea to save trace and performance counter information in logfiles or SQL Server tables, so we can perform more detailed analysis on them. The logical next step would be to build a complete monitoring solution that could provide reports, statistics, and detailed information on all aspects of performance and usage, and that would be always running in the background, collecting data.

We could use all of this data as a source for a "performance and usage data mart", and use our favorite tools such as Reporting Services and Analysis Services to build a BI solution for ourselves. If we have specific requirements or we have to integrate the monitoring solution with an existing system, designing a custom solution could be a good idea, for example, we might simply build Reporting Services reports that access DMVs, or that query trace data collected using ASTrace and stored in SQL Server. However, if we don't want to start from scratch, we can use the MonitoringSSAS solution that is part of the community samples at http://sqlsrvanalysissrvcs.codeplex.com.

This solution automates the collection of performance data from trace, performance counters, and Dynamic Management Views (DMVs). Data is collected and stored using several Integration Services packages and other pieces of code, including some T-SQL stored procedures. The solution also includes some Reporting Services reports that perform basic analysis, and show performance counter values at the time when specific queries were run. We suggest customizing this solution rather than trying to build everything from scratch ourselves because it includes all the source code and could save a lot of time even if only part of the solution is used.

Summary

In this chapter, we spent most of our time describing the key resources that we have to monitor for Analysis Services: Memory, CPU, and I/O. These resources are controlled by the operating system, and for this reason, we looked at how Analysis Services interacts with the operating system and with other processes that may also be competing for resources. We discovered what the most important operating system performance counters are for monitoring an Analysis Services installation, and what tools we have available for accessing them.

After that, we explored various techniques for monitoring processing, query activity, and usage of Analysis Services. We focused on using tools such as SQL Server Profiler, ASTrace, Activity Viewer, and MDX Studio; we also looked at the detailed information provided by traces, performance counters, and Dynamic Management Views for each of the monitoring scenarios.

Finally, we discussed building a complete monitoring solution and pointed out that such a solution already exists as part of the SQL Server Community Samples, complete with source code, and that it can be used as a starting point for building your own custom monitoring solution.