Going in Production - Expert Cube Development with SSAS Multidimensional Models (2014)

Expert Cube Development with SSAS Multidimensional Models (2014)

Chapter 10. Going in Production

When the Analysis Services cube development is completed, its life – from the developer's point of view – is over. Nevertheless, its real life, from the user's point of view, has just begun. During development we probably didn't care too much about what would happen once the cube was in production, but of course there are a whole new set of problems that must be dealt with when a cube has gone live. We are going to address these problems in this chapter.

The focus in this chapter is on the following topics:

· Making changes to a cube in production: These changes will still need to be made to the cube once it's in production, but we need to be very careful about how we make them in case we accidentally break reports, overwrite properties, or unprocess objects.

· Managing partitions: Partitioning our cube is good for two reasons: to improve query performance and to reduce the amount of processing we need to do. In production, new data will need to be loaded into the cube on a regular basis and this means we'll need a way of automatically creating new partitions when they are necessary.

· Processing: Having a lot of data to process forces us to find the most efficient way of processing it. If, as is usually the case, we don't have enough time available to do a full process on our Analysis Services database every time we load new data into it, we need to understand what other options are available for cube and dimension processing.

· Copying databases from one server to another: On larger implementations, we'll have several instances of Analysis Services, for example, in a network load balanced cluster. In this scenario, we will not want to process the same database more than once; instead, we'll need to look at the methods we can use for copying a newly processed database from one instance to another.

Making changes to a cube in production

The first problem we'll face once a cube has gone into production is how to make changes to it without causing disruption to our users. In a development environment, we can generally make any changes to a cube we like because we are the only users of our cube. However, we need to be more careful when deploying changes to a cube into production for two reasons:

· Some changes may result in cubes or dimensions becoming unprocessed. For example, adding a new calculated measure to a cube doesn't require any processing at all, but on the other hand, adding a new dimension means that the cube will need a Full Process. If we have a lot of data in our cube, a Full Process might take a long time, and our users will probably not want to have to wait to run their queries while this happens.

· It's likely that our solution will not match the deployed version of the cube in some important respects. We saw in Chapter 9, Securing the Cube, that roles may be added or updated on an Analysis Services database after it has been deployed; similarly, we may be using different data sources in our development environment compared to our production environment, and as we'll see later on in this chapter, it's possible that the measure groups in deployed cubes will contain extra, dynamically created partitions. When we deploy a solution in SQL Server Data Tools, we have to deploy every object in it – and if we overwrite the roles, data sources, or partitions on the production server, we will end up denying access to users and dropping data from the cube.

One method we definitely do not want to use for deploying changes to a cube in production is to edit the cube directly in Online mode. Although this might seem a quick and easy option, it will of course mean we are bypassing any source control we are using. For the same reason, we don't recommend making changes by running XMLA statements to update an object's definition.

There are two safe and commonly-used techniques for deploying changes to a production environment:

· We can use Visual Studio's project configurations feature, as detailed in this blog entry: http://tinyurl.com/gregprojconfig. For a single project, this allows us to configure different property settings, such as the connection strings used in Data Sources, for different build types like Development or Production. This is a very powerful feature, but it still does not solve the problem of overwriting roles or partitions on the production server.

· The recommended approach is to use the Deployment Wizard, which allows us complete control over what actually gets deployed to the production server. To use it, first we have to build our project in SQL Server Data Tools; we can then run the wizard, point it to the .asdatabase file that was created in our project's bin directory when we did the build, and then choose what happens to roles, partitions, connection strings, and various other properties.

Managing partitions

As we saw in Chapter 8, Query Performance Tuning, partitioning a measure group simply involves slicing it up into smaller chunks that are both easier to maintain and to query. In that chapter, we introduced and explained the basic concepts of partitioning; now, we are interested in how to manage partitions when the cube is up and running.

Measure groups are usually partitioned by the Time dimension, for example, with one partition holding one month of data. Although, there are rare cases where we might want to partition a measure group based on a different dimension (for example, Geography is sometimes used), the vast majority of projects we have worked on use Time as the slicer. This follows on from the fact that new data usually needs to be loaded into the measure group at regular intervals in time, and this new data needs to be processed and stored alongside the existing data.

Clearly, since partitioning is so closely linked to the concept of time, we need to be able to build and process new partitions dynamically when we have new data to load. We could certainly build extra partitions in our Analysis Services database to handle our future needs — maybe create 48 empty monthly partitions to last for four years – but not only would this be time-consuming to do, it would also mean that in four years' time, someone would need to remember to create some more partitions. And even then, we'd still need a way of processing only the partition that contained the latest data, rather than all partitions.

Note

Note that if our partitioning strategy is not dynamic, for example, if we're partitioning by Geography, this section is of no relevance. Static partitions can be defined inside SSDT because they do no change over time.

In this section, we'll demonstrate how you can use SQL Server Integration Services (SSIS) plus some .NET code to create partitions dynamically. Here's a breakdown of what we'll do:

· We'll create a template partition within our measure group that we will use as the basis for all of the new partitions.

· We'll create an Integration Services package that loads data from a configuration database, uses this data to decide if any new partitions need to be created, and finally creates them by copying the template partition and modifying its properties.

There will be two types of partition slice: data for the most recent 12 months will be stored in 12 monthly partitions, while older data will be stored in yearly partitions. This will give us the opportunity to show how to handle more advanced problems like merging partitions.

The measure group will contain different aggregation designs for the different types of partition: monthly and yearly. Our package will apply the right aggregation design to any new partitions it creates.

Once again, all of the code for this chapter is available as part of the sample projects and databases for the book; we're only going to highlight the most interesting parts of that code here.

Relational versus Analysis Services partitioning

The fact tables upon which we build our Analysis Services measure groups are normally partitioned. Partitioning in the relational database follows rules and needs dictated by the relational data structure. A common relational partitioning model is by month, so each partition holds a month of data. When data gets too old, it is deleted or moved somewhere else, to leave space for new data.

Partitioning in Analysis Services is different in several respects:

· Partitions in the same measure group can, and often do, have different slices. For example, as in our case, older data may be partitioned at the year level, while more recent data may be partitioned at the month level. As we said, in the relational database, partitioning is normally done with only one type of slice.

· Similarly, different partitions may have different aggregation designs. Older data tends to be queried less often and at a higher level, and so partitions containing older data may need different aggregations to partitions containing newer data.

· Sometimes, not all the data from the relational database needs to be loaded into the cube. Very old data might be useful in the relational database for relational reporting purposes but dropped from the cube, for example, to make it faster to process, query, or back up.

· We may change our partitioning strategy if our requirements change. If, for example, it turns out that many queries are being run at the month level for old data, we might change our partitioning strategy so that we use monthly partitions for all of our data. This, of course, would not require any changes to the relational database, although it would mean that we would have to process all of the newly-created partitions, which might take a long time.

Building a template partition

The first step towards implementing a dynamic partitioning strategy is to create a template partition in our measure group. Since in a partitioned measure group all of the partitions are identical except for the slice of the data they contain and possibly the aggregation design they use, what we're going to do is to create a single partition that will be used as a blueprint by our Integration Services package. When the package creates a real partition in the measure group, it will make a copy of the template partition and change some of its properties so that it will contain, when processed, the appropriate slice of the fact data.

We will be implementing dynamic partitioning on the Sales measure group in our sample cube here. To do this, we need to make sure there is only one partition in that measure group, and we need to change its Binding type to Query Binding. Since this is only a template partition, we don't want it to be able to store any data. So, we add a condition to the WHERE clause in the SELECT statement that this partition is bound to like WHERE 1=0, as shown in the following screenshot. This will ensure that, if it is processed, no rows will be read during processing and that it will contain no data.

Building a template partition

Note

If we want to use the Design Aggregation wizard at this point, we will need to either remove the WHERE clause from the query, deploy the cube, and process it before launching the wizard; or set the EstimatedRows property on the partition to the average number of rows we expect to have in each partition.

Generating partitions in Integration Services

The technique we are going to use to build partitions in Integration Services requires the creation of a script task and some .NET code that uses Analysis Management Objects (AMO) to connect to the cube and create or delete partitions. AMO is a .NET class library that contains the objects necessary to perform administrative tasks on an Analysis Services database. Since script tasks do not by default contain a reference to the AMO library, the first thing we need to do after creating our script task is right-click on theReferences node in the Project Explorer in the Script Editor window, and add a reference to the Analysis Management Objects component. Full documentation for AMO can be found in books online, so we won't go into any detail on what each object does.

Here is an outline of what happens within the script task:

1. First, we retrieve information on what partitions exist in our fact table from the relational database. This will tell us what partitions need to be present in our measure group.

2. Then, we connect to Analysis Services and our measure group and find out what partitions currently exist there.

3. Next, we work out what Analysis Services partitions should be present in the measure group based on what relational partitions are currently present in the fact table.

4. The last and most complex step is to compare the list of Analysis Services partitions that should be present with the list of partitions that currently exist, and apply any changes that are necessary. This will include deleting old partitions, merging existing partitions, and creating new partitions. When creating new partitions, we simply take the template partition and call the Clone method on it to create a new, identical partition.

Each new Analysis Services partition is bound to a dynamically generated SQL query that will return all of the data from the relational partitions it covers, and other properties such as Slice are also set appropriately.

Once this has happened, we will need to process any new partitions, which we'll see how to do in the next section in this chapter.

At this point, the package looks like this:

Generating partitions in Integration Services

Here's what the cube looks like immediately after the deployment, with only the empty template partition present:

Generating partitions in Integration Services

If we run the package to create partitions up to June 2004, this is the result:

Generating partitions in Integration Services

The package has added a lot of new partitions:

· Yearly partitions for the years 2001, 2002, and 2003. Note that the 2003 partition does not contain a whole year but only the months from January to June. Other months are covered by monthly partitions.

· Monthly partitions for the last 12 months up to June 2004.

The Sales partition, that is, the template partition, is still there and still empty.

Note

In this simple implementation, we use only a single template partition. A more complex implementation could use two different template partitions: one for monthly partitions and one for yearly partitions, perhaps with different aggregation designs associated to them.

If we script the Sales of Year 2003 partition into an XMLA query editor window in SQL Management Studio we can see its definition and understand what the package has done; here's the part of the definition where all the important properties are set:

<ObjectDefinition>

<Partition>

<ID>Sales of Year 2003</ID>

<Name>Sales of Year 2003</Name>

<Annotations>

<Annotation>

<Name>MonthsInsidePartition</Name>

<Value>200301200302200303200304200305200306</Value>

</Annotation>

<Annotation>

<Name>PartitionType</Name>

<Value>YEAR</Value>

</Annotation>

</Annotations>

<Source xsi:type="QueryBinding">

<DataSourceID>Adv DM</DataSourceID>

<QueryDefinition>

SELECT * FROM CubeSales.Sales WHERE

OrderYearMonth=200301

UNION ALL

SELECT * FROM CubeSales.Sales WHERE

OrderYearMonth=200302

UNION ALL

SELECT * FROM CubeSales.Sales WHERE

OrderYearMonth=200303

UNION ALL

SELECT * FROM CubeSales.Sales WHERE

OrderYearMonth=200304

UNION ALL

SELECT * FROM CubeSales.Sales WHERE

OrderYearMonth=200305

UNION ALL

SELECT * FROM CubeSales.Sales WHERE

OrderYearMonth=200306

</QueryDefinition>

</Source>

</Partition>

</ObjectDefinition>

There are two things to point out here:

· In order to store the list of relational partitions that are covered by this Analysis Services partition, we used an Annotation. Annotations are string values that can be stored inside many Analysis Services objects programmatically, and we can use them to store additional information not covered by the standard Analysis Services properties.

In this case, we have used annotations to store both the partition type (Yearly or Monthly) and the list of months that are stored inside the partition. For the month names, we used a string consisting of multiple six-character groups in the form YYYYMM. We will use these annotations when we execute the package in order to determine which months each partition covers without having to parse the source query.

· The QueryDefinition tag contains the SQL query that will be sent to the relational data source (in our case SQL Server) when the partition is processed. The query has been constructed to use the partition key in the most effective way: each SELECT statement returns data from exactly one partition and the results are UNIONed together.

Now, if we run the script again, moving a month forward, the script will update the partitioning structure once again. Here's what the log of the Integration Services package shows:

Year partition Sales of Year 2001 is already ok.

Year partition Sales of Year 2002 is already ok.

Merging partition Sales of Month 200307 into Sales of Year 2003

Partition Sales of Month 200308 is already ok.

Partition Sales of Month 200309 is already ok.

Partition Sales of Month 200310 is already ok.

Partition Sales of Month 200311 is already ok.

Partition Sales of Month 200312 is already ok.

Partition Sales of Month 200401 is already ok.

Partition Sales of Month 200402 is already ok.

Partition Sales of Month 200403 is already ok.

Partition Sales of Month 200404 is already ok.

Partition Sales of Month 200405 is already ok.

Partition Sales of Month 200406 is already ok.

Partition Sales of Month 200407 has been created.

The script has detected that the following changes need to be made and made them:

· The 2003 yearly partition needs to be extended by one month to include July 2003. Since this month is already present as a monthly partition, it is merged into the 2003 partition. This operation does not require that the 2003 partition be reprocessed: Analysis Services is able to merge two already-processed partitions without the need for any further processed.

When merging two partitions, Analysis Services will use the aggregation design of the destination partition (in this case, the 2003 yearly partition), but will disable any aggregations that are not present in the source partition (July 2003 here). A ProcessDefault on the destination partition will rebuild the invalidated aggregations.

· A new monthly partition for July 2004 is needed, and so it has been created.

After having performed the merge operation, the script updates all the annotations inside the partition and also the SELECT statement the partition is bound to. This ensures that the next time this partition is processed it contains the correct data.

There is no way to delete data from within an existing partition without reprocessing it. In the previous scenario, when we created a new monthly partition for July 2004, we might also have wanted to delete one month of data from the yearly partition for 2001 to maintain a consistent time window of data in the measure group. This would have meant we had to reprocess the 2001 partition, though - a potentially expensive operation. Instead, what the script does is only delete a partition at the start of the time window when it contains a whole year's data. This has the added benefit that any 'Same Period Previous Year' calculations that use the ParallelPeriod MDX function described in Chapter 6, Adding Calculations to the Cube, will always work properly for the second year in the time window.

Managing processing

During development, we can process cubes or dimensions whenever we want; in a production environment, processing needs more thought though. On a very simple project, we might be able to get away with doing a Full Process on our entire Analysis Services database if it only takes a few minutes to complete. However, we usually have large data volumes to manage and a limited amount of time to perform any processing, so a Full Process simply isn't feasible - we need to think carefully about how we can only process the data that needs to be processed, and do that processing in the most efficient way possible.

Analysis Services processing can be broken down into two different tasks:

· Dimension Processing: This involves loading data into a dimension and building indexes on it.

· Partition Processing: This is more complex. Before we can query a cube, all of the dimensions in the cube need to be processed, and we need to process the cube itself too. A cube is made up of measure groups, and a measure group is made up of partitions. Since partitions are where the data in a cube is actually stored, then when we talk about processing a cube, what we are really talking about is processing all of the partitions in a cube. Processing a partition involves loading data into it, building indexes, and building any aggregations specified in the aggregation design associated with this partition.

Tip

MOLAP, HOLAP, or ROLAP?

So far, we've assumed that we're always going to use MOLAP storage mode for all of our partitions. That's because 99 percent of the time we will be! MOLAP storage involves storing all data in Analysis Services' own data structures; it gives us the fastest query performance but involves the longest processing times. ROLAP storage leaves all data in the relational database, and when MDX queries are run, then Analysis Services generates SQL statements to retrieve the data it needs; querying is relatively slow as a result, but processing is very fast. HOLAP storage is a cross between the two and involves storing aggregations and indexes in MOLAP mode but everything else in ROLAP mode.

ROLAP storage for partitions is only very rarely useful, for when we really do need to see real-time data inside a cube. Even then, the poor query performance of ROLAP partitions means it can only be used with relatively small data volumes. Since fast query performance is the number one priority for any Analysis Services cube, the extra time needed for processing MOLAP partitions is usually a price worth paying. We have never seen a scenario where HOLAP storage is the right choice, so we can safely ignore it.

Similarly, MOLAP storage is almost always the right choice for dimensions. As we saw in Chapter 4, Measures and Measure Groups, ROLAP storage is usually only used for very large fact dimensions that would take too long to process in MOLAP mode.

Dimension processing

In this section, we will discuss the various different types of processing that can be performed on a dimension and what they actually do. For obvious reasons, we'll want to choose the option that loads the data we need into the dimension in the quickest and most efficient way. However, when processing a dimension, we also need to be aware of the side effects that this processing might have on any cubes that the dimension is present in. If, for example, we perform a Process Update on a dimension, this might invalidate aggregations that include hierarchies from this dimension, meaning they need to be rebuilt too. In this case and others, the side effects are more significant than dimension processing itself: the processing of a single dimension is usually a relatively fast operation, while rebuilding aggregations can take much longer.

Note

Clicking on the Impact Analysis button in the Process dialog in either SQL Management Studio or SQL Server Data Tools will list all of the objects that are affected by any type of processing on any object.

Here is a complete list of all of the options we have for dimension processing, along with a brief description of what each one does and any side effects:

Type

Description

Full

The dimension's structure is deleted and rebuilt. All dimension data is reloaded from scratch. In addition, all partitions of all measure groups related to this dimension require a Full Process before they can be queried.

Add

Compares the contents of the relational dimension table with the contents of the Analysis Services dimension and adds any new members that are found but does not delete or update existing members: this means that you run the risk that important changes to existing dimension members are not made. It does not invalidate existing partition data or aggregations.

Also known as incremental processing, this option is not visible in SQL Management Studio or SSDT. For more information on how to use Process Add, see http://tinyurl.com/gregprocessadd.

Data

Loads data into the dimension but does not process dimension indexes. It has the same side effects as Process Full.

Index

Usually used after Process Data, it builds the bitmap indexes of the dimension and does not require any access to the relational database.

Update

Compares the contents of the relational dimension table with the contents of the Analysis Services dimension and then deletes any members from the Analysis Services dimension that no longer exist in the relational table. It also updates any members that have changed and adds any new members.

Since existing members on hierarchies with at least one flexible attribute relationship in the chain of attribute relationships down to the key attribute can change, then all aggregations including these hierarchies will be dropped. They can be rebuilt later by running a Process Default on affected cubes or measure groups.

Over time, the performance of a dimension will degrade after multiple Process Updates. As a result, if do you use Process Update regularly, it's a good idea to schedule an occasional Process Full, for example, at a weekend or public holiday when it will cause minimal disruption.

Unprocess

Deletes all structures and data for the dimension.

Default

Performs whatever processing is necessary to bring the dimension up to a fully processed state.

With these options in mind, here, in order of increasing complexity, are some of the possible strategies we can use for processing dimensions:

· We can run a Process Full on all of our dimensions, which of course means we then have to run a Process Full on all of our cubes. As we have already said, this is the simplest option but is only feasible if we have enough time to do it - we need to be sure that we will have enough time in the future as the data volumes in our data warehouse grow. Remember that adding hardware can have a dramatic impact on processing performance, and may be cheaper than developing and maintaining a more complex processing solution.

· We can run a Process Update on any dimensions that have changed their contents, and then run a Process Default to rebuild any invalidated aggregations. Note that Process Update can take a significant amount of time to complete on large dimensions.

· If existing members on dimensions never change, we can run a Process Add to add any new members that appear. Although Process Add can be much more difficult to configure than Process Update, it can perform significantly better. However, it can only be used in certain circumstances. For example, in scenarios where we are only making Type 2 changes to a dimension, it is safe to use. The fact that Process Add does not invalidate existing partition data or aggregations is also very important as it makes it very performant because partitions and aggregations do not need to be reprocessed or rebuilt.

· If existing members do change but we don't mind waiting for these changes to appear in the Analysis Services dimension, we can run regular Process Adds and then run a Process Update or even a Process Full on a less regular basis. This can be useful if we need to add new members to a dimension during the day with minimal impact to our users, but can wait until our nightly processing window for our dimensions to be completely updated.

For very large dimensions consisting of several million members or more than of attributes, splitting a Process Full up into separate Process Data and Process Index operations may allow us to make better use of available memory. Whether it does improve processing performance or not will depend on the structure of the dimension and the server's hardware; thorough only testing is will determine if it is necessary to determine whether it is worth doing this. Similarly, as mentioned in Chapter 3, Designing More Complex Dimensions, setting up the AttributeHierarchyOptimizedState and AttributeHierarchyOrdered properties to False may also improve processing performance at the expense of, respectively, query performance and member ordering.

Last of all, remember that many structural changes to a dimension (for example, adding a new hierarchy) will mean that a Process Full is necessary. As a result, such structural changes will need to be carefully planned for.

Partition processing

As with dimensions, partitions can be processed in several ways. Processing a large partition is a very CPU intensive operation, and as a result, partition processing normally takes longer than dimension processing. We therefore have to think more carefully about how to process our partitions in the most efficient way. Luckily, partition processing is also a lot less complex than dimension processing, in that processing a partition has no side effects on any other objects.

The following table lists the different options for processing a partition:

Type

Description

Full

The partition is completely rebuilt. All data inside the partitions is deleted and reloaded from the relational database; indexes and aggregations are rebuilt. It is equivalent to an unprocess, followed by a Process Data and a subsequent Process Index.

Add

Also known as incremental processing, this adds new data to the partition. After we've specified a table or SQL query containing new fact rows, Analysis Services creates a new partition, fully processes it, and then merges the new partition with the existing one.

Over time, the performance of a partition will degrade after multiple Process Adds. Once again, it's a good idea to schedule an occasional Process Full if we are using Process Add regularly.

Data

Behaves the same as Process Full but does not process indexes and aggregations.

Index

Builds indexes and aggregations for the partition. This is normally used after a Process Data.

Default

Does whatever processing is necessary to bring the partition up to a fully processed state. If the partition itself is in an unprocessed state, a Process Full is performed. If, on the other hand, the partition is processed but some indexes or aggregations inside of it are not, only those indexes and aggregations will be processed.

Unprocess

Deletes all structures, data, indexes, and aggregations from the partition.

There are a different set of issues to consider when choosing a partition processing strategy compared to a dimension processing strategy. Dimension data changes naturally, and these changes are something we have to plan for. On the other hand, existing data in our fact tables does not usually change over time. New rows will appear in fact tables and recently added rows may be updated. Unless older data is incorrect or we need to recalculate measure values based on new business requirements, this older data is normally static.

With these considerations in mind, let's take a look at some scenarios where the different partition processing options can be used:

· For new partitions, or where data in existing partitions needs to be completely reloaded, we have to do a Process Full. The important thing here is to set the slices for our partitions so that when we do need to do a Process Full, we do the minimum amount of processing necessary. For example, if new data is loaded into our fact tables every month, every month we can create a new Analysis Services partition to hold that new month's data and run a Process Full on just that new partition. Assuming that the data for previous months does not change, no other partitions need to be processed.

· Process Add is normally only used with the Standard Edition of Analysis Services, where we are only allowed to have one partition per measure group, and so strategies like the one described in the previous bullet are not possible. With only one partition to store all of our data, a Process Full will involve reloading all of the data from our fact table whether it is new or not, and this could take a long time. Using Process Add to load only new rows is going to take much less time, even if we still need to schedule an occasional Process Full as noted earlier.

· If our partitions are very large, processing data and building aggregations might require a great deal of memory. It might be better to perform a Process Data in parallel for all the partitions that need to be processed, and then when this has finished and the memory has been released, we can run a Process Index on them. In this way, we can separate the two processing steps and reduce memory pressure on the server while also having greater control over when load is placed on the relational database.

· It is very difficult to say whether this approach will benefit processing performance – thorough testing is necessary. If memory is not an issue and we have a powerful server, then processing both data and indexes at the same time might increase parallelism because when the CPUs are waiting on the I/O operations involved in a Process Data, they can be used to build aggregations and indexes instead.

· If dimensions have been updated using Process Update, some aggregations on existing processed partitions may no longer be valid and will need to be rebuilt. In this case, running a Process Default on all these partitions will rebuild only the invalid aggregations. Therefore, if we are using Process Update and loading data into newly created partitions at the same time, our workflow needs to be as follows:

1. Run a Process Update on all dimensions that need it.

2. Build any new partitions that are necessary and perform any other partition management tasks.

3. Run a Process Default on all the partitions in the cube. This will have the same effect as running a Process Full on the new partitions, and rebuild any invalid aggregations on the existing partitions.

In order to be able to query a cube, we also need to run a Process Structure on it, which usually takes just a few seconds. This can be run the first time a cube is deployed to production and will only be necessary afterwards if the cube's structure is changed. If we do this and do not process any partitions, the cube can be queried but will contain no data; we can then process partitions individually, and as each partition finishes processing, its data will appear in the cube. This behavior can be useful in a development environment if we want an exact copy of our production cube but don't want to load all of the data into it. In this case, we can run a Process Structure on the cube and then a Process Full on just one or two partitions.

Note

Processing an object will lead to the Storage Engine caches related to that object and the entire Formula Engine cache being cleared: if the data in the cube has changed, any data in the cache may no longer be valid. This is especially important if we have to perform any kind of processing during the day because query performance will suffer as a result, and the more often we process, the less benefit we will get from caching.

Lazy Aggregations

Lazy Aggregations is a processing option for partitions. Normally, when we run a Process Full on a partition, it will be available when both the Process Data and Process Index steps have completed. If we want the partition to be available sooner, we can set theProcessing Mode property on the partition to Lazy Aggregations. This means that the partition becomes available as soon as the Process Data step has completed; the Process Index step will then be run in the background as and when resources are available. The price we pay for having the partition available earlier is that, while aggregations and indexes are not built, queries against the partition will perform worse.

Using the Lazy Aggregations option, we lose control over when the aggregations and indexes for a specific partition will be built because it will depend upon several factors, most of which are out of our control. Also, Lazy Aggregations interfere with our ability to synchronize a cube, as we will see later. As a result of these drawbacks, we do not recommend the use of this option.

Processing reference dimensions

As we saw in Chapter 1, Designing the Data Warehouse for Analysis Services and Chapter 4, Measures and Measure Groups, materialized reference relationships result in a join being made between the fact table and the intermediate dimension table in the SQL generated for partition processing. This highlights the important point that materialized reference relationships are resolved during partition processing and not dimension processing.

Apart from the negative impact that materialized relationships have on partition processing performance, using them also carries a less obvious but more serious penalty: if the intermediate dimension in the relationship ever changes its structure, we have to run a Process Full on all of our partitions.

Consider the scenario we looked at in Chapter 4, Measures and Measure Groups, where a Region dimension joins to a measure group through a Country attribute on a Customer dimension using a referenced relationship. If that relationship is materialized, during partition processing, the Customer dimension table will be joined to the fact table and the key of the Country that each Customer lives in will be stored in the partition. But what happens if we run a Process Update on the Customer dimension and Customers change the Country they live in? The materialized referenced relationship data on existing partitions will not be refreshed, so Region values for these partitions will be incorrect. Even worse, if we then create new partitions and run a Process Full on them, these new partitions will contain the correct materialized relationship data, so Region values will be completely inconsistent. Only a Process Full on all partitions can ensure that queries using the Region dimension will return correct data.

This problem does not occur if the referenced relationship is not materialized. As a result, this is one more good reason not to use materialized referenced relationships.

Handling processing errors

In an ideal world, Analysis Services processing would never fail. Certainly, as we argued in Chapter 1, Designing the Data Warehouse for Analysis Services, we should always try to ensure that all data integrity issues are handled during the ETL for our relational data warehouse rather than in Analysis Services. However, in the real world, there will always be some problems with the data in our data warehouse, and we need to configure Analysis Services processing to handle them.

We can control what happens for each Analysis Services object when there is a processing error using its ErrorConfiguration property. The default setting is for processing to fail whenever an error occurs, but we can override this for different types of error and either ignore the error completely, ignore anything up to a given number of errors and then fail processing, or log the error to a text file and continue processing. We also have two options that govern what happens when we ignore an error: we can either ignore the problem row in the fact table completely or assign the values from them to a special, automatically generated member called UnknownMember on a hierarchy. The existence of Unknown Members is controlled by the UnknownMember property of a dimension: it can either be set to None, Visible, or Hidden.

We can also set these properties for each processing operation by clicking on the Change Settings button in the Process dialog, and then going to the Dimension key errors tab.

Handling processing errors

We recommend handling processing errors by setting up the ErrorConfiguration property on objects rather than having to remember to set them every time we do processing. Discarding rows when they contain errors is probably a bad thing to do since it means our cube could potentially contain incorrect values; it is usually better to assign these values to a visible UnknownMember. This way, even if we can't see these values assigned correctly for a particular hierarchy, the totals seen at the All Member will still be correct, and when browsing the cube, we'll be able to see where exactly these errors are occurring. Remember, though, that this use of Unknown Members is purely an insurance policy in case errors get past the checks we have implemented in our ETL.

We probably don't want processing to completely fail either if an error is encountered; it's better to let processing continue and then reprocess objects later once we have fixed the data problems. Therefore, we should ignore the number of errors generated, use theReport and Continue option for each type of error, and then log these errors to a text file.

Managing processing with Integration Services

Just as we used Integration Services to manage our partitions, we can also use it to manage our dimension and partition processing very easily. Using Integration Services gives us the following benefits:

· Complete control over the order in which objects are processed and the type of processing used. It's very easy to implement even very complex processing strategies using the Integration Services Control Flow.

· Logging Using Integration Services' own logging features (which we should already be using in our ETL), we will be able to record exactly what gets processed, how long each step takes, and any errors that are raised.

· If processing fails for whatever reason, Integration Services makes it very easy to do things like send an e-mail to the cube administrator to tell them that this has happened.

· Creation of new partitions. As we have shown, using Integration Services, we can use a script task to automatically manage our partitions using AMO. As a result, it makes sense to manage partition processing in the same place.

There are four main methods we can use to manage processing within Integration Services:

· Using the Analysis Services Processing Task is our recommended method. It's very easy to configure and allows us to process multiple objects either sequentially or in parallel, in a single batch. However, there may be scenarios where we might not choose to use it. For example, the list of objects that we can set the task to process is static and cannot be changed, even with Integration Services expressions. In some cases, we might only want to run a Process Update on a large dimension if we know its structure has changed; we could certainly build this kind of conditional logic into the Control Flow and have one Analysis Serviced Processing Task for each dimension, but there may well be easier ways of doing this, such as writing AMO code.

· We can also execute XMLA processing commands directly using the Execute Analysis Services DDL Task. This gives us complete flexibility with our processing options, but XMLA commands are not easy for humans to read, and therefore they are difficult to maintain. We can generate XMLA processing commands very easily by right-clicking on the object we want to process in the Object Explorer pane in SQL Server Management Studio, selecting Process, and then in the Process dialog, clicking on the Scriptbutton instead of the OK button.

Managing processing with Integration Services

· We can write .NET code using AMO to process objects, just as we did to create and delete partitions. While this works, it usually gives us no particular advantages over the other methods, will take slightly longer to develop, and will be harder to maintain. It's a good option when we have very complex logic determining what needs processing, and it would be too difficult to implement this logic in the Control Flow.

· External applications, for example, the ASCMD utility described in the following section that can manage processing, can be invoked using the Execute Process Task.

Note

The ASCMD utility is a command-line tool that can execute XMLA, MDX, or DMX commands against Analysis Services. A detailed list of its functionality can be found in its readme at: http://tinyurl.com/ascmdreadme and the Analysis Services 2008 version can be downloaded from http://msftasprodsamples.codeplex.com/releases/. It's extremely useful in situations where we can't use Integration Services, but if we do have a choice, then using Integration Services is always the easiest option.

A simple Integration Services package that performs processing might look something like this:

Managing processing with Integration Services

In this example, we're running our partition management tasks in parallel with our dimension processing, where we're running a Process Update on each dimension that may have changed. After both of these operations have completed, we then use another processing task to process, in parallel, all of the measure groups in our cube with a Process Default so that new partitions are fully processed and any existing partitions with invalid aggregations have their aggregations rebuilt.

Push-mode processing

So far, when we've processed dimensions or partitions, we've been sending XMLA commands to Analysis Services to initiate this. Analysis Services then generates whatever SQL is necessary and runs it against a relational data source. Finally, it loads the data returned by these queries into the object that is being processed. This is known as pull-mode processing: Analysis Services is pulling data from a data source.

There is also another, much less often used, method for processing objects: push-mode processing. In push-mode processing, we can use Integration Services to push data from a Data Flow into a partition or dimension using a Partition Processing or a Dimension Processing destination. No SQL is generated by Analysis Services in this case – we are pushing data into Analysis Services ourselves.

Push-mode processing is useful when we need to load data into Analysis Services from data sources that are not officially supported and for which Analysis Services would not be able to generate valid SQL (such as text files, or databases like Sybase or MySQL). It is also an alternative to running a Process Add on a partition, or a Process Add or a Process Update on a dimension, in situations where data needs to be loaded into Analysis Services; however, like Process Add and Process Update, regular use can lead to performance degradation, and so running a Process Full occasionally is necessary.

Proactive caching

The proactive caching features in Analysis Services allow processing to be kicked off automatically in response to changes in the relational data. When they were first introduced in Analysis Services 2005, they were very heavily hyped, but the truth is that they are almost never used in a production environment. This isn't because there are any major problems with the functionality, more that there are easier ways of handling the requirement for 'real-time' data.

In an environment where Analysis Services cubes are built on top of a traditional data warehouse, when we know when data is loaded into the data warehouse, then it's easier to schedule Analysis Services processing as part of the overall data warehouse ETL. That way, we can kick off processing when we're sure that the fact and dimension tables have finished loading, and have finished successfully. Even if the cube needs to be updated regularly during the day, scheduling that processing at regular intervals using SQL Server Agent and Integration Services means that we can be sure when that processing will take place, that it does not take place too often and therefore impact query performance, and be able to log and handle any errors appropriately.

We only recommend the use of proactive caching on less formal Analysis Services implementations, for example, when cubes are built on tables in an OLTP database. In these cases, where there is no ETL or data warehouse in place and where data volumes are relatively small, using proactive caching is often easier than having to develop and maintain something like an Integration Services package.

SSAS Data Directory maintenance

By default, all of the files used by Analysis Services to store MOLAP data are stored in the Analysis Services data directory. The location of this directory is set up during installation, and can be changed by setting the value of the DataDir server property.

As with any other type of file handled by Windows, the files in the Analysis Services data directory are prone to fragmentation. During processing, many files are created, written, and deleted. If the disk is very fragmented, these files may be split across different portions of the disk, slowing down access to them.

A regular defragmentation of the drive on which the data directory exists will improve the overall speed of the disk subsystem, which will of course have benefits for Analysis Services query and processing performance. As a result, we suggest regular disk defragmentation of the Analysis Services data directory's drive as part of the server's maintenance plan, making sure of course that it only takes place at a time when we are not processing or when users are querying the cube.

Performing database backup

It may be stating the obvious, but after processing has finished, we should always back up our Analysis Services database. This can be done very easily from Integration Services using an Execute Analysis Services DDL task; once again, you can generate the XMLA command needed to back up a database by right-clicking on it in SQL Management Studio, selecting Back Up, and then pressing the Script button on the Back Up Database dialog.

Copying databases between servers

There are several scenarios where we may need to copy an Analysis Services database from one server to a different server. Here are a couple of examples:

· We have several frontend servers directly accessible by users in the DMZ area of the network and one backend server inside the internal network. When the cube is processed on the internal server, it needs to be duplicated to the frontend servers so it can be queried by our users.

· We have two powerful Analysis Services servers and several cubes. Half of the cubes are processed on one server and half are processed on the other. When both servers have finished processing, the databases are synchronized between the two servers so they both have the same data and can share the query load.

Using Analysis Services, we have three options to synchronize data between servers:

· Use the Synchronize XMLA command. This is the preferred method in most situations. When executed, the Synchronize command tells one Analysis Services instance to synchronize a database with a database on another instance. During the synchronization process, Analysis Services checks for any differences between the version of the database on the destination database and the version on the source server. Updated objects are copied over the network from the source server to the destination server until the destination is identical to the source.

· The XMLA needed for a Synchronize command can be generated in SQL Management Studio by right-clicking on the Databases node of an instance in the Object Explorer pane and running the Synchronize Database wizard at the end, choosing to generate a script rather than running the synchronization immediately. This XMLA command can be run from an Execute Analysis Services DDL Task in Integration Services.

· The synchronization operation might be slow in situations where there is a lot of data in a cube or where there are a lot of differences between the source and the destination. However, the major advantage of using this approach is that users can continue to query the destination server while synchronization takes place, although when synchronization has completed, there will be a (usually short) period where new users cannot connect and new queries cannot be run, while the old version of the database is replaced by the new version. Note that synchronization cannot be used at the same time as lazy aggregation building is taking place because there is a risk of database corruption.

· Use backup/restore. Even if this approach seems crude when compared with the more elegant synchronize, it has some advantages. For a start, we're going to be backing up our database anyway after it has been processed, so using the backup to copy the database to another server will require little extra development. Secondly, we can take the same backup file and copy it to multiple destination servers, where they can be restored in parallel.

· A problem with this approach is that if we try to restore over our existing database on the destination server, we cannot query that database while the restore is taking place. One way of working around this would be to restore to a different database to the one that users are currently querying, and then to redirect all new queries to this second database after the restore has completed. This would be feasible if the only client tool used was Reporting Services, where connection strings can be parameterized, but not if a client tool like Excel was used. Another issue is that we have to copy the entire database over the network even if there is only a small amount of new data, and if the backup file is very large, this might take a relatively long time and require extra disk space.

· Use attach/detach. This method simply involves detaching the database from the source server, copying the database files over to the destination server, detaching the old version of the database on the destination server, and then reattaching the new versionof the database. Since attaching and detaching a database is a very fast operation, this method involves the minimum of downtime; it also has the advantage that if a database is attached to one instance with its ReadMode property set to ReadOnly, the same database files can be attached to multiple other instances in the same way.

· The main problem with this approach is that when we detach a database, it cannot be queried at all, and any open connections to it are closed. This may lead to errors in some client tools and, at best, all users will have to reconnect to the server once the attach operation has completed.

Our recommendation is to use the Synchronize command if it runs fast enough and doesn't result in queries being blocked for too long. If we can't use Synchronize because it takes too long, or we don't mind that connections will be dropped, then attach/detach is the best option.

Summary

In this chapter, we have discussed several topics:

· How to deploy changes to a database already in production, using Visual Studio configurations or the Deployment Wizard.

· Automating partition management using Integration Services and AMO code to create, merge, and delete partitions.

· Processing strategies. Although we always want to minimize the amount of time spent processing, there is no one processing strategy that is appropriate for all projects. In most cases, a Process Update will handle any updates to dimensions; this can then be followed by a Process Default on our partitions, which will rebuild any aggregations that have been invalidated by dimension processing on existing partitions, and fully process any newly created partitions.

· How processing can be managed using Integration Services. The Process Analysis Services Task is the easiest way of doing this, but other Control Flow tasks such as the Execute Analysis Services DDL Task can also be used. It's also possible to push data directly into partitions and dimensions in the Integration Services Data Flow.

· Methods for copying databases between different instances of Analysis Services. The Synchronize command is the easiest way of doing this, but attaching and detaching databases is another option.