Expert Cube Development with SSAS Multidimensional Models (2014)
Chapter 3. Designing More Complex Dimensions
At this point in the cube design process we should now have a simple cube deployed and processed, and our users will be telling us what they like and don't like. Once we've taken their feedback on board, it's time to move on to tackle some of the more complex problems. In this chapter, we'll take a look at some specific issues related to dimension design:
· Grouping and banding
· Handling the different types of slowly changing dimensions
· Junk dimensions
· Modeling ragged hierarchies with parent/child hierarchies and the HideMemberIf property
Grouping and banding
Often, we'll need to create groups of some kind on a dimension either to group long lists of members up into more user-friendly groups, or to group numeric attributes such as Age or measure values into bands or ranges. Analysis Services offers some functionality to help us do this. But as usual, we'll get much more flexibility if we design these groups into the dimension ourselves.
First of all, let's consider why we might want to group members on a large attribute hierarchy. Some dimensions are not only very large—there are a lot of rows in the dimension table—but they are also very flat, so they have very few attributes on them that are related to each other and have very few natural hierarchies. We might have a Customer dimension with millions of individual customers on it, and we might also have City and Country attributes, but even then it might be the case that for a large city, a user might drill down and see hundreds or thousands of customers. In this situation, a user looking for an individual customer might have problems finding the one they want if they need to search through a very long list; some client tools might also be slow to respond if they have to display such a large number of members in a dialog or dimension browser. Therefore, it makes sense to create extra attributes on such dimensions to group members together to reduce the chance of this happening.
Analysis Services can automatically create groups for you, using the DiscretizationMethod and DiscretizationBucketCount properties on an attribute. The DiscretizationMethod property allows you to choose how groups should be created: the EqualAreas option will try to create groups with a roughly equal number of members in them, the Clusters option will use a data mining algorithm to create groups of similar members, and the Automatic option will try to work out which of the preceding two options fits the data best; theDiscretizationBucketCount property specifies the number of groups that should be created. Full details of how this functionality works can be found at http://tinyurl.com/groupingatts and while it does what it is supposed to do, it rarely makes sense to use it. The reason can be seen from the following screenshot, which shows the result of using the EqualAreas option to group a Weight attribute:
Clearly, this isn't very user-friendly, and while you could try to tweak property values to get the groups and group names you want, frankly, it is much easier to create new columns in the views we're building our dimensions from to get exactly the kind of grouping that you want. Nothing is going to be more flexible than SQL for this job, and writing the necessary SQL code is not hard—usually a simple CASE statement will be sufficient. An expression such as the following in TSQL, when used to create a new column in a view or a named calculation:
CASE WHEN Weight IS NULL OR Weight<0 THEN 'N/A'
WHEN Weight<10 THEN '0-10Kg'
WHEN Weight<20 THEN '10-20Kg'
ELSE '20Kg or more'
This yields much better results in the dimension when you build an attribute from it, as shown in the following screenshot:
In this case, the names happen to sort in the order you'd want to see them, and you might need an additional column to use as the key for the new attribute. The point is that in this situation, as in many others, a little extra time spent modeling the relational data to get it the way you want it pays dividends even when Analysis Services seems to offer you a quicker way of getting things done.
Similarly, we might need to create an entire dimension that acts as a way of grouping measure values on a fact table. For example, we might have a measure that gives us the total value of an order, and we might want to find the total number of orders whose values fall into some predefined bandings such as High Value, Medium Value, or Low Value. In this case, again we would need to create a dimension table to hold these bandings, but one problem we might have to face is that the ranges used for the bandings might change frequently as the users' requirements change—one day a High Value order might be one for more than €10,000, the next it might be more than €15,000.
If we modeled our new dimension using meaningless surrogate keys, we would have to perform a lookup during our ETL to find out which band each row in the fact table fell into and assign it the appropriate surrogate key, as shown in the following diagram:
But what would happen if the user changed the bandings? If a user does this, then we would have to reload our entire fact table, because potentially any order might now fall into a new banding. A more flexible approach is to hardcode only the granularity of the bandings into the fact table: for example, we could say that our bandings could only have boundaries divisible by €1,000. This would then allow us to use an expression in our fact table ETL such as Floor(OrderValue/100) to create a meaningful key; in our dimension, we would then create one row per €100 up to what we think the maximum value of an order might be, and then group these €100 ranges into the bandings our users wanted, as follows:
The advantage of this is that so long as the granularity of the bandings doesn't change, we will never need to reload our fact table. In Analysis Services terms, this dimension would have two attributes: one built from the meaningful key, and one to hold the name of the band; a Process Update would be all that was necessary when the banding boundaries changed because only the dimension table would have been changed.
Modeling Slowly Changing Dimensions
Slowly Changing Dimensions (SCDs) are a fact of life in almost all data warehouses; we discussed how to deal with them in our relational model in Chapter 1, Designing the Data Warehouse for Analysis Services, and we'll now look at the issues that arise with each different type when building Analysis Services dimensions. In many cases, of course, different attributes on the same dimension can be subject to different types of change and we will use a combination of the techniques outlined in the next section, but it's helpful to discuss the three most common types of change separately for the sake of clarity.
Type I SCDs
Since changes in Type I SCDs involve overwriting existing values in the dimension table, no special design is needed in Analysis Services to support this. Running a Process Update on the dimension will ensure that any changes that have been made in the dimension table are reflected in your Analysis Services dimension. We'll discuss processing in more detail in Chapter 10, Going in Production.
It's at this point that we'll see an error if any attributes have changed that should not have changed, that's to say, if any members on attribute hierarchies which have Rigid relationship types have changed position in the dimension relative to each other. For example, if on the Time dimension, there were Date and Month attributes that had a Rigid attribute relationship defined between them, and the Date January 1st 2001 somehow got moved from under the Month January 2001 to the Month March 2002, we'd expect to get an error because something has clearly gone wrong. It's also possible that a Type I change will result in the data in the dimension not being consistent with the attribute relationships we have defined, and what we thought to be a one-to-many relationship, turns out to be a many-to-many relationship. For example, we might have defined a chain of attribute relationships on a Customer dimension going from Continent to Country to City, but then realize that there is an exception to this rule when the cities of Istanbul and Ankara are added: both are in Turkey, but Istanbul could be listed as being in Europe and Ankara in Asia.
You might think that an update to a dimension that invalidates an attribute relationship would raise an error during dimension processing, but it doesn't always. If you run a Process Full command on a dimension with invalid attribute relationships, and you specifically configure processing error handling to trap duplicate key errors, then processing will fail. However, a Process Update will complete successfully in this situation, regardless of how you configure processing error handling, leaving the dimension in a dangerous state where incorrect data could be returned from the cube.
Another thing we'll see is that any aggregations on measure groups that this dimension joins to that might have been invalidated by the processing are dropped; they'll need to be rebuilt later on in your processing schedule. These aggregations are called Flexibleaggregations, and they're aggregations that include at least one attribute that has a Flexible relationship somewhere in the chain of attribute relationships between them and the key attribute. We can see whether an aggregation is Flexible or Rigid in the Edit Aggregations dialog of the BIDS Helper Aggregation Manager, as shown in the following screenshot:
One problem that may result from Type I changes and which may not become apparent immediately, is that previously saved queries and reports will either stop working or not return the data you expect. In MDX, a member is identified by its unique name and this is how it is referenced when you write an MDX query; the following are some examples of unique names:
· [Customer].[Customer Geography].[City].&[Alexandria]&[NSW]
· [Date].[Calendar Year].&
The format of the unique name will vary depending on the type of hierarchy and various property settings on the dimension, but as we can see from the preceding examples, the uniqueness of a a unique name usually derives from values in the columns you've used in the KeyColumns property of the attribute hierarchy. If a Type I change alters one of these values then that will carry through to a change in the unique names of one or more members on hierarchies in the dimension. If a saved query references the old version of the unique name, one of two things will happen:
· If we have set the MDXMissingMemberMode property of the dimension to Ignore (and this is what the Default option means, too), then any queries which refer to the old version of the unique name will simply ignore that member. This would result in rows or columns disappearing from queries or reports with no warning.
· If we have set MDXMissingMemberMode to Error, then any queries that refer to the old version of the unique name will raise an error when they are run. At best, this will mean you have to manually fix the MDX query so that the old version is replaced with the new version; in some Analysis Services client tools, it will mean you have to delete the entire query or report and recreate it from scratch.
Type II SCDs
Type II SCDs represent a more difficult problem: unlike some other OLAP servers, Analysis Services has no explicit support for different versions of the same dimension, and it's up to us as developers to decide how we should design our dimension to handle such changes.
When a Type II change takes place, a new row is added to a dimension table and this in turn results in the addition of a new member on the key attribute of the Analysis Services dimension. This has some positive implications: if we only allow Type II changes on a dimension, because we are only adding new rows and not updating existing rows, we can do a Process Add on our dimension which will be faster than a Process Update, and we can set all our attribute relationships to Rigid so we do not need to rebuild any aggregations after processing. However, modeling the attributes and attribute relationships on a Type II SCD can require some thought.
Modeling attribute relationships on a Type II SCD
Let's consider a Customer dimension table, which features two columns common to many Type II SCDs:
· A surrogate key column, CustomerKey, which is the primary key on the table.
· A business key column, CustomerAlternateKey, which identifies a single customer. If several Type II changes take place there may be several rows, each with different surrogate keys, for each customer; each of these rows, though, will have the same value forCustomerAlternateKey.
Two attributes should be built on the Analysis Services Customer dimension built from the discussed table:
· One which uses the surrogate key in its KeyColumns property, let's call Customer SCD in the following discussion
· One which uses the business key in its KeyColumns property, which we'll call Customer
Customer SCD will be the key attribute of the dimension and an attribute relationship should be built to reflect the one-to-many relationship between Customer and Customer SCD. We can then build other attributes on the dimension and configure attribute relationships so that those that will never change have attribute relationships that run through the Customer attribute and those that will be subject to Type II changes have attribute relationships that bypass Customer and go straight to Customer SCD. Here's an example of what a first attempt at modeling these attribute relationships might look like:
Given that the same customer may appear many times in the Customer SCD attribute hierarchy after Type II changes have taken place, and that the end user may be confused by this, it makes sense to hide the Customer SCD attribute hierarchy by setting itsAttributeHierarchyVisibleState to False. The user can instead use the Customer attribute hierarchy to view individual customers. However, it's also likely that we would want to build a user hierarchy that would allow users to drill from Country to City and down toCustomer. With the attribute relationships modeled previously, we would have two options, neither of which would be ideal:
· Build the user hierarchy with levels based on the Country attribute hierarchy, the City attribute hierarchy, and the Customer SCD attribute hierarchy. This would be a natural user hierarchy so you would get optimal performance, especially important on a dimension such as Customer that can be very large. But, what would happen if a customer got married, and this created a new row in the dimension table? A new member would be created on Customer SCD, so when the user drilled down from the City level they might see the same Customer listed twice, once in their unmarried state and once in their married state; clearly not a good thing.
· Build the user hierarchy with levels based on the Country attribute hierarchy, the City attribute hierarchy, and the Customer attribute hierarchy. This would ensure the user only ever saw one instance of a customer per city, but it would also be an unnatural user hierarchy and would therefore perform worse.
One solution to this problem is to create an extra attribute hierarchy to represent unique combinations of individual customers and cities, with a composite key based on the CustomerAlternateKey business key and the key used for the City attribute hierarchy. This would allow us to model the attribute relationships as follows:
We would then set AttributeHierarchyVisible to False on this new attribute hierarchy and use it as the bottom level of your user hierarchy. This would guarantee that we only saw one instance of a Customer appear under each City and since it would be a natural user hierarchy, we would have the best possible performance. The price we would pay for this would be that your dimension processing time would increase because of this new, potentially large attribute; we would have to decide whether this was a price worth paying.
Another benefit of this solution is that the Customer SCD attribute hierarchy is never made available for direct querying to the end user. It's still necessary to have it there from a modeling point of view, but the fact it is hidden can be useful: since its key is based on the surrogate key of the dimension table, the unique names of the members on it therefore will incorporate surrogate key values. Although not necessarily the best practice, it is sometimes the case that a data warehouse needs to be completely emptied and reloaded, and if that happens, it's highly likely that a completely new set of surrogate key values will be used in a dimension. If this were to happen it would break existing reports, or worse, cause them to return unexpected data, and for this reason it's a good idea to avoid exposing surrogate key values through MDX unique names where possible.
Handling member status
Type II SCDs typically also have three columns in their dimension tables to track the status of a row: a column that tells us whether this is the most recent, or current version of the row, perhaps containing either the values Current or Previous; and start and end dates that show you the time span for which each row was the current row. It can be useful to include this information in your Analysis Services dimension, although it's by no means always necessary or even a good idea.
Building an attribute from the Current version column is straightforward: we get an attribute hierarchy with an All Member and two members underneath it for the True and False values. Doing this will, for instance, allow us to write queries and calculations that filter our Customer SCD attribute to retrieve only the current versions of each Customer so we can answer questions, such as "How many of my customers are married?" We may not want to make this attribute visible to our users though because it will also slice measure values so they only show values for the current version of each Customer, which probably isn't very helpful.
For the start and end date columns, at this point the only recommendation to make is that rather than storing them as DateTime types, it's better to use the same integer surrogate key format we're using for your Time dimension surrogate keys. This will give us the option of creating role-playing dimensions based on our Time dimension which can then be joined to your SCD on these columns with a Referenced relationship, which again will allow us to filter our Customer SCD attribute to only return those members that were current within a specific date range. It will also, if required, provide the platform for building even more sophisticated models for analyzing the changes made to a dimension over time, such as those described in the Cross Time and Transition Matrix sections of the white paper The Many-to-Many Revolution, available at http://tinyurl.com/m2mrev.
The Type property of an attribute hierarchy can be set to SCDStatus, SCDStartDate, and SCDEndDate for these three columns. These properties have no effect on the behavior of Analysis Services and are purely descriptive.
Type III SCDs
A Type III SCD uses two columns to capture the current and either the previous or original states of something—it allows us to store history, but only a limited amount; probably as a result this approach is very rarely used in the real world. In Analysis Services, this translates to two different sets of attributes; for example, in our Customer dimension example, we might have four attributes for Current City, Current Country, Previous City, and Previous Country, with the attribute relationships modeled as follows:
The two sets of attributes we have described for a Type III dimension, Current and Previous, are good candidates for being grouped together in folders using the AttributeHierarchyDisplayFolder property. Folders can be an important way of improving the usability of dimensions with large numbers of attribute and user hierarchies.
Modeling junk dimensions
As we've already seen, junk dimensions are built from groups of attributes that don't belong on any other dimension, generally columns from fact tables that represent flags or status indicators. When designing an Analysis Services solution, it can be quite tempting to turn each of these columns into their own dimension, having just one attribute, but from a manageability and usability point of view, creating a single junk dimension is preferable to cluttering up your cube with lots of rarely-used dimensions. Creating a junk dimension can be important for query performance too. Typically, when creating a junk dimension, we create a dimension table containing only the combinations of attribute values that actually exist in the fact table—usually a much smaller number of combinations than the theoretical maximum, because there are often dependencies between these attributes, and knowing these combinations in advance can greatly improve the performance of MDX queries that display more than one of these attributes cross-joined together.
As with a Type II SCD, when building an Analysis Services dimension from a junk dimension table we will create a key attribute based on the surrogate key of the dimension, which we can then hide by setting AttributeHierarchyVisible to False. All other attributes on the dimension can be directly related to this attribute, unless of course we can be 100 percent certain that attribute relationships can be defined between other attributes. The attribute relationships might look something like the following:
It's sometimes the case that users will request that one or more attributes from a junk dimension will be split off into a new dimension, and you will need to be careful if you do this because of the risk that existing queries that reference the junk dimension will break as a result of invalid MDX unique name references.
Modeling ragged hierarchies
Ragged hierarchies are another common design problem to deal with when building an Analysis Services dimension. The hierarchies we've dealt with so far can be easily separated out into distinct levels and can be thought of as pyramid-shaped: all of the members on the hierarchy have at least one child, except the members on the lowest level, which have no children at all. Ragged hierarchies, on the other hand, are bush-shaped. The members at any given level may or may not have children. Common examples of ragged hierarchies are those that represent a chart of accounts or the organizational structure of a company.
Modeling parent/child hierarchies
One way of modeling a ragged hierarchy in a data warehouse is with a table with a self-join: every row in the table represents an item somewhere on the hierarchy, and every row has a key column and a foreign key that joins back onto the key column in order to store the key of the parent item. Here's an example taken from the Adventure Works data warehouse:
Analysis Services can build a special kind of hierarchy from this type of table called a parent/child hierarchy, and will do so automatically if you run the New Dimension Wizard on a table with the appropriate relationship defined in the DSV and build attributes from the key and parent key columns. Once we've done this, we'll have a dimension with two attribute hierarchies:
· The key attribute built from the primary key of the table, which is a flat list of all the members in the table.
· The attribute built from the parent key column, which is the parent/child hierarchy and has the same members but is organized into a ragged hierarchy. You'll see that the Usage property of this attribute has been set to Parent.
Typically, we will set the AttributeHierarchyVisible property of the key attribute to False and only let users see the parent/child hierarchy in the dimension.
We can give the levels on a parent/child hierarchy names by setting the NamingTemplate property. This property takes the form of a semicolon delimited list of names, which are used as the names of levels, starting from the top of the hierarchy: asterisks in the names are replaced with the ordinal of the level. Thankfully, there is a dialog box that allows you to construct these strings easily.
There is, unfortunately, a long list of limitations when it comes to using parent/child hierarchies:
· You can only have one parent/child hierarchy per dimension.
· A parent/child hierarchy has to be built from the key attribute of the dimension.
· You cannot use a parent/child hierarchy as a level in a user hierarchy.
· Dimensions with parent/child hierarchies are slower to process than those without.
· Parent/child hierarchies can be very difficult to work with when writing MDX calculations and queries because they do not obey the same rules as regular hierarchies. Scoped assignments, auto-exists, and any calculation that uses the Currentmember function are a particular problem.
· The unique names of parent/child hierarchies include surrogate key values, which as we've already noted, can be a problem if you need to reload the dimension table.
· Use of a parent/child hierarchy in a query can lead to very poor query performance. In general, performance of small parent/child hierarchies of up to a few hundred members is fine, but as soon as you have more than a few thousand members on a parent/child hierarchy, the performance can suffer badly.
· Following on from the previous point, you cannot build an aggregation at the granularity of a parent/child attribute hierarchy; you can only build aggregations above the granularity of the hierarchy or ones at the granularity of the key attribute. It is not possible to build aggregations on any of the intermediate levels that appear in a parent/child hierarchy. This makes it more difficult to tune query performance, although it should be noted that it is not the sole cause of the poor query performance of parent/child hierarchies.
We can work around some of these problems (notably, not being able to build multiple parent/child hierarchies on the same dimension; having to build them from the key attribute; and surrogate keys being used in unique names) by creating multiple Analysis Services dimensions from the same physical dimension table. We would create a dimension to hold just the surrogate key and then build separate dimensions for each parent/child hierarchy and use non-surrogate key columns on the dimension table as the key attribute for them; these then would be linked to the main measure group by using a Referenced relationship (which we'll talk about in the next chapter) via the new surrogate key dimension. Doing this, however, is likely to have an impact on the processing performance if you materialize the referenced relationships or query performance if you don't materialize the referenced relationships.
We recommend that you avoid setting the IsAggregatable property of a parent/child hierarchy to False whenever you can. If you do, then every query will be forced to the granularity of the key attribute of the dimension with the parent/child hierarchy, even when the query doesn't use the parent/child hierarchy. Setting IsAggregatable to True though means that an All Member will be created on the hierarchy and an aggregation can be built at that granularity. For the same reason, we also recommend that you think very carefully before setting the DefaultMember on the parent/child hierarchy.
Avoid using parent/child hierarchies!
For all of the reasons discussed, we recommend that you avoid using parent/child hierarchies unless you have absolutely no other choice but to use them. In some cases you will have to use them, and in some cases using them will not cause any problems, but in our experience when problems occur with parent/child hierarchies (and this is usually when they have a large number of members) then they can be quite serious. In situations where you know there is a fixed maximum number of levels in the hierarchy, you should attempt to flatten out the relational data so that you can build separate attributes for each level. The Analysis Services Parent/Child Dimension Naturalizer utility, available for download at http://tinyurl.com/pcdimnat and also incorporated in BIDS Helper, will help you do this. Note that designing a flattened dimension equivalent to a parent/child dimension can be a very time-consuming task, so consider building a true parent/child hierarchy, then using the discussed tool to naturalize it.
Ragged hierarchies with HideMemberIf
The alternative to using a parent/child hierarchy is to build a regular user hierarchy and then hide certain members in it to make it look ragged by setting the HideMemberIf property on the levels of the user hierarchy. HideMemberIf has a number of possible settings: hide a member if it has no name, hide a member if it has the same name as its parent, hide a member if it has no name and is the only child, and hide a member if it has the same name as its parent and is the only child. A common situation when you would want to use this functionality is on a geography hierarchy with levels such as Country, State, City and Customer; some countries are subdivided into states but some aren't, so it would make no sense to show the State level for those countries; some countries are so small that it makes no sense to show either State or City.
The confusing thing about this functionality is that it needs the data in a dimension to be modeled in a particular way for it to work properly in all client tools. Here's an example of the relational source data for a dimension modeled in the correct way, as shown in the following screenshot:
We can then set HideMemberIf to OnlyChildWithParentName on all of the levels of the user hierarchy, and so be able to make the hierarchy appear as follows in your client tool:
Notice how, for Monaco, the hierarchy only shows three levels, whereas for the UK and Italy there is a fourth level visible showing the State. The important point to make here is that this only worked because the value Prince Albert has been copied up fromCustomer to City, and the value Monte Carlo has been copied up from City to State. If we model the data differently, for example, as shown in the following screenshot:
Here, we will find that we may get our ragged hierarchies to appear correctly in some client tools, but not in the dimension browser in SSDT and not in Excel. This is because the connection string property MDX Compatibility=2 needs to be set to make ragged hierarchies work properly in all cases, and you cannot set this property in Excel—even if you modify the data source file manually.
Since the same column is used as the key and the name of an attribute in many cases, it's better to use the HideMemberIf options, which hide members when they have the same name as their parent rather than blank names. Blank names could mean empty strings are being used as keys, and this could in turn restrict your ability to build good attribute relationships. For example, if you had two customers in different countries but left both with empty strings for their State and City, the empty strings would be interpreted as the same distinct member (with an empty string as the name and key) which appeared under both countries. Therefore, you would be wrong to say there were one-to-many relationships between Country and State, and State and City.
Using HideMemberIf on a ragged hierarchy also carries a query performance penalty, although in general, it still performs better than using a parent/child hierarchy.
In this chapter, we discussed how to deal with some common problems we might face when designing our Analysis Services dimensions: grouping and banding members and numeric values; slowly changing dimensions; junk dimensions; ragged hierarchies; and parent/child hierarchies. We found that we can create new attributes on a dimension to group members and create bandings for numeric values using some simple SQL. This can have a positive impact on the user's experience. We also found that with a bit of thought, the three different types of SCDs can all be modeled satisfactorily. With ragged hierarchies, we learned that the parent/child hierarchies are very flexible, but have several problems, especially with query performance, and should be avoided where possible. We also saw that using a regular user hierarchy and the HideMemberIf property is a better option, although not without its own limitations.
In the next chapter, we will stay on the topic of modeling and look at how common business requirements can be handled with measures and measure groups.