Important Roles of Data Stewards - Data Stewardship (2014)

Data Stewardship (2014)

CHAPTER 7 Important Roles of Data Stewards

Discusses the key roles that Data Stewards have in improving data quality, managing reference data, identity resolution in master data management, information security, QA support, and metadata management.

Keywords

roles; data quality; identity resolution; metadata management; reference data; rules

Introduction

Data Stewards play an extremely important role in many of the enterprise processes and initiatives that work with the data. These roles include inspecting and improving data quality, managing reference data, establishing the identification of master data records, establishing survivorship rules for Master Data Management (MDM), handling exceptions that come out of the MDM engine, specifying how data elements should be classified for security and privacy purposes, supporting quality assurance (QA), and overall metadata management. Without the involvement of Data Stewards, these efforts can make incorrect decisions, go down blind alleys, and even implement “solutions” that don’t meet the needs of the data users in the company.

In this chapter, we will discuss how Data Stewards contribute and guide these processes and play a key role in the initiatives.

Role of Data Stewardship in Improving Data Quality

Even if you start out with perfect data (and who does?), data quality can deteriorate over time without rigorous monitoring and input from the Data Stewards. In many cases, no one is even sure what the quality of the data is because the data quality has not been measured and compared against a defined set of data quality rules.

Stating a set of requirements (rules) for data quality is a crucial part of establishing the quality of the data. That’s because regardless of who you listen to (including the ISO 9001 standard), quality is measured against a set of requirements. The requirements are driven by how the business will use the data, or how the business wants to use the data. As Jack Olson says in his book Data Profiling: The Accuracy Dimension (Morgan Kaufmann Publishers, 2003): “You cannot tell if something is wrong unless you can define what being right is.”

The basic tenets of enforcing data quality—and the stewardship role in that endeavor—is illustrated by a story from my career as a data quality Manager at a large bank. A while back, a friend of mine (a technical manager in IT) was in charge of the enterprise data warehouse at the bank. One day at a meeting, one of the business leads asked her why the data warehouse couldn’t deliver good-quality data. She responded that she would love to, but to do so, she needed to know three things:

- What the business meant by data quality. That is, she needed the rules to apply to tell her when the data did not meet the business requirements.

- What to do with the data that did not meet the data quality rules. Did she continue to load the data? Stop the load? Do something else?

- Who to tell about the data quality issues discovered. She needed to know who was responsible for dealing with what was discovered. She realized that just “knowing” of the failure didn’t actually accomplish anything—someone had to do something with that information.

The Business Data Steward responsible for the data in question supplies the answer to the first question (data quality rules) and is the person notified in answer to the third question (who to tell). And the Business Data Steward has to provide input to the second question, deciding what failures were important enough to disrupt the load.

There are two key points to this story. The first is that to measure data quality, you need to define what you mean by quality. The second is that you need accountable decision makers to respond to violations of the data quality rules. Just knowing isn’t enough—you need to correct!

Note

The quality of the data is often stated in terms of data quality dimensions. For example, the first bullet in the section Preventing the Deterioration of Data Quality discusses two of these dimensions: accuracy and completeness. Table 7.1 defines the most common dimensions of data quality.

Table 7.1

Data Quality Dimensions for Data Stewards

Image

Preventing the Deterioration of Data Quality

There are many reasons why data quality deteriorates or is unknown:

- In many cases, data producers are incented to be fast, but not necessarily accurate. Any time a data producer is paid solely based on how many new customers, new policies, or new accounts (etc.) can be entered, they will figure out how to “game” the system. Fields that can be left blank will be left blank, default values will be left unchanged, and so on. There are few cases in which data producers have an incentive to produce better data for downstream users. To change this behavior, Data Stewards must champion changing the business priorities, and having data producers be incented to be accurate and complete as well as fast. Completeness (one of the data quality dimensions) needs to be defined by the stewards so that data is filled in that serves the needs of all the data users, and not just the direct benefactors of the data entry effort.

- Individual data users may make their own “corrections.” If data consumers perceive that data quality is insufficient for their needs, they may extract data and apply their own changes, thinking that those changes improve the quality. They may then pass that data along to coworkers, propagating their errors. The problem is that the changes may not improve the quality at all because the data still does not meet the needs of the downstream users. The data consumer may not understand what constitutes good quality, or even what the data actually is. Of course, if they have the meaning of the data wrong (i.e., poor-quality metadata), then anything they surmise or change will be incorrect. To change this situation, Business Data Stewards need to make sure that the data is defined and that data quality rules are defined as well. If the data is insufficient for a data consumer, there needs to be a way for the consumer to engage with the Data Steward to raise an issue and have the quality improved.

- Poor-quality data is not detected proactively. Without Data Stewardship, poor-quality data is discovered only at the point where someone wants to use the data and finds that it won’t meet their needs. There generally follows a mad scramble to “fix” the data (often without fixing the root cause of the poor quality) or to find other data to use. Meanwhile, a business opportunity may be evaporating due to data that is of insufficient quality to take advantage of the opportunity. Business Data Stewards can make the business case (and provide the impetus) to enforce data quality rules at the point of origin and during data loads, as well as to put in place a process for remediating the quality issues that come to light when the data does not conform to the defined rules (data quality rule enforcement).

- Data quality rules are not defined. It is pretty hard to determine the quality of data if the rules that define the quality have never been defined! Without rigorous, measurable rules, data quality is largely anecdotal, with varying opinions on the level of quality. Business Data Stewards can define the data quality rules, and work with the stakeholders to ascertain the level of quality needed for various purposes (in context).

What is Data Quality Improvement in Context?

Most people talk about improving the quality of data as an absolute—the data starts out with bad quality, and you improve it to make it have good quality. But it is not quite that simple. This approach raises two critical questions:

- How do you allocate scarce resources to improve data quality? As data quality issues arise, you may find that you don’t have the resources to fix them all, so you will need to prioritize them. A business case (including cost–benefit analysis and return on investment, ROI) needs to be made to establish those priorities and correct data quality issues. The most compelling case is usually to correct regulatory or compliance issues, followed by gaining competitive advantage by being able to use data that was not usable prior to the quality improvement effort. Similar to the guidelines for choosing key business data elements (discussed in Chapter 6), other priorities include issues raised by company executives and issues that need resolution by high-profile projects.

- How do you know when you’re done? That is, at what point has the data quality been improved enough that the data becomes usable for the intended purpose? This threshold will vary for different purposes. For example, cleaning up addresses for the purpose of sending out marketing fliers probably has a lower quality threshold than if the clean addresses are used to correctly identify patients for medical purposes. Thus, setting the level of quality in context (for the intended purpose) is crucial. Obviously, spending time and money to improve the quality beyond what is needed is not an intelligent use of scarce resources. By having the Business Data Stewards (in concert with the stakeholders) establish the level of quality needed, you can tell when you’re done and should move on to the next task.

To define the data quality rules and establish the thresholds in context, Business Data Stewards need to work with data stakeholders. It is the stakeholders who care about the quality, and who are impacted by poor quality. It is also the stakeholders who can establish the value to the company of improving the quality, and how that value is realized. For example, improving the quality for the country of residence for account holders at a bank could reduce or eliminate anti-money laundering fines.

Note

It is often the case that Business Data Stewards are themselves stakeholders. In fact, choosing a Business Data Steward who is a major stakeholder in the data he or she stewards is a good idea!

Importance of Improving Data Quality to the Overall Data Stewardship Effort

The improvement of data quality is one of the most visible and impactful results of a Data Stewardship effort—in fact, many feel it is the fundamental purpose of Data Stewardship. Whereas deliverables like repeatable processes, logistics, and workflow mainly benefit the efficiency of the Data Stewards (which is important, make no mistake), improved data quality (and improved metadata quality) enables the company to operate more efficiently and take advantage of opportunities to get more value out of their data. In fact, much of the ROI for Data Stewardship and Data Governance revolves around improvement in data quality.

Understanding the Dimensions of Data Quality

As the Data Stewards move forward to define the data quality rules, it is helpful to understand the dimensions of data quality. Data quality dimensions are ways to categorize types of data quality measurements. In general, different dimensions require different measurements. For example, measuring completeness requires analyzing whether there is a value (any value) in a field, whereas measuring validity requires comparing existing format, patterns, data types, ranges, values, and more to the defined set of what is allowable.

While many dimensions have been listed and defined in various texts on data quality, there are key dimensions for which Business Data Stewards can relatively easily define data quality rules. Once the conformance to those rules has been measured by a data profiling tool, data quality can be assessed by Data Stewards. As mentioned earlier, the data quality necessary for business purposes can be defined, and data profiling will tell you when you’ve achieved that goal.

The data quality dimensions that are most often used when defining data quality rules are listed in Table 7.1.

Gauging Metadata Quality

Many of the same or similar dimensions of data quality can be used to gauge the quality of the metadata collected and provided by Business Data Stewards. Measuring the metadata quality by these dimensions helps to quantify the progress and success of the metadata component of Data Stewardship.The metadata quality dimensions may include:

- Completeness: Are all the required metadata fields populated? For example, a derived data element may have a definition but not a derivation, and is thus not complete.

- Validity: Does the content meet the requirements for the particular type of metadata? For example, if the metadata is a definition, does the definition meet your published standards and make sense?

- Accessibility: Is all that metadata you’ve worked so hard on easily available and usable to those who need it? Do they understand how to access metadata (e.g., clicking a link to view an associated definition)? Do they understand the value of using the metadata?

- Timeliness: Is the metadata available and up to date at the time someone is attempting to use it to understand the data he or she is accessing?

- Consistency: Is all the metadata consistent, or does it conflict with other metadata? For example, does a usage business rule demand that data be used when the creation business rules state it hasn’t been created yet?

- Accuracy: Is the metadata from a verifiable source, and does it describe the data it is attached to correctly? For example, if the metadata is a definition, does the definition correctly define the data element in question?

Specifying the Data Quality Rules

Data quality rules serve as the starting point for inspecting what is actually in the database (data profiling). A data quality rule consists of two parts:

- The business statement of the rule. The business statement explains what quality means in business terms (see following example). It may also state the business process to which the rule is applied and why the rule is important to the organization.

- The data quality rule specification. The data quality rule specification explains what is considered good quality at the physical database level. This is because data profiling examines the data in the database. The analysis portion of the data profiling effort then compares the database contents to the data quality rule.

For example, a rule for valid values of marital status code for a customer might look like:

Business statement: The marital status code may have values of single, married, widowed, and divorced. It may not be left blank. A value must be picked when entering a new customer. The values for widowed and divorced are tracked separately from single because risk factors are sensitive to whether the customer was previously married and is not married anymore.

Data quality rule specification: “Customer.Mar_Stat_Cd” may be “S,” “M,” “W,” or “D.” Blank is considered an invalid value.

The data quality rule has to be very specific as to what data element it applies to. In the previous example, it would have been necessary to specify in which system (and possibly in which database) the table and column existed that were being tested. In a different system (with a different business process), the data quality rule may look quite different, as in this example for an employee:

Business statement: The marital status code may have values of single and married. It may not be left blank. A value must be picked when entering a new employee. Only single and married are needed as a check for benefits selected.

Data quality rule specification: “EmployeeDemographics.Marital_Cd” may be “Sng” or “Mrd.” Blank is considered an invalid value.

Another key point when specifying data quality rules is to specify all the rules that should exist. At the physical level, data quality rules break down into three main types, all of which can be important in evaluating the data quality:

- Simple-column content rules. These are considered “simple” because you only need to inspect the contents of a single column and check to see if the content meets the rules. Samples of this type of rule include:

image Valid values, range, data type, pattern, and domain.

image Optional versus mandatory (evaluates completeness).

image Reasonable distribution of values. For example, in a customer database, you would expect a fairly even distribution of birthdays; a much larger number of birthdays on a given day of the year probably indicates a problem.

- Cross-column validation rules. The rules require inspecting values in multiple columns (typically in a single table) to determine whether the data meets the quality rules. Samples of this type of rule include:

image Valid values that depend on other column values. An overall list of location codes might pass the simple-column content rules, but only a smaller list of locations is valid if, for example, the region code is set to “West.”

image Optional becomes mandatory when other columns contain certain data. The Value of Collateral field may be optional, but if the loan type is “mortgage,” a positive value must be filled into the Value of Collateral field.

image Mandatory becomes null when other columns contain certain data. The Writing Insurance Agent Name field might normally be mandatory, but if the Origination Point is “web” (indicating the customer applied for the policy online), the Writing Insurance Agent Name field must then be blank.

image Cross-validation of content. An example cross-validates the name of a city with the name of a state in an address table—that is, Minneapolis is not in Wisconsin.

- Cross-table validation rules (seeFigure 7.1). As the name suggests, these data quality rules check columns (and combinations of columns) across tables. Samples of this type of rule include:

image Mandatory presence of foreign-key relationships. For example, if an account must have a customer, then the account table must have value in the Customer ID column that matches a value in the Customer ID column of the Customer table.

image Optional presence of foreign-key relationships depending on other data. For example, if the Loan_Type is “mortgage” in the Loan table, there must be a matching value for Loan_ID in the Collateral table. On the other hand, if the Loan_Type is “unsecured,” then there must not be a matching value for Loan_ID in the Collateral table because “unsecured” means there is no collateral for the loan.

image Columns in different tables are consistent. For example, if the Collateral_Value column contains a value above a certain level, the Appraisal_Type must be “in person” because of the high value of the property.

image

FIGURE 7.1 Cross-table validation rule relationships.

Note

It is not a coincidence that the categories of these physical data quality rules look much like the data quality dimensions discussed earlier in this chapter.

Supporting Improvements in Data Quality and Data Profiling

To understand how Business Data Stewards participate in data profiling, it is helpful to understand the overall Data Profiling workflow, as shown in Figure 7.2. Business Data Stewards get heavily involved (as discussed later in this chapter) beginning in step 4, reviewing the potential issues, evaluating and classifying them to make decisions, and determining next steps. These steps are the real “heart” of data profiling (the rest is largely extracting data and running the tool) and produce the true value that comes out of the data profiling effort.

image

FIGURE 7.2 The value gained from data profiling depends on the input and decisions from the Business Data Stewards.

Note

In some organizations, the role of the data profiling analyst (steps 2 and 3) may be handled by the Business Data Stewards.

Data quality rules come from two main sources. First of all, they may be documented prior to beginning the data profiling effort and can be specified as an input to the tool. The data profiling tool then reports how well the data matches the rule (called goodness of fit). The second possible source of data quality rules can come from the tool itself. Many profiling tools are capable of suggesting possible data quality rules based on the data itself—what Jack Olson calls discovery. For example, the tool may detect that a particular column in a large table virtually always contains a unique value. The tool would then propose (through the output) that this column should have a uniqueness data quality rule. However, there may be a few rows that are not unique; these would be flagged as outliers (violations of the proposed rule).

In the case of tool-proposed data quality rules, Business Data Stewards must make two decisions:

- Is this really a rule? Sometimes, the proposed rule is simply a coincidence, and is not a valid rule at all. In that case, the outliers may not be a problem. One example of this is when the column contains the customer’s phone number. The phone numbers may be largely unique, but occasionally two customers have the same phone number. Since there is no requirement for the phone number to be unique, the duplicates are not an issue.

- If the rule is valid, what action should be taken about the outliers? If there is a small number of outliers in old records (which can happen with systems that have gone through many revisions), Business Data Stewards may choose to do nothing, especially if these are not active records any more. But Business Data Stewards need to make this decision.

Note

Even when the choice is to “do nothing,” the issue and decision to not correct the data should be recorded along with the data profiling results for future reference.

The second major role played by Business Data Stewards is in reviewing the results of the data profiling against the stated business rules—what Jack Olson calls assertion testing. If the data does not conform well to the rule (poor goodness of fit), it may mean that the rule has been misstated, often because of some special condition being left out. For example, in one company where I worked, a data quality rule stated that the credit score for a customer should be an integer between 300 and 850. This was generally true, but about 5% of the values were 999, clearly violating the rule. Upon being notified of a data quality rule violation on a significant number of records, the Business Data Steward recalled that the value 999 meant that the score was unknown. These “special values” have to be stated as part of the rule or the fit of the data to the rule will be less than perfect.

Business Data Stewards provide input and decisions when there are data quality problems. They must address the following items:

- Rule violation. Is a rule being violated? As discussed previously, unexpected values, patterns, data types, and ranges may indicate that either the data quality is poor, the rule is invalid or incomplete, or both!

- When a rule is violated, whether the violation is worth remediating. For example, at a company I worked for, an auto insurance policy system showed over 13,000 unique values for the auto body type (e.g., four-door sedan, pickup truck, etc.). The expected number of valid values (according to the data quality rule) was only about 20. Analysis showed that the root cause was that the auto body type code field was freeform text, allowing any value to be entered. Upon examining the business process for writing a new policy, however, it was discovered that the field was not used even though auto body type is a key component of establishing the price of the policy. Instead, the auto body type was specified by the vehicle’s VIN (vehicle identification number), which has the body type built into the value of the VIN. Since the auto body type code field was not ever used, the decision was made that it wasn’t worth correcting the values in the field.

- Where the data comes from and how it originates. This is important information in figuring out what is going wrong, because (as noted earlier) data production is often not focused on the downstream uses of the data. Of course, in an environment with a metadata repository, you may be able to inspect the data’s lineage to find this out.

- Characteristics of the data that may impact whether the quality is a concern. These characteristics may include:

image The age of the data. For example, very often the outliers (or rule violations) are very old records and it isn’t worth correcting the data.

image Whether the data describes an active record. If the data quality issue is in an inactive record, such as an insurance policy that expired many years ago, it may not be worth correcting.

- The threshold for data quality below which the issue doesn’t matter. As mentioned earlier, the threshold for quality will depend on what the data will be used for (“in context”).

- Workarounds to get good-quality data. A workaround is a method to obtain good-quality data without correcting the poor-quality data. The workaround itself may require spending money and effort, however, and thus must be evaluated and prioritized. In one example, an invalid timestamp was present on data coming from an outside vendor. It was possible to correct the timestamp with a simple calculation. In another case, a set of birthdates was known to be incorrect, and we were able to get the birthdates from another system. However, it was necessary to create an ETL (Extract, Transform, and Load) job to access the high-quality birthdates.

- Evaluation of the impacts to the business if the data issue is not addressed. The business impact is a crucial ingredient in all of these decisions. If the field in question has little business impact (or no business impact, as in the previous example of the auto body type code), it shouldn’t have a high priority for resources to improve the data quality. On the other hand, if the quality of data in a particular field is keeping an enterprise-critical system from going into production, it needs to be fixed right away.

- Evaluation of whether the data quality rule is valuable enough to set up automated enforcement and notification. For important data quality rules on key data elements, it may be important to use automated methods to measure the quality of the data and institute business processes to keep the quality from deteriorating in the future. This decision should not be made lightly, as discussed in the next section.

Enforcing Data Quality during a Load

Enforcement of important data quality rules requires defining the rules, managing any necessary reference data (as discussed later in this chapter), and programming those rules into some sort of a rule engine (see Figure 7.3).

image

FIGURE 7.3 Enforcement of data quality during a load requires specification of the rules by Data Stewards and development of the enforcement mechanism (rule engine and error-handling) by IT development resources.

Normally, IT has to be involved in programming the data quality rule, as well as the handling of data that violates the data quality rules (the invalid records). Handling the violations includes:

- Recognizing the violations and “splitting” of the stream of data into valid and invalid records. Note that this does not mean the invalid records won’t be loaded, but whether they are loaded or not, the records that are invalid must be recognized.

- Handling the data that violates the data quality rules. A number of options are available, such as not loading the invalid records, loading the records but keeping track of them for remediation, or even stopping the load. These are described next.

- Writing out information for invalid records and routing that information to the appropriate decision makers. Some data needs to be written out to help with reconciling the data with the data quality rules that the data violates (see Figure 7.4). The reconciliation process requires the Business Data Stewards (and possibly the Technical Data Stewards) to perform analysis and prioritization of the errors, determine which of those require a project to fix them, and which of those can be immediately corrected (though “immediately” may actually take a while if resources are limited). In some cases, the correction may merely require an update to the rule (perhaps allowing for a wider range or different data type) or an adjustment to the reference data (e.g., adding a new valid location code).

image

FIGURE 7.4 The data quality reconciliation process requires analysis and updates from the Data Stewards.

Finally, it is necessary to execute an action if there is invalid data. Although IT is responsible for the development effort to accomplish the intended action, Business Data Stewards and Technical Data Stewards need to provide input on what action(s) should be taken. There are essentially four possible things that can happen when invalid records (i.e., records containing data that violates one or more data quality rules) are detected during a data load:

- Stop the load. This is the most drastic action, as it leaves the target database either unusable or populated with the contents of the last load (depending on how the load works). However, in some cases, such as a requirement for uniqueness that is tied to the table’s primary key, there is no other choice. The load may also be stopped if less drastic errors occur more than a specified number of times—that is, the errors exceed the threshold for a data quality rule.

- Skip the invalid records and write them out to error tables. In this scenario, the invalid records are not loaded into the database. Instead, they are loaded into a set of tables specifically designed for error analysis. After correction of the data (or determination that correction is not needed), the data is then loaded into the database, filling in the “holes” in the data. Unfortunately, this can be a complex process, especially if delaying the load of invalid records triggers other rules (e.g., if a missing parent record means that dependent child records also cannot be loaded). Further, if the errors are not addressed right away (which can certainly happen), the holes in the database grow with each new load, which may make the target database unusable.

- Load the invalid records into the database but also keep a log of the invalid records. One way is to write out the invalid records to error tables. Keeping a log is the simplest solution, in that the target database is fully loaded, and a record of data that failed the data quality rules is kept track of, perhaps as in Table 7.2. Once the corrections are made, however, the target database must be updated, either via direct edits or by reloading the corrected data from the source. In addition, the error log table must be updated as the records are corrected. This may involve removing the error records or updating them with a status to indicate they have been corrected. The option to update with a status allows you to keep a historical record of the errors.

Table 7.2

Sample Invalid Records Table Structure

Image

Note: In this table, the identifier for the invalid record is recorded, as well as the invalid data and the data quality rule violated.

- If your database structure supports it, you can flag the invalid records inline (right in the database), then find them later via a query/report. However, your target database must have a column to hold the rule violation (which serves as a flag as well). This approach only works well if you are recording a single rule violation (a given record can violate more than one rule) or have a way to insert multiple rule violations into a single column, and parse the multiple rule violations during the query/report.

In any of these cases, a necessary additional action is to notify the responsible Business Data Steward and send an email notification (or create a task in the workflow engine) to the responsible steward and the Enterprise Data Steward. Ideally, both the notifications and creation of the tasks in the workflow engine are fully automated.

Role of Data Stewardship in Managing Reference Data

According to Danette McGilvray in her book Executing Data Quality Projects (Morgan Kaufmann Publishers, 2008), reference data is sets of values or classification schemas that are referred to by systems, applications, data stores, processes, and reports, as well as by transactional and master records. Examples includes lists of valid values, code lists, status codes, state abbreviations, demographic fields, product types, gender, chart of accounts, and product hierarchy.

Organizations often create internal reference data to characterize or standardize their own information. Reference data sets are also defined by external groups, such as government or regulatory bodies, to be used by multiple organizations. For example, currency codes are defined and maintained by ISO.

The simplest case of reference data is enumerated attributes—that is, data limited to specific valid values. For example, the marital status code might be limited to the value of “M” (married) and “S” (single). To manage this type of reference data properly, the code value (e.g., “M” or “S”) and the description that goes with each code (e.g., married or single) must be defined and documented. Each code/description pair should have a business definition, though in practice this is rarely done, especially for the simpler values. For more complex values, however, such as the status of an account or insurance policy, it is very important to have a definition for when, for example, the status is “active” or “inactive.” This is particularly important when the code must be derived. For example, the status might be derived as “active” only when the account start date is less than the system date, and the account end date is null (or some date in the future).

Reference data can be thought of in two forms: a business description and a physical code. The business description concerns itself only with the description and definition of that description, and doesn’t worry about the code value that goes with it. The physical code can vary from one system to another even when the business description is identical between systems. For example, one system may record the gender code as “M” and “F,” while another system may record it as “1” and “2.” The fact that both “M” and “1” have the description of “male” is simply a difference in how the systems are implemented. Business Data Stewards must focus on ensuring that the list of descriptions meets their business needs, while Technical Data Stewards can provide insight into the code values.

Business Data Stewards have an important role in specifying the descriptions, which are, in effect, a definition of the associated code. If the description is insufficiently rigorous, data users will often misinterpret what the code means and use the data incorrectly. In addition, inadequate descriptions of codes make it very difficult to map equivalent codes between systems.

In many companies, there is no “system of record” for much of the reference data. This is often the case with the enumerated attributes with a small list of values, such as marital status code or gender. In that case the business glossary/metadata repository can serve to hold this information. However, it is often not appropriate to store reference data in the glossary or repository. For example, the chart of accounts would have the accounting system as its system of record, since that is where the list must be maintained and where the validation rules are in place to prevent creating invalid accounts. There are also extremely large standardized lists (e.g., the ICD10 codes for medical practice) that need to be stored and managed in a central area where systems that use that information can access it easily. In cases such as the chart of accounts or ICD10 codes, the business glossary should state where the reference data is stored (e.g., the table that contains ICD10 codes), rather than storing the reference data itself.

Note

Different systems may have different sets of reference codes (and their descriptions) even for data elements that appear to be the same. The difference in the reference data occurs due to differences in the data element definitions between the two systems. For example, one insurance system had a policy status code with values of “A” (active) and “I” (inactive). The second insurance system had a policy status code that used the codes “A” (active), “I” (inactive), and “M” (make renewal call). The definition of the policy status code in the second system has to include the fact that the status code was also a trigger for the business process of calling the customer to renew.

General Maintenance of Reference Data

Business Data Stewards play a key role in the maintenance of reference data in critical systems. They need to:

- Document and record the existing valid values to ensure that the values are well understood. Data profiling can help with this process, and assistance from the Technical Data Steward may be required to access the data. However, the biggest challenge in understanding some sets of values is getting information from the business Subject Matter Experts (SMEs) who produce the data!

- Evaluate the need for a proposed new valid value to ensure that the value:

image Does not overlap an existing value. For example, the value of “widowed” for the marital status code overlaps the value “single.” The steward may choose to allow this overlap, but it should be done knowingly, rather than accidentally. Also realize that if the marital status is “widowed” and the person remarries, a rule has to be in place to determine whether to change the marital status to “married,” losing the fact that the person was once widowed.

image Does not duplicate information that is recorded elsewhere. For example, the value of “deceased” for the person status overlaps the deceased flag (a checkbox) recorded elsewhere. Again, this may be allowed, but any time you record information multiple times, the possibility exists for the data to conflict.

image Is consistent with the meaning of the field. All too often, a new value is added to an existing valid value set to avoid adding a new field, a practice referred to as overloading. For example, a new insurance policy status was added (to the existing values of “in force” and “inactive”) of “make renewal call.” This new status is supposed to tell the agent that the policy is within 30 days of expiring, and that he or she should contact the customer. But it is not a status; the policy is still “in force.” This information should be recorded in its own field or derived and presented on a report.

- Evaluate the impact of the new value and ensure that the stakeholders are informed and consulted on these impacts. For example, a transformation may fail because it does not handle the newly added value. Again, assistance may be needed from the Technical Data Steward to perform this evaluation.

- Approve and document the addition of the new value.

- Detect the unauthorized use of new values. It is important to detect when new values have been added to the list of valid values without going through the proper processes. Ideally, the time to detect unauthorized values is before they are used. Detecting these values may be possible if the source system(s) maintain their own list of values, which can be queried on a regular basis and compared to the authorized list. Another way to detect unauthorized values is to perform data validation during data movement (ETL), such as when loading a data warehouse.

Aligning Reference Data Values across Systems

Physical codes in reference data are often more complex than the business descriptions. Even when equivalent data elements in different systems have the same description, the code values are often different. For example, another system may use “Ma” and “Si” (rather than “M” and “S”) for the marital status codes. However, comparing reference data from different systems is often far more complex than just converting equivalent values (e.g., “M” to “Ma”) because there is not always a direct correspondence between the data elements. To use the marital status code as an example, one system might have a “married indicator” showing that you are either married or not. All the codes that indicate married would map to “Y,” all those that indicate not married would map to “N.” The granularity of “divorced” and “widowed” would be lost unless you have method to preserve that information, as will be discussed shortly.

During the mapping process, you are mapping the values from various source systems to create an “enterprise-aligned” data element with “enterprise-aligned” values, illustrated in Figure 7.5. This figure shows the source system data model down the left side, with the source element and its multiple values. The right side shows the shared (or “aligned”) elements with their values. In the center are the mapping rules that show how a value of an element in a source system is mapped to a value in an aligned element.

image

FIGURE 7.5 Mapping source system data elements (and their values) to aligned elements and aligned values.

It is always advisable to profile the reference data to determine what actual values exist in the database, as well as the distribution of those values. The Business Data Steward can then make decisions about what values must be converted (also called harmonized), as well as what to do with values that are not converted.

In addition to mapping equivalent values directly, several different mapping cases need to be dealt with:

- Data elements with the same name that mean something entirely different. Column names tend to get specified by a developer, database administrator, or modeler—and two different systems may use the same terminology for different purposes. For example, in one automobile insurance system, “Sex_Code” referred to the gender of the primary policyholder. In the equivalent homeowner’s policy system, “Sex_Code” referred to the gender of the oldest person in the household (used for discounting purposes). Thus, despite being named the same and even seeming to contain the same format of data, these fields did not represent the same thing.

- Data elements with different names that have the same meaning. This case is very common for the same reasons as the previous one. Older systems tend to use short names, and newer systems may use a longer, more descriptive field name. For example, one system may call the marital status code “MS_CD,” while another might call it “Marital_Status_Code.”

- Data elements that mix several different elements together.Figure 7.6 illustrates this case. In source system 1, some of the valid values (widowed, divorced) are not marital status codes at all; instead they describe why someone’s marital status is “single.” Thus, in the enterprise-aligned view, those values should be mapped to a Marital Status Reason Code, and they should also set the value of marital status to “single.” Source system 3 shows the case where a new valid value for “registered domestic partner” was simply added to the Marital Status Code when the company decided to record this information and provide the same benefits to registered domestic partners as they did to married couples. However, “registered domestic partner” is not technically a marital status; it describes an entirely different situation. Thus, in the enterprise-aligned view, it is mapped to an entirely different data element (“Domestic_Partner_Flag”), which has values of “yes” and “no.”

image

FIGURE 7.6 Mapping mixed sets of valid values to the enterprise-aligned view. Each system calls the column something different. In addition, the valid values are different from one system to another, and there is a difference in the granularity and meaning of the columns.

Note

A case could be made that “registered domestic partner” actually does represent a marital status, because, at least for the purposes of this company, the business processes and benefits afforded to married couples were being extended to registered domestic partners. Again, it would be up to the Business Data Steward to make a recommendation on how to handle this new business case.

The process of bringing reference data together from multiple systems (and dealing with the issues just discussed) is called harmonizing values. This is necessary any time you need to combine data, such as when building a data warehouse across multiple business functions or to use reference data as one of the determining attributes for performing identity resolution as part of MDM. A simple example involves creating a “central patient” record for a healthcare company. The gender of the patient is often a determining attribute in resolving multiple patient records. But the gender code might be recorded as “M” and “F” in one system and “1” and “2” in another. These values have to be understood and mapped together to match up the patient records. The role of Business Data Stewards in identity resolution in MDM is discussed in more detail in the next section.

Business Data Stewards play an important role in managing reference data when harmonizing values, as shown in Figure 7.7. They must:

image

FIGURE 7.7 Responsibilities for the process steps involved in managing harmonized reference data valid values.

- Review and understand the source system reference data element and each of the valid values associated with the element.

- Determine what reference data is needed in the enterprise-aligned view, and what those values should be.

- Map and approve the source value to the enterprise-aligned (harmonized) values.

- Manage the harmonized values and mappings.

Role of Data Stewardship in Identity Resolution for MDM

One of the most important parts of a MDM effort is resolving multiple instances of data that represent the same entity into a single record for that entity. This is called identity resolution. For example, an insurance company had separate systems for selling automobile insurance, homeowner’s insurance, and personal liability insurance. A customer might own all three types of insurance, but unless the insurance company has a way to resolve the identities of the customers in all three systems, they would never know that the same person was a customer for all three types of insurance. This could lead to missed multipolicy discounts and effort wasted trying to sell a person insurance they already have (which certainly would not help the credibility of the company either). Another example is a chain drugstore, where the same person might have prescriptions filled in multiple different stores. Without being able to tell that it is the same person, there is a danger of missing drug–drug interactions and fraud.

Business Data Stewards have a number of important decision-making responsibilities during the identity resolution process. Figure 7.8 shows the overall workflow for identity resolution, and as you can see (and as we will discuss in the next few sections), Business Data Stewards (as well as IT support, which may include Technical Data Stewards) have many roles to play and many decisions to make in arriving at the “golden copy”—the single record that represents a real-world entity.

image

FIGURE 7.8 The workflow for identity resolution. Processes that include the Business Data Steward “swim lane” require input and decisions from Business Data Stewards.

Identifying the Identifying Attributes

Identifying attributes are those fields that, when used in combination, uniquely identify records as describing the same real-world instance of a master entity. For example, the identifying attributes for a customer might be a combination of first name, last name, gender, birthdate, address, and email address. That is, if two people had the same values for all these attributes, they are considered to be the same person. Figure 7.9 shows how you might use identifying attributes to distinguish between two very similar records. The identifying attributes are compared across the two records and differences between the two records help to make the decision.

Note

In the ideal case, the data might contain a field that is unique to an individual, such as the Social Security Number or Social Insurance Number. That single attribute would then be the only needed identifying attribute. Another example might be a company-issued identifier for registered members, such as a user ID. However, keep in mind that there needs to be checks, even in the case of a unique identifier. The checks are needed because numbers can be miskeyed or otherwise mistakenly entered. In the case of a person, for example, it would be good to do a check on the name and birthdate. Such a check would also pick up “normal” occurrences, such as when a woman changes her name after marriage.

image

FIGURE 7.9 Compare the values in the identifying attributes of two records to discern whether they represent two different entities.

It can be tricky to figure out the identifying attributes. On the one hand, you want the minimum set of attributes, as they must exist in all the systems from which you want to aggregate records to create the golden copy. Balancing that, however, is the need to have a high degree of confidence that you are indeed compiling a set of records that represent the same entity. On top of these needs is the need to have good-quality data in the identifying attribute fields. For example, you might have a field for the Social Security Number for your customers (a great unique identifier), but if the field is mostly empty, it is not a good identifying attribute.

Business Data Stewards should have knowledge of what data exists, what issues there are with the data, and how to find it (more on this later). Gathering the set of identifying attributes can become an iterative process as you map the fields between systems, examine the completeness and other quality dimensions of the data, and attempt to create the golden copy. And even though there are software tools that can catalog potential identifying attributes and propose a set of candidates to use, Business Data Stewards are your best decision makers on the final set of identifying attributes.

Another decision that must be made is the sensitivity of the golden copy to false negatives and false positives. A false negative is when two instances of the same entity are classified as being different entities. For example, if the marketing system has two records for a customer and one has the wrong birthdate, it might be concluded that these are really different people. The end result might be that the customer gets two copies of a catalog, or fails to get a volume discount because purchases are not aggregated to the same customer. A false positive is when instances of different entities are incorrectly merged to be the same entity. For example, two patients in a hospital system are determined to be the same patient when they are not. The end result could be incorrect dispensing of medications, improper surgeries, and misdiagnoses of diseases. Clearly, the consequences of a false positive in the case of the hospital patient is far greater than for the marketing customer, and thus the confidence level at which the merging of hospital records can take place is far higher than that for the marketing customer. Business Data Stewards have important input in determining this sensitivity. A high required confidence level usually leads to a larger number of higher-quality identifying attributes being required before a golden copy is created. It also leads to more effort being required to cleanse and standardize the data to enable it to be used for identity resolution.

There are many software tools that can assist with identity resolution. Many of these tools use a score to indicate how closely two records (when compared to each other) match. The higher the score, the higher the probability that the records represent the same entity (based on the tool’s algorithms). Figure 7.10 shows what the output from such a tool might look like. In this example, the tool is comparing records from two systems (system 1 and system 2).

image

FIGURE 7.10 The upper and lower trust limits specify the records that are considered to be automatic matches (upper) and automatic nonmatches (lower). Records that fall between these limits require further analysis to determine whether they represent the same entity.

Business Data Stewards must provide input to:

- Set the score above which two records are considered an automatic match. This score is sometimes called the upper trust limit. Pairs of records with scores above the upper trust limit are considered automatic matches. If there are too many false positives, Business Data Stewards may recommend raising this limit. In Figure 7.10, the record pairs above the upper trust limit bar would be automatically considered to represent the same entity (matches).

- Set the score below which two records are considered an automatic nonmatch. This score is sometimes called the lower trust limit. Pairs of records with scores below the lower trust limit are considered automatic nonmatches. If there are too many false negatives, Business Data Stewards may recommend lowering this limit. In Figure 7.10, the records below the lower trust limit would be considered to represent different entities (nonmatches).

Pairs of records that fall between the upper trust limit and the lower trust limit are candidates for further analysis. Oftentimes a human being (perhaps an Operational Data Steward) can examine these records and make a match/nonmatch decision. However, if there are too many of these records, it can be difficult and time consuming to get through them all. Adjusting the trust limit scores can reduce the number of these candidates, but the trade-off is that it may also increase false positives or false negatives.

Note

In addition to adjusting the scores that represent the upper and lower trust limits, Business Data Stewards may recommend additional (or different) identifying attributes. They may also recommend additional data cleansing, standardization, or enrichment.

Finding the Records and Mapping the Fields

Business Data Stewards and Technical Data Stewards need to work together to:

- Figure out what systems contain records that must be integrated into the golden copy. If you are going to have a master record (golden copy) then all the systems that contain records corresponding to the entity must be mapped so they can be used properly.

- Figure out the metadata that describes the fields in each system. Field names may give a clue, but are usually not the whole story. For example, the “birthdate” field in a homeowner’s insurance policy system actually contained the date of birth of the oldest person living in the home, and so not necessarily the policyholder’s birthdate. Assuming that the field contained the policyholder’s birthdate would have led to a false negative and would have missed the fact that the customer had a homeowner’s policy. This effort exposes a lot of information about the contents of the identifying attributes, as illustrated in Figure 7.11.

image

FIGURE 7.11 Understanding the metadata and field contents of the potential identifying attributes.

- Profile and examine the data in the identifying attributes. Analyzing what is stored in the fields can help to understand what the field means, as well as how the field is being used in a particular system. It is also crucial to understand what data is stored and in what format. For example, a full name field might contain multiple full names, in which case a matching algorithm would not match the full name with the full name stored in another system. The problem here is that to match records, you need to be able to clearly define the content and structure of each field in the record. Another example is addresses. Profiling the address fields exposed numerous misspellings and “impossible” addresses (Detroit is not in Illinois). But by knowing about these errors (which would not have been matched to the addresses stored in other systems), we were able to apply an address standardization algorithm and obtain the matches. As you can see from the address example, it is critical to clean up data before attempting to match data between systems.

Enrich the Data from Outside Sources

It is often possible to enrich the data (and identifying attributes) you have by using outside sources (see Figure 7.12). For example, an automobile insurance policy system may have unreliable addresses, but highly accurate driver license numbers for the policyholders. The DMV, on the other hand, has very reliable addresses linked to the driver license numbers, as the registration and other paperwork must be mailed to the driver. By retrieving the addresses from the DMV using the driver license number, it is possible to enrich the addresses with accurate information. Business Data Stewards will usually know where this is being done or is possible, and may even be able to suggest what outside sources will work best. They may also know where there are issues with the quality of the data such that enrichment may be impractical or unreliable. In one case, for example, fake driver license numbers were being filled in by the agents to get the policy written without needing to force the policyholder to provide a driver’s license. In cases where the fake driver license numbers corresponded to real ones (for someone else), the enriched data that came back from the DMV was incorrect and represented a significant data quality problem.

Note

As difficult as it is to perform identity resolution for individual human beings, it can be even more difficult when dealing with customers who are not human beings like corporations, business units of corporations, etc.

image

FIGURE 7.12 Enriching and enhancing data.

Role of Data Stewardship in MDM Survivorship

Another key aspect of MDM is called survivorship. David Loshin defines this concept in his book Master Data Management (Morgan Kaufman, 2009, p. 193) as “the process applied when two (or more) records representing the same entity contain conflicting information to determine which record’s value survives in the resulting merged record.” To determine the values that survive in the golden-copy record requires a set of business rules to resolve the conflicting value and choose one. Loshin describes it this way: “The master data attribute’s value is populated as directed by a source-to-target mapping based on the quality and suitability of every candidate source. Business rules delineate valid source systems, their corresponding priority, qualifying conditions, transformations, and the circumstances under which these rules are applied.” The business rules must be determined by Business Data Steward(s) for the mastered data, based on the variety of considerations that Loshin discusses.

Figure 7.13 illustrates a simplified view of the survivorship process. Business Data Stewards and Technical Data Stewards (as well as other contributors) identify the various sources of each master data attribute. This is typically done in the early stages of the MDM effort, alongside the effort to locate the identifying attributes. Once the sources of each attribute are documented, the survivorship business rules need to be specified. The business rules establish the priorities of the available values—that is, what value should be chosen to populate into the golden copy. Examples of establishing priorities include:

- Value missing or blank. Clearly, if the value is missing or blank, it can’t be used, and the value must be found elsewhere. This can occur when a system has the attribute, but the attribute is not always populated.

- By date. If multiple systems have an available value for an attribute, a possible rule would be to choose the value with the most recent date. This is not always a good choice, especially if the most recently populated attribute value is incomplete or violates other data quality rules, such as being invalid, of the wrong format, or with a suspect value distribution. For example, one system had a more recent birthdate than any other available source, but the distribution of the birthdates indicated that at least one of the dates (12/31) wasn’t reliable.

- By data quality. The values in some systems may be chosen (even over more recently entered data) if they meet the specific data quality rules, as noted in the preceding bullet.

image

FIGURE 7.13 Business and Technical Data Stewards set the survivorship rules that are then applied by the MDM tool.

Once the business rules are established, they are used by the MDM tool to resolve attribute values in the survivorship process. The tool identifies the master records from multiple sources that have conflicting attribute values, applies the resolution rules, and creates (or updates) the golden copy with the selected value.

Note

Keep in mind that it may be possible to keep multiple values in some cases, such as addresses. This reflects the real world, where people can have multiple valid addresses. However, even with multiple values (or value sets, as in addresses), a rule is needed to select the value/value set to use in circumstances where only a single value is allowed (e.g., a system that can store only a single address).

Role of Data Stewardship in MDM Exception Handling

A MDM engine (or hub) takes in records and performs a large number of processes, including cleansing, matching to existing records, inserting new records, updating existing records, merging records that represent the same entity, unmerging records that were merged in error, and deleting obsolete records. The processes follow a set of rules that have been carefully established and tested to handle the incoming data with known data quality (established by profiling the data) and data quality issues. However, data (and data quality) changes over time, and the rules may fail under these circumstances. Examples of source data changes that can cause rule failure include:

- Invalid lookup. Many times an incoming code or other data value in the MDM data will trigger a lookup to translate the code or value into either a harmonized code (a code that has been standardized across systems), a different (standardized) data value, or a description. An invalid lookup can occur when the incoming code or data value is not found in the lookup table, resulting in an error because the lookup transformation cannot take place.

- Data quality exception. The MDM processing engine expects data to have certain traits, such as a given length, data type (e.g., numeric or string), format, range, or pattern. A data quality exception can occur when the arriving data is not as expected. For example, if the data type is different from what was anticipated (e.g., numeric is expected but alphabetic characters are present), this can cause a data quality exception that must be remediated. Another example is when the value falls outside the expected range, causing a conversion to fail or create a result that violates the allowed result.

- Missing parent exception. Records in the MDM hub may have a hierarchy; the missing parent exception occurs when a child record occurs without the parent record being present.

- Invalid relationship exception. Records in the MDM hub may have mandatory relationships that can include hierarchical relationships. If the record from either end of the relationship is missing, this can cause an exception. For example, an individual in the employee master may have a mandatory relationship to organization. If the organization record is missing or invalid, an invalid relationship exception occurs.

- Missing data exception. If data is missing from a field, this can cause an exception. This can also be thought of as a violation of the data quality completeness dimension. For example, if an address record is supposed to have all the fields (street, city, state, country, zip, etc.) populated, and one or more fields are empty, a missing data exception occurs.

Business Data Stewards are heavily involved in analyzing and mitigating exceptions, as shown in Figure 7.14. The processes involved are:

- Receive and log exception. The MDM engine keeps track of all exceptions and produces a report for analysis. At a minimum, the report must include the data that violated the processing rules, as well as the rule that was violated. Business Data Stewards (or designated personnel, such as Operational Data Stewards) receive the report and keep track of the exception as an issue.

- Conduct root-cause analysis. The next step is to figure out what caused the exception. The reason is almost always data that did not have the traits expected. Of course, the important question is whythe data was not as expected. Assuming that the data was originally profiled and cleansed when the MDM processes were specified, built, and tested, something must have occurred to change the traits of the data. For example, a new code might have been added to the list of valid values for a field, or a change in the source system may have allowed a new data type to be entered. Root-cause analysis delves into the reasons for the violation. Since information about source systems and (possibly) MDM processing may be needed, IT may have to assist with root-cause analysis.

- Remediate to fix a problem. Once the problem has been located, changes need to be made to remediate it. This may involve new instructions on how to enter data, adding new constraints in the source system (which will require help from IT), or changing the MDM exception rules to handle the new data traits.

- Update documentation. Where necessary, the documentation on how the MDM engine processes data (including conversion and transformation rules) must be updated to reflect any changes.

- Update MDM/DQ exception rules. If the exception rules must be updated to handle the new data traits, IT will need to make those changes and test the new exception rules.

image

FIGURE 7.14 Business Data Stewards must investigate the causes for MDM exceptions and propose either corrections to the source data or changes to the exception rules.

Role of Data Stewardship in Information Security

Protecting the privacy and security of data is another important activity where Data Stewards (as well as many others) play an important role. The overall process is illustrated in Figure 7.15. The process typically starts with a standard or policy that establishes the categories into which the data should be classified, and how each category can be used and must be protected. The categories will have names like “public,” “company private,” and “confidential.” In many industries (e.g., healthcare and insurance), the policies are almost always driven by legislation. The legislation often sets how the data must be treated, from which the appropriate information security category can be derived. For example, in the automobile insurance business in California (and many other states), the driver license number must be encrypted both at rest (in the database) and in motion (ETL). This led the insurance company where I worked to classify the driver license number as “highly confidential,” as that category required these same protections. Legislation—and thus policy—may also specify privacy rules for handling groups of data elements. For example, while the customer’s last name by itself might be considered “public,” the combination of the last name, first name, and zip code could well be considered “confidential.”

image

FIGURE 7.15 Establishing and enforcing information security requires participation from many different roles.

In many cases, the information security policy leaves it to the business to determine what security classification the data falls into. This breaks down into two steps. In the first step, the Business Data Steward establishes the classification for the business data element or grouping of business data elements, again, often as a result of applying government regulations to the data. For example, the Customer Data Steward might establish that the combination of name and address is considered “confidential,” though just the name (when used alone) is “company private.” In the second step, the Business Data Steward works with the Technical Data Stewards (and other IT support) to determine where the data exists physically in the database(s). The “confidential” fields are then suitably protected by IT support to ensure that people who do not have a need to know do not have access.

The Data Stewardship Council may also set up the process by which employees receive the necessary permissions (usually by authority from their supervisors or other responsible person) to have access to “confidential” data. This process may be managed by Human Resources, Data Governance, or some other entity, but the process itself should have the approval of the Business Data Stewards.

Note

Human resources is often a good choice to manage the process of obtaining permissions for access because they maintain the reporting hierarchy, are responsible for the processes of hiring and termination, and maintain job descriptions, which are often referred to when deciding whether access should be granted.

Role of Data Stewardship in Metadata Management

Like data, metadata is a resource that must be managed. Business Data Stewards have several important roles in that management. As discussed earlier, they have to manage definitions. This includes first defining the key business data elements. As part of that definition, they need to identify and link related terms that are used in the definition. If there are very similar terms, which might be mistaken for the defined term, it is important to ensure that the similar terms are clearly delineated and that users of the defined terms understand the difference.

Another critical activity is for the Business Data Stewards, in concert with the Technical Data Stewards, to link the business data elements to their physical counterparts. A robust metadata repository that supports lineage can help with this process by showing how the data flows through the various systems and ETL. Profiling the data also exposes the content to ensure that the links are accurate.

If the Data Governance toolset includes a metadata repository, Business Data Stewards play an important role in customizing that tool, both to add functionality and to create an interface that works for them. Regular meetings should be held (or an ongoing discussion group should be created) to discuss what additional functionality is needed and where the interface is confusing the users or is more difficult to use than it needs to be.

Role of Data Stewardship in Supporting Quality Assurance

The Quality Assurance (QA) phase of a project is where the application or modification is tested to ensure that it meets the business requirements that were laid out. Very often, however, no QA test cases are written to test the specified level of data quality or the meaning of the data behind the fields.

Business Data Stewards can help this situation by working with QA to write test cases in these two areas. The test cases are run by the QA folks, just as any other tests would be, and the results and defects are recorded in the QA tracking system, again, just like any other test cases. They are then prioritized for correction along with any other discovered defects.

Test cases that test for violations of data quality rules look for things like:

- Invalid values for fields that have a valid value set.

- Values that fall outside of the associated range, pattern, and type.

- Missing relationships (a home equity loan must have a collateral record) based on field values.

- Unpopulated fields that have been declared mandatory.

- Ability to create a record when data is missing.

Test cases that are written using the metadata look for things like:

- Does the user interface (e.g., screens or reports) show the expected values based on definitions?

- Do the screens show multiple fields that are actually the same field (due to acronyms)?

- Are calculated fields correctly derived using the derivation rule for the element?

- Are field labels named correctly based on the official data element name?

Summary

Data Stewards play a pivotal role in many enterprise efforts that involve improving the quality and usability of the data asset. While it is not impossible to improve data quality, manage and harmonize reference data, perform identity resolution, and manage metadata without Data Stewardship, it is more difficult, takes longer, and may require extensive rework. In addition, you are less likely to get the results you want. That is because there may be no one responsible and accountable for making decisions about the data, so the results can be suboptimal and may not take into account the needs of all the stakeholders.

Improving data quality involves participation from Business Data Stewards to understand the uses of the data, establish the data quality rules that define when the data is suitable for those uses, and review the results of data profiling to discern where the quality is not up to the levels needed to run the business. Improving data quality is both the main reason to have Data Stewardship and one of the more visible ways that Business Data Stewards show their value to the company.

Valid values are established within the source systems based on how the source systems need to transact business. Business Data Stewards are key in not only determining what the valid values should be, but also in managing those values by making decisions about the meaning of each value and adding values when appropriate. Further, any time reference data from different systems needs to be brought together (harmonized) into a data warehouse or other mechanism for combining data from multiple systems, Business Data Stewards must map the source values to the aligned values that enable an overall enterprise view of the data.

With the increasing importance of MDM, the need to perform identity resolution has become critical. Business Data Stewards help to identify the identifying attributes, locate records that represent instances of the master entity in different systems, examine the contents of the identifying attributes to determine whether they are usable, and map the different attributes from different systems together to enable the creation of the unique golden copy of the entity. They must also provide recommendations on the sensitivity of the matches to achieve results consistent with the data usage.

MDM also involves survivorship and exception handling. For survivorship, Business Data Stewards must establish the rules that determine what data to persist in the golden copy when records representing the same entity contain conflicting information. These rules may actually change over time as new systems come online or the trustworthiness of different source systems increases or decreases. For exception handling, Business Data Stewards (with lots of assistance from IT) must analyze new processing exceptions and figure out the root cause, and either fix the data or change the processing rules to handle the new data traits.

Although many companies establish an information security policy independently from Data Stewardship, it is Business Data Stewards who need to assign and document the security classification of the key business data elements, as well as help locate the physical data elements that IT must protect from misuse per the policy.

In companies that manage their metadata rigorously (especially in a metadata repository), Business Data Stewards have responsibilities to establish definitions, derivations, and data quality, and creation and usage business rules. They must also participate in the effort to link the business data elements to their physical counterparts. In addition, if the metadata toolset needs additional functionality or improvements to the user interface, Business Data Stewards have a very definite say in specifying the additions and improvements.

QA test cases can test for violations of data quality rules and creation rules to ensure that the specified level of quality is being maintained by the application, and that records cannot be created if data is missing. The metadata test cases check for correctness of field labels, values in those fields (based on definitions), and other violations of the metadata specified for the data element.