Data Quality - Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Chapter 4. Data Quality

Passing the Standard

We might as well not do any data storage if we are not storing and passing high quality data. This chapter defines data quality and a program to maintain high standards throughout the enterprise.

Keywords

data quality; referential integrity; system of origination; data profiling

Before we get too far into the platforms that will store the information and analytics our companies need, I need to issue a warning. Data anywhere in the enterprise ecosystem that is considered of poor quality will be detrimental to the information manager’s goals, to any system affected, and to overall company goals. Though there is a lot of work involved in the platform itself, focusing strictly on the platform and ignoring the quality of the data that goes in it will be done at great peril.

How many applications could successfully maximize their function with “dirty data” stemming from poorly designed data models? Could cross-selling and upselling be maximized if the customers were not unique and their data was not complete or the products were at different levels of granularity or had incorrect attribution? Could you do credit card fraud detection or churn management correctly if you did not have the customer’s transaction pattern correct? Many of these applications have failed in the past because they were not supported with sufficiently clean, consistent data of the business entities such as common definitions of customer, prospect, product, location, part, etc.

It is important to understand the flow of information in the business ecosystem and to inject data quality into the ecosystem at the most leverageable points—those systems that serve the function of data distribution. The master data management hub, as well as other systems of origination, are some of the most leverageable components and should epitomize data quality. The data warehouse is leverageble to the post-operational analytics environment. In reality, each environment (even big data ones) needs data to be up to a standard prior to its use.

Data quality should be suitable to a particular purpose, and that purpose changes from operational systems to Master Data Management (MDM) to data warehouse. For example, classification data is often neglected in operational systems because it doesn’t impact effective transaction processing. It’s addressed more effectively in MDM solutions in which classification data is more of a focus, such as identification data in MDM and data warehouse solutions that integrate such data from multiple domains of values (e.g., stock ticker and CUSIP as two separate domains of identification values for the same entities). A key concept here is suitability to purpose.

That is the mentality for how I’ll define data quality—data that is void of intolerable defects. These intolerable defects should be defined according to the uses of the data. There is no national, industry, or vendor standard that you should aspire to. Data quality takes effort. It must be time (and money) well spent. Chasing what someone else decided is good for the system is not the best use of time—unless that someone else is the data steward.

Data Quality Defined

Data quality is the absence of intolerable defects. It is not the absence of defects. Every enterprise will have those. It is the absence of defects that see us falling short of a standard in a way that would have real, measurable negative business impact. Those negative effects could see us mistreating customers, stocking shelves erroneously, creating foolish marketing campaigns, or missing chances for expansion. Proper data quality management is also a value proposition that will ultimately fall short of perfection, yet will provide more value than it costs.

The proper investment in data quality is based on a bell curve on which the enterprise seeks to achieve the optimal ROI at the top of the curve. Here is a simple example: an e-commerce/direct mail catalog/brick-and-mortar enterprise (e.g., Victoria’s Secret), regularly interacts with its customers. For e-commerce sales, address information is updated with every order. Brick-and-mortar sales may or may not capture the latest consumer address, and direct mail catalog orders will capture the latest address.

However, if I place an order and move two weeks later, my data is out-of-date: short of perfection. Perfection is achievable, but not economically achievable. For instance, an enterprise could hire agents in the field to knock on their customers’ doors and monitor the license plates of cars coming and going to ensure that they know to the day when a customer moves. This would come closer to perfect data on the current address of consumers, but at tremendous cost (not to mention that it would irritate the customer). The point: the data needs to provide more value than it costs.

The data steward represents the business interest in the data and will articulate the rules and provide most of the manual changes necessary to data. The sidebar further describes the roles of the data steward in information management. Please note this is not an abdication of overall data quality responsibility or an excuse for the information manager to not lead or bring ideas to the table. The bottom line responsibility for data quality lies with the information manager.

Data Stewardship Roles and Responsibilities

Data stewards should come from business areas, be assigned on a subject-area basis and perform the following roles for information in the enterprise.

1. Determining the data sources

2. Arbitrating the transformation rules for data

3. Verifying the data

4. Contributing the business description of the data1

5. Supporting the data user community

6. Contributing to the overall information management program governance

7. Ensuring data quality


1The “business metadata”

Since every organization will already have information management components in place, it may be necessary to prioritize remediation efforts. Many times the systems needing improved data quality are obvious. Other times, it requires some profiling. Profiling should be done as a check against the current state of the data in relation to potential defects. It’s a hunt for specific data quality violations, not a general hunt for general data flaws.

Data Quality Defect Categories

So, together with the data steward(s) and with an understanding of how the data will be used (both directly and downstream), here are the categories of data quality rule violations:

• Referential Integrity

• Uniqueness

• Cardinality

• Subtype/Supertype

• Reasonable Domains

• Multiple Meaning Columns

• Formatting Errors

• Optional Data

• Derived Data

• Complete Data

• Incorrect Data

• Data Codification

Referential Integrity

Referential integrity (RI) refers to the integrity of reference between data in related tables. For example, product identifiers in sales records need to also be found in the product tables. RI is essential in preventing users from navigating to a “dead end.” Many transactional systems utilize database-enforced RI, but many other systems like data warehouses2 do not, due to the performance and order restrictions it places on data loading. But RI is still important and must be enforced somehow. In the target marketing system, for example, incorrect RI may mean sales with an undefined product and/or to an undefined customer, which effectively takes that sale out of the analysis.

Uniqueness

Column(s) assigned to uniquely identify a business entity, like a product identifier, should also be unique. It constitutes a violation when the same identifier is used to identify multiple products. This also has implications for RI. Not only do we not want dead-end navigations in the data, but we also do want a singular reference. In the target marketing system, for example, a single product for product identifier 123 is expected.

Cardinality

Cardinality restricts the volume of referenceability for any given value. For example, for the targeted marketing system, we may expect to find between one and three addresses for a customer—no more and no less. Other relationships may be based on one-to-one, one-to-many (infinite), or zero-to-many. The zero-to-many is interesting because it allows for the entry of a product without a sale. There are reasonable conditions on the cardinality in this case, whereby these products would be “new” products—perhaps products with an entry date in the last 14 days only.

Subtype/Supertype

Subtype/supertype constructs may also be found in the model. For example, the target marketing system may represent a customer’s marital status as married, divorced, widowed, or single. Different constructs may apply to each customer marital state, like former-spouse data is relevant for divorced persons or an employee may also be a customer. Since most employees are not customers, we would not carry the customer fields for all employees, just for the few who are both.

Subtype/supertype bounds the states of a given business entity. The logical progression through the various states for a business entity (that is: prospect, customer, good customer, former customer) is another quality check.

Reasonable Domain

Reasonable domains for values are especially useful for analyzing numeric data, whose data types can allow an enormous range of values, only a subset of which are reasonable for the field. These include items like incomes>$2,000,000 or<0; and offers mailed>1,000,000 or<1.

Valid values for credit scores is another example. FICO range is 300 to 900 and values outside of this range would be consider unreasonable and quite possibly incorrect. That’s why we want to flag them for investigation.

Multiple Meaning Columns

Data columns should not be used for multiple meanings. Data types constrain the values in a column—to a degree. Any mix of 0 to 20 characters can go into a character (20) data type column. However, if this is a Name field, there are some sequences that you do not expect to find, such as all numbers, which may indicate the field is occasionally being used for a separate meaning. The field is flagged as containing inappropriate data.

Formatting Errors

There are various formatting errors that can be found in the field. Typical formatting errors found in name columns include, for example:

• A space in front of a name

• Two spaces between first and last name and/or middle initial

• No period after middle initial

• Inconsistent use of middle initial (sometimes used, sometimes not)

• Use of ALL CAPS

• Use of “&” instead of “and” when indicating plurality

• Use of a slash (/) instead of a hyphen (-)

In environments where original data entry is free-form, unconstrained, and without the use of master data as a reference, many of the violations mentioned above will be found.

Optional Data

Another type of typical data quality error involves columns that should only contain data if certain data values (or null) are present elsewhere. For example, in the customer table where some customers are organizations and others are individuals, only organizations have SIC codes, organization size, and so on (unless the presence of that data actually represents data for the individual’s organization).

Derived Data

Some systems derive data from data in other systems. For example, one system may contain discount amount, unit price, and the discount percentage—a simple division calculation. Checking the veracity of calculations done in other systems is a common quality check.

Complete Data

The existence of data consistently through all of its history, without gaps, indicates users are not left to wonder what happened during missing periods—another common quality check.

Incorrect Data

There can also be incorrect data—for example, the inconsistent use of initials and misspellings or inappropriate spellings (such as William McNight instead of William McKnight).

Data Codification

Finally, does the data conform to the expected set of “clean” values for the column? For example, in the gender column we expect to find M, F, U (unknown), or the like.

Data quality rule determination determines the rules that govern our data.

Data Profiling

Hopefully categorizing data quality like this takes a bit of a load off. You have a list of categories of violations come from to review in conjunction with the data steward(s) and, eventually, the data itself.

Once the rules are determined, it’s time to check them against the data itself in a process known as Data Profiling. Rules are chosen based on the business impact of the violation. Judgment needs to be used in determining the rules that you will take action on. In Profiling, for example, you may learn that an important rule is not being violated, so you would not choose to take action on it. Depending on the depth of the data quality problem and considering agile principles (Chapter 16), actions, themselves, should be staggered into the environment.

To enforce the “new” business rules consistently across all the data will require a cleanup effort. This brings us to the starting gate for making physical changes to the data to improve enterprise data quality.

Existing structures and their data must be analyzed to fit into the environment after the modeling cures are applied. It is highly desirous that the data quality be consistent across all history and into the future. This means remedying historical data (defined as pre-cleansing modeling) to the go-forward standard. This also can mean a set of challenges.

If, for example, you want to constrain future data entry for last names to be at least 2 characters, but you have several that are only 1 character today, something must be done with that data to bring it to standard. If, for example, you wanted to constrain future birthdates to be less than the system date, but you have several that indicate a birth year of 2099, what are you going to do with that data?

When there are fields like name in a customer record that, if incomprehensible, would invalidate the rest of the record, you cannot just nullify the data. There is a level of poor quality data, such as name of “xxx” that leaves you with nowhere to go with it by itself. From here, you can:

1. Give up and delete the record

2. Salvage the rest of the record (if it’s of high quality) by making the name something like “Unknown”3

3. Pass the entire record to a human, who can do research with the other fields to come up with the name

No one field’s fate should necessarily be decided based solely on that field’s value. The adjacent fields should be considered as well.

There are other fields that you would expend less energy on getting correct. One key overriding principle in data quality is that you will never get it 100% perfect. You must balance perfection against timelines, budget, and value of the improvement of the data. A field like automobile make may not only be only mildly interesting, it may also be impossible to ascertain should the value be “xxx.” For those invalid fields, you would nullify them.

You will never get data quality 100% perfect

When developing the cleanup strategy, it is worth reiterating that the primary focus of the decisions in support of data quality should be the business interests. These are best represented directly by those in the business with daily, direct interaction with the data in question. This is a good quality for the data stewards to have.

Data profiling can be performed with software, but do not simply train the software on the data and say “have at it.” Enter your rules. Find out which ones are being violated and to what degree. Data profiling can also be performed using SQL. GROUP BY statements can show the spread of data in any column.

For example, we may find that, despite desiring/anticipating birthdates between 1900 and 1999 in our B2C customer table, there are some birthdates of 1869, 1888, 2000, 2999, and 9999. If data entry is unconstrained, as it often is, you will have these situations. Another example for consumer data entered online is a larger than expected number of consumers in Alabama, because it’s the first choice on the list.

There are possibly some typos in this list (1869, 1888) but there is also some weirdness (2999 and 9999), which possibly indicates some rushed or lazy input. There is also 2000 which possibly challenges our assumption that we don’t have any customer that young. Maybe we do. Maybe we let one through. Maybe the rule should be changed. Or maybe it’s a valid rule and we want to know about any time it’s violated. Remember, we are profiling in order to put into place rules that will be run on a regular basis on the data. We could do individual investigation on each violation, but it’s more important to “stop the bleeding” going forward.

Many times, data quality work covers up for shortcomings in the originating systems. As master data management and other systems take up more data origination processes within organizations, information with better quality will flow into the organization.

Sources of Poor Data Quality

The following are seven sources of data quality issues.

1. Entry quality: Did the information enter the system correctly at the origin?

2. Process quality: Proper checks and quality control at each touchpoint along the path can help ensure that problems are rooted out, but these checks are often absent in legacy processes.

3. Identification quality: Data quality processes can largely eliminate this problem by matching records, identifying duplicates, and placing a confidence score4 on the similarity of records.

4. Integration quality: Is all the known information about an object integrated to the point of providing an accurate representation of the object?

5. Usage quality: Is the information used and interpreted correctly at the point of access?

6. Aging quality: Has enough time passed that the validity of the information can no longer be trusted?

7. Organizational quality: The biggest challenge to reconciliation is getting the various departments to agree that their A equals the other’s B equals the other’s C plus D.

Master Data Management as a System of Data Origination

A growing source of important information in our enterprises is found in master data management (MDM) systems, which will be discussed in Chapter 7. Organizations are increasingly turning to MDM systems to improve data origination processes by utilizing MDM’s workflow, data quality, and business rule capabilities. MDM systems provide for the management of complex hierarchies within the data, providing access to those hierarchies at any point in historical time.

If used as the system of origination or data enrichment, MDM systems generate the single version of the truth for the data it masters before any other systems gain access to the data. Then, those environments have systems that are working with the corporately adjudicated master data with high data quality, as opposed to environments in which each system is responsible for its own data.

This architecture can have a dramatic effect on enterprise data quality. However, the MDM environment must be modeled well in order to achieve the benefits.

You could be moving all kinds of interesting data around the organization with MDM, but if it does not adhere to a high standard of quality, it can all be for naught. Actually, that would be an MDM implementation that would not be worth doing at all.

Of course, again, the best place to ensure data quality is at the original point of entry. Remediation efforts after that point are more costly and less effective. Many data entry systems, even MDM, allow for free-form data entry, which is a real inhibitor to system success. The modeling may be more important in a MDM system than in any other system in the enterprise.

Cures for Poor Data Quality

It is much more costly to fix data quality errors in downstream systems than it is to fix them at the point of origin. Primarily, if data entry in operational systems can be constrained to enter valid sets of values, this is a great help.

For example, if a name is being entered, automatic fill-in is a handy data quality assist. If a product is being entered, it should be from an up-to-date drop-down list. If appropriate, new data being entered can be compared against previously entered similar (including phonetically similar) data, and the entry function can be verified with “are you sure” type prompts.

If 50% of the data being entered into a field is unusable, our transformation strategy has no hope of salvaging much use out of this field. This is often the case with call-center data entry, where agents are rewarded based on the volume of calls taken. When this is the case, it’s easy to ignore certain fields on the screen and accept the default (or worse—enter what’s convenient but incorrect).

Fixing early entry systems is not always possible since many organizations deal with complex, older technology operational systems that are not easily changed. Many of these systems were written without an understanding of data quality or to provide data quality based solely on the operational purpose of the systems. Changing them takes many months and may involve a major cross-functional effort. Downstream systems, such as data warehouses, benefit from quality improvement in systems that improve the quality at the point of origin.

Wherever the changes need to take place, there is a reasonable set of actions that can be performed on data to increase its quality:

Hold Out: Record(s) that are brought into the environment from other environments are held out of the main data areas in cases of gross rule violation. They may be placed into “holding” tables for manual inspection and action. If adapting this approach, be sure procedural reviews are held quickly, because data is held out of the system until it is accounted for.

Report on It: Data quality violation is reported on, but data is loaded and remains in the data store for the system. Typically the handling of these violations involves nothing more than a post-load report on just-loaded data that creates visibility into the situation. These reports are not useful unless they are reviewed from the appropriate systems and by business personnel.

Change Data: Transform data to a value in a master set of “good” values (for example, Texus is changed to Texas) or otherwise apply rules to improve the data.

It is also quite possible that referential sets of data, not singular rows, are affected when other forms of data quality defects are detected, and the appropriate action should be taken on that data as well as the data directly affected by the quality defect. As with all data quality checks, the data just loaded is the only data that needs to be checked.

It used to be that operational systems were hands-off to any interests beyond support of quick data entry. Well-intended speedup measures like removing drop-down lists and referential integrity had the predictable knock-on effect of lowering overall data quality. When the organization’s lifeblood was the speed of those transaction entries, that made sense. Now that information is the battleground, that strategy, in isolation, doesn’t make sense. We must find ways to speed up transaction entry while providing data quality at the same time.

To those who are skittish about touching operational systems and performance, try harder. The ghost was given up far too easily in the past.

The major data modeling constructs relevant to data entry data quality, which tie back directly to the data quality defect categories are:

1. Uniqueness

2. Reasonable Domain

3. Referential Integrity

4. Completeness

5. Derived Data

Defaults and null constraints are usually more problematic to data quality when used than when not used because they allow for an abstract value (or null, which is no value) to be used in place of a customized, relevant value.

These major constructs relevant to data quality are enforced in the organization in its data models. Implemented data models are the most leverageable place in the entire architecture to enforce change. If you get the data model correct by following the above constructs, the data has a much higher chance of having quality. Get the data model wrong and you spawn innumerable downstream workarounds and organization gyrations that run cover for the model shortcomings.

Data Quality Scoring

Once the systems and the data quality rules are identified and the data is characterized, scoring the data quality needs to be performed. Scoring represents the state of the data quality for that rule. System scores are an aggregate of the rule scores for that system and the overall score is a prorated aggregation of the system scores.

Scoring is a relative measure of conformance to rules. For a given rule, it could be as simple as the percentage of opportunities for rule enforcement that are positively met. For example, if 94% of genders conform to the desired values, that score is 94%.

Since many of the results will (hopefully) be above 99%, to provide a higher level of granularity to the scoring, you may set a floor at 50 (or any number) and measure the adherence above that number. For example, using the gender example, the score is (94-50) / 50, or 88%, instead of 94%.

Action Plan

• Survey and prioritize your company systems according to their use of and need for quality data

• Develop business data stewards—subject matter experts and extended information team members—for important subject areas

• Determine, with the data stewards, the rules the data should live up to

• Profile data against the rules

• Score the data in the system

• Measure impact of various levels of data quality improvement

• Improve data quality

• Review systems of data origination and consider master data management

www.mcknightcg.com/bookch4


2This is prevalent in data warehousing data stores since database RI duplicates RI enforced via the ETL process, thus unnecessarily requiring additional hardware and processing time.

3While carrying the poor quality data in a separate column or table for reference

4Column indicating a measure of confidence (1–100) in another column of data