A Risk-Based De-Identification Methodology - Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Chapter 2. A Risk-Based De-Identification Methodology

Before we can describe how we de-identified any health data sets, we need to describe some basic methodology. It’s a necessary evil, but we’ll keep the math to a bare minimum. The complete methodology and its justifications have been provided in detail elsewhere.[16] Here we’ll just provide a high-level description of the key steps. The case studies that we go through in subsequent chapters will illustrate how each of these steps applies to real data. This will help you understand in a concrete way how de-identification actually works in practice.

Basic Principles

Some important basic principles guide our methodology for de-identification. These principles are consistent with existing privacy laws in multiple jurisdictions.

The risk of re-identification can be quantified

Having some way to measure risk allows us to decide whether it’s too high, and how much de-identification needs to be applied to a data set. This quantification is really just an estimate under certain assumptions. The assumptions concern data quality and the type of attack that an adversary will likely launch on a data set. We start by assuming ideal conditions about data quality for the data set itself and the information that an adversary would use to attack the data set. This assumption, although unrealistic, actually results in conservative estimates of risk (i.e., setting the risk estimate a bit higher than it probably is) because the better the data is, the more likely it is that an adversary will successfully re-identify someone. It’s better to err on the conservative side and be protective, rather than permissive, with someone’s personal health information. In general, scientific evidence has tended to err on the conservative side—so our reasoning is consistent with some quite strong precedents.

The Goldilocks principle: balancing privacy with data utility

It’s important that we produce data sets that are useful. Ideally, we’d like to have a data set that has both maximal privacy protection and maximal usefulness. Unfortunately, this is impossible. Like Goldilocks, we want to fall somewhere in the middle, where privacy is good, but so is data utility. As illustrated in Figure 2-1, maximum privacy protection (i.e., zero risk) means very little to no information being released. De-identification will always result in some loss of information, and hence a reduction in data utility. We want to make sure this loss is minimal so that the data can still be useful for data analysis afterwards. But at the same time we want to make sure that the risk is very small. In other words, we strive for an amount of de-identification that’s just right to achieve these two goals.

The trade-off between perfect data and perfect privacy

Figure 2-1. The trade-off between perfect data and perfect privacy

The re-identification risk needs to be very small

It’s not possible to disclose health data and guarantee zero risk of records being re-identified. Requiring zero risk in our daily lives would mean never leaving the house! What we want is access to data and a very small risk of re-identification. It turns out that the definition of very smallwill depend on the context. For example, if we’re releasing data on a public website, the definition of very small risk is quite different from if we are releasing data to a trusted researcher who has good security and privacy practices in place. A repeatable process is therefore needed to account for this context when defining acceptable risk thresholds.

De-identification involves a mix of technical, contractual, and other measures

A number of different approaches can be used to ensure that the risk of re-identification is very small. Some techniques can be contractual, some can be related to proper governance and oversight, and others can be technical, requiring modifications to the data itself. In practice, a combination of these approaches is used. It’s considered reasonable to combine a contractual approach with a technical approach to get the overall risk to be very small. The point is that it’s not necessary to use only a technical approach.

Steps in the De-Identification Methodology

There are some basic tasks you’ll have to perform on a data set to achieve the degree of de-identification that’s acceptable for your purposes. Much of the book will provide detailed techniques for carrying out these steps.

Step 1: Selecting Direct and Indirect Identifiers

The direct identifiers in a data set are those fields that can be directly used to uniquely identify individuals or their households. For example, an individual’s Social Security number is considered a direct identifier, because there’s only one person with that number. Indirect identifiers are other fields in the data set that can be used to identify individuals. For example, date of birth and geographic location, such as a ZIP or postal code, are considered indirect identifiers. There may be more than one person with the same birthdate in your ZIP code, but maybe not! And the more indirect identifiers you have, the more likely it becomes that an attacker can pinpoint an individual in the data set. Indirect identifiers are also referred to as quasi-identifiers, a term we’ll use throughout.

EXAMPLES OF DIRECT AND INDIRECT IDENTIFIERS

Examples of direct identifiers include name, telephone number, fax number, email address, health insurance card number, credit card number, Social Security number, medical record number, and social insurance number.

Examples of quasi-identifiers include sex, date of birth or age, location (such as postal code, census geography, and information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as dates of admission, discharge, procedure, death, specimen collection, or visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality.

Both types of identifying fields characterize information that an adversary can know and then use to re-identify the records in the data set. The adversaries might know this information because they’re acquaintances of individuals in the data set (e.g., relatives or neighbors), or because that information exists in a public registry (e.g., a voter registration list). The distinction between these two types of fields is important because the method you’ll use for anonymization will depend strongly on such distinctions.

NOTE

We use masking techniques to anonymize the direct identifiers, and de-identification techniques to anonymize the quasi-identifiers. If you’re not sure whether a field is a direct or indirect identifier, and it will be used in a statistical analysis, then treat it as an indirect identifier. Otherwise you lose that information entirely, because masking doesn’t produce fields that are useful for analytics, whereas a major objective of de-identifying indirect identifiers is to preserve analytic integrity (as described in Step 4: De-Identifying the Data).

Step 2: Setting the Threshold

The risk threshold represents the maximum acceptable risk for sharing the data. This threshold needs to be quantitative and defensible. There are two key factors to consider when setting the threshold:

§ Is the data going to be in the public domain (a public use file, for example)?

§ What’s the extent of the invasion of privacy when this data is shared as intended?

A public data set has no restrictions on who has access to it or what users can do with it. For example, a data set that will be posted on the Internet, as part of an open data or an open government initiative, would be considered a public data set. For a data set that’s not going to be publicly available, you’ll know who the data recipient is and can impose certain restrictions and controls on that recipient (more on that later).

The invasion of privacy evaluation considers whether the data release would be considered an invasion of the privacy of the data subjects. Things that we consider include the sensitivity of the data, potential harm to patients in the event of an inadvertent disclosure, and what consent mechanisms existed when the data was originally collected (e.g., did the patients consent to this use or disclosure?). We’ve developed a detailed checklist for assessing and scoring invasion of privacy elsewhere.[16]

Step 3: Examining Plausible Attacks

Four plausible attacks can be made on a data set. The first three are relevant when there’s a known data recipient, and the last is relevant only to public data sets:

1. The data recipient deliberately attempts to re-identify the data.

2. The data recipient inadvertently (or spontaneously) re-identifies the data.

3. There’s a data breach at the data recipient’s site and the data is “in the wild.”

4. An adversary can launch a demonstration attack on the data.

If the data set will be used and disclosed by a known data recipient, then the first three attacks need to be considered plausible ones. These three cover the universe of attacks we’ve seen empirically. There are two general factors that affect the probability of these three types of attacks occurring:

Motives and capacity

Whether the data recipient has the motivation, resources, and technical capacity to re-identify the data set

Mitigating controls

The security and privacy practices of the data recipient

We’ve developed detailed checklists for assessing and scoring these factors elsewhere.[16]

Motives can be managed by having enforceable contracts with the data recipient. Such an agreement will determine how likely a deliberate re-identification attempt would be.

MANAGING THE MOTIVES OF RE-IDENTIFICATION

It’s important to manage the motives of re-identification for data recipients. You do that by having contracts in place, and these contracts need to include very specific clauses:

§ A prohibition on re-identification

§ A requirement to pass on that prohibition to any other party the data is subsequently shared with

§ A prohibition on attempting to contact any of the patients in the data set

§ An audit requirement that allows you to conduct spot checks to ensure compliance with the agreement, or a requirement for regular third-party audits

Without such a contract, there are some very legitimate ways to re-identify a data set. Consider a pharmacy that sells prescription data to a consumer health portal. The data is de-identified using the HIPAA Safe Harbor de-identification standard and contains patient age, gender, dispensed drug information, the pharmacy location, and all of the physician details (we’ve discussed the privacy of prescription data elsewhere).[17], [18] So, in the eyes of HIPAA there are now few restrictions on that data.

The portal operator then matches the prescription data from the pharmacy with other data collected through the portal to augment the patient profiles. How can the portal do that? Here are some ways:

§ The prescriber is very likely the patient’s doctor, so the data will match that way.

§ Say the portal gets data from the pharmacy every month. By knowing when a data file is received, the portal will know that the prescription was dispensed in the last month, even if the date of the prescription is not provided as part of the data set.

§ The patient likely lives close to the prescriber—so the portal would look for patients living within a certain radius of the prescriber.

§ The patient likely lives close to the pharmacy where the drug was dispensed—so the portal would also look for patients living within a certain radius of the pharmacy.

§ The portal can also match on age and gender.

With the above pieces of information, the portal can then add to the patients’ profiles with their exact prescription information and deliver competing drug advertisements when patients visit the portal. This is an example of a completely legitimate re-identification attack on a data set that uses Safe Harbor. Unless there is a contract with the pharmacy explicitly prohibiting such a re-identification, there is nothing keeping the portal from doing this.

Mitigating controls will have an impact on the likelihood of a rogue employee at the data recipient being able to re-identify the data set. A rogue employee may not necessarily be bound by a contract unless there are strong mitigating controls in place at the data recipient’s site.

A demonstration attack, the fourth in our list of attacks, occurs when an adversary wants to make a point of showing that a data set can be re-identified. The adversary is not looking for a specific person, but the one or more that are easiest to re-identify—it is an attack on low-hanging fruit. It’s the worst kind of attack, producing the highest probability of re-identification. A demonstration attack has some important features:

§ It requires only a single record to be re-identified to make the point.

§ Because academics and the media have performed almost all known demonstration attacks,[19] the available resources to perform the attack are usually scarce (i.e., limited money).

§ Publicizing the attack is important for the success of its aspects as a demonstration, so illegal or suspect behaviors will not likely be performed as part of the attack (e.g., using stolen data or misrepresentation to get access to registries).

The first of these features can lead to an overestimation of the risks in the data set (remember, no data set is guaranteed to be free of re-identification risk). But the latter two features usually limit the why and how to a smaller pool of adversaries, and point to ways that we can reduce their interest in re-identifying a record in a data set (e.g., by making sure that the probability of success is sufficiently low that it would exhaust their resources). There’s even a manifesto for privacy researchers on ethically launching a demonstration attack.[20] Just the same, if a data set will be made publicly available, without restrictions, then this is the worst case that must be considered because the risk of an attack on low-hanging fruit is, in general, possible.

In a public data release our only defense against re-identification is modifying the data set. There are no other controls we can use to manage re-identification risk, and the Internet has a long memory. Unfortunately, this will result in a data set that has been modified quite a bit. When disclosing data to a known data recipient, other controls can be put in place, such as the contract and security and privacy practice requirements in that contract. These additional controls will reduce the overall risk and allow fewer modifications to the data.

The probabilities of these four types of attacks can be estimated in a reasonable way, as we’ll describe in Probability Metrics, allowing you to analyze the actual overall risk of each case. For a nonpublic data set, if all of the three risk values for attacks 2–4 are below the threshold determined in Step 2: Setting the Threshold, the overall re-identification risk can be considered very small.

Step 4: De-Identifying the Data

The actual process of de-identifying a data set involves applying one or more of three different techniques:

Generalization

Reducing the precision of a field. For example, the date of birth or date of a visit can be generalized to a month and year, to a year, or to a five-year interval. Generalization maintains the truthfulness of the data.

Suppression

Replacing a value in a data set with a NULL value (or whatever the data set uses to indicate a missing value). For example, in a birth registry, a 55-year-old mother would have a high probability of being unique. To protect her we would suppress her age value.

Subsampling

Releasing only a simple random sample of the data set rather than the whole data set. For example, a 50% sample of the data may be released instead of all of the records.

These techniques have been applied extensively in health care settings and we’ve found them to be acceptable to data analysts. They aren’t the only techniques that have been developed for de-identifying data, but many of the other ones have serious disadvantages. For example, data analysts are often very reluctant to work with synthetic data, especially in a health care context. The addition of noise can often be reversed using various filtering methods. New models, like differential privacy, have some important practical limitations that make them unsuitable, at least for applications in health care.[21] And other techniques have not been applied extensively in health care settings, so we don’t yet know if or how well they work.

Step 5: Documenting the Process

From a regulatory perspective, it’s important to document the process that was used to de-identify the data set, as well as the results of enacting that process. The process documentation would be something like this book or a detailed methodology text.[16] The results documentation would normally include a summary of the data set that was used to perform the risk assessment, the risk thresholds that were used and their justifications, assumptions that were made, and evidence that the re-identification risk after the data has been de-identified is below the specified thresholds.

Measuring Risk Under Plausible Attacks

To measure re-identification risk in a meaningful way, we need to define plausible attacks. The metrics themselves consist of probabilities and conditional probabilities. We won’t go into detailed equations, but we will provide some basic concepts to help you understand how to capture the context of a data release when deciding on plausible attacks. You’ll see many examples of these concepts operationalized in the rest of the book.

T1: Deliberate Attempt at Re-Identification

Most of the attacks in this section take place in a relatively safe environment, where the institution we give our data to promises to keep it private. Consider a situation where we’re releasing a data set to a researcher. That researcher’s institution, say a university, has signed a data use agreement that prohibits re-identification attempts. We can assume that as a legal entity the university will comply with the contracts that it signs. We can then say that the university does not have the motivation to re-identify the data. The university may have some technical capacity to launch a re-identification attack, though. These two considerations make up the dimension of motives and capacity.

Our assumptions can’t rule out the possibility that someone at the university will deliberately attempt to re-identify the data. There may be a rogue staff member who wants to monetize the data for personal financial gain. In that case, the probability of a re-identification attack will depend on the controls that the university has in place to manage the data (the dimension of mitigating controls).

These two dimensions affect the Pr(re-id, attempt) factor of the following equation, which is the probability that the university or its staff will both attempt to and successfully re-identify the data. Assuming now that an attempt will occur, we can then measure the probability that an adversary will successfully re-identify a record. We call the overall probability of re-identification under this scenario attack T1:

Attack T1

Pr(re-id, attempt) = Pr(attempt) × Pr(re-id | attempt)

The first factor on the righthand side, the probability of attempt, captures the context of the data release; the second factor on the righthand side, the probability of re-identification given an attempt,[22] captures the probability of an adversary actually re-identifying a record in the data set. Each probability is a value between 0.0 and 1.0, so their product will also be in that range.

NOTE

The value of Pr(attempt) needs to be estimated based on common sense and experience. It’s a subjective estimate that’s derived from expert opinion. One scheme for estimating that value has been described elsewhere.[16] The key characteristics of that scheme are that it’s conservative—in that it errs on the side of assigning a higher probability to an attempted re-identification even if the data recipient has good controls in place—and it gives results that are consistent with the levels of risk that people have been releasing data with, today and historically.

Another scheme is to say that if motives and capacity are managed, then we need to focus on the possibility of a rogue employee. If the data set will be accessed by, say, 100 people, how many rogue employees will there be? If there is likely to be only one rogue employee, then we can say that Pr(attempt) is 1/100. If we want to be conservative we can say that 10 of these employees may go rogue—so Pr(attempt) = 0.1.

Going back to our example, if the university doesn’t sign a data use agreement, or if the agreement doesn’t have a prohibition on re-identification attempts by the university, the value for Pr(attempt) will be high. These types of conditions are taken into account when assessing the motives and capacity.

The value of Pr(re-id | attempt), which is the probability of correctly re-identifying a record given that an attempt was made, is computed directly from the data set. Specific metrics come later, in Probability Metrics.

T2: Inadvertent Attempt at Re-Identification

Now let’s consider the second attack, T2. Under this attack, a staff member at the university inadvertently recognizes someone in the data set. This can be, for example, a data analyst who is working with the data set. The data analyst may recognize an acquaintance, such as a neighbor or a relative, in the data set through the individual’s age and ZIP code. Attack T2 is represented as:

Attack T2

Pr(re-id, acquaintance) = Pr(acquaintance) × Pr(re-id | acquaintance)

The value of Pr(acquaintance) captures the probability that that staff member can know someone who is potentially in the data set. For example, if the data set is of breast cancer patients, this probability is that of a randomly selected member of the population knowing someone who has breast cancer. This probability can be computed in a straightforward manner by considering that on average people tend to have 150 friends. This is called the Dunbar number. While the exact value for the Dunbar number shows that it varies around 150, it’s been consistently shown to be close to 150 for offline and online relationships together.[16] We can also get the prevalence of breast cancer in the population using public data sets. Knowing the prevalence, ρ, and the Dunbar number, we can use the estimate Pr(acquaintance) = 1–(1–ρ)150/2 (we divide the number of friends in half seeing as breast cancer is a disease that predominantly affects women).

If we’re considering a data set of breast cancer patients living in California and the adversary is in Chicago, we’re only interested in the prevalence of breast cancer in California, not in Chicago. Therefore, the prevalence needs to be specific to the geography of the data subjects. We can assume that the prevalence is more or less the same in California as it is nationally, and use the national number, or we can look up the prevalence for California in order to be more accurate. We wouldn’t know if the adversary has 150 acquaintances in California, but we can make the worst-case assumption and say that he does. If the prevalence can’t be found, then as a last resort it can be estimated from the data itself. These assumptions would have to be documented.

The next factor in this equation—Pr(re-id | acquaintance), the probability of correctly re-identifying a record given that the adversary knows someone in the population covered by the data set—is computed from the data itself. There will be more on that in Probability Metrics.

T3: Data Breach

The third attack, T3, can take place if the university loses the data set—in other words, in the case of a data breach, which was the attack described in Step 3: Examining Plausible Attacks. Current evidence suggests that most breaches occur through losses or thefts of mobile devices. But other breach vectors are also possible. Based on recent credible evidence, we know that approximately 27% of providers that are supposed to follow the HIPAA Security Rule have a reportable breach every year. The HIPAA Security Rule is really a basic set of security practices that an organization needs to have in place—it’s a minimum standard only. Attack T3 is represented as:

Attack T3

Pr(re-id, breach) = Pr(breach) × Pr(re-id | breach)

It should be noted that the 27% breach rate, or Pr(breach) = 0.27, is likely to change over time. But at the time of writing, this is arguably a reasonable estimate of the risk. As with attacks T1 and T2, Pr(re-id | breach) is computed from the data itself, which we cover in Probability Metrics.

IMPLEMENTING THE HIPAA SECURITY RULE

There are a bunch of mitigating controls that need to be considered in dealing with personal health information.[23] These are considered the most basic forms of controls. Think of them as minimum standards only! We can give you only a taste of what’s expected, because it’s pretty detailed (although this summary covers a lot of ground):

Controlling access, disclosure, retention, and disposition of personal data

It should go without saying (but we’ll say it anyway) that only authorized staff should have access to data, and only when they need it to do their jobs. There should also be data sharing agreements in place with collaborators and subcontractors, and all of the above should have to sign nondisclosure or confidentiality agreements. Of course, data can’t be kept forever, so there should also be a data retention policy with limits on long-term use, and regular purging of data so it’s not sitting around waiting for a breach to occur. If any data is going to leave the US, there should also be enforceable data sharing agreements and policies in place to control disclosure to third parties.

Safeguarding personal data

It’s important to respond to complaints or incidents, and that all staff receive privacy, confidentiality, and security training. Sanctions are usually doled out to anyone that steps out of line with these policies and procedures, and there’s a protocol for privacy breaches that has been put to good use. Authentication measures must be in place with logs that can be used to investigate an incident. Data can be accessed remotely, but that access must be secure and logged. On the technical side, a regularly updated program needs to be in place to prevent malicious or mobile code from being run on servers, workstations and mobile devices, and data should be transmitted securely. It’s also necessary to have physical security in place to protect access to computers and files, with mandatory photo ID.

Ensuring accountability and transparency in the management of personal data

There needs to be someone senior accountable for the privacy, confidentiality, and security of data, and there needs to be a way to contact that person. Internal or external auditing and monitoring mechanisms also need to be in place.

T4: Public Data

The final attack we consider, T4, is when data is disclosed publicly. In that case we assume that there is an adversary who has background information that can be used to launch an attack on the data, and that the adversary will attempt a re-identification attack. We therefore only consider the probability of re-identification from the data set:

Attack T4

Pr(re-id), based on data set only

So, if we know who, specifically, is getting the data set, we de-identify our data set in the face of assumed attacks T1, T2, and T3; if we don’t know who is getting the data set, as in a public data release, we de-identify our data set assuming attack T4.

Measuring Re-Identification Risk

A recent text provided a detailed review of various metrics that can be used to measure re-identification risk.[16] We’ll focus on the key risk metrics that we use in our case studies, and explain how to interpret them.

Probability Metrics

When we release a data set, we can assign a probability of re-identification to every single record. To manage the risk of re-identification, however, we need to assign an overall probability value to the whole data set. This allows us to decide whether the whole data set has an acceptable risk. There are two general approaches that we can use: maximum risk and average risk.

With the maximum risk approach, we assign the overall risk to the record that has the highest probability of re-identification. This is, admittedly, a conservative approach. The key assumption is that the adversary is attempting to re-identify a single person in the data set, and that the adversary will look for the target record that has the highest probability of being re-identified. For instance, the adversary will try to re-identify the record that’s the most extreme outlier because that record is likely to be unique in the population. This assumption is valid under attack T4, where we assume a demonstration attack is attempted.

Attack T1 would be a breach of contract, so an adversary is not likely to want to promote or advertise the attempted re-identification. The most likely scenarios are that the adversary is trying to re-identify someone she knows (i.e., a specific target, maybe a friend or relative, or someone famous), or the adversary is trying to re-identify everyone in the data set.

If the adversary is trying to re-identify an acquaintance, any one of the records can be the target, so it’s fair to use a form of average risk. This risk is also called journalist or prosecutor risk, for obvious reasons. If the adversary is trying to re-identify everyone in the data set, it’s also a form of average risk, but it’s called marketer risk. Journalist or prosecutor risk is always greater than or equal to marketer risk, so we only need to focus on journalist or prosecutor risk to manage both scenarios.[16]

In the case of attack T2, a spontaneous recognition (“Holly Smokes, that’s my neighbor!”), we follow the same logic. Since we’re talking about the re-identification of a single record, but every record is potentially at risk, we again consider the average risk. It’s a question of who, on average, might recognize someone they know in the data set.

Finally, for attack T3, we assume that anyone who gets their hands on a breached data set realizes that using it in any way whatsoever is probably illegal. It’s therefore pretty unlikely that they would launch a targeted demonstration attack. If they do anything besides report having found the lost data, they won’t want to advertise it. So again we can use average risk, because anyone or everyone might be at risk for re-identification.

For each attack, we estimate the same Pr(re-id | T) from the data itself, where T is any one of attempt, acquaintance, or breach. In other words, we assume that there is a deliberate attempt, or that there is an acquaintance, or that there is a breach, and treat each of these conditional probabilities as the same. What we’ve done is operationalize the original concept of Pr(re-id, attempt) so that it considers three different forms of “attempts.” And we use this to ensure that data sets are less than the re-identification risk threshold, regardless of the form of the attempt.

WHAT’S THE DATA WORTH?

If a malicious adversary gets his hands on a data set, what’s a single re-identified record worth (e.g., for identify theft)? Not much, apparently. Full identities have been estimated to be worth only $1–$15.[24] In one case where hackers broke into a database of prescription records, the ransom amount worked out to only $1.20 per patient.[25] If a data set is de-identified, it hardly seems worth the effort to re-identify a single record.

On the other hand, the data set as a whole may be of value if the adversary can ransom it off or sell it. The hackers mentioned above demanded $10 million from the Virginia Department of Health Professionals. This database was intended to allow pharmacists and health care professionals to track prescription drug abuse, such as incidents of patients who go “doctor-shopping” to find more than one doctor to prescribe narcotics.

In a different case, hackers threatened to publicize the personal information of clients of Express Scripts,[26] a company that manages pharmacy benefits. The company was using production data containing personal information for software testing, and there was a breach on the testing side of the business. Unfortunately, using production data for testing is common. The initial extortion attempt was based on the breach of 75 records. It turned out later that 700,000 individuals might have been affected by the breach.

The problem with average risk is that it allows records that are unique to be released. For example, say we have a data set with a record about Fred, a 99-year-old man living in a particular ZIP code. If Fred is the only 99-year-old in that ZIP code, and this is captured in the data set, it’s pretty easy for anyone that knows Fred to re-identify his health records. Worse, assume he’s the only male over 90 in that ZIP code; then it’s even easier for someone to re-identify his records, perhaps from the voter registration list. It’s perfectly reasonable to assume that the average risk for that data set would be quite low because very few records in the data set stand out. So we can release the data set—but what about Fred?

We therefore need a more stringent definition of average risk; one that takes uniques into account. Let’s call it strict average risk. This is a two-step risk metric:

1. We need to make sure that the maximum risk is below a specific threshold, say 0.5 (i.e., at least two records match) or preferably 0.33 (i.e., at least three records match). Meeting this threshold means getting rid of unique records in the data set, like Fred, through de-identification.

2. Then we evaluate average risk. If the maximum risk in the first step is above the threshold of 0.5 or 0.33, that would be the risk level for the data set. Otherwise, if the maximum risk in the first step is below the threshold, the risk level for the data set is the regular average risk.

Adhering to strict average risk ensures that there are no unique individuals in the data set, and is consistent with best practices in the disclosure control community. It keeps the Freds safe!

Information Loss Metrics

When de-identifying a data set, we inevitably lose information (e.g., through generalization or suppression). Information loss metrics give us a concrete way of evaluating the distortion that has been applied to a data set. Ideal information loss metrics would evaluate the extent to which the analytics results change before and after de-identification. But that’s not generally practical because it requires precise knowledge of the analytics that will be run before the de-identification can even be applied. What we need are more general information loss metrics.

A lot of these metrics have been cooked up in the academic literature. We’ve found two that in practice give us the most useful information:

Entropy

A measure of uncertainty. In our context it reflects the amount of precision lost in the data and can account for changes due to generalization, suppression, and subsampling.[27] The greater the entropy, the more the information loss. Pretty straightforward. Even though entropy is a unitless measure, by definition, it can be converted to a percentage by defining the denominator as the maximum possible entropy for a data set. This makes changes in entropy, before and after de-identification, clearer.

Missingness

A measure of cells or records that are missing. Again straightforward, and an important measure in its own right. High amounts of missingness can reduce the statistical power in a data set, because some records may have to be dropped. And bias can be introduced if missing values are not random. If we take a simple example of a single quasi-identifier, rare and extreme values are more likely to be suppressed for the sake of de-identification. Therefore, by definition, the pattern of suppression will not be completely random.

These two metrics show us different perspectives on information loss. As we use it here, entropy is affected only by the generalizations that are applied. So it gives a measure of how much information was lost due to the generalization that was applied during de-idenitfication. Missingness is affected only by suppression. Therefore, the combination of these two metrics gives us an overall picture of information loss.

THE IMPACT OF MISSINGNESS

A common way to deal with missing cells in data analysis is Complete Case Analysis (CCA), which simply ignores records with any missingness whatsoever, even in variables that are not included in an analysis. It’s the quick and dirty way to get rid of missingness. It’s also the default in many statistical packages, and is common practice in epidemiology. But it’s pretty well known that this approach to data analysis will result in a hefty loss of data. For example, with only 2% of the values missing at random in each of 10 variables, you would lose 18.3% of the observations on average. With five variables having 10% of their values missing at random, 41% of the observations would be lost on average.[28]

Another common way to deal with missing cells in a data set is Available Case Analysis (ACA), which uses records with complete values for the variables used in a particular analysis. It’s much more careful to include as much data as possible, unlike the brute force approach of CCA. For example, in constructing a correlation matrix, different records are used for each pair of variables depending on the availability of both values, but this can produce nonsense results.[28]

Both CCA and ACA are only fully justified under the strong assumption that missingness is completely at random.[29], [30] And suppression done for the sake of de-identification can’t be considered random. Of course, most (if not all) missingness in real data is not completely at random either—it’s a theoretical ideal. As a result, the amount of suppression, leading to increased missingness, is an important indicator of the amount of distortion caused to the data by the de-identification algorithm.

Our aim is to minimize missingness as much as possible. Of course, you can always collect more data to compensate for loss of statistical power, but that means time and money. When the type of bias is well understood, it’s possible to use model-based imputation to recover the missing values (a technique for estimating the missing values through techniques such as averaging the surrounding values or filling in common values). But it’s not always possible to create accurate models, and this also adds to the complexity of the analysis.

NOTE

We should clarify that missingness can have two interpretations. It could be the percentage of records (rows) that have had any suppression applied to them on any of the quasi-identifiers, or it could be the percentage of cells (row and column entries) in the quasi-identifiers that have had some suppression applied to them.

Let’s say we have a data set of 100 records and two quasi-identifiers. Three records have “natural” missingness, due to data collection errors, on the first quasi-identifier. These are records 3, 5, and 7. So we have 3 records out of 100 that have missing values, or 3% record missingness. Let’s say that de-identification adds some suppression to the second quasi-identifier in records 5, 10, and 18 (i.e., not the same quasi-identifier). In this case, only 2 new records had suppression applied to them due to de-identification (records 10 and 18), so we now have 5 records out of 100 with missing values, or 5% record missingness. Therefore, the information loss is 2% record missingness.

Now consider the second type of missingness, based on the percentage of suppressed cells that are for quasi-identifiers. Our previous example started with 3 missing cells out of 200 (100 cells per quasi-identifier), or 1.5% cell missingness. After de-identification, we had 6 missing cells out of 200, or 3% cell missingness. Therefore, the information loss is 1.5% cell missingness.

Risk Thresholds

We’ve covered our methodology and how to measure risk, but we also need to discuss the practical aspects of choosing, and meeting, risk thresholds.

Choosing Thresholds

Measuring Risk Under Plausible Attacks presented four attacks with four equations. What’s the maximum acceptable probability of re-identification for the whole data set? It’s reasonable to set the same risk threshold for all four attacks, because a patient or regulator won’t care which type of attack reveals an identity.

WHAT IF A DATA SUBJECT SELF-REVEALS?

There’s a lot of interest in making data from academic and industry-sponsored clinical trials more generally available, including having the data made publicly available. But participants in a clinical trial can self-reveal that they’re part of such a trial, maybe through their posts to an online social network.

If a data subject self-reveals, that subject’s records have a higher probability of being re-identified. This applies whether we use average or maximum risk. Re-identification of a record may reveal additional information about the data subject, such as co-morbidities, other drugs she may be taking, or sensitive information about her functioning or mental well-being.

Because there’s a risk of at least one participant in a clinical trial self-revealing that she has participated, should we assume a higher risk for all data subjects in the database? In general, that would be a very conservative approach. There needs to be evidence that many participants in the trial are likely to self-reveal before it would be reasonable to assume an elevated risk of re-identification.

As a cautionary measure, however, it would be important to inform participants in the clinical trial that there is a plan to share trial data broadly, and explain the risks from self-revealing. If participants have been informed of these risks, then arguably it’s reasonable to leave the risk as is; if participants have not been informed of these risks, then it may be more prudent to assume an elevated risk.

There are many precedents going back multiple decades for what is an acceptable probability of releasing personal information.[16] The values from these precedents, shown in Figure 2-2, have not changed recently and remain in wide use. In fact, they are recommended by regulators and in court cases. However, all of these precedents are for maximum risk.

The different maximum risk thresholds that have been used in practice

Figure 2-2. The different maximum risk thresholds that have been used in practice

As the precedents indicate, when data has been released publicly, the thresholds range between 0.09 and 0.05 for maximum risk. Therefore, it’s relatively straightforward to define a threshold for attack T4, where we assume there is the risk of a demonstration attack. If we choose the lowest threshold, we can set the condition that Pr(re-id) ≤ 0.05.

To decide which threshold to use within this 0.09 to 0.05 range, we can look at the sensitivity of the data and the consent mechanism that was in place when the data was originally collected—this is the invasion of privacy dimension that needs to be evaluated.[16] For example, if the data is highly sensitive, we choose a lower threshold within that range. On the other hand, if the patients gave their explicit consent to releasing the data publicly while understanding the risks, we can set a higher threshold within that range.

WHAT’S CONSIDERED SENSITIVE INFORMATION?

All health data is sensitive. But some types of health information are considered especially sensitive because their disclosure is stigmatized or can be particularly harmful to the individual. What are these particularly harmful diagnoses? The National Committee on Vital Health Statistics has summarized what it believes can be considered “sensitive” information.[31]

Under federal law:

§ Genetic information, including a disease or disorder in family members

§ Psychotherapy notes recorded by a mental health professional providing care

§ Substance abuse treatment records from almost any program

§ Cash payments where the patient asks that it not be shared with a health insurance company

Under (many) state laws:

§ HIV information, or information on any sexually transmitted diseases

§ Mental health information of any sort, including opinions formed or advice given regarding a patient’s mental or emotional condition

§ Information in the records of children and adolescents

For the sake of patient trust:

§ Mental health information, beyond what’s defined in state laws

§ Sexuality and reproductive health information

When an entire record may be deemed sensitive:

§ In cases of domestic violence or stalking

§ When dealing with public figures and celebrities, or victims of violent crimes

§ Records of adolescents, including childhood information

For data that’s not publicly released, we have argued that the overall risk threshold should also be in the range of 0.1 to 0.05.[16] Let’s take as an example a T1 attack. If the Pr(attempt) = 0.3, and the overall threshold was determined to be 0.1, then the threshold for Pr(re-id | attempt) = 0.1/0.3 = 0.33. This value is consistent with the precedents summarized in Figure 2-2. Choosing a threshold value in the 0.1 to 0.05 range is also a function of the invasion of privacy assessment.

By definition, average risk is equal to or lower than maximum risk. Arguably we should set thresholds for average risk that are equal to or lower than thresholds for maximum risk. On the other hand, attacks T1, T2, and T3 are on data sets that are not disclosed publicly, and, as we’ve discussed, they’re very unlikely to result in a demonstration attack. So the thresholds used should really be higher than for T4. Based on a previous analysis, it’s been recommended that the average risk thresholds vary between 0.1 and 0.05, depending on the results of an invasion of privacyscore.[16] This balances the two competing requirements noted above. A summary is provided in Table 2-1.

Table 2-1. Attacks, risk metrics, and thresholds

Attack ID

Risk metric

Threshold range

Risk metric

T1

Pr(re-id, attempt)

0.1 to 0.05

Average risk

T2

Pr(re-id, acquaintance)

0.1 to 0.05

Average risk

T3

Pr(re-id, breach)

0.1 to 0.05

Average risk

T4

Pr(re-id)

0.09 to 0.05

Maximum risk

Meeting Thresholds

Once we’ve decided on a risk threshold, we then need a way to meet that threshold if the risk is found to be above it. We discussed what techniques to use to de-identify a data set in Step 4: De-Identifying the Data. At this point we’re primarily interested in generalization and suppression. But how do you go about using these techniques to meet the risk threshold? First we need a couple of definitions:

Equivalence class

All the records that have the same values on the quasi-identifiers. For example, all the records in a data set about 17-year-old males admitted on 2008/01/01[32] are an equivalence class.

Equivalence class size

The number of records in an equivalence class. Equivalence class sizes potentially change during de-identification. For example, there may be three records for 17-year-old males admitted on 2008/01/01. When the age is recoded to a five-year interval, then there may be eight records for males between 16 and 20 years old admitted on 2008/01/01.

k-anonymity

The most common criterion to protect against re-identification. This states that the size of each equivalence class in the data set is at least k.[33] Many k-anonymity algorithms use generalization and suppression.

A simple example of de-identification using k-anonymity is illustrated in Figure 2-3. Here the objective was to achieve 3-anonymity through generalization and suppression. Each record pertains to a different patient, and there may be additional variables not shown because they’re not quasi-identifiers. The de-identified data set in panel (b) has two equivalence classes (records 1, 2, 3, and 5, and records 7, 8, and 9), and three records were suppressed. We assume record suppression for simplicity, but in practice cell suppression could be used.

The original data set (a) de-identified into a 3-anonymous data set (b)

Figure 2-3. The original data set (a) de-identified into a 3-anonymous data set (b)

The basic attack that k-anonymity protects against assumes that the adversary has background information about a specific patient in the data set, and the adversary is trying to figure out which record belongs to that patient. The adversary might have the background information because he knows the patient personally (e.g., the patient is a neighbor, co-worker, or ex-spouse), or because the patient is famous and the information is public. But if there are at least k patients in the de-identified data set with the same values on their quasi-identifiers, the adversary has at most a 1/kchance of matching correctly to that patient. Now if we let ki be the size of equivalence class i, then maximum risk is max(ki) across all values of i in the data set, and average risk is average(ki) across all values of i.

Risky Business

In many jurisdictions, demonstrating that a data set has a very small risk of re-identification is a legal or regulatory requirement. Our methodology provides a basis for meeting these requirements in a defensible, evidence-based way. But you need to follow all of the steps!

De-identification is no mean feat, and it requires a lot of forethought. The steps discussed in this chapter were formulated based on best practices that have evolved from a lot of academic and practical work. What we’ve done is break them down into digestible pieces so that you have a process to follow, leading to good data sharing practices of your own.

But the story doesn’t end here, and there are plenty of details you can dig into.[16] We’ve only presented the highlights! We’ll work through many examples throughout the book that shed light on the practical application of de-identification to real data sets, and we’ll grow our repertoire of tools to deal with complex data sets and scenarios.


[16] K. El Emam, A Guide to the De-identification of Personal Health Information (Boca Raton, FL: CRC Press/Auerbach, 2013).

[17] K. El Emam, A. Brown, and P. AbdelMalik, “Evaluating Predictors of Geographic Area Population Size Cut-Offs to Manage Re-Identification Risk,” Journal of the American Medical Informatics Association 16:(2009): 256–266.

[18] K. El Emam and P. Kosseim, “Privacy Interests in Prescription Data Part 2: Patient Privacy,” IEEE Security and Privacy 7:7 (2009): 75–78.

[19] K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-identification Attacks on Health Data,” PLoS ONE 6:12 (2001): e28071.

[20] Y. Erlich, “Breaking Good: A Short Ethical Manifesto for the Privacy Researcher,” Bill of Health, 23 May 2013.

[21] F. Dankar and K. El Emam, “Practicing Differential Privacy in Health Care: A Review,” Transactions on Data Privacy 6:1 (2013): 35–67.

[22] The conditional probability Pr(A | B) is read as the probability of A given B, or the probability that A will occur on the condition that B occurs.

[23] K.M. Stine, M.A. Scholl, P. Bowen, J. Hash, C.D. Smith, D. Steinberg, and L.A. Johnson, “An Introductory Resource Guide for Implementing the Health Insurance Portability and Accountability Act (HIPAA) Security Rule,” NIST Special Publication 800-66 Revision 1, October 2008.

[24] Symantec, Symantec Global Internet Threat Report—Trends for July-December 07 (Symantec Enterprise Security, 2008).

[25] B. Krebs, “Hackers Break into Virginia Health Professions Database, Demand Ransom,” Washington Post, 4 May 2009.

[26] S. Rubenstein, “Express Scripts Data Breach Leads to Extortion Attempt,” The Wall Street Journal, 7 November 2008.

[27] K. El Emam, F. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, “A Globally Optimal k-Anonymity Method for the De-identification of Health Data,” Journal of the American Medical Informatics Association 16:5 (2009): 670–682.

[28] J. Kim and J. Curry, “The Treatment of Missing Data in Multivariate Analysis,” Social Methods & Research 6:(1977): 215–240.

[29] R. Little and D. Rubin, Statistical Analysis with Missing Data (New York: John Wiley & Sons, 1987).

[30] W. Vach and M. Blettner, “Biased Estimation of the Odds Ratio in Case-Control Studies Due to the Use of Ad Hoc Methods of Correcting for Missing Values for Confounding Variables,” American Journal of Epidemiology 134:8 (1991): 895–907.

[31] National Committee on Vital Health Statistics, letter to Kathleen Sebelius, Secretary to Department of Health and Human Services, 10 November 2010, “Re: Recommendations Regarding Sensitive Health Information.”

[32] We have adopted the ISO 8601 standard format for dates, yyyy/mm/dd, throughout the book.

[33] P. Samarati and L. Sweeney, “Protecting Privacy When Disclosing Information: k-Anonymity and Its Enforcement Through Generalisation and Suppression,” Technical Report SRI-CSL-98-04 (Menlo Park, CA: SRI International, 1998).