Dates, Long Tails, and Correlation: Insurance Claims Data - Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Chapter 5. Dates, Long Tails, and Correlation: Insurance Claims Data

Insurance claims data represents a treasure trove of health information for all manner of analytics and research. In it you can find diagnostic and treatment information, details on providers used, and a great deal regarding finance and billing. The data is very precise, due to the need for precise accounting of medical charges and reimbursements, but it’s primarily administrative data, which can pose all kinds of data cleaning and formatting problems. More importantly for our purposes, however, is that it can also present some unique challenges to de-identification that you wouldn’t necessarily see in data collected primarily for research.

The Heritage Provider Network (HPN) presented us with a unique challenge: to de-identify a large claims data set, with up to three-year longitudinal patient profiles, for an essentially public data release. This work motivated a lot of the methods we present here. The methods are useful whenever you deal with longitudinal data, but especially when there are a large number of observations per patient.

The Heritage Health Prize

The Heritage Provider Network is a provider of health care services in California. They initiated the Heritage Health Prize (HHP) competition to develop a predictive algorithm that can “identify patients who will be admitted to a hospital within the next year using historical claims data.”[45]The data provided to competitors consisted of up to three years’ worth of data for some 113,000 patients, comprising a total 2,668,990 claims. The competition ran from 2011-04 to 2013-04, with a potential reward of $3,000,000 for the best algorithm. We’ll use data from the HHP, before de-identification and subsampling, to describe the methods presented in this chapter.

Date Generalization

Insurance claims will undoubtedly have dates: the date of service, when drugs were dispensed, or when specimens were collected. Knowing a sequence of dates gives an adversary a way to re-identify individuals. For example, if a relative or neighbor knows that you had surgery on Tuesday, a claim for anesthesia on Tuesday would definitely be consistent with your surgery. Multiple surgery-related charges on Tuesday would further strengthen the evidence that this was your health record.

This might seem like a stretch—obviously a lot of people would have had surgery that day—but it’s an important piece of information that has to be included in the risk assessment. The more quasi-identifiers you have in a data set, the more unique patients become. And keep in mind that a pseudonym is used to link patient information across claims in a de-identified data set, so these sequences of dates are not lost.

It’s also reasonable to assume that acquaintances, at the very least, will know the precise (or general) date for health care services. The example of a neighbor is a good one, but so is a co-worker or a friend. And people talk, so word gets around, especially if the treatment is for something serious (like the treatment for a major illness, or a major surgery). It’s even worse for famous people, as the information will be spread by the media and stored in print and online.

Randomizing Dates Independently of One Another

A simple way to de-identify date information is to add noise to each date (essentially shifting each date independently of one another). If the noise added is from a Gaussian or Laplace distribution, it leaves open opportunities for attacking the data using modeling techniques that can average out the noise. It’s therefore better to generate the noise from a uniform distribution. We could, for example, randomly shift each date by up to 15 days in either direction (earlier or later).

Consider a situation where a patient named Bob has a unique pattern of dates that results in a high risk of re-identification. Maybe Bob was seeking care during a prolonged length of time, making him stand out. Randomizing dates independently of one another, adding or subtracting up to 15 days as in Table 5-1, makes the dates of his claims fall into the same range of uncertainty as those of every other patient in the data.

Table 5-1. Date sequence for Bob’s medical claims, randomized by +/– 15 days

Original sequence

Randomized

2001/04/08

2001/04/23

2002/05/07

2002/05/18

2002/08/12

2002/08/06

2003/07/27

2003/07/29

2003/08/11

2003/07/28

Adding noise to dates could, however, distort the order of claims by date, as was the case for the last two dates in Table 5-1. We could end up with a claim for post-op recovery charges being dated before the surgery itself, which wouldn’t make much sense (or might make you think the health care provider is doing some seriously funky stuff). Another example would be admission and discharge dates. Randomizing these dates independently of one another could result in a patient discharge before admission—although that may sound like good medicine, it doesn’t make much sense!

NOTE

Generally speaking, de-identification should produce data that still makes sense, or analysts won’t trust it. Of course, they’ll probably be used to seeing “bad data,” with all sorts of formatting and data problems, but de-identification shouldn’t make this worse. Cleaning data is not the purview of de-identification, but maintaining its integrity should be.

What maintaining the integrity of a data set really means depends on the type of analysis that’s needed. In some cases the exact order of events won’t matter, just the general trend—maybe monthly, or seasonal, or even yearly for summary statistics and frequency plots. If a health care facility wants to run some statistics on admissions for the past year, randomizing dates to within a month won’t matter, even if some discharges come before admissions.

Shifting the Sequence, Ignoring the Intervals

Another approach that might seem to make sense—in order to preserve the order of claims by date—is to shift the whole sequence of dates for a patient so that the order of claims is preserved. In Table 5-2 all dates for Bob’s health care claims have been shifted by –15 days. We still used a random number between –15 and 15, from a uniform distribution, but this time the same value was applied to all dates in order to preserve their order.

Table 5-2. Date sequence for Bob’s medical claims, shifted

Original sequence

Shifted sequence

2001/04/08

2001/03/24

2002/05/07

2002/04/22

2002/08/12

2002/07/28

2003/07/27

2003/07/12

2003/08/11

2003/07/27

The problem with this approach is that the intervals between dates are preserved exactly. So even if you don’t recognize the dates, you could look at the intervals and determine when they match Bob’s health claims. If the dates can be known, so can the intervals between them (an easy and direct inference). The sequence of intervals may even be easier to recognize than the dates for treatments that have a known pattern, like dialysis treatments or chemotherapy. We didn’t have to worry about this when we randomized dates because when the end points are unknown, so are the intervals between them.

What if there are unusually large intervals between claims? In Table 5-2, the interval between Bob’s first two claims is maintained at 394 days. This could stand out among other records in the data set. Bob could be the only 55-year-old male with an interval of more than 365 days in the database. Of course, we have to treat dates as longitudinal, so the intervals would be considered using a longitudinal risk assessment. But the point is that the intervals themselves pose a re-identification risk, and must therefore be managed somehow. So shifting a sequence of dates is not an approach we recommend.

Generalizing Intervals to Maintain Order

A better approach that can be used to protect claims while maintaining their order by date is to generalize the intervals between them. To provide a data set with dates, instead of intervals, we can use the first (randomized) date in the claim sequence as an anchor and add a random value from the generalized intervals to all subsequent claims.

Let’s go through this process step by step, again using Bob’s sequence of dates. In Table 5-3 we converted the original sequence to intervals between claims, with the first date kept as the anchor.

Table 5-3. Date sequence for Bob’s medical claims converted to intervals

Original sequence

Converted sequence

2001/04/08

2001/04/08

2002/05/07

394 days

2002/08/12

97 days

2003/07/27

349 days

2003/08/11

15 days

Next we generalized the converted sequence. Following our previous examples, we generalized the first date, our anchor, to month, as shown in Table 5-4. But many analysts don’t want to deal with generalized dates, so we then randomized the intervals, as we did for dates (i.e., selecting an integer at random from the generalized interval). Since the anchor is generalized to month, all subsequent dates will also be randomized to within a month. But we needed to do something with the intervals, or else we’d have the same problems as before. So, we generalized them to within a week.

Table 5-4. Date sequence for Bob’s medical claims as de-identified intervals

Converted sequence

Generalized sequence

Randomized

2001/04/08

2001/04

2001/04/23

394 days

[393-399]

398 days

97 days

[92-98]

96 days

349 days

[344-350]

345 days

15 days

[15-21]

20 days

Finally, we converted the intervals back to dates by adding each randomized interval to the previous date, with the generalized anchor date providing the starting point, as shown in Table 5-5. Because we added the intervals between dates to the previous dates in the sequence, we guaranteed that the order of claims was maintained, as well as the approximate distances between them in time.

Table 5-5. Date sequence for Bob’s medical claims de-identified

Original sequence

Intermediate step

De-identified sequence

2001/04/08

2001/04/23

2001/04/23

2002/05/07

2001/04/23 + 398 days

2002/05/26

2002/08/12

2002/05/26 + 96 days

2002/08/30

2003/07/27

2002/08/30 + 345 days

2003/08/10

2003/08/11

2003/08/10 + 20 days

2003/08/30

If admission and discharge dates were included in the claims data, they would be included as quasi-identifiers for each visit. We could calculate length of stay and time since last service, then apply the same approach just shown to de-identify these values.

Rather than generalizing all the date intervals to the same level (global recoding), as we’ve described, we could instead generalize intervals specifically based on need for each record (local recoding). For example, for patients in large equivalence classes with much lower risk of re-identification, perhaps the date intervals could be five days long instead of the seven days we used. Normally when generalizing information this is bad, because you end up with inconsistent formats between records (e.g., one record with interval “[1-5],” and another with “[1-7]”). But in this case we’re only using the generalization to determine the interval from which we’ll draw a random number, so in the end the dates have the same format (day, month, year). The point would be to decrease information loss, at the expense of some computational overhead.

NOTE

The generalization for the anchor date and intervals should be given to data analysts. They may need this information to choose models and interpret results. As a simple example, if the date were generalized to weeks, the analysts would know that their results would be accurate to within a week at best. If some form of local recoding were used to de-identify the data set, then they would need to know the equivalent generalization under global recoding (or the max generalization used under local recoding, which would be the same).

Dates and Intervals and Back Again

We don’t want to lose sight of the fact that the intervals between dates were derived from, well, dates. If we drop the anchor, or reorder the intervals before generalizing, then we’ve lost the original dates. That would be bad for de-identification, because dates of service are quasi-identifiers, plain and simple. We need this information to do a proper risk assessment. Not only that, but there’s information contained in the pattern of dates you don’t want to lose. The sequence of intervals (4 months, 3 months, 2 months, 1 month) might represent a patient that’s getting worse, whereas the sequence of intervals (1 month, 2 months, 3 months, 4 months) might represent a patient that’s getting better. The point here is that the anchor date and the order of the intervals need to be preserved during the risk assessment, and during de-identification.

When we sample random interval values from generalized intervals, note that we aren’t using the original dates (or intervals) any longer. You have two dates in the same generalized interval? Then sample two random values in the generalized interval! It doesn’t matter what values are returned (if they’re ordered or unordered), because they’re added sequentially so that the order of events is preserved. This is very different from the scenario in which we randomized the original dates by +/–15 days.

Now you might worry that generalizing intervals (e.g., going from days to seven-day intervals), then adding it all up at the end to produce a new sequence of dates, might skew the sequence to be much longer than the original. But because we’re drawing the sample from a uniform distribution, we expect the mean to be the center of the interval’s range. It’s reasonable to assume that the dates are randomly distributed among the intervals—i.e., they’re drawn from a statistical distribution. Therefore, when we add up the randomized intervals, the last interval of the sequence is expected to be within the final generalized interval for the patient.

Now, if there’s bias in the dates (e.g., they’re always at the start of the month), it’s possible that when the dates are added up the final dates will get pushed outside this range. For example, generalizing to seven-day intervals, on average we expect that the sequence (1, 8, 15) would be randomized to (4, 11, 19). The original intervals sum up to 24 days whereas the randomized intervals sum up to 34 days, which means the final date would be off by 10 days. No big deal if your anchor date was generalized to month, but do that for 20 visits and the bias grows significantly.

This raises another important point—what about consecutive days in hospital? Intervals of one day, because of an overnight stay, can’t be changed to intervals greater than one day. Not only would this greatly bias the final date if the patient has had many days in hospital, but it breaks the cardinal rule of not producing data that doesn’t make sense! Thus, we would keep consecutive visits together, and randomize the intervals between visits starting from two days and onwards.

A Different Anchor

When we have records for infants or children in a data set, date generalization can produce dates of service that are before the date of birth. Say an infant is born on 2012/01/15, and generalized to within a year of that date in the process of de-identification. Maybe the new anonymized date of birth is now 2012/07/23. Now say the infant’s first date of service in the data set is 2012/01/18. If the anchor date were generalized to month, the first date of service would end up being before the date of birth. This would of course be very confusing to anyone analyzing the data.

What we need to do, then, is use the date of birth as the anchor and compute the first interval from the date of birth rather than from the first date of service for the patient. That way the order is preserved with respect to the date of birth.

A similar situation arises with the date of death. We don’t want to have a date of service that occurs after the date of death. Where there’s a date of death, we would therefore add that at the end of the sequence so that it’s calculated based on the interval from the last date of service. This way we can ensure order consistency for start and end of life anchors.

Other Quasi-Identifiers

Let’s not forget that there are other quasi-identifiers to deal with. We mentioned this at the outset, but it bears repeating. Randomizing dates or intervals between dates won’t hide the exact diseases or surgeries reported. Say Bob is the only person in the data set that has had appendicitis in the first half of 2001. Shifting dates by a month or two won’t change this. So we need to include dates as part of the risk assessment, even though we’ve simplified our discussion by ignoring other quasi-identifiers.

As with any quasi-identifier, the generalization of dates needs to be done in the context of other quasi-identifiers, and with a well-defined generalization hierarchy (e.g., days, weeks, months). But dates are longitudinal, and managing the risk from the sequence, or even the set, of dates for a patient can be challenging. If we increase the level of generalization we may seriously deteriorate the quality of the data. Although we did point out that it depends on what analyses will be done with the data, at some point the level of generalization will cross the border into Crazy Town. It’s the trade-off between re-identification risk and data quality.

Consider a situation where Bob has a unique pattern of dates that results in a high risk of re-identification. During the period when Bob was seeking care, no one else in his age group and ZIP code was also seeking care. Sure, there were people going to see their health care providers now and again, but Bob had persistent problems requiring medical treatment. This makes him stand out, because dates are longitudinal quasi-identifiers. The only way to reduce the risk of re-identification for Bob would be to increase the level of generalization of the quasi-identifiers in the data set, or suppress dates.

Connected Dates

The concept of connected fields is also important when we deal with dates. Sometimes we see dates in a claim that are so highly correlated that we need to treat them as connected variables. For example, there may be a date of service and a log date that indicates when the information was entered into the electronic medical record. The log date will almost always be on the same day as the day of service.

We can’t ignore such connections because it’s easy to infer the correct date of service from the log date. So it’s necessary to shift the log date by the same amount as the date of service. This ensures data consistency and eliminates the possibility of an inappropriate inference.

Another example, to drive the point home, is the start date of a treatment plan, and the dates the treatments are administered. These would be two connected quasi-identifiers. And how about a log date of when information was entered into the system, and a log date of any modifications to the record? These would be two connected dates, but not quasi-identifiers. However, they would need to be connected to the quasi-identifier dates just described! Yikes, this is getting complicated—but it’s the reality of working with health data, which can be as complicated as you can imagine.

Long Tails

Now let’s move on to a property of claims data that can pose a unique challenge to the de-identification of a data set. We know that claims data is longitudinal, but most patients will only have a handful of health care visits. Those that have severe or chronic conditions will undoubtedly have more than a handful of claims, because they’ll have more visits, interventions, or treatments. But then there will be some that have even more claims, because they are extreme cases.

It’s important to understand that every procedure and drug, even medical supplies and transportation, has an associated charge. Insurance data tracks mundane things like giving the patient an aspirin, or anesthesia, or an ambulance ride. During a single visit, there may be tens of charges, and each will be captured in a row of data.

The result of this wide range of patients, and their resulting claims, is well demonstrated in Figure 5-1. This is the long tail of claims data, driven by the very sick. All together, there were 5,426,238 claims in this data set, for 145,650 patients. The vast majority of patients had less than 10 claims—but some had more than 1,300 claims.

The long tail of claims data for HPN

Figure 5-1. The long tail of claims data for HPN

The Risk from Long Tails

An adversary might know the number of claims that a patient has had, or at least the range. It’s the patients in the long tail of the distribution who are most at risk. All an adversary needs to know is that they have some lower bound in the number of claims—if that lower bound is high enough, the number of patients in that category will be quite low. Say an adversary knows someone has been through a lot of procedures during multiple hospital stays, or one very long hospital visit. The adversary doesn’t need to know the exact number of procedures, but may be able to accurately guess it was over 1,000. Combining that information with basic demographics results in only a handful of patients per equivalence class.

To protect against such attacks, we could cut the tail of that distribution at, say, the 95th or 99th percentile. At the 95th percentile, patients with more than 139 claims would have claims truncated (those above 139), resulting in the removal of 11% of claims; at the 99th percentile, patients with more than 266 claims would have claims truncated (those above 266), resulting in the removal of 2.8% of claims. That’s a lot of claims to remove from the data set for an ad hoc method that hasn’t factored risk into the approach.

Ideally the truncation method we use should be a risk-based approach and minimize the number of claims that need to be removed. That means that the number of claims needs to be treated as a quasi-identifier, and we need a way to minimize truncation to meet a predefined risk threshold.

Threat Modeling

We’ll use a maximum risk of 0.1 to demonstrate truncation, since the data set from the Heritage Health Prize is essentially public, with some important constraints, and a low risk of invasion of privacy. All patients with sensitive diagnoses—e.g., mental health disorders or sexually transmitted diseases—were removed with exact definitions documented online.[46] The fields we used here, listed in Table 5-6, are only a subset of the full data set. The main driver of the claims data is the Current Procedural Terminology (CPT)[47] code, which captures all medical, surgical, and diagnostic services.[48]

Table 5-6. Quasi-identifiers requested by an analyst

Level

Field

Description

1

Age

Patient’s age in years

1

Sex

Patient’s sex

2

CPTCode

CPT code for the claim

2

Date

Date of service for the claim

Number of Claims to Truncate

An adversary is very unlikely to know the exact number of claims that a patient has in a data set. For the HHP data set we assumed that an adversary would know the number of claims to within a range of 5. So, if a patient had exactly 14 claims, the adversary would only know that it was somewhere in the range of 11 to 15, but not that it was exactly 14.

This assumption is quite conservative—in practice we could use a range of 10 claims. We could also add a cutoff at 500 claims, so that the adversary would just know that the patient had more than 500 claims, but not the exact number. For a public data release we might prefer to be conservative, so we’ll carry on with the assumption of accuracy to within five claims.

We divide the number of claims into a series of bins. Each bin is a discrete interval of the number of claims a patient can have, with a range of five claims each. We sort these in increasing order and count the number of patients in each bin. The result is a frequency table, shown in Table 5-7. This is just an example based on a subset of the HPN data—the original HPN data is simply too large for demonstration purposes.

Table 5-7. Frequency table for the number of claims

Number of claims

Number of patients

[1-5]

100

[6-10]

50

[11-15]

40

[16-20]

30

[21-25]

7

[26-30]

4

[31-35]

11

With a maximum risk threshold of 0.1, we don’t want the number of patients in a bin (determined by the number of claims) to be less than 10. In other words, we’re treating the number of claims as a quasi-identifier, and we’ve generalized the number of claims so that they’re only considered within a range of five claims.

The truncation algorithm works backward, from the last bin, with the highest number of claims ([31-35]), to the first bin, with the lowest number of claims ([1-5]). We look at the number of patients in a bin, and if it doesn’t meet our risk threshold, we move those patients up to the next bin by removing some of their claims. Let’s walk through an example.

There are 11 patients with [31-35] claims. This number is larger than the minimum bin size for a maximum risk threshold of 0.1, so we leave it as is. Next, there are four patients with [26-30] claims. Here we have a problem, so we “move” these patients to the next bin upwards, the [21-25] bin. We do this by removing claims for these four patients so that they have between 21 and 25 claims. The exact number of claims they’ll end up with will be randomly chosen from the range [21-25].

By doing this, we’ve added four new patients to the [21-25] bin. After that movement, there are now 11 patients in the [21-25] bin, which makes it large enough for the 0.1 threshold. The remaining bins are also large enough, so we’re done truncating claims.

This binning approach allowed us to minimize the amount of truncation to only 0.06% of claims in the HHP data set. Contrast that with the 11% and 2.6% of claims that were truncated using 95th and 99th percentiles, respectively. Not to mention that we hadn’t treated the number of claims as a quasi-identifier in the percentile method!

Which Claims to Truncate

So how do we decide which claims to truncate? To reduce the risk of re-identification, we want to remove claims with the rarest values on the quasi-identifiers. For the HHP data set, the level 2 quasi-identifiers we’re concerned with are the date and CPT code. If a person has a claim with a rare procedure, that claim is a good candidate for truncation because the rare procedure makes that claim, and therefore that person, stand out.

The number of people with a particular quasi-identifier value is called its support. The lower the support, the more that quasi-identifier value stands out, and the higher the risk of re-identification is for the people with that value. We could sort claims by their support for CPT codes, for example, and truncate those with the lowest support to reach our desired level of truncation.

A rare CPT code could, however, be coupled with a very common disease. That creates a conflict, because now we have to decide whether a claim is a good candidate for truncation or not. One way to resolve this conflict is to consider the variability in the support for the set of quasi-identifiers in a particular claim. If a claim has a low mean support but some quasi-identifiers with relatively high support, the claim has high variability. But if a claim has a low mean support and all its quasi-identifiers have relatively low support, the claim has low variability. Obviously, the low-variability claims should be truncated before the high-variability ones.

To capture these factors, we define a score based on the mean support, adjusted by the variability in support. The point of the score is to capture events that are the most rare among the claims, and therefore single out the best candidates for truncation. The basic calculation is pretty straightforward: first we compute the support for each quasi-identifier in a claim, and then we compute the mean (μ) and standard deviation (σ) of the support values. From this we derive an equation that linearizes this relationship, from a mean support of 1 to the max support. Two examples are shown in Figure 5-2: low variability and high variability. Next, we order claims by score and truncate those with the lowest scores first.

NOTE

Because it’s a bit messy, we’ve included the equation we use in Figure 5-2. Although this isn’t a book of equations, we felt that this time it was warranted. Please forgive us.

In order to get a score that ranges from 0 to 1 for all events, we need to scale our linear equation with the max standard deviation (σmax) and max support (supmax). The max support could be the max number of patients in the data set (a quick and dirty solution), or the max support across all quasi-identifiers for all of a patient’s claims.

Balancing support with variability to find claims to truncate

Figure 5-2. Balancing support with variability to find claims to truncate

With this scheme we can minimize the amount of truncation, and truncate only the claims that are most identifying. In practice, truncation is effective for claims data or other transactional data where the number of claims or transactions per patient is large. A plot of the number of claims or transactions in a data set that shows a long tail is a clear indication that truncation would be appropriate.

Correlation of Related Items

One major issue to consider when de-identifying detailed medical information is the correlation among variables. De-identifying a data set may not be effective if other quasi-identifiers can be inferred from the data. For example, let’s say that we’re disclosing only basic demographics and prescription drug information. There’s no diagnosis information in the data set at all. Without the diagnosis information, we may believe that the risk of re-identification is very small. But can an adversary use the demographics and drug information to predict the missing diagnosis information? If that’s the case, the predicted diagnosis information could potentially increase the re-identification risk significantly.

Unfortunately, the answer turns out to be a qualified yes—an adversary could predict missing information and compromise the intended de-identification. An adversary could ask a health care professional, such as a pharmacist, to determine the missing diagnosis information. It’s a pharmacist’s job to know which drugs are prescribed for which diseases, after all. Or an adversary could use a second data set that is not de-identified to build a predictive model that can anticipate the missing diagnosis information. In the case of an insurance claims database, this second data set could, for example, belong to a competitor.

Expert Opinions

For HHP, we needed to determine whether or not to disclose drug information on top of the claim data already provided. We were concerned, however, with inferences of diagnosis above and beyond what was being provided for the competition data set. Because this was a public data release, the detailed diagnosis codes (of which there are literally thousands, based on the International Classification of Diseases, or ICD[49]) were generalized into 45 primary condition groups, which have been determined to be good predictors of mortality.[50] Providing drug information would have risked breaking the generalization on which our risk assessment was based!

To illustrate this, we conducted a simple experiment with 31 Ontario pharmacists. Pharmacists are trained to determine drugs for specific diagnoses and, to a certain extent, the reverse as well. So they’re in a good position to provide extra information about, or predict, diagnoses from drug profiles. We gave each pharmacist longitudinal data on 20 patients, with their basic demographics and generic drugs that they took. They were asked to predict each patient’s primary condition group, ICD chapter, and three-digit ICD code (i.e., increasing the level of detail at each step). Their success rates are shown in Table 5-8.

Table 5-8. Accuracy of pharmacists in predicting diseases from drugs

Primary condition

ICD chapter

Three-digit ICD

Accuracy

0.622

0.381

0.323

The pharmacists did a decent job at predicting the primary condition, but less so when they had to predict the ICD chapter and three-digit ICD-9 code. However, if the intention was not to provide any kind of diagnosis information, their ability to predict the primary condition could still increase the re-identification risk of the patients above the assumed levels, and above the threshold used. Of course, you don’t have to go to this extent and conduct your own experiments when de-identifying data sets. But you should ask experts their opinions about what information can be inferred from the data you intend to release.

Predictive Models

Another approach we’ve used is to build predictive models to test what information can be reasonably inferred from a de-identified data set. This can identify sources of risk, but only if you believe an adversary can have access to a data set similar to yours, prior to de-identification.

For HHP, we used machine learning models to predict detailed diagnoses (based on ICD codes) using the other information in the claims data set, including the primary condition groups. We divided the HPN data set into 40 different subsets, each representing a primary condition group. In total, we used 1,458,420 claims, with an average of 36,450 claims per subset. They had attributes that described a patient’s personal profile, including hospitalization, laboratory tests, and prescribed drug codes, which were used to predict the ICD-9 code of the patient.

For each subset we computed the proportion (percentage) of claims that were correctly classified into the three-digit ICD-9 code by the machine learning algorithm. For comparison purposes, we used as the baseline the assignment of the most frequent diagnosis code. That way we could evaluate how much better the machine learning model was compared to a simple guess based on a majority rule. The results are shown in Table 5-9. For a simple machine learning algorithm, the Naive Bayes classifier, predictions were better than the baseline for 32 of the 40 primary condition groups.

Table 5-9. Accuracy of machine learning models in predicting diseases from HHP data

Accuracy

Three-digit ICD

Five-digit ICD

Baseline

0.428

0.273

Machine learning

0.654

0.506

Implications for De-Identifying Data Sets

The results of these analyses, both the expert predictions and the learning models, had an impact on the level of detail of the drug data that was disclosed as part of HHP. Since many organizations get patient claims data, it seemed plausible that an adversary could seek the help of an expert or build an accurate model to predict more detailed diagnosis information than was included in the competition data set.

The bottom line is that you must be careful of correlated information in medical data. There are plausible inference channels from one part of the data to another. This means that if you generalize or suppress one part of the record, you need to verify that other information in the record cannot be used to impute or provide details that undo the de-identification. You can do this by running modeling experiments similar to the ones described here.

Final Thoughts

There’s a lot of information in claims data, and a lot that needs to be done for de-identification. The data will contain dates, and those dates will be longitudinal. Analysts usually want to work with the day, month, and year of real dates. Generalizing dates and providing a random date from the generalization works, but may mix up the order of claims. Converting to intervals between dates, while maintaining an anchor and their order, provides a simple yet effective way to de-identify dates while preserving the order of events.

A bigger challenge with claims data is the enormous number of claims some patients may have—our so-called “long tail.” Treating the number of claims as a quasi-identifier, with a predefined level of precision, allows you to incorporate this risk into the de-identification process. Truncating claims allows you to manage this risk, and our support-based approach allows the number of truncated claims to be minimized so as little information is lost as possible.

Probably the biggest unknown, however, is the risk from correlation. For this we need to talk to medical experts to determine what can be learned from the data we want to disclose. We simply can’t assume that withholding data elements will prevent their disclosure. If drugs are going to be included in a released data set, you might want to include diseases in the risk assessment. And if you’re going to provide detailed information for some data elements, think hard about whether or not it can be used to infer beyond the generalizations provided elsewhere. If you can, build your own predictive models to test your assumptions!


[45] Heritage Provider Network Health Prize

[46] K. El Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, J. Howard, and J. Gluck, “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research 14:1 (2012): e33.

[47] CPT is a registered trademark of the American Medical Association.

[48] American Medical Association, CPT 2006 Professional Edition (CPT/Current Procedural Terminology (Professional Edition)). (Chicago, IL: AMA Press, 2005).

[49] World Health Organization/Centers for Disease Control and Prevention. International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM).

[50] G.J. Escobar, J.D. Greene, P. Scheirer, M.N. Gardner, D. Draper, and P. Kipnis, “Risk-Adjusting Hospital Inpatient Mortality Using Automated Inpatient, Outpatient, and Laboratory Databases, Medical Care 46:3 (2008): 232–39.