Medical Codes: A Hackathon - Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Chapter 10. Medical Codes: A Hackathon

There are a few standard coding systems used in health data, for procedures, diseases, and drugs. We’ve mentioned them a few times already, but a chapter about codes sounds pretty boring. Well, we’re not about to list the codes and leave it at that. You can get books about these codes elsewhere (and they are oh, so interesting to read… to be fair, they’re references, not books to curl up with in an armchair). No, what we’ll look at are the specific ways to anonymize these codes in health data. By now you know the drill: generalization and suppression. But—spoiler alert—we have another trick up our sleeve to keep the original codes within generalized groups. This is a major aid in increasing the utility of data. But we’ll save that one for last, just to add to the anticipation.

We had the chance to apply all of these methods to a data set used for a hackathon known as the Cajun Code Fest[69] (how awesome is that name?). A hackathon is a competition in which programmers are (figuratively) caged up in some common space to code day and night to accomplish a predefined goal (no programmers were harmed in the making of this hackathon). For the Cajun Code Fest, in Lafayette, Louisiana, registrants were given de-identified claims data for the state and told to come up with something that would improve health care.

Taking advantage of health data requires more than just programming skills, so the organizers of the Cajun Code Fest encouraged people who had knowledge of health care to take part, even if they knew nothing about programming. Once people arrived, they would be put into multidisciplinary teams. This hackathon is probably the best example of health data being made available for all the right reasons. Competitors also got to enjoy the Festival International de Louisiane, with music and—it should go without saying in the South—lots of food (will code for crawfish!). A hackathon is a fun and engaging way to get people involved, as the Cajun Code Fest proves.

Codes in Practice

The point of medical codes is not to provide clarity to human readers, at least not directly, because they look a lot like random alphanumeric strings. What they do is standardize medical categories assigned and consumed by health data users. And they are very precise. Usually, health data sets will place the code and its individual description in adjacent fields (the description itself is human readable, although it will sometimes be in short form or be terse in order to keep the text field to a minimum number of characters).

Let’s look at the most common sets of coding systems we see in medical data. If you have something different in your data, don’t fret. The methods we’ll look at can be applied to different coding systems, or be mapped to these coding systems:

International Classification of Diseases (ICD)

This system, maintained by the World Health Organization (WHO), provides diagnostic codes for classifying diseases, as well as signs, symptoms, complaints, and external causes of diseases or injuries. It’s extremely detailed, with literally thousands of codes arranged in a hierarchy that we can take advantage of for anonymization. The US uses ICD-9 with a Clinical Modification (ICD-9-CM), which includes procedures.[70] But the ICD-9 system (which has about 14,000 disease codes and 4,000 procedure codes) is about 30 years old, and most countries have already adopted ICD-10 (which has about 10 times more codes!). US providers will be required to use ICD-10-CM about a year after this book goes to press, if there are no delays.

Common Procedural Terminology (CPT)

Maintained (and copyright protected) by the American Medical Association (AMA), this coding system is used to describe medical, surgical, and diagnostic services in the US.[71] It’s primarily used for billing purposes, with about 8,000 codes, so we see it in claims data all the time. Similar to the ICD system, the system is hierarchical, which we’ll make use of in anonymization. Healthcare Common Procedure Coding System (HCPCS) codes are often included in the same field as CPT codes. Once ICD-10 codes are adopted in the US, CPT and HCPCS codes will continue to be used in outpatient and office settings (whereas ICD-10-CM will be used for inpatient procedures).

National Drug Code (NDC)

This is the coding system used in the US to identify drug products. You can think of the NDC as a serial number for a drug product, given by the Food and Drug Administration (FDA), which will change for different manufacturers and distributors, different doses, and different forms of the drug (e.g., liquid, pill, or inhaler). In order to manage the risk of identity disclosure from drug information, anyone anonymizing data also needs to generalize NDCs into a hierarchy, which they don’t have. WHO has a system we can map these to for the sake of anonymization? No, wait, we mean the WHO has such a system, thankfully.

The Cajun Code Fest included all the coding systems just described in the competition data. The introductory webcast to discuss the competition and, specifically, the data that competitors would have access to gave the frequency of the top diseases, procedures, and drugs. The slides were then posted online for anyone to read. This might not sound like a big deal, but it means that those particular codes could not be given in a generalized form, or suffer a lot of suppression, and that masking could not change the frequency of those codes.

Here’s why. Imagine that we mapped code 410 to 414 to mask the original code. Well, that won’t work because an adversary could use the frequency of code 410 provided in the documentation to figure out that code 414 is in fact 410 (i.e., a frequency attack). So we needed something a little smarter to de-identify the data and preserve its utility.


Normally generalization wouldn’t work here because, as we just explained, we needed to preserve the original codes. If we grouped ICD-9-CM codes from 410 to 414 into Myocardial Infarction, the original codes would be lost. But we’re not about to throw out the baby with the bathwater. We need generalization to group like codes together. An adversary is unlikely to know the exact codes that a patient has in her data (because they’re so very detailed and precise).

Remember that we said there are literally thousands of ICD-9 codes—the same applies to the other codes as well. So whether we’re going to produce generalized groups in place of the original codes or not, we still need generalization. This is something we need to apply somewhat differently to each coding system (although the common thread is having a hierarchy to work with).

The Digits of Diseases

Let’s focus primarily on ICD-9-CM codes, since that’s what we had for the Cajun Code Fest, and they’re still used in the US. ICD-9 disease codes are five digits long and are arranged in a hierarchy. More digits means more precision. The fewest number of digits you’ll see for disease codes in their original form is three (codes below 100 will be padded with zeros to keep the length to three). Then there’s supposed to be a period (although some data sets drop the period), and one or two more optional digits meant to define the disease in more precision.

Our first level of generalization for ICD-9 codes is therefore the three-digit codes, made up of the first three digits of the original code (this is something we’ve evaluated elsewhere).[72] A higher level of generalization would be two-digit codes (using the first two digits of the original code), and then even higher would be one-digit codes (the first digit of the original code, of course). It’s not perfect, and some will argue with the diagnostic accuracy of this generalization approach, but it’s quick to implement. Three digits also works for E and V codes (external causes of injury and supplemental classification), and procedure codes can use the first two digits instead (all part of the clinical modification). This can get a bit confusing, so let’s focus on disease codes.

For an even better generalization than three-digit ICD-9 codes, one that offers more or less the same number of disease and procedure categories but better diagnostic precision, there’s the publicly available dictionary produced by a team at Vanderbilt.[73] The Vanderbilt ICD-9 code groups were developed for Phenome-Wide Association Studies (PheWAS), and loosely follow the three-digit category and section groupings defined with the ICD-9 code system itself. For the Cajun Code Fest, we started with these code groups (see Table 10-1).

Table 10-1. Three’s company when it comes to ICD-9 disease codes. Broader generalizations that group ranges of ICD–9 codes are provided by the definitions in the ICD-9 codebook.[70]

Level of generalization



ICD-9 chapter definition

Broad 3-digit group

Diseases of the circulatory system (390–459)

ICD-9 section definition

3-digit group

Ischemic heart disease (410–414)

3 digits

Minimum ICD-9 code

Other acute and subacute forms (411)

4/5 digits


Acute coronary occlusion without myocardial infarction (411.81)

Really, to generalize ICD-9 codes more broadly than with three-digit codes, it’s probably better to move up the generalization hierarchy using the Clinical Classifications Software (CCS). As odd as the name may be, CCS is actually a system for grouping diseases and procedures into clinically meaningful categories, developed as part of the Healthcare Cost and Utilization Project (HCUP).[74] This is more sophisticated than simply cropping ICD-9 codes and will produce more meaningful data than ICD-9 section definitions. We won’t discuss the use of CCS as a generalization scheme in this book any more than this, but it’s worth knowing that it’s out there in case you ever need it.

ICD sections further attempt to group like diseases into broad categories. Alternatively primary condition groups can be used to gather like diseases into broad categories, a generalization scheme we’ve used before (diseases are grouped into 45 broad diagnostic categories based on relative similarity and mortality rates).[75] But this raises an important point, because primary condition groups were created in part because health experts don’t necessarily agree with the way some codes are grouped in the ICD-9 hierarchy. Like CCS, we won’t discuss these further, but see the references if you ever you need a broader generalization scheme than what ICD-9 codes provide.[76]

So really we have three generalizations we can use to produce meaningful health data, depending on how much we need to reduce risk: Vanderbilt ICD-9 code groups, CCS, and primary condition groups. That isn’t to say we can’t use cropping, or ICD-9 section definitions, but these three generalizations are better in terms of preserving the utility of the data for analytics.

The Digits of Procedures

CPT codes are similar to ICD-9 codes: they’re five digits long, although we can, and most often do, use three digits for generalization. There are also 17 broad CPT categories that sometimes have to be used in cases where the risk of re-identification for the data set is much higher than the risk threshold. Another option, as with ICD-9 codes, is to crop at the second, or even first, digit of the code.

Unfortunately, because CPT codes are exclusively used in claims data for the purposes of billing, and they are copyright protected (i.e., you have to pay to use them), they simply aren’t seen much in research. As a result, we haven’t seen any other dictionaries available to generalize them differently since researchers haven’t developed grouping systems for CPT codes. So, there’s not much more we can say about them here.

The (Alpha)Digits of Drugs

The NDC system used in the US doesn’t have a hierarchy, so we need to map the codes to something we can work with. We already mentioned that the WHO has such a system, called the Anatomic Therapeutic Chemical (ATC) classification. This system provides a hierarchy for drug utilization research,[77] but we can take advantage of this hierarchy for our own purposes. We’re sure they won’t mind, given that our purposes are genuinely for the greater good. And like the ICD, the ATC is a publicly available resource.

The ATC classification system is divided into 14 main anatomical groups, depending on the organ or system on which the drug is meant to act (e.g., blood and blood-forming organs). The main groups are followed by therapeutic, pharmacological, and chemical subgroups, before coding the chemical substance.

Although drugs are “classified according to the main therapeutic use of the main active ingredient,” this classification structure implies that a drug can be given multiple ATC codes depending on how it’s used. And similar to ICD and CPT codes, we truncate the codes to generalize them based on their hierarchy. The equivalent to three-digit ICD-9 or CPT codes is, in this case, a four-character ATC code (see the example in Table 10-2).

Table 10-2. This is your brain on ATC codes. Similar to ICD codes, but for drugs. Let’s break down C01DA02, which is used to treat heart disease.


Code type




1 letter

Anatomical group

C for cardiovascular


2 digits

Therapeutic group

01 for cardiac therapy


1 letter

Therapeutic/pharmacological subgroup

D for vasodilators


1 letter

Chemical/therapeutic/pharmacological subgroup

A for organic nitrates


2 digits

Chemical substance

02 for glyceryl trinitrate

Although the WHO is an international body, it doesn’t provide a dictionary for mapping from a specific country’s drug code system to its own—it’s up to countries themselves to do this. Because the ATC classification system is mainly used for research, the FDA hasn’t provided a map from the NDC system. And digital pharma systems used by pharmacies haven’t provided this dictionary either, at least as far as we know. So, we developed our own conversion table from NDCs to ATC codes that we’ll use until something better comes along.[78]


Converting NDCs to ATC codes requires knowledge of pharmaceutical drugs to differentiate between generic names, routes of administration, and main anatomical groups, as well as knowing where to find relevant information.

We had a team of six pharmacy employees that independently participated in this work, and they all commented that the combination drugs, those with multiple active ingredients, were the most difficult to map. All the same, they ended up with almost perfect agreement on a sample of codes.

With every data set we work with, however, we need to convert more NDCs over to the ATC classification system, both because there will be different drugs used in different jurisdictions, for different insurance providers, different populations, etc., and also because of the way NDCs are assigned: by manufacture or distributor, dose, or form. A different color pill results in a different NDC.


Even with generalization, you’ll probably need some amount of suppression. For the Cajun Code Fest, which had user agreements in place and did not make the data set public, we used an attack simulator to measure average risk using the measure of adversary power explained in Adversary Power. For the level 1 demographic data, we used the same approach we always use for cross-sectional data sets, just like in Chapter 3.

For the level 2 longitudinal data we also used a form of k-anonymity by counting the number of distinct patients in a medical code group, then suppressing codes from the competition data set in which there were fewer than k patients. Notice that we said distinct patients—the same patient could have multiple claims with the same medical code, and we don’t want to count these more than once. We included the generalized level 1 demographics when we evaluated equivalence classes, though, as well as some level 2 nesting quasi-identifiers that were needed to keep patients within relevant groups.

Picture this: we group patients by their level 1 demographic data, and include the type of place where they received care (e.g., outpatient care, inpatient care, or maybe even a mental health facility). The place of care is a level 2 quasi-identifier and is tied to the claim, but it’s at a higher level and is needed to ensure we group patients appropriately based on care. Then we count how many distinct patients have a specific medical code, taking this information into account.

So, a 30- to 39-year-old man (level 1) with an in-hospital (level 2) diagnosis of myocardial infarction (level 2 medical code) who is all alone in this equivalence class will have this medical code suppressed anywhere it appears in his claims.

And we did this for each medical code—Vanderbilt ICD-9 code groups, three-digit CPT codes, and four-character ATC groups—one at a time. That’s right, one at a time. Because the risk was low enough, we didn’t need to suppress data by considering k patients with a particular combination of disease, procedure, and drugs. For the ATC groups we used the dosage form as a nesting variable to ensure that similar drugs were considered together (there’s a big difference between a suppository and a pill).

Now, if there’s a strong correlation between a disease and a procedure, and one medical code is suppressed because it’s not represented by k patients, then we expect the other to be suppressed as well. After all, they would have the same nesting variables. But it will depend on how strongly correlated they are, and how many patients have the particular codes. We could treat them as connected variables, but in truth they are different and we don’t want to suppress medical codes unless we really have to. Under this scheme, “really have to” means fewer than k patients.

We apply this form of longitudinal suppression to every data set we anonymize. You can think of it as a form of strict average risk, which we saw in Probability Metrics, where k will be at least two or three distinct patients. Often we’ll set k to be the inverse of our overall risk threshold. So, if we have a threshold of 0.2, we use k = 5; if we have a threshold of 0.1, we use k = 10.

We do this to each level 2 field with medical codes individually, linked to the generalized level 1 data and any level 2 quasi-identifier that is of a higher level (e.g., place of service). Obviously, the suppression has to apply to any connected fields as well (such as description fields, or other quasi-identifiers). Suppressing the code for myocardial infarction but not its description would pretty much defeat the purpose of that suppression!


It seems like every day we’re shuffling. That’s how popular it is with health care users, because they want to maintain the formats of their original fields and end up with realistic-looking data sets. But when it comes to a hackathon, you really want everybody to have a good time—and y’all can’t do better than the Cajun Code Fest!

We already mentioned that the organizers gave away the frequencies of the top diseases, procedures, and drugs in a webcast and an online document describing the competition data set. And we’ve discussed the generalization and suppression we used. But we needed to bring one more trick to the table, so that we could provide the original medical codes, with their original frequencies (bar some suppression to the rarer codes). It just so happens that the trick is something used in pretty much every game of cards.

We start the process off the same way we did for suppression: link all patient claims to their generalized level 1 demographic data. Then we consider all the medical codes in a field that have the same level 1 data plus any level 2 nesting quasi-identifiers. For suppression, we counted all the distinct patients in this set that had the same generalized medical code.

But this time we treat the original codes—within the generalized code group—as a deck of cards, and shuffle them between patients. In other words, we are randomly exchanging exact diagnoses between patients in a code group and their equivalence class. If we didn’t do this, then only the code group would be provided in the de-identified data set (since that was the level of generalization deemed suitable through de-identification). You can see that the nesting quasi-identifiers (forming the equivalence class) can be pretty important here. You don’t want to swap codes between males and females, or adults and children. The specific codes in a group might also be specific to the place of service, or dosage form, or whatever.


Shuffling is a form of data swapping.[79] The medical code is the swapping attribute—the thing we want to shuffle among similar records—and the demographics and generalized code groups we use to nest the shuffling are called the swapping keys. Statistics on the swapping keys don’t change, because they aren’t affected by the swapping of medical codes between records. But swapping attributes confuses an adversary that is trying to link records to external information.

The purpose of shuffling is to keep the original codes in the data set, but we want the highest level of data integrity we can get. And of course, we need to use the same nesting quasi-identifiers used in suppression in shuffling. The suppression will ensure we have at least k patients in a specific code group and equivalence class, but with shuffling, the adversary can’t know which patient in a code group had which original code. That’s not to say that shuffling is perfect: though it can create illogical pairings, especially if overly broad code groups are used (e.g., two-digit ICD codes).


If an organization has a data set it needs to de-identify for internal use only, we believe it’s acceptable to perform risk mitigation using Vanderbilt ICD-9 code groups and three-digit CPT codes, but leave the original codes in place—no shuffling required. Unknowable pseudonyms need to be used for the patient identifiers, in place of service identifiers though, and any other identifiers that can be tied uniquely to patients.

Also, there need to be restrictions in place against matching your data set to others. When we do a risk assessment with these three-digit codes, we are guarding against attacks based on what a normal person can reasonably know—five-digit ICD codes are too precise for the average person (unless they have another data set to match against). Plus, those three-digit codes already represent hundreds of possible medical categories, which is still very precise for the average person.

Let’s say we had four 30- to 39-year-old men with an inpatient diagnosis of ischemic heart disease (Vanderbilt ICD-9 code group 411). Their original diagnoses were intermediate coronary syndrome (411.1), intermediate coronary syndrome again, acute coronary occlusion without myocardial infarction (411.81), and other acute and subacute form of ischemic heart disease (411.89). The original deck therefore had {411.1, 411.1, 411.81, 411.89}, but the shuffled deck had {411.81, 411.1, 411.89, 411.1} (where the position of the code indicates the patient that gets this code, because we’re shuffling their diagnoses and not the patients themselves).

For most card games, we can probably get away with a few good riffle shuffles to get good enough randomness. This is how you shuffle a deck of cards in the classic casino scenes of most movies (with half the deck of cards in each hand, the thumbs are lifted so that the cards are interleaved when they are merged into a single deck). In our case we’re swapping codes between patients, but it’s no different than dealing out a deck of shuffled cards.

But we can do a lot better than “good enough” randomness, given that we have computers at our disposal! Instead, we can program the computer to do the shuffling for us, and be assured that the dealer isn’t stacking the deck.


That’s not to say that we can use any shuffling algorithm implemented using a computer. Even in a game of cards, top players have been known to take advantage of nonrandomness in a deck. Pseudorandom algorithms in computer randomization also fall into patterns that weaken their effectiveness. What we want is a uniformly distributed random permutation of the rows.

The quick and dirty way to shuffle is to add a column of random numbers, ensuring the probability of getting the same number twice is low, and then sort the rows by ordering them using the column of random numbers. Of course, you want to shuffle within the equivalence class—the level 1 quasi-identifiers and the level 2 nesting quasi-identifiers—so that you can order the rows by the generalized groups and then the random number.

Because the number of codes in each generalized group will be small, you don’t need a huge amount of random numbers (remember, you want a low probability of getting the same number twice). A more sophisticated way to permute rows is to use the Fisher-Yates shuffle, which selects a random number from the number of rows being shuffled in its current iteration and then swaps the last row with the random one. But you have to do this within each equivalence class, which would slow down this implementation.

And voilà, you get to keep the original medical codes, with pretty much the same frequencies (barring some suppression in cases where there just aren’t enough patients to justify keeping this data, given the risk of re-identification). Perception is reality, and providing medical codes in this way makes the data look like the original!

By shuffling codes, we made it possible to apply all the tools developed at the Cajun Code Fest directly to the original data. The same can be said for testing software, or public data sets and all that can be done with them. Not only is the utility of the data increased (because we can preserve the original frequencies), but people’s perception of the data being provided is improved because they can see that it’s better data than generalized groups. Even when risk thresholds need to be low, for example in public data sets, shuffling makes the data far more useful and realistic.

Final Thoughts

In a way, we treat medical codes no differently than any other quasi-identifier. To reduce the risk of re-identification, we generalize and suppress. But medical codes are very precise, which means there can be thousands of them. Luckily, some are designed with a logical hierarchy that is easy to take advantage of (e.g., ICD, CPT, and ATC codes), and others we can map to a coding scheme that has the logical hierarchy we need (e.g., NDCs).

And with shuffling we can even go a step further and provide the original codes, not just the generalized groups, with their original frequencies, so that the final de-identified data set has better utility all around. As if the Festival International de Louisiane during Cajun Code Fest wasn’t reason enough to dance, this should be.

[69] Cajun Code Fest, “Own Your Own Health.”

[70] World Health Organization/Centers for Disease Control and Prevention, International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM).

[71] American Medical Association, CPT 2012 Professional Edition (CPT/Current Procedural Terminology (Professional Edition)) (Chicago, IL: AMA Press, 2011).

[72] K. El Emam, D. Paton, F. Dankar, and G. Koru, “De-Identifying a Public Use Microdata File from the Canadian National Discharge Abstract Database,” BMC Medical Informatics and Decision Making 11:(2011): 53

[73] J.C. Denny, M.D. Ritchie, M. Basford, J. Pulley, L. Bastarache, K. Brown-Gentry, D. Wang, D.R. Masys, D.M. Roden, and D.C. Crawford, “PheWAS: Demonstrating the Feasibility of a Phenome-Wide Scan to Discover Gene–Disease Associations,” Bioinformatics 26:9 (2010): 1205–1210.

[74] A. Elixhauser, C. Steiner, and L. Palmer, Clinical Classifications Software (CCS) (US Agency for Healthcare Research and Quality), 2011.

[75] G.J. Escobar, J.D. Greene, P. Scheirer, M.N. Gardner, D. Draper, and P. Kipnis. “Risk-Adjusting Hospital Inpatient Mortality Using Automated Inpatient, Outpatient, and Laboratory Databases,” Medical Care 46:(2008): 232–9.

[76] K. El Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, J. Howard, and J. Gluck, “De-Identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research 14:1(2012): e33.

[77] WHO Collaborating Centre for Drug Statistics Methodology “Guidelines for ATC classification and DDD assignment 2011,” (Oslo, 2010).

[78] N.C. Santanello and E. Bortnichak, “Creation of Standardized Coding Libraries—A Call to Arms for Pharmacoepidemiologists,” PharmacoEpi and Risk Management Newsletter 2:(2009): 6–7.

[79] G.T. Duncan, M. Elliot, and J.-J. Salazar-González, Statistical Confidentiality (New York: Springer. 2011.)