Free-Form Text: Electronic Medical Records - Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Chapter 8. Free-Form Text: Electronic Medical Records

It might seem like a somewhat trivial task to find and anonymize all personal health information in a document, and there’s research in the academic literature about this, but here we’re interested in where the rubber meets the road. It’s not enough for a system to catch 80% of the names in a medical document—it has to catch all of them or it’s considered a breach. That means our standards have to be much higher than what you’d typically find.

NOTE

You’ll notice that we’re dealing with both direct and indirect identifiers when processing free-form text. Both types of identifiers are extracted from the text and dealt with in the same way (e.g., tagging or redaction). So does that make text de-identification a form of masking or a form of de-identification, as we’ve defined them earlier in this book? In fact, it’s both. That’s why we call it text anonymization.

Not So Regular Expressions

Many modern clinical systems have free-form text: nurses’ notes, consultation letters, radiology reports, pathology reports, and so on. This text gets into electronic systems through health care providers who input the data. Even though electronic medical records allow the entry of structured information, sometimes using these systems takes longer than just writing in the information (for example, choosing a diagnosis from a long drop-down list), and if no analytics will be performed on this data that will directly inform their practice (which is commonly the case), there is no incentive to take the time to enter structured data. So we end up with a lot of free-form text.

This text is a very rich source of information for analytics purposes. But before we can pass such data through an analytics pipeline it needs to be anonymized, as is the case for structured data.

Text found in clinical systems is different from what you’d find from other sources (for example, in newspaper articles or research articles). For one thing, most of it is never edited, which means a lot of typos, shorthand, incomplete sentences, spelling errors, and poor grammar. The text might even come from automated transcription of a dictation, or be the output of optical character recognition software transferring handwritten notes. Sometimes these methods may work out great, but other times they may perform poorly. Remember that many facilities have legacy systems and hardware, and that there are still many old and less than perfect tools in use today.

Still, what we call free-form text data will have some structure in a health care setting. For example, an extract from an electronic medical record may have the name of the patient at the top, as part of a header, with the patient’s medical record number, the date of the visit, and the name of the physician writing the note. These four fields would be the structured component of the text data. After those fields would come the true free-form text component, consisting of the notes from medical staff.

Another example of somewhat structured free-form text data is an XML file with pre-defined elements and attributes. An XML document may be exported from an electronic medical record system or be part of a feed from a medical device. Some of these XML documents may specifically tag the name of the patient, but some may be comments or notes as free-form text. It really depends on the context, and the variety of text formats themselves can add to the challenge of anonymization.

When there’s some predefined structure to a data set, we can take advantage of it during the anonymization. For example, if we know from the structured field that a patient’s name is “David Greybeard,” then whenever we encounter “David” or “Dave” in the free-form text portion we’ll know that it’s likely referring to that same patient.

When the free-form text data comes from a known system, such as an electronic medical record, we can also pull out the list of names of all patients and all physicians, and create a list of names and aliases that can be used as a lookup dictionary. These lists, for example, can be used as part of the anonymization since the vast majority of names in the data set will be of those patients and physicians.

But we want to give you a general presentation of text anonymization here, without getting bogged down in the wide variety of electronic medical record systems and how they are structured. So, we’ll ignore the techniques we just mentioned that take advantage of structured information, and focus on the anonymization of the free-form text component itself.

General Approaches to Text Anonymization

There are two general approaches to solving this problem: model based, and heuristic or rule based (which we’ll simply refer to as rule based, because it sounds a lot less complicated). A model-based approach derives a statistical or machine learning model from a training data set, then applies that model to new data. In a rule-based approach, the anonymization system applies a predefined set of criteria and constraints directly on the data set without the need for any training. (Well, not exactly—the training in this case came from researchers and analysts that figured out the rules. But we can take the rules themselves, once formulated, and just apply them directly to the free-form text data.) Another option is to combine the model-based and rule-based approaches into a hybrid system, to get the best of their strengths and, hopefully, lessen their weaknesses.

Anonymization tools could be deployed in a large institution, or in many small practices and clinics. By “large institution” we mean something like the veterans’ administration or a teaching hospital. A large institution would be a good place for a model-based system because it would have the resources to create an annotated data set to be used for training the model. We’ve found that many small practices and clinics, though, don’t seem to fit with the model-based approach. They just don’t have the capacity or expertise to invest in training the model. And because of this, they often aren’t willing to try using model-based systems. That poses some unique challenges, as rule-based systems may have limits in terms of what they can realistically do.

Training the model is really important because you need details from the local environment to calibrate its settings. Training a model on one institution’s data and then using it on another institution’s data will not work very well—the model will just not be that accurate. Also, the document types matter. Imagine a model that was trained by anonymizing the free-form text in pathology reports but was then applied to radiology reports. Again, the contexts are completely different. The shorthand used will vary; the medical terms will vary; even the structure of sentences may vary. Context matters to a model-based system. If a training data set isn’t available, the only way a model will meet your needs is if it starts with a reasonable set of rules (yes, we’re back to the hybrid system).

To get a model-based system to work to its full potential, though, we need a good local training data set. That means extracting a sample of historical data that is representative of the whole, and manually annotating it. Annotation is a time-consuming and labor-intensive exercise, and it’s best done by someone with some medical training who can correctly recognize medical terms in the text. The amount of annotation will depend on the modeling technique, but we estimate that in general at least 100 documents need to be annotated manually to get the system up and running. But that’s only a starting point. A much larger number of documents would need to be manually annotated to build a training data set that ensures the model is stable across document types.

Ideally, you’d have a team of people with medical training doing the manual annotation to build a really good training data set. They could work together, pooling resources and brainpower, with each person responsible for a specific set of documents. You’d probably also want one highly reliable senior person to perform manual annotation on a sample of each person’s documents, to see how much they agree (there are statistical ways to do this). Or, if you have the resources, you could have people work on the same documents, and allow them to discuss discrepancies between their work and resolve differences. One highly reliable senior person would again be needed, this time to adjudicate on any unresolved differences.

However, we’ve found that many organizations, big and small, are unwilling to make the investment to produce good training sets. They don’t want to be bothered with assigning medically trained staff to manually annotate documents, given all of their more pressing needs. What they want is a system that works immediately, no muss or fuss. So we’ll instead focus on rule-based approaches. Do keep in mind, though, that model-based systems can have some performance advantages in detecting personal information in free-form text. What’s more important in the context of our discussion are the metrics we’ll discuss for evaluating these systems, rule or model based.

Ways to Mark the Text as Anonymized

The purpose of a text anonymization system is to extract personal information from text. There are different elements we want to extract. A list of some of the obvious direct and indirect identifiers is presented in Table 8-1. Once these elements are extracted, we can then anonymize them in a number of different ways: redaction by replacing them with some form of indicator, like “***”; tagging with their type; randomization within their type; and generalization into a less-specific type.

Table 8-1. Some identifiers we need to extract

Direct identifier

Indirect identifier

First and middle name

Age

Last name

Postal/ZIP code

Street

Date

Email

Health care facility

Phone number

City and state

ID

Country

Redaction is the simplest post-processing approach, but it’s also the least informative. It’s the classic spy movie style of removing secret information from documents by blacking it out. It shows that there was some personal information in that space, and that it was removed. But anyone looking at the data gets no indication of the type of information that was removed. Sometimes redaction is acceptable if the type of analysis that will be performed on the anonymized text does not need to use direct or indirect identifiers. For example, if the analysis is to classify the patient into one of a number of possible diagnoses or by severity of disease, whether he was “Dave” or “Al” will not really make a difference to that task.

Tagging at least replaces the element with the type of information that was removed. For example, “David” can be replaced by “<FIRST_NAME>.” Because there could be multiple first names in the text, we can also index them. Every instance of “David” could be replaced with the tag “<FIRST_NAME:1>,” and if there’s another name in the text such as “Jane,” good, all we need to do is replace this with another tag, “<FIRST_NAME:2>.” That way the reader of the anonymized text will know when the same person is being referred to repeatedly in another part of the same document.

Randomization is the same scheme that’s used for dealing with direct identifiers in structured data. We simply replace every instance of “David” with a randomly selected name from a names database (for example, a names database obtained from the census). We can even be smart about it and replace “David” with a name that’s typically associated with the same sex, in this case a male name, like “Flint.” Generalization would also work the same way as with structured data, where a generalized value is used to replace the original value in the text.

Evaluation Is Key

The challenge with text anonymization is the detection itself. Finding elements of personal information in the text is not always straightforward. Text anonymization systems are obviously evaluated based on how well they can detect the various elements of personal information, but you need to evaluate how well a text anonymization system detects personal information to determine whether it’s right for the job.

WARNING

Strict but appropriate criteria help to guide modifications and improvements. If the evaluation is weak, personal information could be inadvertently leaked in the text that was supposed to be anonymized. If that text is then released, there’s a good chance it will be considered a breach.

We call any record with free-form text a document. The collection of documents that form a large and structured set of texts is called a corpus (Latin for “body”). We want to evaluate the performance of a text anonymization system on the corpus (certainly not a corpus vile, or worthless body, for our purposes).

We want to be sure that elements of personal information are detected with high accuracy per document. Assume a document has 10 last names in it: if we detect and redact 9 of those last names but fail to detect even 1 of them, this is considered a disclosure. It doesn’t matter that 90% of the last names were detected and redacted—that last name would result in the identification of an individual. The document would have to be considered to have personal information in it. This is considered an “all or nothing” approach to evaluation.

By way of comparison, consider a record in a structured data set. If there’s enough information left behind after anonymization that David Greybeard can be associated to the record, the record can be re-identified. It doesn’t matter that we protected most of David’s record—the record is still re-identifiable. That’s bad, and we wouldn’t accept this for structured data. The same rules have to apply to free-form text.

In the text anonymization system with 1 leaked name out of 10, we had a 100% failure rate on that document, not a 90% success rate. This distinction is important because it can have a huge impact on how we measure the performance of the text anonymization system on a corpus. If we had a second document where 80% of the identifiers were detected and redacted, the overall performance of the system would not be an average of 85%, but 0%. Of course, if these were the only two documents out of 100 in which names had not been detected, then the average performance of the system would be 98%. The point here is that we’re evaluating performance based on the number of documents that have leaks in the corpus, not the number of individual names themselves that are leaked.

If even one identifier is leaked in a document, it’s a disclosure. The two documents with leaks represent failures of the text anonymization system to detect personal information.

Figure 8-1. If even one identifier is leaked in a document, it’s a disclosure. The two documents with leaks represent failures of the text anonymization system to detect personal information.

This is far stricter than the criteria typically used when evaluating text anonymization systems, but it’s consistent with privacy regulations. Privacy regulations do not provide a “free pass” to free-form text. A data set needs to have a low risk of re-identification, irrespective of its format.

When we evaluate the performance of an anonymization system, we need to make a distinction between direct and indirect identifiers—they have a different standard for what’s deemed “acceptable risk.” For starters, if we’re able to detect one type of personal information very well and another one poorly, if we pool their results we might lose this distinction between the two types of personal information. We need to know whether one element is detected poorly. For direct identifiers, the standards of acceptable detection on the corpus have to be very high—close to perfection—because, well, they’re directly identifiable! We can’t miss these. For indirect identifiers, however, we can be a bit more lax. In this case we can use standards similar to what we use for structured data.

Let’s first examine the metrics to use before covering acceptable detection rates.

Appropriate Metrics, Strict but Fair

We’ll use the confusion matrix in Table 8-2 to help illustrate how we define three metrics. If the text anonymization system tagged “chimpanzee” as a FIRST_NAME when it isn’t one, that’s a false positive (FP). If the system didn’t detect “Jane” as a FIRST_NAME, that’s a false negative (FN). An element that’s correctly identified (“Greybeard” is a LAST_NAME) is a true positive (TP), and an element that’s not personal information that was ignored (“banana”) is a true negative (TN).

Table 8-2. A confusion matrix, just to confuse you (PI = personal information)

Recognized element of PI

Unrecognized element of PI

Element of PI

True positive (TP)

False negative (FN)

Not element of PI

False positive (FP)

True negative (TN)

With those definitions in place, we can describe some metrics using that confusion matrix. As you can imagine, there could be a very large number of true negatives in a document. We don’t want to include these in any metrics, as they would skew the evaluations. Of the elements that were recognized as personal information (the second column of the confusion matrix), we’re interested in the proportion that were correctly recognized, known as the precision. This can also be interpreted as the probability of a recognized element actually being personal information:

Precision (or Positive Predictive Value)

Precision = TP / (TP + FP)

We’re also interested in the proportion of these elements of personal information that were correctly recognized, known as recall. Similar to precision, this can be interpreted as the probability that an element of personal information is recognized:

Recall (or Sensitivity or True Positive Rate)

Recall = TP / (TP + FN)

We can summarize these two results into a single metric, known as the F-measure, using the harmonic mean of precision and recall. It’s a form of balanced average appropriate for rates. Although this combined measure is an easy way to look at both precision and recall together, it’s always better to consider all three metrics:

F-measure (or F-score)

F-measure = 2 / ( (1 / Recall) + (1 / Precision) )

In practice, we can also weight precision and recall differently, using a weighted harmonic mean. A low precision value means that the text anonymization system is redacting too much information that it doesn’t need to. This reduces the utility of the anonymized data. A low recall value means that the system is not detecting elements of personal information and, depending on the threshold selected, could be considered a privacy breach. So we may prefer to use an alternative weighting. A popular choice is the F2-measure, where recall is weighted higher than precision. This places more emphasis on avoiding the risk of leaving in elements of personal information:

F2-measure

F2-measure= 5 / ( (4 / Recall) + (1 / Precision) )

Text anonymization is, at heart, an information extraction task. These metrics have a long-standing tradition of use in the field of information retrieval,[59] and are also used in evaluating the detection of personal health information.[60], [61]

Standards for Recall, and a Risk-Based Approach

From a re-identification risk perspective, some experts in this area have set standards for recall (ensuring elements of personal information are in fact detected). In the case of direct identifiers, recall needs to be close to 95% or higher to be acceptable. This would allow a minimal amount of direct identifiers to be missed

For indirect identifiers, we can apply a risk-based approach that fits nicely with our previously described methodology. We can say that an adversary would need to have at least two indirect identifiers—not detected in the same document—to re-identify someone. Two may seem like an arbitrary number, but if only one were needed it would likely be a direct identifier, not an indirect one. There’s evidence that Canadians can be re-identified with a very high probability using only two indirect identifiers (date of birth and postal code)[62] and similar results have been seen in European countries. In the US, two pieces of information (date of birth and five-digit ZIP code) are believed to uniquely characterize approximately 63% of the population.[63]

Now we need to tie this in with our risk-based approach to anonymization. If we let r be the recall for indirect identifiers, the probability of two indirect identifiers being missed during anonymization is given by Pr(missed) = (1 – r)2. So, anything greater than two indirect identifiers would actually be less conservative, which is why we limit ourselves to two. Let’s revisit our four attacks from Chapter 2 and reframe them in the context of free-form text anonymization with indirect identifiers.

For example, we can consider attack T1, a deliberate attempt at re-identification, and define the overall risk of re-identification. But now we include the probability of missing an indirect identifier, because it really depends on whether or not the text anonymization system detects it in the first place. The risk here is that it doesn’t detect an indirect identifier. If two indirect identifiers are missed in a document, then it’s possible that a patient could be re-identified.

Attack T1 (Text)

Pr(re-id | attempt, missed) × Pr(attempt | missed) × Pr(missed)

So, for a re-identification to occur, the anonymization system needs to have missed at least two indirect identifiers. If that happens, an adversary must attempt to re-identify the data. And if that happens, we’re interested in the probability of an individual being re-identified from these two pieces of information.

For the purpose of a simple example, let’s say we want this overall probability to be less than or equal to 0.05, assuming we’re treating it as a public data release:

Risk (Text)

Pr(re-id | attempt, missed) × Pr(attempt | missed) × (1 – r)2 ≤ 0.05

If we assume that two indirect identifiers can re-identify someone with certainty, we can simplify the equation by letting Pr(re-id | attempt, missed) = 1. This is more or less true in the Canadian context, and a bit conservative in the US context (assuming date of birth and postal code or five-digit ZIP code).

For the BORN registry (introduced in Chapter 3), we determined that the probability of a re-identification attempt is 0.4. By rearranging the preceding risk equation, with Pr(attempt | missed) = 0.4, we find the acceptable recall for the anonymization of free-form text in this case is r ≥ 1 – √(0.05/0.4) = 0.65. As this simple example demonstrates, with this risk-based scheme the appropriate recall is determined by context.

The same reasoning can be applied to the remaining attacks to set the appropriate recall threshold. The maximum recall across all relevant attacks is the recall value that will be used to anonymize the data (anything lower will not meet the recall threshold for at least one attack, i.e., the one with the max recall). But remember that the acceptable thresholds to use for direct and indirect identifiers are very different when deciding whether the anonymization system is defensible.

Standards for Precision

Deciding what’s an acceptable minimum level of precision is more challenging because it’s largely a reflection of data quality. Some experts in this area have suggested a precision of 80% for direct identifiers and 70% for indirect identifiers, and we agree with these thresholds.

But unlike the “all-or-nothing” approach to calculating recall across documents, precision can be calculated as a “micro-average.” A micro-average pools all of the elements that have been extracted as protected health information (PHI) by the anonymization system and calculates precision directly, ignoring whether the elements of PHI came from different documents. Basically we treat the corpus as one big document. So it’s not precision per document, but precision across all documents. This makes sense because we’re interested in whether or not the system recognized PHI correctly, and the distinction between documents doesn’t matter in this case.

Anonymization Rules

There are a lot of things to consider when developing an anonymization system for free-form text.[64] We’ll examine a few of the major ones.

First and foremost, keep in mind that there will be all kinds of errors in the text. This isn’t a slight on doctors’ handwriting, but a recognition of reality. Medical staff have to use tablets with virtual keyboards, or enter information at terminal stations while hurriedly trying to look after multiple patients, and so on and so forth. Lookup lists of names and the like are an important part of the process of detecting personal information, but all these errors will limit their usefulness. And where lookup lists are used, they need to include shorthand, nicknames, and any other variants you can think of.

NOTE

When looking up names, it’s usually better to use a distance measure, or edit distance (in a broad sense), to capture words that are “close” to matching. That way some spelling errors will not completely trip up the system. The common edit distance is the Levenshtein distance, which basically measures the number of single-character edits required to get from one word to another.

There are a lot of examples of variants besides just typos and the like—there are also variants for drug and disease names. People often say aspirin, but the chemical name is acetylsalicylic acid (good luck trying to spell that under pressure), or ASA. Drugs can also be referred to by brand names (Advil) or generic names (ibuprofen), and they could have different names in different countries (acetaminophen in North America, or paracetamol in Europe). Shorthand can vary widely between departments, although there are some common ones, like hep B for hepatitis B, or HBV for the hep B virus, depending on the context.

Acronyms can be confused with common medical nomenclature. Confusion can also arise when dealing with postal codes: consider maternal pregnancy history depicted as G2P1A1 denoting gravida 2, parity 1, and abortions 1; C6C7T1 denoting cervical spine 6, cervical spine 7, and thoracic spine 1; or S1S2S4 nomenclature denoting heart sounds.

Names of people can also cause all sorts of confusion. They could clash with months of the year (April), and there are eponyms to worry about, where a person’s name could be a disease (Huntington), an index (Apgar), or a procedure (Bankart). These are direct identifiers, so it’s important to be able to distinguish between these examples and actual names, maybe using the context to figure it out. One way to do this is to have “exclusion lists” of procedure, index, and disease names that are not identified as people’s names.

Similar problems exist with location names (no one said this would be easy). Fergus Falls is a city in Minnesota, and “Fergus falls” is the description of some guy taking a fall. What if they’re not properly capitalized? Vienna, Wien, and Vena are all the capital of Austria (why it has three capitals is not clear). Villa Maria is the name of a villa, and it’s an area in Montreal. The Big Apple is New York or someone’s lunch, and the Pittsburgh of the South is the US city of Birmingham. All of these variants, composite names, and nicknames will add confusion to a text anonymization system.

Informatics for Integrating Biology and the Bedside (i2b2)

Informatics for Integrating Biology and the Bedside, better known as i2b2, is a research center in biomedical computing[65] that endeavors to be at the cutting edge of genomics and biomedical informatics. The center is comprised of six cores:

§ (Core 1) Science

§ (Core 2) Driving biology projects

§ (Core 3) Infrastructure

§ (Core 4) Education

§ (Core 5) Dissemination

§ (Core 6) Administration

It provides data other researchers can use in their own work, and in this case text data for anonymization. We’ll use this text data to show how the methods we’ve presented can be used in practice, but of course the results carry over to other data sets.

i2b2 Text Data Set

i2b2 has made publicly available a set of 889 discharge records from various medical departments of the health care system in Boston, MA, USA. These documents were manually annotated by a group of people and all personal health identifiers were replaced with supposedly realistic surrogates. But it’s far from perfect.

If you look at the names in the i2b2 documents closely you’ll find <PHI TYPE="PATIENT">Flytheleungfyfe</PHI>, <PHI TYPE="PATIENT">Cornea</PHI>, <PHI TYPE="DOCTOR">ah / BMot </PHI>, and <PHI TYPE="DOCTOR">A HAS</PHI>. Or how about <PHI TYPE="DOCTOR">Ala Aixknoleskaysneighslemteelenortkote</PHI>, a last name with 35 letters? Those are some very strange names, and this weirdness happens for at least 10% of the tagged PHI (maybe even as high as 25%).

We use this data set because it’s generally available, and it’s been used in a number of other evaluative studies. This would make it useful for comparisons, except that most, if not all, other free-form text anonymization systems have been evaluated using much less strict metrics.

The odd names tagged as PHI also make this rule-based system seem to perform far worse than it really would, because it’s very unlikely that any rules or lists will include these strange names. You can imagine that a model-based system might train on a lot of these oddities and therefore seem to perform better, but this might be very misleading. A rule-based system could be tuned to deal with such text, but that wouldn’t be useful for real text. i2b2 records are what they are, but we changed some of the annotations to make them more realistic and produce more meaningful results.

We divided the documents into two parts: a training set of 589 documents (66% of the corpus) and a test set of 300 documents (34% of the corpus). Let’s consider a few summary statistics to show you what the data is like (nothing fancy, just a frequency count, mean, median, standard deviation, and interquartile range of the number of identifiers per document). First we have the training documents, with summaries in Table 8-3.

Table 8-3. Per-element summary statistics of the i2b2 training subset of 589 documents (IQR = interquartile range)

PHI element

Count

Mean

Median

Std. dev.

IQR

First name

2237

3.79

4.00

2.17

3

Middle name

280

0.47

0.00

0.92

1

Last name

3257

5.52

5.00

3.74

4

Email

357

0.60

1.00

0.48

1

Date

5115

8.68

8.00

5.02

5

Phone number

133

0.22

0.00

0.56

0

Health care facility

1715

2.91

2.00

2.34

3

Age

12

0.02

0.00

0.14

0

ID

3118

5.29

5.00

1.32

2

Street

24

0.04

0.00

0.22

0

City

163

0.27

0.00

0.73

0

State

66

0.11

0.00

0.40

0

ZIP code

11

0.01

0.00

0.16

0

Then there are the testing documents, with summaries in Table 8-4. Overall they’re very similar subsets of the corpus, but note that there’s a lot of variation in the number of names and dates per document (although this shouldn’t come as any surprise).

Table 8-4. Per-element summary statistics of the i2b2 training subset of 300 documents

PHI element

Count

Mean

Median

Std. dev.

IQR

First name

1092

3.64

4.00

1.44

1

Middle name

146

0.48

0.00

0.87

1

Last name

1411

4.70

4.00

2.26

3

Email

261

0.87

1.00

0.33

0

Date

1983

6.61

6.00

3.82

3

Phone number

99

0.33

0.00

0.54

1

Health care facility

685

2.28

2.00

1.43

2

Age

4

0.01

0.00

0.14

0

ID

1690

5.63

6.00

0.86

1

Street

10

0.03

0.00

0.19

0

City

50

0.16

0.00

0.66

0

State

31

0.10

0.00

0.44

0

ZIP code

5

0.01

0.00

0.12

0

Risk Assessment

Let’s assume we’re disclosing the i2b2 data set to Researcher Ronnie (him again!), who’ll use it to develop machine learning models to classify patients based on their tendency to be readmitted. Given that the i2b2 data has been selected specifically for sharing, there are no sensitive diagnoses in there. So we’ll use a threshold of 0.1 for the overall risk.

Threat Modeling

We need to consider the three attacks T1, T2, and T3 from Step 3: Examining Plausible Attacks:

1. If Researcher Ronnie follows the HIPAA Security Rule (described in Implementing the HIPAA Security Rule), Pr(attempt) = 0.4, giving us a minimum recall of r ≥ 1 – √(0.1/0.4) = 0.5.

2. There were 845,806 discharges in the state in 2006,[66] which gives us a prevalence of 0.001 and a probability of knowing someone who has been discharged from a hospital of Pr(acquaintance) = 0.145. This gives us a minimal recall of 0.17 under attack T2.

3. For a data breach we already have Pr(breach) = 0.27, based on historical data, giving us a minimum recall of 0.39.

Based on these three attacks, we have to take the strictest minimum recall, which is the 0.5 found for T1. Using this minimum recall makes sure that the overall risk of re-identification for this data set is at or below 0.1. For precision, we used the threshold of 80% defined in Standards for Precision.

A Rule-Based System

We built a rule-based system to illustrate how anonymization works in practice, and also to set a baseline for the kind of performance that you can expect from such a system. The basic rule-based system that we developed has a total of 125 rules, summarized in Table 8-5. By “basic” we mean a dozen pages of rules, and you could write a textbook about how this was done, so we’ll leave it at that.

Table 8-5. The number of rules by element type

PHI type

Number of rules

Person names (first, middle, last names)

36

Phone number

12

Date

20

Email

2

ID

18

Age

1

Organization

14

Location (street, city, state, country, ZIP code)

26

Results

We used the i2b2 documents in our rule-based system, using our strict metrics, to evaluate its ability to anonymize free-form text. The rules were constructed using the training documents, and were then evaluated on the testing documents. We evaluated the system by applying our simple rule-based system to the 300 testing documents from the i2b2 data set. As mentioned earlier, precision was evaluated using the micro-average, and recall using an “all-or-nothing” approach.

The recall and precision for each element of personal information were considered separately. The results in Table 8-6 and Table 8-7 apply to direct and indirect identifiers, respectively.

Table 8-6. Results of rule-based system on direct identifiers

PHI element

Precision (micro-average)

Recall (all-or-nothing)

F2-measure (micro-average)

F2-measure (all-or-nothing)

First name

1

0.97

0.99

0.97

Middle name

1

1

1

1

Last name

0.93

0.99

0.98

0.94

Email

1

0.99

0.99

0.99

ID

0.98

0.95

0.99

0.95

Phone number

0.93

0.98

0.97

0.97

Table 8-7. Results of rule-based system on indirect identifiers

PHI element

Precision (micro-average)

Recall (all-or-nothing)

F2-measure (micro-average)

F2-measure (all-or-nothing)

Date

0.95

0.90

0.97

0.92

Health care facility

1

0.99

1

0.99

Age

0.8

1

0.95

0.91

Street

1

0.78

0.83

0.81

City

0.8

0.81

0.84

0.79

State

0.8

0.9

0.88

0.87

ZIP code

1

1

1

1

For each element by itself, the overall risk levels fall below our thresholds. The precision for direct and indirect identifiers is above 70% and 80%, respectively. Recall is at or above 95% for the direct identifiers, and above 50% for indirect identifiers. Therefore, this text anonymization system would adequately manage the overall risk of disclosing the text data to Researcher Ronnie and can be used to prepare the data set for him.

What if Researcher Ronnie wanted to get a data set different from i2b2? Would this system still manage the risk adequately? There are two ways to address this question:

§ Evaluate the performance of the text anonymization system on the new data set.

§ Extrapolate from the i2b2 data set.

To evaluate the system on a different data set, we would need to select a sample of documents, annotate them, and then run the anonymization system on that annotated data set and evaluate the precision and recall as described earlier. If these values meet the thresholds defined for Researcher Ronnie, the rest of the documents can be anonymized using the same system and disclosed to him.

This approach is consistent with what we’ve been doing with structured data. Here, we would evaluate the precision and recall for the specific data set that we’re working with, and decide from that whether the risk is acceptable. But the result would be specific to the data set being disclosed.

Using the second approach would be the most defensible if there were some similarity between the i2b2 data set and the new data set that Researcher Ronnie has (e.g., if they’re both discharge data sets). That way there would be a strong basis for extending the results from the i2b2 evaluations to the second data set. In general, it’s easier to extrapolate to other data sets when the anonymization system has been evaluated on a wide variety of data sets to begin with—i.e., data from different document types and different facilities—because people write differently in different departments and facilities.

We would like to be able to extrapolate from other data sets to decide whether we can reliably anonymize a new data set using the text anonymization system, because this makes it much easier to use the same system repeatedly. But it can be challenging to get a variety of suitable data sets to work with. You’ll notice, however, that under the second approach we don’t evaluate the risk for free-form text in the same way as for structured data. In the second approach we build a system that has acceptable average performance and then apply it on new data sets without further evaluation. And that’s fine provided we extrapolate from similar data sets, to demonstrate that the system is reliable in the context of the data sets we want to anonymize.

Final Thoughts

Free-form text can be challenging to anonymize specifically because it’s free-form, as in “anything goes:” shorthand, acronyms, nicknames, spelling errors, you name it. Text anonymization systems exist in academic literature, so it’s not an insurmountable problem. But evaluating the system so that it truly meets the requirements of health care, with assurances of low risk, needs to be done in an exacting way. For direct identifiers, there’s no leeway—recall, or sensitivity, needs to be around 95%. For indirect identifiers, we can use a risk-based method similar to what we did with a structured data set and evaluate the risk for our attack scenarios. This allows us to come up with a minimum recall that fits the context of our data release.

There’s evidence that rule-based systems tend to have better recall than model-based systems. But model-based systems tend to have better precision. A good way to design a text anonymization system is therefore to run a rule-based first pass with a model-based second pass to improve precision. We’ve shown you what you can expect from a rule-based system in terms of performance. A second-pass model-based system would need only to classify detected elements as false positives or true positives.


[59] C. van Rijsbergen, Information Retrieval, 2nd ed. (Oxford: Butterworth-Heinemann, 1979).

[60] M. Sokolova, “Evaluation Measures for Detection of Personal Health Information,” Proceedings of the Workshop on Biomedical Natural Language Processing in conjunction with the 8th International Conference on Recent Advances in Natural Language Processing (INCOMA, 2001), 19–26.

[61] O. Ferrandez, B. South, S. Shen, F.J. Friedlin, M. Samore, and S. Meystre, “Evaluating Current Automatic De-identification Methods with Veteran’s Health Administration Clinical Documents,” BMC Medical Research Methodology 12:(2012): 109.

[62] K. El Emam, D. Buckeridge, R. Tamblyn, A. Neisa, E. Jonker, and A. Verma, “The Re-identification Risk of Canadians from Longitudinal Demographics,” BMC Medical Informatics and Decision Making 11:1 (2011): 46.

[63] P. Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population,” Proceedings of the 5th ACM Workshop on Privacy in Electronic Society (New York: ACM, 2006), 77–80.

[64] GATE, A General Architecture for Text Engineering. Natural Language Processing Group, Sheffield University.

[65] Informatics for Integrating Biology and the Bedside (i2b2)

[66] HCUP Central Distributor SID. Number of Discharges by Year. Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality.