Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)
Chapter 7. Data Reduction: Research Registry Revisited
We often see massive data sets that need to be deidentified. With such large data sets, we are looking for ways to speed up the computations. We noted earlier that the deidentification process is often iterative, and it’s important to expedite these iterations so we can reach a final data set rapidly. In enterprise settings there may be a time window for deidentification to complete so that subsequent analytics can be performed. This also demands more rapid execution of deidentification.
For large and complex data sets traditional deidentification may also result in considerable information loss. We therefore need to find ways to reduce the size and complexity of the data that minimize information loss but still provide the same privacy assurances.
The Subsampling Limbo
A straightforward data reduction approach is subsampling, in which we randomly sample patients from the data set to create a subset of records—the subset is the data set that is then released. The sampling fraction is the proportion of patients that are included in the subset. In the case of a crosssectional data set, we sample based on records; in the case of a longitudinal data set, we sample based on patients instead. Otherwise, we could end up with a lot of records from those patients with an extreme number of records, even though they’re in the minority. That wouldn’t result in the correct sampling fraction, since we wouldn’t have a representative subsample of patients. What we want is a data set that is representative of the whole.
This seems simple enough, and subsampling can actually be a very powerful way to reduce the reidentification risk. That’s why national statistical agencies will generally only release a small subsample of census data—they do it as a way to manage reidentification risk. You may know that your neighbor A. Herring is in the census, because pretty much everyone is, but if the data being made publicly available is a random sample, then you don’t know if A. Herring is included in the public data set or not.
How Low Can We Go?
Let’s be clear, though: not all data users will be happy with a sample. Generally, data analysts want as much data as they can get. But apart from the general desire for more data, subsampling can also reduce the statistical power of many tests (just to be clear, this has nothing to do with adversary power, discussed in Chapter 6). So, subsampling won’t always be the best option for reducing reidentification risk.
Our cutoff is a sample of 3,000—if a subsample is smaller than that, then the statistical power of classical tests will take a nontrivial hit.^{[}56^{]} But for samples of more than 3,000, the statistical power will be above 90% for most tests. When a data set might contain outliers or effects that are evident only in small groups, then 3,000 may be a low number even if the statistical power is high.
On the flip side, it’s also possible to have too much data, resulting in models that are overfit and unreliable (i.e., they won’t validate on external data). Studies have suggested a max of 20 effective observations per candidate predictor, but more for narrowly distributed predictors.^{[}57^{]} Things get more complicated for some models,^{[}58^{]} so really a power analysis involving effect size should be used to determine an appropriate sample size. And if this doesn’t make any sense to you, then leave it to a statistician to sort out (they’re good that way).
Not for All Types of Risk
For average risk, subsampling has a significant impact and can be another technique to use in addition to generalization and suppression. If a data user wants more precision on a variable (i.e., less generalization and suppression), then we can offer to use a smaller sampling fraction in order to bring the risk down.
But for maximum risk, the reality is that subsampling has very little impact. We would have to create a subsample that is extremely small for it to really have a meaningful impact on maximum risk. This means that for the public release of data sets, subsampling will not be a very useful technique to use (although it may still be worth doing, depending on the context and the changes in risk that you’re looking for). However, for average risk it can have a nontrivial impact and is a useful technique to employ.
Remember that the subsampling we’re talking about would be done completely at random, so it wouldn’t bias the results of the analysis, and wouldn’t require any special analytical methods. Of course, the subsampling can be done in a more complex way, where specific groups of observations are sampled less or more than others, but this also introduces complexities in risk measurement. And, to be honest, it would stretch the assumptions of current risk metrics and their estimators.
BORN to Limbo!
Let’s use the BORN data, which we first saw in Chapter 3, to see how subsampling affects risk. We’re disclosing the data to our good friend Researcher Ronnie, who now wants more precision in the variables and is willing to take a subsample to get it. The risk needs to be smaller than 0.115, and it’s average risk since it’s for a researcher. We use the same setup and quasiidentifiers as before, which we repeat in Table 71.
Table 71. Quasiidentifiers requested by Researcher Ronnie
Field 
Description 
BSex 
Baby’s sex 
MDOB 
Maternal date of birth 
BDOB 
Baby’s date of birth 
MPC 
Maternal postal code 
To give you a sense of things, we created random subsamples going down to only 5% of the data (OK, admittedly that’s not much), and for each subsample we measured the average reidentification risk. This process was repeated 1,000 times and the mean risk across all iterations was computed. We compared the original data set to a version with generalization (but less than before), as described in Table 72, to see how we can change the precision of variables when we’re subsampling.
Table 72. BORN again! Two versions of the data to see what subsampling will do to risk.
Data set 
BSex 
MDOB 
BDOB 
MPC 
Original 
No change 
dd/mm/year 
dd/mm/year 
6 characters 
With generalization 
No change 
year 
week/year 
3 characters 
Now for the subsampling. Figure 71 shows that the average risk decreases rapidly as smaller samples are drawn. The question is, how low can we go? This is a dance we can only do with Researcher Ronnie, as he’s the one that needs to work with the final subsampled data set.
Figure 71. The lower Researcher Ronnie can go, the lower the risk
With all of the detailed variables and no generalization or suppression, subsampling achieves the 0.115 threshold with a sampling fraction of 0.3. The data set has 919,710 records, so a 30% subsample gives us 275,913 records. With generalization of the quasiidentifiers, we meet the risk threshold with a 35% subsample of 321,898 records. But the cost is a loss in precision on some of the quasiidentifiers.
If Researcher Ronnie wanted to perform detailed geospatial analysis, then he’d most likely want sixdigit postal codes. With generalization on some of the other fields and suppression, we could even give him a larger subsample. It really comes down to what Researcher Ronnie needs, and how low he can go (in terms of sampling fraction, of course).
Many QuasiIdentifiers
Many health data sets have a large number of quasiidentifiers. We see that with registries and with survey data. A large number of variables might be collected directly from patients and their providers, or data might be linked with other data sets to create a much larger and more comprehensive data set. More quasiidentifiers means more opportunity to reidentify patients.
When there are a large number of quasiidentifiers, it is called “the curse of dimensionality.” This means that there are too many variables, and the risk of reidentification is going to be quite large. To manage the risk in such cases using traditional assumptions and methods would result in excessive generalization and suppression. The data would have quite a lot of information removed from it.
But we have to be realistic with what we assume an adversary can know about patients. When there are a large number of quasiidentifiers, we don’t have to consider all of them at the same time. If a data set has 20 quasiidentifiers, it’s not reasonable to assume that an adversary would have 20 different pieces of background knowledge on any individual in the database. This is a lot of information about individuals, especially when we’re referring to their physical or mental health. An adversary might only have five or seven pieces of background information that can be used for a reidentification attack.
This is the same concept of “adversary power” that we discussed in the context of the WTC registry, except now we are looking at adversary power from the perspective of a crosssectional data set rather than a series of longitudinal events. The way we approach adversary power is a little different here because all of the patients will have the same number of quasiidentifiers (as opposed to each patient potentially having a different number of events), and we do not model diversity as that concept is not really applicable here.
If the 20 quasiidentifiers are strongly correlated, then knowing one quasiidentifier might be enough to estimate a few more. Bootstrapping this idea further, the adversary might only need to know three or five things to estimate the full set of 20 quasiidentifiers. But that doesn’t change our approach, because evaluating the risk on a few quasiidentifiers would therefore be the same as looking at all the quasiidentifiers. So let’s assume they’re independent of one another.
Subsets of QuasiIdentifiers
Let’s assume that the adversary could only reasonably know 5 of the 20 quasiidentifiers. The question then is which 5 out of 20 do we consider when evaluating the reidentification probability? We need to choose a subset of 5 fields to measure reidentification risk and to deidentify exposed records. Since any combination of 5 fields is plausible, this becomes a combinatorial problem—there are 15,504 combinations of 5 items from a set of 20 (better known as 20 choose 5, or, of 20 items, how many combinations of 5 can we choose?).
We could use a brute force approach and measure the reidentification risk for all 15,504 possible combinations of five quasiidentifiers. If the measured risk were high for any of these 15,504 combinations, then we would apply our deidentification algorithms. The problem is that evaluating 15,504 combinations can take a long time. And this solution doesn’t scale well. For 21 fields, there are 20,249 combinations of 5; for 22 fields, there are 26,334 combinations; and so on. This would be a very inefficient way of solving the problem.
Let’s consider some shortcuts. We’ll refer to our quasiidentifiers as q_{1} to q_{20}. We can consider each quasiidentifier individually; if the measured risk is high on any one individual quasiidentifier (say, q_{1}) then it will by definition be high for all combinations of five quasiidentifiers that include q_{1}. Date of birth is a good example, as it can be quite identifying. Grouped with state, gender, race, and body mass index, the main driver of risk would still be date of birth. The risk might be higher as a combination of five, but if it was already too high with date of birth then there’s no need to evaluate the combination of quasiidentifiers.
On the other hand, even if the measured risk on q_{1} is low, there may well be a combination of five quasiidentifiers that includes q_{1} that has a high measured risk. Take age generalized into 10year bands. That seems much less identifying, but pair it with state, gender, race, and body mass index, and suddenly we’re not so sure. In this case we need to evaluate all combinations of five quasiidentifiers to ensure they have a low risk of reidentification. This isn’t much of a shortcut, then, because we still need to check the combinations of five for many quasiidentifiers.
Another shortcut, which is more efficient but incurs a slight cost in terms of information loss, is to use covering designs. We can fine tune how much information loss is acceptable, balancing between precision and computation time.
Covering Designs
The idea of covering designs is to “cover” the combinations we’re interested in with a larger group of combinations. It’s easier to illustrate this with a small example. Let’s say that we have five quasiidentifiers and we want to evaluate the risk of reidentification for all combinations of three quasiidentifiers. But we also want to minimize the number of combinations that we evaluate. Table 73 shows all combinations of 3 from 5. This gives us 10 combinations.
Table 73. Every combination of five quasiidentifiers taken three at a time
Set # 
Combination 
1 
{q_{1}, q_{2}, q_{3}} 
2 
{q_{1}, q_{2}, q_{4}} 
3 
{q_{1}, q_{2}, q_{5}} 
4 
{q_{1}, q_{3}, q_{4}} 
5 
{q_{1}, q_{3}, q_{5}} 
6 
{q_{1}, q_{4}, q_{5}} 
7 
{q_{2}, q_{3}, q_{4}} 
8 
{q_{2}, q_{3}, q_{5}} 
9 
{q_{2}, q_{4}, q_{5}} 
10 
{q_{3}, q_{4}, q_{5}} 
Now let’s consider the fewest possible blocks of size 4, or combinations of four quasiidentifiers, that cover all combinations of three quasiidentifiers. This is shown in Table 74. Here we managed to reduce the number of combinations from 10 to 4, but we still evaluate all possible combinations of three quasiidentifiers from a set of five. Table 74 is a (5,4,3) covering design (like 5 choose 3, except we add a middle piece, which is the block size, so that it’s more like 5 choose blocks of 4 to cover combinations of 3).
Table 74. Covering every combination of five quasiidentifiers taken three at a time in blocks of four
Block 
Set #’s covered 
{q_{1}, q_{2}, q_{3}, q_{4}} 
1, 2, 4, 7 
{q_{1}, q_{2}, q_{3}, q_{5}} 
1, 3, 5, 8 
{q_{1}, q_{2}, q_{4}, q_{5}} 
2, 3, 6, 9 
{q_{1}, q_{3}, q_{4}, q_{5}} 
4, 5, 6, 10 
When we measure the reidentification probability for a covering design, we take the maximum probability across the combinations that are evaluated and use that as the measure for the whole data set. The rotten apple spoils the barrel. We’ve developed a strategy to minimize the amount of information loss. Compared to other strategies we tried, this one has resulted in the least distortion of the data:
1. Sort the blocks in the covering design by the percent of records that are high risk, with the largest percentages on top.
2. Starting from the top, apply generalization followed by suppression to the block until the risk is acceptably low.
3. Continue down the list of blocks, using generalization and suppression, until the risk is acceptably low for all of them.
Starting from the top of the list of blocks, applying generalization and suppression lowers the risk for the other blocks lower down the list as well. In a lot of cases that means we don’t need to do anything to the blocks lower down the list, since the worst was dealt with first.
Covering BORN
We linked the BORN data set to socioeconomic data from the census, available at the postal code level. The census provides aggregate numbers, not exact matches to the individual records in BORN. There might be 30 people in postal code P0M1B0 that are visible minorities, but we don’t know which, if any, of these are in the BORN data. The only option to get the exact numbers would be to use secure linking, which we’ll discuss in Chapter 12. The other way is to simulate it. If we know there are 200 people in P0M1B0, then we randomly assign the status of visible minority to individual records with that same postal code, with a probability of 15% (30/200). For our purposes, we used the simulation approach to assign a value to each record in the BORN registry.
Linking BORN, with five quasiidentifiers, to the census data added seven quasiidentifiers to the maternalhealth registry (see Table 75). Using covering designs makes the deidentification of this data set a lot faster. Compared to a brute force approach of evaluating all combinations, many fewer combinations of quasiidentifiers need to be evaluated.
Table 75. BORN and census quasiidentifiers
Origin 
Variable 
Description 
BORN 
BSex 
Baby’s sex 
BORN 
MDOB 
Maternal date of birth 
BORN 
BDOB 
Baby’s date of birth 
BORN 
MPC 
Maternal postal code 
BORN 
PBabies 
Number of previous live births 
BORN 
LHIN 
The health region where the birth occurred 
BORN 
LANGUAGE 
Language spoken at home (this field also appears in the census) 
BORN 
ABORIGN 
Aboriginal status 
BORN 
MINORITY 
Visible minority 
Census 
MSTATUS 
Marital status 
Census 
IMMIGRATION 
Immigration status 
Census 
OCCUPATION 
Occupation category 
Census 
PGROUP 
Population group (e.g., West Indian, East Asian, Arab, African, and so on) 
Census 
INCOME 
Income in $10k increments 
Census 
CERT 
Highest educational certificate obtained 
In general, the number of blocks goes down as the block size increases (as you would expect). This can really speed up computations. However, the larger the blocks are, the more we end up overestimating the actual probability of reidentification, which results in more distortion to the data than is strictly necessary. We show this in Table 76, which is independent of the data. Note that the first entry for a covering, (15,k,k), is equivalent to all combinations (15 choose k), the worst case.
Table 76. A cover up, to reduce the number of quasiidentifiers we need to consider. The numbers in this table are the numbers of blocks we need to evaluate.
Block size (k) 
(15,k,3) 
(15,k,5) 
(15,k,7) 
3 
455 

4 
124 

5 
56 
3003 

6 
31 
578 

7 
15 
189 
6435 
8 
13 
89 
871 
9 
10 
42 
270 
10 
7 
27 
108 
Final Thoughts
If you’re using average risk, subsampling is a simple and very effective way to lower the potential for reidentification. Not so much for maximum risk. Just make sure you have some idea of what kind of analytics will be performed, and you can probably limbo to a pretty low subsample and not have to worry. A power analysis would be a great way to be sure. In another dimension, you could have too many quasiidentifiers, and covering designs can help manage the risk in a more reasonable way than assuming the adversary knows all of them. All in all, it’s just more tools to help manage risk and our assumptions.
^{[}56^{] }J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. (Hillsdale, NJ: LEA Publishers, 1988).
^{[}57^{] }F.E. Harrell Jr., Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (New York: Springer, 2001).
^{[}58^{] }T. Snijders and R. Bosker, Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling (London: SAGE Publications, 1999).