Cross-Sectional Data: Research Registries - Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Anonymizing Health Data: Case Studies and Methods to Get You Started (2013)

Chapter 3. Cross-Sectional Data: Research Registries

One of the first real-life de-identification challenges we faced was releasing data from a maternal-child registry. It’s common to create registries specifically to hold data that will be disclosed for secondary purposes. The Better Outcomes Registry & Network (BORN) of Ontario[34] integrates the data for all hospital and home births in the province, about 140,000 births per year. It was created for improving the provision of health care, and also for research. But without the ability to de-identify data sets, most research using this data would grind to a halt.

The data set from the BORN registry is cross-sectional, in that we cannot trace mothers over time. If a mother has a baby in 2009 and another in 2011, it’s simply not possible to know that it was the same women. This kind of data is quite common in registries and surveys. We use the BORN data set a number of times throughout the book to illustrate various methods because it’s a good baseline data set to work with and one that we know well.

Process Overview

We’ll come back to BORN later. First we’ll discuss the general process of getting de-identified data from a research registry.

Secondary Uses and Disclosures

Once the data is in the registry, outside researchers can make requests for access to data sets containing individual-level records. There’s a process to approve these data requests, and it illustrates an important part of the de-identification process: health research studies need to be approved by Institutional Review Boards (IRBs) in the US or Research Ethics Boards (REBs) in Canada. We’ll use REB for short since BORN is in Canada, but really our discussion applies equally to an IRB. REBs are often the only institutional gatekeepers providing oversight of research projects. Their primary mandate is to deal with the ethics of research studies, which means that they often deal with privacy issues. For example, the primary issues with many study protocols that involve the secondary analysis of data are related to privacy.

BORN FOR RESEARCH

What are some of the drivers that influence the rate of late preterm labor in Ontario? This was a question asked by researchers, and luckily for them BORN was able to produce a de-identified data set they could use to answer it! As a result of this work they were able to develop demographic tables and risk estimates for obstetrical factors and how they’re associated with rates of late preterm birth (this work has been submitted for publication).

Does maternal exposure to air pollution influence the presence of obstetrical complications? BORN data was linked to pollution data to answer this question. Accurately linking the data required more information than would normally be allowed after de-identification, so instead it was done internally. The de-identified data set, with maternal and pollution data, was then provided to the research fellow working on this problem. The major finding was that preeclampsia is inversely proportional to moderate levels of carbon monoxide,[35] which is consistent with previous studies (although it’s not yet clear why this relationship exists).

Does maternal H1N1 vaccination during pregnancy produce adverse infant and early childhood outcomes? Internal researchers at BORN linked maternal vaccination status from their own data to reported adverse childhood outcomes in the infants whose mothers were vaccinated (data provided by the Institute for Clinical Evaluative Sciences). This is ongoing research, but so far they’ve found that maternal vaccination during pregnancy is protective for the fetus and neonate.[36]

REBs shouldn’t perform the de-identification of a data set, and arguably they aren’t in a good position to determine whether a data set is de-identified or not. There are two main reasons for that:

§ In practice, many REBs don’t have privacy experts. It’s simply not a requirement to have a privacy expert on the board. And even if there is a privacy expert, identifiability issues are a subdiscipline within privacy. General Counsel Jasmine, a privacy lawyer on the board, might be able to address questions around legislation and data use agreements, but she might not have the technical expertise to deal with questions about the identifiability of the data.

§ The process of deciding whether a data set is identifiable and resolving re-identification risk concerns is iterative. Researcher Ronnie can easily agree to have dates of birth converted to ages, or to have admission and discharge dates converted to length of stay, if these are raised as presenting unacceptable re-identification risk. But at the same time, Researcher Ronnie might not be happy with suppressing some records in the data set. This process of deciding on the trade-offs to convert an original data set to a de-identified data set that is still useful for the planned research requires negotiations with the researcher. The REB process and workflow is not well equipped for these types of iterative interactions, and if they are attempted they can be very slow and consequently frustrating.

An alternative approach that tends to work better in practice is for the re-identification risk assessment to be done prior to submitting a research protocol[37] to the REB. With this approach, Privacy Pippa, an expert on re-identification, works with Researcher Ronnie to decide how to manage the risk of re-identification through security controls and de-identification. This process is quite interactive and iterative. For this to work, an individual expert like Privacy Pippa must be assigned to the researcher by the organization holding the data, rather than the process being conducted by a committee that reviews data requests at, say, monthly meetings. Privacy Pippa could be assigned to work with multiple researchers at the same time, but it’s always one expert per protocol. After Privacy Pippa and Researcher Ronnie have agreed on how to de-identify the data, a standard risk assessment report is produced and sent to the REB. We’ll show you some simple examples of such a report at the end of this chapter.

The risk assessment report or certificate of de-identification the REB receives explains how the re-identification risk has been managed. This issue is then considered dealt with, and the board can move on to any other issues they have with the protocol. They’re still accountable for the decision, but they essentially delegate the analysis to Privacy Pippa, who conducts the assessment and negotiations with Researcher Ronnie. Moving the re-identification risk assessment outside the board to the expert on de-identification ensures that the assessment is done thoroughly, that it’s based on defensible evidence, and that the final outcome on this issue is satisfactory to the board as well as the researcher.

The REB also defines the acceptability criteria for the de-identification. For example, the board might say that the probability of re-identification must be below 0.2 for data that will be provided to researchers within the country, and 0.1 for researchers in other countries. Privacy Pippa would then use these as parameters for the assessment and negotiations with whatever researcher she’s dealing with. The board might also stipulate other conditions, such as the requirement for Researcher Ronnie to sign a data sharing agreement. Privacy Pippa ensures that these conditions are met before the protocol is submitted to the board. This is the process that’s in place with the BORN registry.

Getting the Data

There’s a process to getting health research data, summarized in Figure 3-1. The key players are Researcher Ronnie, a scientific review committee, a data access committee (DAC), the REB, and Database Administrator Darryl.

The process of getting health research data requires the involvement of a fair number of people

Figure 3-1. The process of getting health research data requires the involvement of a fair number of people

The scientific review committee might be a committee formed by a funding agency, or peers at a particular institution. The exact way that the scientific review is done isn’t the topic of this book, but it’s included here to illustrate a few key points. The DAC consists of de-identification and privacy experts who perform a re-identification risk assessment on the protocol.

The DAC needs to have access to tools that can perform a re-identification risk assessment. These tools also need to be able to analyze the original data being requested in order to perform that risk assessment. Database Administrator Darryl is responsible for holding on securely to data withprotected health information (PHI) and has an appropriate de-identification tool in place to de-identify the data.

CREATING A FUNCTIONAL DAC

In setting up the DAC, there are a number of critical success factors to consider:

Expertise

The HIPAA Privacy Rule defines a de-identification expert as “a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable.” Only recently have applied courses started coming out that can help individuals develop the necessary expertise—these should start having an impact on the available pool of experts soon. Having a well-defined methodology to follow, like we describe in this book, should provide concrete guidance on the steps to consider and the issues to address specifically for health data.

Duration

One of the common complaints about the use of de-identification experts is that the process takes a long time. If we add another layer into the data release process, in the form of the DAC, will this further slow access to data? Automation could help speed up some of the iterative analysis that needs to be done, and it also reduces the amount of detailed knowledge that a DAC member needs to acquire. A key is to also have dedicated resources in the DAC.

Performance measurement

Measuring the performance of the DAC in general is quite important. The dedicated DAC resources can then be evaluated on a well-defined set of metrics, and this creates an incentive to optimize on these metrics. Useful metrics would include turnaround time and researcher satisfaction (as measured through surveys).

Formulating the Protocol

Researcher Ronnie submits the research protocol to the scientific review committee and the DAC. In practice there might be some iteration between the scientific review process and the DAC process.

The DAC performs a re-identification risk assessment and decides how to adequately de-identify the data set requested by Researcher Ronnie. This process might result in changing the precision of the data that’s requested. For example, the original protocol might request admission and discharge dates, but the risk assessment might recommend replacing that with length of stay in days. Such changes in the data might require changes in the protocol as well.

If the protocol changes, the scientific review might have to be revisited. Also, during the scientific review, methodological or theoretical issues might be raised that could affect the requested data elements. If the requested data elements change, the re-identification risk assessment might have to be revisited. Therefore, at least conceptually, there’s potentially some interaction and possibly iteration between the scientific review process and the re-identification risk assessment process performed by the DAC.

NOTE

In practice, the interaction between scientific review and data access review isn’t often possible because of the way peer review is structured (e.g., with the research funding agencies). Since there’s not likely to be any interaction or iteration between scientific review and data access review, we can save time by doing these activities in parallel, or sequence them and hope for the best!

If either the scientific review committee or the DAC doesn’t approve the protocol, it goes back to Researcher Ronnie for a revision. If the scientific review committee approves the protocol, it provides some kind of approval documentation, such as a letter.

Negotiating with the Data Access Committee

Researcher Ronnie provides the DAC with the protocol as well as a variable checklist. This checklist is quite important because it clarifies the exact fields that are requested. It also highlights to Researcher Ronnie which fields in the requested database are quasi-identifiers and might therefore undergo some kind of generalization and suppression.

The checklist allows Researcher Ronnie to indicate the level of data granularity that he’ll accept. For example, he might be willing to get the year of birth instead of the full date of birth. If this is explicitly specified up front in the checklist, it will most likely reduce significantly the number of iterations between Researcher Ronnie and the DAC. The checklist should also contain information about the importance of the quasi-identifiers—an important quasi-identifier should be minimally impacted by the de-identification. The more trade-offs that Researcher Ronnie is willing to make up front, the less time the re-identification risk analysis will require.

The DAC determines how to appropriately de-identify the data given the risks, and negotiates the chosen process with Researcher Ronnie. This negotiation could take a number of iterations, but these iterations should be relatively quick because a single individual from the DAC is assigned to negotiate with the researcher. The point is not to create another layer of bureaucracy, but to negotiate trade-offs. The output from this process consists of two things:

Risk assessment results

This consists of a report indicating the de-identification that will be applied as well as the residual risk in the data set that will be disclosed. We’ll provide an example in Risk Assessment.

Data sharing agreement

Because the amount of de-identification would normally be contingent on the security and privacy practices that Researcher Ronnie has in place, he must commit to implementing these practices in a data sharing agreement. This agreement isn’t always needed. For example, if Researcher Ronnie is an employee of a hospital and the data comes from the hospital, then he will be bound by an employment contract that should cover the handling of sensitive patient data. However, if Researcher Ronnie is external to the hospital or at a university, then a data sharing agreement would most certainly be recommended. A different data sharing agreement would be needed for every project because the specific terms may vary depending on the data required.

Once it gets these two items, the REB will have sufficient evidence that the residual risk of re-identification is acceptably low and will know the terms of the data sharing agreement that Researcher Ronnie will be signing for this particular data release. There’s evidence that many Canadian REBs will waive the requirement to obtain patient consent if they’re convinced that the requested data set is de-identified. And now the REB can perform a regular ethics review knowing that the privacy issues have been addressed.

If the REB approves the protocol, this information is conveyed to Database Administrator Darryl, who then creates a data set according to the risk assessment report. Database Administrator Darryl then provides the data to Researcher Ronnie in some secure format.

If the REB doesn’t approve the protocol for reasons not related to re-identification risk, Researcher Ronnie has to resubmit the protocol at some later point with the changes required by the REB. If the protocol isn’t approved because of an issue related to re-identification risk, Researcher Ronnie has to go through the process again with the DAC to perform another risk assessment.

BORN Ontario

Now back to our case study. BORN is a prescribed registry under Ontario’s health privacy legislation, and is allowed to use and disclose the data collected. But it’s obligated to protect the personal health information of mothers and their babies. Now, that doesn’t mean it has to lock it down—it wants the data to be used to improve the maternal-child health care system. But this legislation does mean that there must be safeguards in place, which we’ll get to.

The data is collected through a number of mechanisms, including manual data entry and automated extraction and uploads from health record systems. The registry includes information about the infants’ and mothers’ health. Several sources contribute to the data collected by BORN:

§ Prenatal screening labs

§ Hospitals (labor, birth, and early newborn care information including NICU admissions)

§ Midwifery groups (labor, birth, and early newborn care information)

§ Specialized antenatal clinics (information about congenital anomalies)

§ Newborn screening labs

§ Prenatal screening and newborn screening follow-up clinics

§ Fertility clinics

BORN Data Set

Researchers will request all kinds of data sets from BORN to answer interesting questions relating to mothers and their children. The sidebar BORN for Research gives a few examples, but let’s just take one simple example to demonstrate the process we’ve been describing.

Researcher Ronnie at an Ontario university made a request for the clinical records in the registry. The registry doesn’t collect any direct identifiers, such as names or social insurance numbers, so a researcher can only request indirect identifiers and clinical variables. There are a lot of variables in the registry, so we’ll focus on the fields that were requested.

No specific cohort was requested by Researcher Ronnie, so all records in the registry were included, from 2005–2011. At the time of analysis the registry had 919,710 records. The fields requested are summarized in Table 3-1. These are the quasi-identifiers, because fields that are direct identifiers can’t be released at all. Researcher Ronnie didn’t request highly sensitive data, such as data on congenital anomalies or maternal health problems, which plays a role in our selection of a risk threshold.

Table 3-1. Quasi-identifiers requested by Researcher Ronnie

Field

Description

BSex

Baby’s sex

MDOB

Maternal date of birth

BDOB

Baby’s date of birth

MPC

Maternal postal code

We can’t generalize the sex variable, but a generalization hierarchy was defined for each of the other three variables, shown in Table 3-2. A generalization hierarchy is a list of changes that the organization could potentially apply to the data, arranged from the most specific (and most risky) representation to a less specific (and safer) representation.

Table 3-2. Generalization hierarchy for the quasi-identifiers

Field

Generalization hierarchy

MDOB

dd/mm/year → week/year → mm/year → quarter/year → year → 5-year interval → 10-year interval

BDOB

dd/mm/year → week/year → mm/year → quarter/year → year → 5-year interval → 10-year interval

MPC

Cropping the last x character(s), where x is 1 → 2 → 3 → 4 → 5

Risk Assessment

Based on a detailed risk assessment, as described in Step 2: Setting the Threshold, we determined that an average risk threshold of 0.1 would be appropriate for Researcher Ronnie and the data he requested. Had he requested highly sensitive variables, such as data on congenital anomalies or maternal health problems, the risk threshold would have been set to 0.05.

Threat Modeling

Now we examine the three plausible attacks for a known data recipient that were discussed in Step 3: Examining Plausible Attacks:

1. For a deliberate attempt at re-identification, we need to set the value for the probability of attempt, Pr(attempt), based on expert opinion. In this case the researcher was at an academic institution with a data agreement in place, in the same country, and with no obvious reason to want to re-identify records in the registry (low motives and capacity). That being said, Researcher Ronnie didn’t have much in the way of security and privacy practices, just the bare minimum (low mitigating controls). So although the data wasn’t sensitive, the detailed risk assessment concluded that an appropriate estimate would be Pr(attempt) = 0.4.

2. Evaluating an inadvertent attempt at re-identification requires an estimate of the probability that a researcher will have an acquaintance in the data set, Pr(acquaintance). A first estimate would simply be the probability of knowing a woman that has had a baby in any one year between 2005–2011 (assuming that Researcher Ronnie at least knows the year of birth). The worst-case estimate would be based on the year 2008, in which there were 119,785 births out of 4,478,500 women between the ages of 14 and 60. That’s a prevalence of 0.027, which results inPr(acquaintance) = 1–(1–0.027)150/2 = 0.87. In this case we divided the estimated number of friends, 150, in half because we were only considering women (male pregnancies would most certainly be outliers!).

3. For a data breach we already have Pr(breach) = 0.27, based on historical data.

Our basic measure of risk boils down to Pr(re-id, T) = Pr(T) × Pr(re-id | T), where the factor Pr(T) is one of the previously mentioned probabilities. The higher the value of Pr(T), the more de-identification we need to perform on the data set. So in this case we were most concerned with an inadvertent attempt at re-identification, with Pr(re-id, acquaintance) = 0.87 × Pr(re-id | acquaintance) ≤ 0.1, where the last factor was determined directly from the data set. We therefore have Pr(re-id | acquaintance) ≤ 0.115.

Results

With a risk threshold in hand, and a generalization hierarchy to work with, we produced a de-identified data set with a minimized amount of cell suppression.[38] Our first attempt resulted in a data set with MDOB in year, BDOB in week/year, and MPCs of one character, leading to the missingness and entropy shown in Table 3-3. These are the same measures we first saw in Information Loss Metrics.

Table 3-3. First attempt at de-identification

Cell missingness

Record missingness

Entropy

Before

0.02%

0.08%

After

0.75%

0.79%

58.26%

Researcher Ronnie explained that geography was more important than we had originally thought, and suggested that the first three digits of postal codes would provide the geographic resolution he needed. He also clarified that the mother’s date of birth didn’t need to be overly precise. So we made it possible to increase the precision of the postal code, without increasing risk, by decreasing the precision of other measures. Our second attempt resulted in a data set with MDOB in 10-year intervals, BDOB in quarter/year, and MPCs of three characters, leading to the missingness and entropy values in Table 3-4.

Table 3-4. Second attempt at de-identification

Cell missingness

Record missingness

Entropy

Before

0.02%

0.08%

After

0.02%

0.08%

64.24%

Entropy increased here because of the large change in the level of generalization of the mother’s date of birth. But given that no noticeable suppression was applied to the data, Researcher Ronnie asked that we try to increase the precision of the baby’s date of birth. Our third attempt resulted in a data set with MDOB in 10-year intervals, BDOB in month/year, and MPCs of three characters, leading to the missingness and entropy values in Table 3-5.

Table 3-5. Third attempt at de-identification

Cell missingness

Record missingness

Entropy

Before

0.02%

0.08%

After

26.7%

26.7%

59.59%

Given the massive increase in missingness, Researcher Ronnie requested the second attempt at producing a de-identified data set, with MDOB in 10-year intervals, BDOB in quarter/year, and MPCs of three characters, leading to the missingness and entropy values in Table 3-4.

Year on Year: Reusing Risk Analyses

Let’s say Researcher Ronnie had only asked for a data set for one particular year, say 2005. The next year he decides to run a follow-up study and asks for the same data, but for 2006 (he no longer has the 2005 data, because it was deleted as part of the data retention rules). Can we use the same de-identification scheme that we used for the 2005 data, with the same generalizations, or do we need to do a new risk assessment on the 2006 data? If the de-identification specifications are different, it will be difficult to pool statistical results.

This question comes up a lot, so we decided to test it using BORN data sets from the years 2005–2010. The initial size and missingness for the data sets for each year are shown in Table 3-6. We applied our de-identification algorithm to the 2005 data set first, using the same risk threshold of 0.115 as before. With BSex unchanged, the result was MDOB in year, BDOB in week/year, and MPCs of three characters.

Table 3-6. Summary of data sets over six years

Year

No. of records

Cell missingness

Record missingness

2005

114,352

0.02%

0.07%

2006

119,785

0.01%

0.06%

2007

129,540

0.02%

0.06%

2008

137,122

0.02%

0.07%

2009

138,926

0.01%

0.05%

2010

137,351

0.03%

0.14%

We then applied exactly the same generalizations to each of the subsequent years, from 2006–2010. Suppression was applied until the average risk was below our threshold of 0.115. The results for the remaining years are shown in Table 3-7, and as you can see they’re quite stable across years. This means that, for this data set, there’s no need to re-evaluate the risk of re-identification for subsequent years. There are a couple of advantages to this result:

§ We can do re-identification risk assessments on subsets, knowing that we would get the same results if we ran the analysis on the full data set. This can reduce computation time.

§ It allows BORN to have a data sharing agreement in place and provide data sets to researchers on an annual basis using the same specification without having to re-do all of the analysis again.

Table 3-7. De-identification over six years, using the same generalizations

Year

Cell missingness

Record missingness

Entropy

2005

3.41%

3.42%

61.04%

2006

2.54%

2.54%

61.27%

2007

1.64%

1.65%

61.28%

2008

1.24%

1.26%

61.26%

2009

1.15%

1.16%

61.25%

2010

1.42%

1.47%

61.23%

That being said, the BORN data pertains to very stable populations—newborns and their mothers. There haven’t been dramatic changes in the number or characteristics of births in Ontario from 2005–2010. The stability of the data contributes to the stability of de-identification. If we had changed the group of quasi-identifiers used from one year to the next, the results would not hold.

Final Thoughts

There are a lot of people overseeing access to data in the health care system, and that’s a good thing. But luckily, there’s usually a system in place to release the data for secondary purposes, especially research. Evidence-based health care requires data, and patients want their privacy. Together they have driven the need for a process of access with oversight.

Sometimes researchers expect that de-identification will not give them useful data. However, if done properly this is not going to be the case. The key point to remember, at least with regard to de-identification, is that the researcher needs to know the process and be prepared to negotiate with respect to the controls that need to be put in place, the available data elements, and the trade-offs needed to ensure the utility of information given the intended analysis.

After negotiating with a DAC, the de-identification risk assessment results need to be provided to a research ethics or institutional review board; this is a key reason for documenting the process of de-identification.

We get asked a lot about stability of de-identification, and with the BORN data we were able to show that, at least with a stable population, it seems to hold its own. But it’s a good idea to revisit a de-identification scheme every 18 to 24 months to ensure that there haven’t been any material changes to the data set and the processes that generate it. And to be clear, the de-identification scheme should be revisited if any quasi-identifiers are added or changed—that would be considered a material change.


[34] The Better Outcomes Registry & Network (BORN) of Ontario

[35] D. Zhai, Y. Guo, G. Smith, D. Krewski, M. Walker, and S.W. Wen, “Maternal Exposure to Moderate Ambient Carbon Monoxide Is Associated with Decreased Risk of Preeclampsia,” American Journal of Obstetrics & Gynecology, 207:1 (2012): 57.e1–9.

[36] B. Deshayne, A.E. Sprague, N. Liu, A.S. Yasseen, S.W. Wen, G. Smith, and M.C. Walker, “H1N1 Influenza Vaccination During Pregnancy and Fetal and Neonatal Outcomes,” American Journal of Public Health 102:(2012): e33-e40.

[37] A research protocol outlines the methods and procedures that will be used to conduct the research.

[38] K. El Emam, F. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, “A Globally Optimal k-Anonymity Method for the De-identification of Health Data,” Journal of the American Medical Informatics Association 16:5 (2009): 670–682.