LibGuides: Sensitive Research Data: De-identification

How to de-identify sensitive data?

When considering what variables to de-identify, determine FOR EACH VARIABLE the value of the information for data analysis versus for data security. You may determine that, if the data is too sensitive, rather than de-identifying, you should destroy the data or share the data with restricted-access only.

The amount of risk associated with re-identification can depend upon

the sensitivity of the dataset's topic (more vulnerable topics would be more dangerous if re-identified)
the specificity of the identifier (a more detailed job title more searchable than more generic title)
the size (rows/observations) of the dataset overall (fewer rows means easier to narrow down options)
the size and composition of the population (smaller groups that exist means easier to guess if participated in the study)
the combination of information in the dataset (particular variables put together could make guessing more doable)
the recency of the data collected (newer data is more relevant to today)

Governmental guidance on de-identification methods
The US Department of Health and Human Services walks through theoretical steps expected in order to de-identify and protect health information. This guide does not TRAIN on how to do these steps but just outlines what the steps you need to consider ARE.
Techniques for Quantitative and Qualitative Data
Purdue University Libraries provides a list of different technique options and their definitions for de-identifying quantitative and qualitative data.
Example of de-identification per (quantitative) variable
This tip sheet from UC Davis provides an example of what would need to be de-identified per variable in a dataset.
De-identification guidelines for Qualitative data
The Qualitative Data Repository provides specific steps that can be taken to de-identify text (such as transcripts).
Example of de-identification on (qualitative) transcript
This transcript with track changes done by the UK Data Service shows what could be good to change when de-identifying qualitative text data.

more... less...

Click the downloadable PDF file in this repository record.
Data Confidentiality Guide
This guide, created by the Australian Bureau of Statistics, asynchronously walks through various components of handling sensitive data, including providing an overview about the Five Safes framework, describing various confidentiality techniques and showing how to confidentialize your own data.
Anonymisation
This recorded training webinar (~1 hour) by CESSDA provides an overview of k-anonymity and use of sdcMicro to anonymize quantitative data.
How to De-identify your data: balancing statistical accuracy and subject privacy in large social-science data sets
This is an ACM journal article (2015) by Harvard researchers on how to de-identify data.

Tools to help de-identify data

The Observatory of Anonymity
Check whether there's a chance of reidentification, based on combining variables.
sdcMicro (R package)
This R package includes various risk estimation methods.

more... less...

This is an open source tool
CliniDeID
CliniDeID® automatically de-identifies clinical notes and structured data according to the HIPAA Safe Harbor method. It accurately finds identifiers and tags or replaces them with realistic surrogates for better anonymity. It improves access to richer, more detailed, and more accurate clinical data for clinical researchers.

more... less...

This is an open source tool
NLM-Scrubber
NLM-Scrubber is a freely available clinical text deidentification tool designed and developed at the National Library of Medicine. Our aim is to enable clinical scientists to access clinical health information that is not associated with the patient by following the Safe Harbor principles as outlined in the HIPAA Privacy Rule.

more... less...

This is an open source tool
ARX Data Anonymization Tool
Developed at the Technical University of Munich, this tool should help with transforming structured (i.e. tabular) sensitive personal data using selected methods from the broad area of statistical disclosure control.

more... less...

This is an open source tool
QualiAnon
QualiAnon is a tool developed by Qualiservice in cooperation with Pangaea to support researchers in the anonymization and pseudonymization of research data.

more... less...

This is an open source tool
Amnesia
Amnesia, developed by Athena Innovation and supported by the European Union's OpenAIRE organization, has statistical method solution features for pseudo-anonymization, masking, k-anonymity, km-anonymity, generalization and suppression.