|
Differentially Private Synthetic Microdata
|
A practical guide on how to create synthetic datasets using differential privacy.
|
|
|
|
Synthetic data at UNHCR
As part of its commitment to promoting open data and staying at the forefront of privacy technologies, UNHCR undertook a project to explore the use of synthetic data for some of the
Microdata Library datasets.
Synthetic datasets are generated using randomly created dummy values that preserve the statistical properties of the original data, making them a valuable resource for research and analysis. Data security is ensured through
differential privacy, a state-of-the-art privacy-enhancing technology that provides a rigorous mathematical framework for protecting sensitive information.
UNHCR found this methodology particularly beneficial when working with datasets that represent full enumerations rather than sample surveys. This approach was applied to registration data, a census-like collection that includes all displaced individuals registered with the organization, enabling its secure sharing on this platform.
Drawing from the experience gained in this project, UNHCR developed a practical guide to applying this methodology. The guide is freely available for download at the top of this page.
Registration data
When people are forced to flee their homes due to war, persecution, or violence, registration and documentation by UNHCR are crucial first steps in ensuring protection and access to essential assistance. The data collected during registration provides comprehensive insights into displaced populations, supporting program planning for shelter, food, water, healthcare, sanitation, cash-based interventions, and other targeted aid.
For the first time, UNHCR is making registration data available on the UNHCR Microdata Library through the use of synthetic data. Synthetic registration datasets will be accessible in the
dedicated collection.
Software
Synthetic datasets were generated using software provided and maintained by
OpenDP.
Specifically, the
SmartNoise Synthesizer was utilized, which is part of the
SmartNoise SDK: Tools for Differential Privacy on Tabular Data.
Authors and Contributors
This guidance was written by
Nitin Kohli, PhD, from the University of California, Berkeley, who also tested differential privacy methods on UNHCR data.
The UNHCR
Data Curation team reviewed and supervised the work.
Data Innovation Fund
This work was made possible through the funding and support of the
UNHCR Data Innovation Fund.