|
Differentially Private Synthetic Microdata
|
A practical guide on how to create synthetic datasets using differential privacy.
|
|
|
|
Synthetic data at UNHCR
As part of its commitment to promoting open data and staying at the forefront of privacy technologies, UNHCR undertook a project to explore the use of synthetic data for some of the
Microdata Library datasets.
Synthetic datasets are generated using randomly created dummy values that preserve the statistical properties of the original data, making them a valuable resource for research and analysis. Data security is ensured through
differential privacy, a state-of-the-art privacy-enhancing technology that provides a rigorous mathematical framework for protecting sensitive information.
UNHCR found this methodology particularly beneficial when working with datasets that represent full enumerations rather than sample surveys. This approach was applied to registration data, a census-like collection that includes all displaced individuals registered with the organization, enabling its secure sharing on this platform.
Drawing from the experience gained in this project, UNHCR developed a practical guide to applying this methodology. The guide is freely available for download at the top of this page.
Registration data
When people are forced to flee their homes due to war, persecution, or violence, registration and documentation by UNHCR are crucial first steps in ensuring protection and access to essential assistance. The data collected during registration provides comprehensive insights into displaced populations, supporting program planning for shelter, food, water, healthcare, sanitation, cash-based interventions, and other targeted aid.
For the first time, UNHCR is making registration data available on the UNHCR Microdata Library through the use of synthetic data. Synthetic registration datasets will be accessible in the
dedicated collection.
Software
Synthetic datasets were generated using software provided and maintained by
OpenDP.
Specifically, the
SmartNoise Synthesizer was utilized, which is part of the
SmartNoise SDK: Tools for Differential Privacy on Tabular Data.
Project Members
This guide is the result of a project that tested differential privacy methods on UNHCR data.
The project was carried out by the following team members:
Federico Sanson - Project Lead
Margherita Leonelli - Data Engineer
Alejandra Moreno Ramirez - User Scoping
Nitin Kohli, PhD - Differential Privacy Consultant and Guide Author
Data Innovation Fund
This work was made possible through the funding and support of the
UNHCR Data Innovation Fund.