Securing private data at scale with differentially private partition selection (opens in new tab)
Google Research has introduced a novel parallel algorithm called MaxAdaptiveDegree (MAD) to enhance differentially private (DP) partition selection, a critical process for identifying common data items in massive datasets without compromising individual privacy. By utilizing an adaptive weighting mechanism, the algorithm optimizes the utility-privacy trade-off, allowing researchers to safely release significantly more data than previous non-adaptive methods. This breakthrough enables privacy-preserving analysis on datasets containing hundreds of billions of items, scaling up to three orders of magnitude larger than existing sequential approaches.
The Role of DP Partition Selection
- DP partition selection identifies a meaningful subset of unique items from large collections based on their frequency across multiple users.
- The process ensures that no single individual's data can be identified in the final list by adding controlled noise and filtering out items that are not sufficiently common.
- This technique is a foundational step for various machine learning tasks, including extracting n-gram vocabularies for language models, analyzing private data streams, and increasing efficiency in private model fine-tuning.
The Weight, Noise, and Filter Paradigm
- The standard approach to private partition selection begins by computing a "weight" for each item, typically representing its frequency, while ensuring "low sensitivity" so no single user has an outsized impact.
- Random Gaussian noise is added to these weights to obfuscate exact counts, preventing attackers from inferring the presence of specific individuals.
- A threshold determined by DP parameters is then applied; only items whose noisy weights exceed this threshold are included in the final output.
Improving Utility via Adaptive Weighting
- Traditional non-adaptive methods often result in "wastage," where highly popular items receive significantly more weight than necessary to cross the selection threshold.
- The MaxAdaptiveDegree (MAD) algorithm introduces adaptivity by identifying items with excess weight and rerouting that weight to "under-allocated" items sitting just below the threshold.
- This strategic reallocation allows a larger number of less-frequent items to be safely released, significantly increasing the utility of the dataset without compromising privacy or computational efficiency.
Scalability and Parallelization
- Unlike sequential algorithms that process data one piece at a time, MAD is designed as a parallel algorithm to handle the scale of modern user-based datasets.
- The algorithm can process datasets with hundreds of billions of items by breaking the problem down into smaller parts computed simultaneously across multiple processors.
- Google has open-sourced the implementation on GitHub to provide the research community with a tool that maintains robust privacy guarantees even at a massive scale.
Researchers and data scientists working with large-scale sensitive datasets should consider implementing the MaxAdaptiveDegree algorithm to maximize the amount of shareable data while strictly adhering to user-level differential privacy standards.