google Aug 19, 2025

Securing private data at scale with differentially private partition selection (opens in new tab)

ai machine-learning differential-privacy scalability data-privacy partition-selection parallel-computing algorithms

Google Research has introduced a novel parallel algorithm called MaxAdaptiveDegree (MAD) to enhance differentially private (DP) partition selection, a critical process for identifying common data items in massive datasets without compromising individual privacy. By utilizing an adaptive weighting mechanism, the algorithm optimizes the utility-privacy trade-off, allowing researchers to safely release significantly more data than previous non-adaptive methods. This breakthrough enables privacy-preserving analysis on datasets containing hundreds of billions of items, scaling up to three orders of magnitude larger than existing sequential approaches.

The Role of DP Partition Selection

DP partition selection identifies a meaningful subset of unique items from large collections based on their frequency across multiple users.
The process ensures that no single individual's data can be identified in the final list by adding controlled noise and filtering out items that are not sufficiently common.
This technique is a foundational step for various machine learning tasks, including extracting n-gram vocabularies for language models, analyzing private data streams, and increasing efficiency in private model fine-tuning.

The Weight, Noise, and Filter Paradigm

The standard approach to private partition selection begins by computing a "weight" for each item, typically representing its frequency, while ensuring "low sensitivity" so no single user has an outsized impact.
Random Gaussian noise is added to these weights to obfuscate exact counts, preventing attackers from inferring the presence of specific individuals.
A threshold determined by DP parameters is then applied; only items whose noisy weights exceed this threshold are included in the final output.

Improving Utility via Adaptive Weighting

Traditional non-adaptive methods often result in "wastage," where highly popular items receive significantly more weight than necessary to cross the selection threshold.
The MaxAdaptiveDegree (MAD) algorithm introduces adaptivity by identifying items with excess weight and rerouting that weight to "under-allocated" items sitting just below the threshold.
This strategic reallocation allows a larger number of less-frequent items to be safely released, significantly increasing the utility of the dataset without compromising privacy or computational efficiency.

Scalability and Parallelization

Unlike sequential algorithms that process data one piece at a time, MAD is designed as a parallel algorithm to handle the scale of modern user-based datasets.
The algorithm can process datasets with hundreds of billions of items by breaking the problem down into smaller parts computed simultaneously across multiple processors.
Google has open-sourced the implementation on GitHub to provide the research community with a tool that maintains robust privacy guarantees even at a massive scale.

Researchers and data scientists working with large-scale sensitive datasets should consider implementing the MaxAdaptiveDegree algorithm to maximize the amount of shareable data while strictly adhering to user-level differential privacy standards.