google

Securing private data at scale with differentially private partition selection (opens in new tab)

Google Research has introduced a novel parallel algorithm called MaxAdaptiveDegree (MAD) to enhance differentially private (DP) partition selection, a critical process for identifying common data items in massive datasets without compromising individual privacy. By utilizing an adaptive weighting mechanism, the algorithm optimizes the utility-privacy trade-off, allowing researchers to safely release significantly more data than previous non-adaptive methods. This breakthrough enables privacy-preserving analysis on datasets containing hundreds of billions of items, scaling up to three orders of magnitude larger than existing sequential approaches.

The Role of DP Partition Selection

  • DP partition selection identifies a meaningful subset of unique items from large collections based on their frequency across multiple users.
  • The process ensures that no single individual's data can be identified in the final list by adding controlled noise and filtering out items that are not sufficiently common.
  • This technique is a foundational step for various machine learning tasks, including extracting n-gram vocabularies for language models, analyzing private data streams, and increasing efficiency in private model fine-tuning.

The Weight, Noise, and Filter Paradigm

  • The standard approach to private partition selection begins by computing a "weight" for each item, typically representing its frequency, while ensuring "low sensitivity" so no single user has an outsized impact.
  • Random Gaussian noise is added to these weights to obfuscate exact counts, preventing attackers from inferring the presence of specific individuals.
  • A threshold determined by DP parameters is then applied; only items whose noisy weights exceed this threshold are included in the final output.

Improving Utility via Adaptive Weighting

  • Traditional non-adaptive methods often result in "wastage," where highly popular items receive significantly more weight than necessary to cross the selection threshold.
  • The MaxAdaptiveDegree (MAD) algorithm introduces adaptivity by identifying items with excess weight and rerouting that weight to "under-allocated" items sitting just below the threshold.
  • This strategic reallocation allows a larger number of less-frequent items to be safely released, significantly increasing the utility of the dataset without compromising privacy or computational efficiency.

Scalability and Parallelization

  • Unlike sequential algorithms that process data one piece at a time, MAD is designed as a parallel algorithm to handle the scale of modern user-based datasets.
  • The algorithm can process datasets with hundreds of billions of items by breaking the problem down into smaller parts computed simultaneously across multiple processors.
  • Google has open-sourced the implementation on GitHub to provide the research community with a tool that maintains robust privacy guarantees even at a massive scale.

Researchers and data scientists working with large-scale sensitive datasets should consider implementing the MaxAdaptiveDegree algorithm to maximize the amount of shareable data while strictly adhering to user-level differential privacy standards.