dp-sgd

3 posts

google

Differentially private machine learning at scale with JAX-Privacy (opens in new tab)

Google DeepMind and Google Research have announced the release of JAX-Privacy 1.0, a high-performance library designed to scale differentially private (DP) machine learning. By leveraging JAX’s native parallelization and functional programming model, the toolkit enables researchers to train large-scale foundation models while maintaining rigorous privacy guarantees. This version introduces modular components for advanced algorithms and empirical auditing, making private training both computationally efficient and verifiable across distributed environments. ### Scaling Differential Privacy with JAX * The library is built directly on the JAX ecosystem, integrating seamlessly with Flax for neural network architectures and Optax for optimization. * It utilizes JAX’s `vmap` for automatic vectorization and `shard_map` for single-program multiple-data (SPMD) parallelization, allowing DP primitives to scale across multiple accelerators. * By using just-in-time (JIT) compilation, the library mitigates the traditional performance overhead associated with per-example gradient clipping and noise addition. ### Core Components and Advanced Algorithms * The toolkit provides fundamental building blocks for implementing standard DP algorithms like DP-SGD and DP-FTRL, including specialized modules for data batch construction. * It supports state-of-the-art methods such as DP matrix factorization, which improves performance by injecting correlated noise across training iterations. * Features like micro-batching and padding are included to handle the massive, variable-sized batches often required to achieve an optimal balance between privacy and model utility. ### Verification and Privacy Auditing * JAX-Privacy incorporates rigorous privacy accounting based on Rényi Differential Privacy to provide precise tracking of privacy budgets. * The library includes tools for empirical auditing, allowing developers to validate their privacy guarantees through techniques like membership inference attacks and data poisoning. * The design ensures correctness in distributed settings, specifically focusing on consistent noise generation and gradient synchronization across clusters. JAX-Privacy 1.0 is a robust solution for researchers and engineers who need to deploy production-grade private models. Its modular architecture and integration with high-performance computing primitives make it a primary choice for training foundation models on sensitive datasets without compromising on scalability or security.

google

A picture's worth a thousand (private) words: Hierarchical generation of coherent synthetic photo albums (opens in new tab)

Researchers at Google have developed a hierarchical method for generating differentially private (DP) synthetic photo albums, providing a way to share representative datasets while protecting sensitive individual information. By utilizing an intermediate text representation and a two-stage generation process, the approach maintains thematic coherence across multiple images in an album—a significant challenge for traditional synthetic data methods. This framework allows organizations to apply standard, non-private analytical techniques to safe synthetic substitutes rather than modifying every individual analysis method for differential privacy. ## The Hierarchical Generation Process * The workflow begins by converting original photo albums into structured text; an AI model generates detailed captions for each image and a summary for the entire album. * Two large language models (LLMs) are privately fine-tuned using DP-SGD: the first is trained to produce album summaries, and the second generates individual photo captions based on those summaries. * Synthetic data is then produced hierarchically, where the model first generates a global album summary to serve as context, followed by a series of individual photo captions that remain consistent with that context. * The final step uses a text-to-image AI model to transform the private, synthetic text captions back into a set of coherent images. ## Benefits of Intermediate Text Representations * Text summarization is inherently privacy-enhancing because it is a "lossy" operation, meaning the text description is unlikely to capture the exact unique details of an original photo. * Using text as a midpoint allows for more efficient resource management, as generated albums can be filtered and curated at the text level before undergoing the computationally expensive process of image generation. * The hierarchical approach ensures that photos within a synthetic album share the same characters and themes, as every caption in a set is derived from the same contextual summary. * Training two separate models with shorter context windows is significantly more efficient than training one large model, because the computational cost of self-attention scales quadratically with the length of the context. This hierarchical, text-mediated approach demonstrates that high-level semantic information and thematic coherence can be preserved in synthetic datasets without sacrificing individual privacy. Organizations should consider this workflow—translating complex multi-modal data into structured text before synthesis—to scale differentially private data generation for advanced modeling and analysis.

google

Fine-tuning LLMs with user-level differential privacy (opens in new tab)

Researchers from Google investigated scaling user-level differential privacy (DP) to the fine-tuning of large language models in datacenter environments. While traditional example-level DP protects individual data points, user-level DP provides a stronger guarantee by masking the presence of an entire user's dataset, which is critical for privacy-sensitive, domain-specific tasks. The study explores how the flexibility of datacenter training can be used to optimize sampling strategies and contribution bounds to minimize the noise typically required for these stringent privacy guarantees. ## Limitations of Example-Level Privacy * Standard differential privacy focuses on "example-level" protection, which prevents attackers from learning about specific individual data points. * In many real-world scenarios, a single user contributes many examples to a dataset; if an attacker can analyze these multiple points together, they may still learn private information about the user even under example-level DP. * User-level DP addresses this by ensuring a model remains essentially the same whether or not a specific user’s entire data collection was used during training. * While more robust, user-level DP is "strictly harder" to implement because it requires injecting significantly more noise into the training process, a problem that scales with the size of the model. ## Methodologies for User-Level DP Fine-Tuning * Both primary algorithms require a "contribution bound" during pre-processing, which strictly limits the number of examples any single user can provide to the training set. * Example-Level Sampling (ELS) involves sampling random individual examples for a batch and then applying a modified version of DP-SGD with high noise to compensate for the potential presence of multiple examples from the same user. * User-Level Sampling (ULS) involves sampling random users and including all of their (bounded) examples in a batch, which more closely resembles the structure of federated learning. * The datacenter environment offers a unique advantage over federated learning because researchers can perform precise queries on both individual examples and whole users, allowing for better optimization of the noise-to-utility ratio. ## Optimization and Datacenter Flexibility * The researchers focused on fine-tuning rather than full training because DP requires additional computation that is often unaffordable for base model training. * A central challenge in this research is determining the optimal "contribution bound"—if the bound is too low, valuable data is discarded, but if it is too high, more noise must be added to maintain privacy. * Because the datacenter allows for random sampling of any user at any time (unlike federated learning where devices must be online), the ULS algorithm can be tuned more effectively to achieve quality gains in the final model. To maximize the utility of LLMs fine-tuned on private data, developers should prioritize User-Level Sampling (ULS) strategies and carefully calibrate the contribution bounds of their datasets. By leveraging the controlled environment of a datacenter to optimize these parameters, it is possible to achieve high-performance models that respect user privacy more effectively than traditional example-level methods.