A picture's worth a thousand (private) words: Hierarchical generation of coherent synthetic photo albums (opens in new tab)
Researchers at Google have developed a hierarchical method for generating differentially private (DP) synthetic photo albums, providing a way to share representative datasets while protecting sensitive individual information. By utilizing an intermediate text representation and a two-stage generation process, the approach maintains thematic coherence across multiple images in an album—a significant challenge for traditional synthetic data methods. This framework allows organizations to apply standard, non-private analytical techniques to safe synthetic substitutes rather than modifying every individual analysis method for differential privacy.
The Hierarchical Generation Process
- The workflow begins by converting original photo albums into structured text; an AI model generates detailed captions for each image and a summary for the entire album.
- Two large language models (LLMs) are privately fine-tuned using DP-SGD: the first is trained to produce album summaries, and the second generates individual photo captions based on those summaries.
- Synthetic data is then produced hierarchically, where the model first generates a global album summary to serve as context, followed by a series of individual photo captions that remain consistent with that context.
- The final step uses a text-to-image AI model to transform the private, synthetic text captions back into a set of coherent images.
Benefits of Intermediate Text Representations
- Text summarization is inherently privacy-enhancing because it is a "lossy" operation, meaning the text description is unlikely to capture the exact unique details of an original photo.
- Using text as a midpoint allows for more efficient resource management, as generated albums can be filtered and curated at the text level before undergoing the computationally expensive process of image generation.
- The hierarchical approach ensures that photos within a synthetic album share the same characters and themes, as every caption in a set is derived from the same contextual summary.
- Training two separate models with shorter context windows is significantly more efficient than training one large model, because the computational cost of self-attention scales quadratically with the length of the context.
This hierarchical, text-mediated approach demonstrates that high-level semantic information and thematic coherence can be preserved in synthetic datasets without sacrificing individual privacy. Organizations should consider this workflow—translating complex multi-modal data into structured text before synthesis—to scale differentially private data generation for advanced modeling and analysis.