google Oct 19, 2025

A picture's worth a thousand (private) words: Hierarchical generation of coherent synthetic photo albums (opens in new tab)

ai llm gen-ai differential-privacy synthetic-data dp-sgd text-to-image hierarchical-generation

Researchers at Google have developed a hierarchical method for generating differentially private (DP) synthetic photo albums, providing a way to share representative datasets while protecting sensitive individual information. By utilizing an intermediate text representation and a two-stage generation process, the approach maintains thematic coherence across multiple images in an album—a significant challenge for traditional synthetic data methods. This framework allows organizations to apply standard, non-private analytical techniques to safe synthetic substitutes rather than modifying every individual analysis method for differential privacy.

The Hierarchical Generation Process

The workflow begins by converting original photo albums into structured text; an AI model generates detailed captions for each image and a summary for the entire album.
Two large language models (LLMs) are privately fine-tuned using DP-SGD: the first is trained to produce album summaries, and the second generates individual photo captions based on those summaries.
Synthetic data is then produced hierarchically, where the model first generates a global album summary to serve as context, followed by a series of individual photo captions that remain consistent with that context.
The final step uses a text-to-image AI model to transform the private, synthetic text captions back into a set of coherent images.

Benefits of Intermediate Text Representations

Text summarization is inherently privacy-enhancing because it is a "lossy" operation, meaning the text description is unlikely to capture the exact unique details of an original photo.
Using text as a midpoint allows for more efficient resource management, as generated albums can be filtered and curated at the text level before undergoing the computationally expensive process of image generation.
The hierarchical approach ensures that photos within a synthetic album share the same characters and themes, as every caption in a set is derived from the same contextual summary.
Training two separate models with shorter context windows is significantly more efficient than training one large model, because the computational cost of self-attention scales quadratically with the length of the context.

This hierarchical, text-mediated approach demonstrates that high-level semantic information and thematic coherence can be preserved in synthetic datasets without sacrificing individual privacy. Organizations should consider this workflow—translating complex multi-modal data into structured text before synthesis—to scale differentially private data generation for advanced modeling and analysis.