google Aug 13, 2025

Beyond billion-parameter burdens: Unlocking data synthesis with a conditional generator (opens in new tab)

ai llm differential-privacy gemma fine-tuning clustering synthetic-data-generation topic-modeling bart

The CTCL (Data Synthesis with ConTrollability and CLustering) framework provides a lightweight alternative to the computationally expensive process of fine-tuning billion-parameter models for differentially private synthetic data generation. By utilizing a 140-million parameter generator and a universal topic model, the system achieves high-quality distribution matching while remaining accessible for resource-constrained applications. This approach allows for the generation of unlimited synthetic samples without incurring additional privacy costs, consistently outperforming existing API-based and large-scale baselines under strict privacy guarantees.

Pre-training Universal Components

The framework relies on two core components developed using large-scale public corpora, which can be reused across different private domains:

CTCL-Topic: A universal topic model derived from Wikipedia documents. It uses BERTopic to embed and cluster data into approximately 1,000 distinct topics, each represented by 10 descriptive keywords.
CTCL-Generator: A conditional language model based on the 140M-parameter BART-base architecture. It was pre-trained on 430 million description–document pairs from the SlimPajama dataset, with descriptions generated by Gemma-2-2B to ensure the model can generate text based on specific input conditions.

Learning the Private Domain

Once the universal components are established, the framework learns the specific characteristics of a private dataset through a two-step process:

Differentially Private (DP) Histograms: The system captures high-level distributional information by creating a DP-protected histogram that represents the percentage of each topic present in the private corpus.
DP Fine-Tuning: Each document in the private dataset is associated with its corresponding keywords from the CTCL-Topic model. The CTCL-Generator is then fine-tuned on these keyword-document pairs using differential privacy to ensure individual data points are protected.

Controllable Data Generation

The final stage involves producing the synthetic dataset by sampling from the fine-tuned generator:

Proportional Sampling: The system generates data by targeting the exact topic proportions found in the private domain histogram.
Keyword Conditioning: For each topic, the model uses the associated 10 keywords as input to prompt the DP fine-tuned generator to produce relevant documents.
Post-Processing Efficiency: Because the generator is already fine-tuned with DP, the framework can generate an unlimited number of synthetic samples without further privacy budget expenditure, a significant advantage over iterative selection algorithms.

CTCL offers a highly scalable and efficient solution for organizations needing to synthesize private text data without the infrastructure requirements of massive LLMs. Its ability to maintain topic-wise distribution through keyword conditioning makes it an ideal choice for specialized domains where maintaining the statistical utility of the data is as critical as protecting user privacy.