synthetic-data-generation

1 posts

google

Beyond billion-parameter burdens: Unlocking data synthesis with a conditional generator (opens in new tab)

The CTCL (Data Synthesis with ConTrollability and CLustering) framework provides a lightweight alternative to the computationally expensive process of fine-tuning billion-parameter models for differentially private synthetic data generation. By utilizing a 140-million parameter generator and a universal topic model, the system achieves high-quality distribution matching while remaining accessible for resource-constrained applications. This approach allows for the generation of unlimited synthetic samples without incurring additional privacy costs, consistently outperforming existing API-based and large-scale baselines under strict privacy guarantees. ### Pre-training Universal Components The framework relies on two core components developed using large-scale public corpora, which can be reused across different private domains: * **CTCL-Topic:** A universal topic model derived from Wikipedia documents. It uses BERTopic to embed and cluster data into approximately 1,000 distinct topics, each represented by 10 descriptive keywords. * **CTCL-Generator:** A conditional language model based on the 140M-parameter BART-base architecture. It was pre-trained on 430 million description–document pairs from the SlimPajama dataset, with descriptions generated by Gemma-2-2B to ensure the model can generate text based on specific input conditions. ### Learning the Private Domain Once the universal components are established, the framework learns the specific characteristics of a private dataset through a two-step process: * **Differentially Private (DP) Histograms:** The system captures high-level distributional information by creating a DP-protected histogram that represents the percentage of each topic present in the private corpus. * **DP Fine-Tuning:** Each document in the private dataset is associated with its corresponding keywords from the CTCL-Topic model. The CTCL-Generator is then fine-tuned on these keyword-document pairs using differential privacy to ensure individual data points are protected. ### Controllable Data Generation The final stage involves producing the synthetic dataset by sampling from the fine-tuned generator: * **Proportional Sampling:** The system generates data by targeting the exact topic proportions found in the private domain histogram. * **Keyword Conditioning:** For each topic, the model uses the associated 10 keywords as input to prompt the DP fine-tuned generator to produce relevant documents. * **Post-Processing Efficiency:** Because the generator is already fine-tuned with DP, the framework can generate an unlimited number of synthetic samples without further privacy budget expenditure, a significant advantage over iterative selection algorithms. CTCL offers a highly scalable and efficient solution for organizations needing to synthesize private text data without the infrastructure requirements of massive LLMs. Its ability to maintain topic-wise distribution through keyword conditioning makes it an ideal choice for specialized domains where maintaining the statistical utility of the data is as critical as protecting user privacy.