Teaching AI to read a map February 17, 2026 Artemis Panagopoulou, Student Researcher, and Mohit Goyal, Senior Software Engineer, Google We propose a system for synthetic data generation to train AI systems to visually follow any route on any map, finally teaching language models…
The CTCL (Data Synthesis with ConTrollability and CLustering) framework provides a lightweight alternative to the computationally expensive process of fine-tuning billion-parameter models for differentially private synthetic data generation. By utilizing a 140-million parameter generator and a universal topic model, the system achieves high-quality distribution matching while remaining accessible for resource-constrained applications. This approach allows for the generation of unlimited synthetic samples without incurring additional privacy costs, consistently outperforming existing API-based and large-scale baselines under strict privacy guarantees.
### Pre-training Universal Components
The framework relies on two core components developed using large-scale public corpora, which can be reused across different private domains:
* **CTCL-Topic:** A universal topic model derived from Wikipedia documents. It uses BERTopic to embed and cluster data into approximately 1,000 distinct topics, each represented by 10 descriptive keywords.
* **CTCL-Generator:** A conditional language model based on the 140M-parameter BART-base architecture. It was pre-trained on 430 million description–document pairs from the SlimPajama dataset, with descriptions generated by Gemma-2-2B to ensure the model can generate text based on specific input conditions.
### Learning the Private Domain
Once the universal components are established, the framework learns the specific characteristics of a private dataset through a two-step process:
* **Differentially Private (DP) Histograms:** The system captures high-level distributional information by creating a DP-protected histogram that represents the percentage of each topic present in the private corpus.
* **DP Fine-Tuning:** Each document in the private dataset is associated with its corresponding keywords from the CTCL-Topic model. The CTCL-Generator is then fine-tuned on these keyword-document pairs using differential privacy to ensure individual data points are protected.
### Controllable Data Generation
The final stage involves producing the synthetic dataset by sampling from the fine-tuned generator:
* **Proportional Sampling:** The system generates data by targeting the exact topic proportions found in the private domain histogram.
* **Keyword Conditioning:** For each topic, the model uses the associated 10 keywords as input to prompt the DP fine-tuned generator to produce relevant documents.
* **Post-Processing Efficiency:** Because the generator is already fine-tuned with DP, the framework can generate an unlimited number of synthetic samples without further privacy budget expenditure, a significant advantage over iterative selection algorithms.
CTCL offers a highly scalable and efficient solution for organizations needing to synthesize private text data without the infrastructure requirements of massive LLMs. Its ability to maintain topic-wise distribution through keyword conditioning makes it an ideal choice for specialized domains where maintaining the statistical utility of the data is as critical as protecting user privacy.