Synthetic and federated: Privacy-preserving domain adaptation with LLMs for mobile applications (opens in new tab)
Researchers at Google have developed a framework for improving both small and large language models (LMs) in mobile applications like Gboard by utilizing privacy-preserving synthetic data and federated learning. This approach combines differential privacy (DP) with large language model (LLM) generation to minimize data memorization risks while achieving significant gains in production metrics like next-word prediction and proofreading. The result is a robust pipeline that allows models to adapt to specific user domains without compromising individual privacy or requiring centralized data storage.
Strengthening Privacy with DP-FL
- Gboard has transitioned all production LMs trained on user data to a Federated Learning with Differential Privacy (DP-FL) framework, ensuring data remains on-device and is never memorized.
- The deployment utilizes the BLT-DP-FTRL algorithm, which offers an optimized trade-off between privacy guarantees and model utility while being easier to deploy in production.
- Engineers adopted the SI-CIFG model architecture to facilitate efficient on-device training, ensuring the hardware can handle local updates while maintaining compatibility with DP constraints.
Synthetic Data Generation via Public LLMs
- Powerful LLMs trained on public web data are prompted to synthesize high-quality text that mimics mobile user interactions without ever accessing actual private user data.
- The process involves a two-step prompting strategy: first, filtering public datasets to identify topics common in mobile communication, and second, generating new, domain-specific text based on those patterns.
- This synthetic data serves as a bridge for pre-training small LMs, which are then refined through private post-training on-device to capture the nuances of user behavior.
Adapting LLMs for Mobile Proofreading
- To support advanced features like Gboard's "Proofread," researchers developed a "Synthesize-then-Adapt" pipeline specifically for error correction.
- LLMs generate synthetic "corrupted" text to simulate common mobile typing errors, providing the necessary training pairs (error/correction) that are difficult to find in public datasets.
- Federated learning is then used to adapt these error-correction models to specific app domains (such as messaging or email) using on-device signals, ensuring the model understands the specific context of the user's typing.
The success of these techniques in Gboard demonstrates that synthetic data can effectively replace or augment private data throughout the machine learning lifecycle. For developers working with sensitive user information, adopting a "synthetic-first" approach combined with federated learning provides a scalable path to model improvement that adheres to the core principles of data minimization and anonymization.