Google Research / gemma

8 posts

google

Toward provably private insights into AI use (opens in new tab)

Google Research has introduced Provably Private Insights (PPI), a framework designed to analyze generative AI usage patterns while providing mathematical guarantees of user privacy. By integrating Large Language Models (LLMs) with differential privacy and trusted execution environments (TEEs), the system enables developers to derive aggregate trends from unstructured data without exposing individual user content. This approach ensures that server-side processing remains limited to privacy-preserving computations that are fully auditable by external parties. ### The Role of LLMs in Structured Summarization The system employs "data expert" LLMs to transform unstructured generative AI data into actionable, structured insights. * The framework utilizes open-source Gemma 3 models to perform specific analysis tasks, such as classifying transcripts into topics or identifying user frustration levels. * This "structured summarization" occurs entirely within a TEE, ensuring that the model processes raw data in an environment inaccessible to human operators or external processes. * Developers can update LLM prompts frequently to answer new research questions without compromising the underlying privacy architecture. ### Confidential Federated Analytics (CFA) Infrastructure The PPI system is built upon Confidential Federated Analytics, a technique that isolates data through hardware-based security and cryptographic verification. * User devices encrypt data and define specific authorized processing steps before uploading it to the server. * A TEE-hosted key management service only releases decryption keys to processing steps that match public, open-source code signatures. * System integrity is verified using Rekor, a public, tamper-resistant transparency log that allows external parties to confirm that the code running in the TEE is exactly what was published. ### Anonymization via Differential Privacy Once the LLM extracts features from the data, the system applies differential privacy (DP) to ensure that the final output does not reveal information about any specific individual. * The extracted categories are aggregated into histograms, with DP noise added to the final counts to prevent the identification of single users. * Because the privacy guarantee is applied at the aggregation stage, the system remains secure even if a developer uses a prompt specifically designed to isolate a single user's data. * All aggregation algorithms are open-source and reproducibly buildable, allowing for end-to-end verifiability of the privacy claims. By open-sourcing the PPI stack through the Google Parfait project and deploying it in applications like Pixel Recorder, this framework establishes a new standard for transparent data analysis. Developers should look to integrate similar TEE-based federated analytics to balance the need for product insights with the necessity of provable, hardware-backed user privacy.

google

VaultGemma: The world's most capable differentially private LLM (opens in new tab)

VaultGemma represents a significant milestone in privacy-preserving AI as the most capable large language model trained from scratch using differential privacy (DP). By establishing new scaling laws specifically for DP training, researchers have optimized the complex trade-offs between compute, privacy budgets, and model utility. The resulting 1-billion-parameter model demonstrates that high-performance generative AI can be achieved while maintaining rigorous mathematical guarantees against data memorization. ## Scaling Laws for Differentially Private Training * Performance in DP-trained models is primarily governed by the "noise-batch ratio," which measures the amount of random privacy noise relative to the size of the training data groups. * Research suggests that for any given compute and privacy budget, there exists an optimal training configuration that balances model size, iterations, and batch size to achieve the lowest possible training loss. * A critical finding indicates that DP training requires a departure from standard scaling practices, favoring significantly larger batch sizes and smaller model architectures than traditional non-DP training. ## Synergies in Privacy, Compute, and Data * Increasing the privacy budget (epsilon) in isolation leads to diminishing returns unless it is paired with a proportional increase in compute (FLOPs) or data (tokens). * Visualizations of the scaling laws show that different model sizes can provide similar utility if the number of training iterations and batch sizes are correctly adjusted. * The optimal configuration shifts between investing in larger models versus more iterations depending on the specific constraints of the data and privacy budgets. ## Training at Scale with Algorithmic Advancements * VaultGemma is built on the Gemma 2 architecture and utilizes a 1B parameter setup optimized for the unique constraints of DP. * To overcome hardware limitations when processing the massive batch sizes required for DP training, the team developed a "Virtual Batch" technique in JAX to aggregate gradients across multiple steps. * Training from scratch allows the model to outperform traditional DP-finetuned models, which often struggle to balance utility with the noise introduced during the fine-tuning process. ## Performance and Evaluation * VaultGemma achieves competitive results against standard 1B parameter models while providing formal privacy protections. * The model demonstrates superior privacy-utility trade-offs, proving that carefully scaled DP models can retain high levels of reasoning and language capability. * The release includes the model weights and a comprehensive technical report to assist the community in developing the next generation of private-by-design AI. VaultGemma provides a practical blueprint for developers who need to balance the power of large language models with strict data confidentiality requirements. By leveraging the provided scaling insights, organizations can now train models that are mathematically resistant to data leakage without sacrificing significant performance.

google

Speculative cascades — A hybrid approach for smarter, faster LLM inference (opens in new tab)

Speculative cascades represent a hybrid inference method that integrates the cost-efficiency of model cascades with the latency-reducing benefits of speculative decoding. By utilizing a smaller drafter model to generate token sequences that are verified in parallel by a larger expert model, this approach allows for high-speed generation while maintaining flexible quality standards. The result is a system that achieves superior cost-quality trade-offs and higher speed-ups than either traditional cascading or standard speculative decoding alone. ### Limitations of Cascades and Speculative Decoding * **Sequential Bottlenecks in Cascades:** Traditional cascades use a deferral rule to decide if a small model can handle a prompt. If the small model is not confident, the system waits for it to finish before starting the large model from scratch, wasting significant time. * **Strict Matching in Speculative Decoding:** This method requires the large model to verify the small model’s tokens. Even if the small model produces a factually correct and high-quality response, the large model will reject the entire draft if the tokens do not match its own preferred output exactly. * **Trade-off Divergence:** Cascades prioritize reducing computational costs but suffer from latency when deferring, while speculative decoding prioritizes speed but often performs redundant work because it mandates identical output to the larger model. ### The Speculative Cascades Mechanism * **Parallel Verification with Deferral:** Speculative cascades use the parallel processing of speculative decoding but introduce a flexible decision rule. The system can choose to accept the smaller model’s draft even if it differs from the larger model’s prediction, provided it meets a confidence threshold. * **Flexible Token Matching:** Unlike standard speculative decoding, which often relies on strict token-by-token matching, speculative cascades allow for "probabilistic matches" or quality-based acceptance to prevent unnecessary rejections. * **Resource Optimization:** By strategically deferring to the smaller model for certain segments of the generation, the system reduces the total work required from the expensive expert model without losing the speed of parallel execution. ### Empirical Results and Performance * **Model Testing:** The approach was validated using Gemma and T5 models across diverse language tasks, including reasoning, coding, translation, and question answering. * **Superior Trade-offs:** Testing showed that speculative cascades consistently outperformed baselines in cost-quality metrics, providing faster inference without the strict "all-or-nothing" quality constraints of speculative decoding. * **Task Versatility:** The hybrid method proved effective across both creative tasks (like summarization) and factual tasks (like math or coding), where different levels of "correctness" are acceptable. Speculative cascades offer a practical path for scaling LLM deployments by balancing the high cost of large models with the need for low-latency user experiences. Developers looking to optimize inference should consider this hybrid approach to capture the efficiency of small models while retaining the oversight of larger, more capable ones.

google

Beyond billion-parameter burdens: Unlocking data synthesis with a conditional generator (opens in new tab)

The CTCL (Data Synthesis with ConTrollability and CLustering) framework provides a lightweight alternative to the computationally expensive process of fine-tuning billion-parameter models for differentially private synthetic data generation. By utilizing a 140-million parameter generator and a universal topic model, the system achieves high-quality distribution matching while remaining accessible for resource-constrained applications. This approach allows for the generation of unlimited synthetic samples without incurring additional privacy costs, consistently outperforming existing API-based and large-scale baselines under strict privacy guarantees. ### Pre-training Universal Components The framework relies on two core components developed using large-scale public corpora, which can be reused across different private domains: * **CTCL-Topic:** A universal topic model derived from Wikipedia documents. It uses BERTopic to embed and cluster data into approximately 1,000 distinct topics, each represented by 10 descriptive keywords. * **CTCL-Generator:** A conditional language model based on the 140M-parameter BART-base architecture. It was pre-trained on 430 million description–document pairs from the SlimPajama dataset, with descriptions generated by Gemma-2-2B to ensure the model can generate text based on specific input conditions. ### Learning the Private Domain Once the universal components are established, the framework learns the specific characteristics of a private dataset through a two-step process: * **Differentially Private (DP) Histograms:** The system captures high-level distributional information by creating a DP-protected histogram that represents the percentage of each topic present in the private corpus. * **DP Fine-Tuning:** Each document in the private dataset is associated with its corresponding keywords from the CTCL-Topic model. The CTCL-Generator is then fine-tuned on these keyword-document pairs using differential privacy to ensure individual data points are protected. ### Controllable Data Generation The final stage involves producing the synthetic dataset by sampling from the fine-tuned generator: * **Proportional Sampling:** The system generates data by targeting the exact topic proportions found in the private domain histogram. * **Keyword Conditioning:** For each topic, the model uses the associated 10 keywords as input to prompt the DP fine-tuned generator to produce relevant documents. * **Post-Processing Efficiency:** Because the generator is already fine-tuned with DP, the framework can generate an unlimited number of synthetic samples without further privacy budget expenditure, a significant advantage over iterative selection algorithms. CTCL offers a highly scalable and efficient solution for organizations needing to synthesize private text data without the infrastructure requirements of massive LLMs. Its ability to maintain topic-wise distribution through keyword conditioning makes it an ideal choice for specialized domains where maintaining the statistical utility of the data is as critical as protecting user privacy.

google

REGEN: Empowering personalized recommendations with natural language (opens in new tab)

Google Research has introduced REGEN, a benchmark dataset designed to evolve recommender systems from simple item predictors into conversational agents capable of natural language interaction. By augmenting the Amazon Product Reviews dataset with synthetic critiques and narratives using Gemini 1.5 Flash, the researchers provide a framework for training models to understand user feedback and explain their suggestions. The study demonstrates that integrating natural language critiques significantly improves recommendation accuracy while enabling models to generate personalized, context-aware content. ### Composition of the REGEN Dataset * The dataset enriches the existing Amazon Product Reviews archive by adding synthetic conversational elements, specifically targeting the gap in datasets that support natural language feedback. * **Critiques** are generated for similar item pairs within hierarchical categories, allowing users to guide the system by requesting specific changes, such as a different color or increased storage. * **Narratives** provide contextual depth through purchase reasons, product endorsements, and concise user summaries, helping the system justify its recommendations to the end-user. ### Unified Generative Modeling Approaches * The researchers framed a "jointly generative" task where models must process a purchase history and optional critique to output both a recommended item ID and a supporting narrative. * The **FLARE (Hybrid)** architecture uses a sequential recommender for item prediction based on collaborative filtering, which then feeds into a Gemma 2B LLM to generate the final text narrative. * The **LUMEN (Unified)** model functions as an end-to-end system where item IDs and text tokens are integrated into a single vocabulary, allowing one LLM to handle critiques, recommendations, and narratives simultaneously. ### Performance and Impact of User Feedback * Incorporating natural language critiques consistently improved recommendation metrics across different architectures, demonstrating that language-guided refinement is a powerful tool for accuracy. * In the Office domain, the FLARE hybrid model's Recall@10—a measure of how often the desired item appears in the top 10 results—increased from 0.124 to 0.1402 when critiques were included. * Results indicate that models trained on REGEN can achieve performance comparable to state-of-the-art specialized recommenders while maintaining high-quality natural language generation. The REGEN dataset and the accompanying LUMEN architecture provide a path forward for building more transparent and interactive AI assistants. For developers and researchers, utilizing these conversational benchmarks is essential for moving beyond "black box" recommendations toward systems that can explain their logic and adapt to specific user preferences in real time.

google

Google Research at Google I/O 2025 (opens in new tab)

Google Research at I/O 2025 showcases the "research to reality" transition, highlighting how years of foundational breakthroughs are now being integrated into Gemini models and specialized products. By focusing on multimodal capabilities, pedagogy, and extreme model efficiency, Google aims to democratize access to advanced AI while ensuring it remains grounded and useful across global contexts. ## Specialized Healthcare Models: MedGemma and AMIE * **MedGemma:** This new open model, based on Gemma 3, is optimized for multimodal medical tasks such as radiology image analysis and clinical data summarization. It is available in 4B and 27B sizes, performing similarly to much larger models on the MedQA benchmark while remaining small enough for efficient local fine-tuning. * **AMIE (Articulate Medical Intelligence Explorer):** A research AI agent designed for diagnostic medical reasoning. Its latest multimodal version can now interpret and reason about visual medical information, such as skin lesions or medical imaging, to assist clinicians in diagnostic accuracy. ## Educational Optimization through LearnLM * **Gemini 2.5 Pro Integration:** The LearnLM family of models, developed with educational experts, is now integrated into Gemini 2.5 Pro. This fine-tuning enhances STEM reasoning, multimodal understanding, and pedagogical feedback. * **Interactive Learning Tools:** A new research-optimized quiz experience allows students to generate custom assessments from their own notes, providing specific feedback on right and wrong answers rather than just providing solutions. * **Global Assessment Pilots:** Through partnerships like the one with Kayma, Google is testing the automatic assessment of short and long-form content in regions like Ghana to scale quality educational tools. ## Multilingual Expansion and On-Device Gemma Models * **Gemma 3 and 3n:** Research breakthroughs have expanded Gemma 3’s support to over 140 languages. The introduction of **Gemma 3n** targets extreme efficiency, capable of running on devices with as little as 2GB of RAM while maintaining low latency and low energy consumption. * **ECLeKTic Benchmark:** To assist the developer community, Google introduced this novel benchmark specifically for evaluating how well large language models transfer knowledge across different languages. ## Model Efficiency and Factuality in Search * **Inference Techniques:** Google Research continues to set industry standards for model speed and accessibility through technical innovations like **speculative decoding** and **cascades**, which reduce the computational cost of generating high-quality responses. * **Grounded Outputs:** Significant focus remains on factual consistency, ensuring that the AI models powering features like AI Overviews in Search provide reliable and grounded information to users. As Google continues to shrink the gap between laboratory breakthroughs and consumer products, the emphasis remains on making high-performance AI accessible on low-cost hardware and across diverse linguistic landscapes. Developers and researchers can now leverage these specialized tools via platforms like HuggingFace and Vertex AI to build more targeted, efficient applications.

google

Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis (opens in new tab)

Cell2Sentence-Scale (C2S-Scale) is a new family of open-source large language models designed to transform complex single-cell transcriptomic data into a text-based format accessible to natural language processing. By representing gene expression profiles as "cell sentences," the framework allows researchers to use general-purpose LLM architectures to "read" and "write" biological information. This approach simplifies single-cell analysis, enabling conversational queries and automated data interpretation that were previously limited to specialized tools and expert users. ### The Cell2Sentence Mapping Method * Translates single-cell RNA sequencing (scRNA-seq) measurements into sequences of text by ordering gene names according to their expression levels. * Enables the integration of cellular data with text-based biological context, such as cell types, experimental metadata, and scientific literature. * Leverages the existing vocabulary of biology—gene names and functions—to make high-dimensional data interpretable by standard language model tokenizers. ### C2S-Scale Model Architecture and Training * Built upon Google’s Gemma open model family, maintaining the original architecture to benefit from existing scalability and infrastructure. * Trained on a dataset exceeding 1 billion tokens derived from real-world transcriptomic data and biological metadata. * Features a range of model sizes from 410 million to 27 billion parameters, allowing researchers to choose between computational efficiency for exploratory work and high performance for complex tasks. ### Functional Applications in Biology * **Conversational Querying:** Researchers can interact with data through natural language to ask specific questions, such as predicting how a T cell might respond to a particular cancer therapy. * **Automated Interpretation:** The models can generate biological summaries of experiments, describing everything from individual cell types to the characteristics of entire tissues. * **Predictive Tasks:** The framework handles diverse tasks including cell type annotation and the generation of synthetic cells or tissues for research simulations. ### Performance and Biological Scaling Laws * Research demonstrates that biological language models follow predictable scaling laws, where performance in tasks like cell type annotation improves as model size increases. * Larger models show superior gene overlap and semantic similarity scores when interpreting datasets compared to smaller versions. * Smaller models remain highly effective for parameter-efficient fine-tuning in resource-constrained environments. C2S-Scale is available as an open-source resource on GitHub and HuggingFace, offering a flexible toolkit for the research community to apply large language models to next-generation genomic discovery.

google

Generating synthetic data with differentially private LLM inference (opens in new tab)

Researchers at Google have developed an inference-only method for generating differentially private (DP) synthetic data that avoids the high costs and data requirements associated with private fine-tuning. By prompting off-the-shelf large language models (LLMs) with sensitive examples in parallel and aggregating their outputs, the approach can generate thousands of high-quality synthetic data points while maintaining rigorous privacy guarantees. This method allows synthetic data to serve as a secure interface for model development, enabling teams to collaborate without requiring specialized knowledge of differential privacy. ## Differentially Private Prediction and Aggregation The core of this method relies on "private prediction," where privacy is applied to the model's output rather than the model itself. * Sensitive data points are distributed across multiple independent prompts, ensuring that no single individual's record can significantly influence the final output. * The LLM generates next-token predictions for each prompt in parallel, which are then aggregated to mask individual contributions. * The researchers designed a DP token sampling algorithm that treats the standard LLM "softmax" sampling process as a version of the exponential mechanism, a mathematical framework used to select the best option from a set while maintaining privacy. ## Enhancing Efficiency via KV Caching Previous attempts at private prediction were computationally expensive because they required a fresh batch of sensitive examples for every single token generated. * A new privacy analysis allows the system to reuse a fixed batch of sensitive examples across an entire generation sequence. * By maintaining the same context for each generation step, the system becomes compatible with standard inference optimization techniques like KV (Key-Value) caching. * This improvement enables the generation of synthetic data at a scale two to three orders of magnitude larger than prior methods. ## Optimizing Privacy Spend with Public Drafters To preserve the "privacy budget"—the limited amount of information that can be released before privacy is compromised—the method introduces a public drafter model. * The drafter model predicts the next token based solely on previously generated synthetic text, without ever seeing the sensitive data. * Using the sparse vector technique, the system only consumes the privacy budget when the public drafter’s suggestion disagrees with the private aggregate of the sensitive data. * This is particularly useful for structured data, where the drafter can handle formatting and syntax tokens, saving the privacy budget for the actual content. By leveraging off-the-shelf models like Gemma, this approach provides a scalable way to transform sensitive datasets into useful synthetic versions. These synthetic datasets are high-quality enough to replace real data in downstream machine learning tasks, such as in-context learning or fine-tuning models like BERT, without the risk of leaking individual user information.