Google Research / nlp

10 posts

google

Gemini provides automated feedback for theoretical computer scientists at STOC 2026 (opens in new tab)

Google Research launched an experimental program for the STOC 2026 conference using a specialized Gemini model to provide automated, rigorous feedback on theoretical computer science submissions. By identifying critical logical errors and proof gaps within a 24-hour window, the tool demonstrated that advanced AI can serve as a powerful pre-vetting collaborator for high-level mathematical research. The overwhelmingly positive reception from authors indicates that AI can effectively augment the human peer-review process by improving paper quality before formal submission. ## Advanced Reasoning via Inference Scaling - The tool utilized an advanced version of Gemini 2.5 Deep Think specifically optimized for mathematical rigor. - It employed inference scaling methods, allowing the model to explore and combine multiple possible solutions and reasoning traces simultaneously. - This non-linear approach to problem-solving helps the model focus on the most salient technical issues while significantly reducing the likelihood of hallucinations. ## Structured Technical Feedback - Feedback was delivered in a structured format that included a high-level summary of the paper's core contributions. - The model provided a detailed analysis of potential mistakes, specifically targeting errors within lemmas, theorems, and logical proofs. - Authors also received a categorized list of minor corrections, such as inconsistent variable naming and typographical errors. ## Identified Technical Issues and Impact - The pilot saw high engagement, with over 80% of STOC 2026 submitters opting in for the AI-generated review. - The tool successfully identified "critical bugs" and calculation errors that had previously evaded human authors for months. - Survey results showed that 97% of participants found the feedback helpful, and 81% reported that the tool improved the overall clarity and readability of their work. ## Expert Verification and Hallucinations - Because the users were domain experts, they were able to act as a filter, distinguishing between deep technical insights and occasional model hallucinations. - While the model sometimes struggled to parse complex notation or interpret figures, authors valued the "neutral tone" and the speed of the two-day turnaround. - The feedback was used as a starting point for human verification, allowing researchers to refine their arguments rather than blindly following the model's output. ## Future Outlook and Educational Potential - Beyond professional research, 75% of surveyed authors see significant educational value in using the tool to train students in mathematical rigor. - The experiment's success has led to 88% of participants expressing interest in having continuous access to such a tool throughout their entire research and drafting process. The success of the STOC 2026 pilot suggests that researchers should consider integrating specialized LLMs early in the drafting phase to catch "embarrassing" or logic-breaking errors. While the human expert remains the final arbiter of truth, these tools provide a necessary layer of automated verification that can accelerate the pace of scientific discovery.

google

AfriMed-QA: Benchmarking large language models for global health (opens in new tab)

AfriMed-QA is a comprehensive benchmarking suite designed to address the critical gap in medical LLM evaluation for African healthcare contexts. Developed through a partnership between Google Research and a pan-African consortium, the project demonstrates that current models often struggle with geographic distribution shifts in disease and localized linguistic nuances. The researchers conclude that diverse, region-specific datasets are essential for training equitable AI tools that can safely provide clinical decision support in low-resource settings. ## Limitations of Western-Centric Benchmarks * Existing medical benchmarks like USMLE MedQA focus on Western clinical contexts, which may not generalize to other regions. * Models trained on traditional datasets often fail to account for specific distribution shifts in disease types and cultural symptom descriptions. * The lack of diverse data makes it difficult to assess how LLMs handle variations in language and linguistics, even when the primary language is English. ## The AfriMed-QA Dataset Composition * The dataset contains approximately 15,000 clinically diverse questions and answers sourced from 16 African countries. * It covers 32 medical specialties, ranging from neurosurgery and internal medicine to infectious diseases and obstetrics. * The content is divided into three distinct formats: 4,000+ expert multiple-choice questions (MCQs), 1,200 open-ended short-answer questions (SAQs), and 10,000 consumer-style queries. * Data was crowdsourced from 621 contributors across 60 medical schools to ensure a broad representation of the continent's medical landscape. ## Data Collection and Curation Methodology * Researchers adapted a specialized web-based platform, originally built by Intron Health, to facilitate large-scale crowdsourcing across different regions. * To protect privacy, consumer queries were generated by prompting users with specific disease scenarios rather than asking for personal health information. * The curation process included custom user interfaces for quality reviews and blinded human evaluations by clinical experts to ensure the accuracy of reference answers. ## LLM Performance and Evaluation Results * The study benchmarked 30 general and biomedical LLMs, evaluating them for accuracy, semantic similarity, and human preference. * A significant performance gap exists between model sizes; larger models consistently outperformed smaller models on the AfriMed-QA benchmark. * This trend highlights a challenge for low-resource settings, where smaller, specialized models are often preferred for on-device or edge deployment due to infrastructure constraints. * The dataset has already been utilized to improve Google’s MedGemma, demonstrating its utility in training multimodal medical models. The AfriMed-QA benchmark datasets and evaluation code have been open-sourced on Hugging Face and GitHub to support the global research community. Developers are encouraged to use these tools to build and refine medical AI that is more inclusive and effective for the Global South.

google

A scalable framework for evaluating health language models (opens in new tab)

Researchers at Google have developed a scalable framework for evaluating health-focused language models by replacing subjective, high-complexity rubrics with granular, binary criteria. This "Adaptive Precise Boolean" approach addresses the high costs and low inter-rater reliability typically associated with expert-led evaluation in specialized medical domains. By dynamically filtering rubric questions based on context, the framework significantly improves both the speed and precision of model assessments. ## Limitations of Traditional Evaluation * Current evaluation practices for health LLMs rely heavily on human experts, making them cost-prohibitive and difficult to scale. * Standard tools, such as Likert scales (e.g., 1-5 ratings) or open-ended text, often lead to subjective interpretations and low inter-rater consistency. * Evaluating complex, personalized health data requires a level of detail that traditional broad-scale rubrics fail to capture accurately. ## Precise Boolean Rubrics * The framework "granularizes" complex evaluation targets into a larger set of focused, binary (Yes/No) questions. * This format reduces ambiguity by forcing raters to make definitive judgments on specific aspects of a model's response. * By removing the middle ground found in multi-point scales, the framework produces a more robust and actionable signal for programmatic model refinement. ## The Adaptive Filtering Mechanism * To prevent the high volume of binary questions from overwhelming human raters, the researchers introduced an "Adaptive" layer. * The framework uses the Gemini model as a zero-shot classifier to analyze the user query and LLM response, identifying only the most relevant rubric questions. * This data-driven adaptation ensures that human experts only spend time on pertinent criteria, resulting in "Human-Adaptive Precise Boolean" rubrics. ## Performance and Reliability Gains * The methodology was validated in the domain of metabolic health, covering topics like diabetes, obesity, and cardiovascular disease. * The Adaptive Precise Boolean approach reduced human evaluation time by over 50% compared to traditional Likert-scale methods. * Inter-rater reliability, measured through intra-class correlation coefficients (ICC), was significantly higher than the baseline, proving that simpler scoring can provide a higher quality signal. This framework demonstrates that breaking down complex medical evaluations into simple, machine-filtered binary questions is a more efficient path toward safe and accurate health AI. Organizations developing domain-specific models should consider adopting adaptive binary rubrics to balance the need for expert oversight with the requirements of large-scale model iteration.

google

Enabling physician-centered oversight for AMIE (opens in new tab)

Guardrailed-AMIE (g-AMIE) is a diagnostic AI framework designed to perform patient history-taking while strictly adhering to safety guardrails that prevent it from providing direct medical advice. By decoupling data collection from clinical decision-making, the system enables an asynchronous oversight model where primary care physicians (PCPs) review and finalize AI-generated medical summaries. In virtual clinical trials, g-AMIE’s diagnostic outputs and patient communications were preferred by overseeing physicians and patient actors over human-led control groups. ## Multi-Agent Architecture and Guardrails * The system utilizes a multi-agent setup powered by Gemini 2.0 Flash, consisting of a dialogue agent, a guardrail agent, and a SOAP note agent. * The dialogue agent conducts history-taking in three distinct phases: general information gathering, targeted validation of a differential diagnosis, and a conclusion phase for patient questions. * A dedicated guardrail agent monitors and rephrases responses in real-time to ensure the AI abstains from sharing individualized diagnoses or treatment plans directly with the patient. * The SOAP note agent employs sequential multi-step generation to separate summarization tasks (Subjective and Objective) from more complex inferential tasks (Assessment and Plan). ## The Clinician Cockpit and Asynchronous Oversight * To facilitate human review, researchers developed the "clinician cockpit," a web interface co-designed with outpatient physicians through semi-structured interviews. * The interface is structured around the standard SOAP note format, presenting the patient’s perspective, measurable data, differential diagnosis, and proposed management strategy. * This framework allows overseeing PCPs to review cases asynchronously, editing the AI’s proposed differential diagnoses and management plans before sharing a final message with the patient. * The separation of history-taking from decision-making ensures that licensed medical professionals retain ultimate accountability for patient care. ## Performance Evaluation via Virtual OSCE * The system was evaluated in a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) involving 60 case scenarios. * g-AMIE’s performance was compared against primary care physicians, nurse practitioners, and physician assistants who were required to operate under the same restrictive guardrails. * Overseeing PCPs and independent physician raters preferred g-AMIE’s diagnostic accuracy and management plans over those of the human control groups. * Patient actors reported a preference for the messages generated by g-AMIE compared to those drafted by human clinicians in the study. While g-AMIE demonstrates high potential for human-AI collaboration in diagnostics, the researchers emphasize that results should be interpreted with caution. The workflow was specifically optimized for AI characteristics, and human clinicians may require specialized training to perform effectively within such highly regulated guardrail frameworks.

google

REGEN: Empowering personalized recommendations with natural language (opens in new tab)

Google Research has introduced REGEN, a benchmark dataset designed to evolve recommender systems from simple item predictors into conversational agents capable of natural language interaction. By augmenting the Amazon Product Reviews dataset with synthetic critiques and narratives using Gemini 1.5 Flash, the researchers provide a framework for training models to understand user feedback and explain their suggestions. The study demonstrates that integrating natural language critiques significantly improves recommendation accuracy while enabling models to generate personalized, context-aware content. ### Composition of the REGEN Dataset * The dataset enriches the existing Amazon Product Reviews archive by adding synthetic conversational elements, specifically targeting the gap in datasets that support natural language feedback. * **Critiques** are generated for similar item pairs within hierarchical categories, allowing users to guide the system by requesting specific changes, such as a different color or increased storage. * **Narratives** provide contextual depth through purchase reasons, product endorsements, and concise user summaries, helping the system justify its recommendations to the end-user. ### Unified Generative Modeling Approaches * The researchers framed a "jointly generative" task where models must process a purchase history and optional critique to output both a recommended item ID and a supporting narrative. * The **FLARE (Hybrid)** architecture uses a sequential recommender for item prediction based on collaborative filtering, which then feeds into a Gemma 2B LLM to generate the final text narrative. * The **LUMEN (Unified)** model functions as an end-to-end system where item IDs and text tokens are integrated into a single vocabulary, allowing one LLM to handle critiques, recommendations, and narratives simultaneously. ### Performance and Impact of User Feedback * Incorporating natural language critiques consistently improved recommendation metrics across different architectures, demonstrating that language-guided refinement is a powerful tool for accuracy. * In the Office domain, the FLARE hybrid model's Recall@10—a measure of how often the desired item appears in the top 10 results—increased from 0.124 to 0.1402 when critiques were included. * Results indicate that models trained on REGEN can achieve performance comparable to state-of-the-art specialized recommenders while maintaining high-quality natural language generation. The REGEN dataset and the accompanying LUMEN architecture provide a path forward for building more transparent and interactive AI assistants. For developers and researchers, utilizing these conversational benchmarks is essential for moving beyond "black box" recommendations toward systems that can explain their logic and adapt to specific user preferences in real time.

google

Making complex text understandable: Minimally-lossy text simplification with Gemini (opens in new tab)

Google Research has introduced a novel system using Gemini models to perform minimally-lossy text simplification, a process designed to enhance readability while meticulously preserving original meaning and nuance. By utilizing an automated, iterative prompt-refinement loop, the system optimizes LLM instructions to achieve high-fidelity paraphrasing that avoids the information loss typical of standard summarization. A large-scale randomized study confirms that this approach significantly improves user comprehension across complex domains like law and medicine while simultaneously reducing cognitive load for the reader. ## Automated Evaluation and Fidelity Assessment * The system moves beyond traditional metrics like Flesch-Kincaid by using a Gemini-powered 1-10 readability scale that aligns more closely with human judgment and comprehension ease. * Fidelity is maintained through a specialized process using Gemini 1.5 Pro that maps specific claims from the original source text directly to the simplified output. * This mapping method identifies and weights specific error types, such as information loss, unnecessary gains, or factual distortions, to ensure the output remains a faithful representation of the technical original. ## Iterative Prompt Optimization Loop * To overcome the limitations and speed of manual prompt engineering, the researchers implemented a feedback loop where Gemini models optimize their own instructions. * In this "LLMs optimizing LLMs" setup, Gemini 1.5 Pro analyzes the performance of simplification prompts and proposes refinements based on automated readability and fidelity scores. * The optimization process ran for 824 iterations before performance plateaued, allowing the system to autonomously discover highly effective strategies for simplifying text without sacrificing detail. ## Validating Impact through Randomized Studies * The effectiveness of the model was validated with 4,563 participants across 31 diverse text excerpts covering specialized fields like aerospace, philosophy, finance, and biology. * The study utilized a randomized complete block design to compare the original text against simplified versions, measuring outcomes through nearly 50,000 multiple-choice question responses. * Beyond accuracy, researchers measured cognitive effort using the NASA Task Load Index and tracked self-reported user confidence to ensure the simplification actually lowered the barrier to understanding. This technology provides a scalable method for democratizing access to specialist knowledge by making expert-level discourse understandable to a general audience. The system is currently available as the "Simplify" feature within the Google app for iOS, offering a practical tool for users navigating complex digital information.

google

Amplify Initiative: Localized data for globalized AI (opens in new tab)

The Amplify Initiative by Google Research addresses the critical lack of linguistic and cultural diversity in generative AI training data by establishing an open, community-based platform for localized data collection. By partnering with regional experts to co-create structured, high-quality datasets, the initiative aims to ensure AI models are both representative and effective in solving local challenges across health, finance, and education. This approach shifts data collection from a top-down model to a participatory framework that prioritizes responsible, locally respectful practices in the Global South. ## The Amplify Platform Framework The initiative is designed to bridge the gap between global AI capabilities and local needs through three core pillars: * **Participatory Co-creation:** Researchers and local communities collaborate to define specific data needs, ensuring the resulting datasets address region-specific problems like financial literacy or localized health misinformation. * **Open Access for Innovation:** The platform provides high-quality, multilingual datasets suitable for fine-tuning and evaluating models, specifically empowering developers in the Global South to build tools for their own communities. * **Author Recognition:** Contributors receive tangible rewards, including professional certificates, research acknowledgments, and data authorship attribution, creating a sustainable ecosystem for expert participation. ## Pilot Implementation in Sub-Saharan Africa To test the methodology, Google Research partnered with Makerere University’s AI Lab in Uganda to conduct an on-the-ground pilot program. * **Expert Onboarding:** The program trained 259 experts across Ghana, Kenya, Malawi, Nigeria, and Uganda through a combination of in-person workshops and app-based modules. * **Dataset Composition:** The pilot resulted in 8,091 annotated adversarial queries across seven languages, covering salient domains such as education and finance. * **Adversarial Focus:** By focusing on adversarial queries, the team captured localized nuances of potential AI harms, including regional stereotypes and specialized advice that generic models often miss. ## Technical Workflow and App-Based Methodology The initiative utilizes a structured technical pipeline to scale data collection while maintaining high quality and privacy. * **Privacy-Preserving Android App:** A dedicated app serves as the primary interface for training, data creation, and annotation, allowing experts to contribute from their own environments. * **Automated Validation:** The app includes built-in feedback loops that use automated checks to ensure queries are relevant and to prevent the submission of semantically similar or duplicate entries. * **Domain-Specific Annotation:** Experts are provided with specialized annotation topics tailored to their professional backgrounds, ensuring that the metadata for each query is technically accurate and contextually relevant. The Amplify Initiative provides a scalable blueprint for building inclusive AI by empowering experts in the Global South to define their own data needs. As the project expands to India and Brazil, it offers a vital resource for developers seeking to fine-tune models for local contexts and improve the safety and relevance of AI on a global scale.

google

Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis (opens in new tab)

Cell2Sentence-Scale (C2S-Scale) is a new family of open-source large language models designed to transform complex single-cell transcriptomic data into a text-based format accessible to natural language processing. By representing gene expression profiles as "cell sentences," the framework allows researchers to use general-purpose LLM architectures to "read" and "write" biological information. This approach simplifies single-cell analysis, enabling conversational queries and automated data interpretation that were previously limited to specialized tools and expert users. ### The Cell2Sentence Mapping Method * Translates single-cell RNA sequencing (scRNA-seq) measurements into sequences of text by ordering gene names according to their expression levels. * Enables the integration of cellular data with text-based biological context, such as cell types, experimental metadata, and scientific literature. * Leverages the existing vocabulary of biology—gene names and functions—to make high-dimensional data interpretable by standard language model tokenizers. ### C2S-Scale Model Architecture and Training * Built upon Google’s Gemma open model family, maintaining the original architecture to benefit from existing scalability and infrastructure. * Trained on a dataset exceeding 1 billion tokens derived from real-world transcriptomic data and biological metadata. * Features a range of model sizes from 410 million to 27 billion parameters, allowing researchers to choose between computational efficiency for exploratory work and high performance for complex tasks. ### Functional Applications in Biology * **Conversational Querying:** Researchers can interact with data through natural language to ask specific questions, such as predicting how a T cell might respond to a particular cancer therapy. * **Automated Interpretation:** The models can generate biological summaries of experiments, describing everything from individual cell types to the characteristics of entire tissues. * **Predictive Tasks:** The framework handles diverse tasks including cell type annotation and the generation of synthetic cells or tissues for research simulations. ### Performance and Biological Scaling Laws * Research demonstrates that biological language models follow predictable scaling laws, where performance in tasks like cell type annotation improves as model size increases. * Larger models show superior gene overlap and semantic similarity scores when interpreting datasets compared to smaller versions. * Smaller models remain highly effective for parameter-efficient fine-tuning in resource-constrained environments. C2S-Scale is available as an open-source resource on GitHub and HuggingFace, offering a flexible toolkit for the research community to apply large language models to next-generation genomic discovery.

google

ECLeKTic: A novel benchmark for evaluating cross-lingual knowledge transfer in LLMs (opens in new tab)

ECLeKTic is a novel benchmark designed to evaluate how effectively large language models (LLMs) transfer knowledge between languages, addressing a common limitation where models possess information in a source language but fail to access it in others. By utilizing a closed-book question-answering format based on language-specific Wikipedia entries, the benchmark quantifies the gap between human-like cross-lingual understanding and current machine performance. Initial testing reveals that even state-of-the-art models have significant room for improvement, with the highest-performing model, Gemini 2.5 Pro, achieving only a 52.6% success rate. ## Methodology and Dataset Construction The researchers built the ECLeKTic dataset by focusing on "information silos" within Wikipedia to ensure the models would need to perform internal transfer rather than simply recalling translated training data. * The dataset targets 12 languages: English, French, German, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Portuguese, and Spanish. * Researchers selected 100 articles per language from a July 2023 Wikipedia snapshot that existed exclusively in that specific language and had no equivalent articles in the other 11 targeted languages. * This approach uses Wikipedia presence as a proxy to identify facts likely encountered by the model in only one language during its training phase. ## Human Refinement and Decontextualization To ensure the quality and portability of the questions, the team employed native speakers to refine and verify the data generated by AI. * Human annotators filtered Gemini-generated question-and-answer pairs to ensure they were answerable in a closed-book setting without referring to external context. * Annotators performed "decontextualization" by adding specific details to ambiguous terms; for example, a reference to the "Supreme Court" was clarified as the "Israeli Supreme Court" to ensure the question remained accurate after translation. * Questions were curated to focus on cultural and local salience rather than general global knowledge like science or universal current events. * The final dataset consists of 384 unique questions, which were translated and verified across all 11 target languages, resulting in 4,224 total examples. ## Benchmarking Model Performance The benchmark evaluates models using a specific metric called "overall success," which measures a model's ability to answer a question correctly in both the original source language and the target language. * The benchmark was used to test eight leading open and proprietary LLMs. * Gemini 2.0 Pro initially set a high bar with 41.6% success, which was later surpassed by Gemini 2.5 Pro at 52.6%. * The results demonstrate that while models are improving, they still struggle to maintain consistent knowledge across different linguistic contexts, representing a major hurdle for equitable global information access. The release of ECLeKTic as an open-source benchmark on Kaggle provides a vital tool for the AI community to bridge the "knowledge gap" between high-resource and low-resource languages. Developers and researchers should use this data to refine training methodologies, aiming for models that can express their internal knowledge regardless of the language used in the prompt.

google

Deciphering language processing in the human brain through LLM representations (opens in new tab)

Recent research by Google Research and collaborating universities indicates that Large Language Models (LLMs) process natural language through internal representations that closely mirror neural activity in the human brain. By comparing intracranial recordings from spontaneous conversations with the internal embeddings of the Whisper speech-to-text model, the study found a high degree of linear alignment between artificial and biological language processing. These findings suggest that the statistical structures learned by LLMs via next-word prediction provide a viable computational framework for understanding how humans comprehend and produce speech. ## Mapping LLM Embeddings to Brain Activity * Researchers utilized intracranial electrodes to record neural signals during real-world, free-flowing conversations. * The study compared neural activity against two distinct types of embeddings from the Transformer-based Whisper model: "speech embeddings" from the model’s encoder and "language embeddings" from the decoder. * A linear transformation was used to predict brain signals based on these embeddings, revealing that LLMs and the human brain share similar multidimensional spaces for coding linguistic information. * The alignment suggests that human language processing may rely more on statistical structures and contextual embeddings rather than traditional symbolic rules or syntactic parts of speech. ## Neural Sequences in Speech Comprehension * When a subject listens to speech, the brain follows a specific chronological sequence that aligns with model representations. * Initially, speech embeddings predict cortical activity in the superior temporal gyrus (STG), which is responsible for processing auditory speech sounds. * A few hundred milliseconds later, language embeddings predict activity in Broca’s area (located in the inferior frontal gyrus), marking the transition from sound perception to decoding meaning. ## Reversed Dynamics in Speech Production * During speech production, the neural sequence is reversed, beginning approximately 500 milliseconds before a word is articulated. * Processing starts in Broca’s area, where language embeddings predict activity as the brain plans the semantic content of the utterance. * This is followed by activity in the motor cortex (MC), aligned with speech embeddings, as the brain prepares the physical articulatory movements. * Finally, after articulation, speech embeddings predict activity back in the STG, suggesting the brain is monitoring the sound of the speaker's own voice. This research validates the use of LLMs as powerful predictive tools for neuroscience, offering a new lens through which to study the temporal and spatial dynamics of human communication. By bridging the gap between artificial intelligence and cognitive biology, researchers can better model how the brain integrates sound and meaning in real-time.