Google Research / gemini

22 posts

google

Gemini provides automated feedback for theoretical computer scientists at STOC 2026 (opens in new tab)

Google Research launched an experimental program for the STOC 2026 conference using a specialized Gemini model to provide automated, rigorous feedback on theoretical computer science submissions. By identifying critical logical errors and proof gaps within a 24-hour window, the tool demonstrated that advanced AI can serve as a powerful pre-vetting collaborator for high-level mathematical research. The overwhelmingly positive reception from authors indicates that AI can effectively augment the human peer-review process by improving paper quality before formal submission. ## Advanced Reasoning via Inference Scaling - The tool utilized an advanced version of Gemini 2.5 Deep Think specifically optimized for mathematical rigor. - It employed inference scaling methods, allowing the model to explore and combine multiple possible solutions and reasoning traces simultaneously. - This non-linear approach to problem-solving helps the model focus on the most salient technical issues while significantly reducing the likelihood of hallucinations. ## Structured Technical Feedback - Feedback was delivered in a structured format that included a high-level summary of the paper's core contributions. - The model provided a detailed analysis of potential mistakes, specifically targeting errors within lemmas, theorems, and logical proofs. - Authors also received a categorized list of minor corrections, such as inconsistent variable naming and typographical errors. ## Identified Technical Issues and Impact - The pilot saw high engagement, with over 80% of STOC 2026 submitters opting in for the AI-generated review. - The tool successfully identified "critical bugs" and calculation errors that had previously evaded human authors for months. - Survey results showed that 97% of participants found the feedback helpful, and 81% reported that the tool improved the overall clarity and readability of their work. ## Expert Verification and Hallucinations - Because the users were domain experts, they were able to act as a filter, distinguishing between deep technical insights and occasional model hallucinations. - While the model sometimes struggled to parse complex notation or interpret figures, authors valued the "neutral tone" and the speed of the two-day turnaround. - The feedback was used as a starting point for human verification, allowing researchers to refine their arguments rather than blindly following the model's output. ## Future Outlook and Educational Potential - Beyond professional research, 75% of surveyed authors see significant educational value in using the tool to train students in mathematical rigor. - The experiment's success has led to 88% of participants expressing interest in having continuous access to such a tool throughout their entire research and drafting process. The success of the STOC 2026 pilot suggests that researchers should consider integrating specialized LLMs early in the drafting phase to catch "embarrassing" or logic-breaking errors. While the human expert remains the final arbiter of truth, these tools provide a necessary layer of automated verification that can accelerate the pace of scientific discovery.

google

Generative UI: A rich, custom, visual interactive user experience for any prompt (opens in new tab)

Google Research has introduced a novel Generative UI framework that enables AI models to dynamically construct bespoke, interactive user experiences—including web pages, games, and functional tools—in response to any natural language prompt. This shift from static, predefined interfaces to AI-generated environments allows for highly customized digital spaces that adapt to a user's specific intent and context. Evaluated through human testing, these custom-generated interfaces are strongly preferred over traditional, text-heavy LLM outputs, signaling a fundamental evolution in human-computer interaction. ### Product Integration in Gemini and Google Search The technology is currently being deployed as an experimental feature across Google’s main AI consumer platforms to enhance how users visualize and interact with data. * **Dynamic View and Visual Layout:** These experiments in the Gemini app use agentic coding capabilities to design and code a complete interactive response for every prompt. * **AI Mode in Google Search:** Available for Google AI Pro and Ultra subscribers, this feature uses Gemini 3’s multimodal understanding to build instant, bespoke interfaces for complex queries. * **Contextual Customization:** The system differentiates between user needs, such as providing a simplified interface for a child learning about the microbiome versus a data-rich layout for an adult. * **Task-Specific Tools:** Beyond text, the system generates functional applications like fashion advisors, event planners, and science simulations for topics like RNA transcription. ### Technical Architecture and Implementation The Generative UI implementation relies on a multi-layered approach centered around the Gemini 3 Pro model to ensure the generated code is both functional and accurate. * **Tool Access:** The model is connected to server-side tools, including image generation and real-time web search, to enrich the UI with external data. * **System Instructions:** Detailed guidance provides the model with specific goals, formatting requirements, and technical specifications to avoid common coding errors. * **Agentic Coding:** The model acts as both a designer and a developer, writing the necessary code to render the UI on the fly based on its interpretation of the user’s prompt. * **Post-Processing:** Outputs undergo a series of automated checks to address common issues and refine the final visual experience before it reaches the browser. ### The Shift from Static to Generative Interfaces This research represents a move away from the traditional software paradigm where users must navigate a fixed catalog of applications to find the tool they need. * **Prompt-Driven UX:** Interfaces are generated from prompts as simple as a single word or as complex as multi-paragraph instructions. * **Interactive Comprehension:** By building simulations on the fly, the system creates a dynamic environment optimized for deep learning and task completion. * **Preference Benchmarking:** Research indicates that when generation speed is excluded as a factor, users significantly prefer these custom-built visual tools over standard, static AI responses. To experience this new paradigm, users can select the "Thinking" option from the model menu in Google Search’s AI Mode or engage with the Dynamic View experiment in the Gemini app to generate tailored tools for specific learning or productivity tasks.

google

StreetReaderAI: Towards making street view accessible via context-aware multimodal AI (opens in new tab)

StreetReaderAI is a research prototype designed to make immersive street-level imagery accessible to the blind and low-vision community through multimodal AI. By integrating real-time scene analysis with context-aware geographic data, the system transforms visual mapping data into an interactive, audio-first experience. This framework allows users to virtually explore environments and plan routes with a level of detail and independence previously unavailable through traditional screen readers. ### Navigation and Spatial Awareness The system offers an immersive, first-person exploration interface that mimics the mechanics of accessible gaming. * Users navigate using keyboard shortcuts or voice commands, taking "virtual steps" forward or backward and panning their view in 360 degrees. * Real-time audio feedback provides cardinal and intercardinal directions, such as "Now facing North," to maintain spatial orientation. * Distance tracking informs the user how far they have traveled between panoramic images, while "teleport" features allow for quick jumps to specific addresses or landmarks. ### Context-Aware AI Describer At the core of the tool is a subsystem backed by Gemini that synthesizes visual and geographic data to generate descriptions. * The AI Describer combines the current field-of-view image with dynamic metadata about nearby roads, intersections, and points of interest. * Two distinct modes cater to different user needs: a "Default" mode focusing on pedestrian safety and navigation, and a "Tour Guide" mode that provides historical and architectural details. * The system utilizes Gemini to proactively predict and suggest follow-up questions relevant to the specific scene, such as details about crosswalks or building entrances. ### Interactive Dialogue and Session Memory StreetReaderAI utilizes the Multimodal Live API to facilitate real-time, natural language conversations about the environment. * The AI Chat agent maintains a large context window of approximately 1,048,576 tokens, allowing it to retain a "memory" of up to 4,000 previous images and interactions. * This memory allows users to ask retrospective spatial questions, such as "Where was that bus stop I just passed?", with the agent providing relative directions based on the user's current location. * By tracking every pan and movement, the agent can provide specific details about the environment that were captured in previous steps of the virtual walk. ### User Evaluation and Practical Application Testing with blind screen reader users confirmed the system's utility in practical, real-world scenarios. * Participants successfully used the prototype to evaluate potential walking routes, identifying critical environmental features like the presence of benches or shelters at bus stops. * The study highlighted the importance of multimodal inputs—combining image recognition with structured map data—to provide a more accurate and reliable description than image analysis alone could offer. While StreetReaderAI remains a proof-of-concept, it demonstrates that the integration of multimodal LLMs and spatial data can bridge significant accessibility gaps in digital mapping. Future implementation of these technologies could transform how visually impaired individuals interact with the world, turning static street imagery into a functional tool for independent mobility and exploration.

google

Google Earth AI: Unlocking geospatial insights with foundation models and cross-modal reasoning (opens in new tab)

Google Earth AI introduces a framework of geospatial foundation models and reasoning agents designed to solve complex, planetary-scale challenges through cross-modal reasoning. By integrating Gemini-powered orchestrators with specialized imagery, population, and environmental models, the system deconstructs multifaceted queries into actionable multi-step plans. This approach enables a holistic understanding of real-world events, such as disaster response and disease forecasting, by grounding AI insights in diverse, grounded geospatial data. ## Geospatial Reasoning Agents * Utilizes Gemini models as intelligent orchestrators to manage complex queries that require data from multiple domains. * The agent deconstructs a high-level question—such as predicting hurricane landfalls and community vulnerability—into a sequence of smaller, executable tasks. * It executes these plans by autonomously calling specialized foundation models, querying vast datastores, and utilizing geospatial tools to fuse disparate data points into a single, cohesive answer. ## Remote Sensing and Imagery Foundations * Employs vision-language models and open-vocabulary object detection trained on a large corpus of high-resolution overhead imagery paired with text descriptions. * Enables "zero-shot" capabilities, allowing users to find specific objects like "flooded roads" or "building damage" using natural language without needing to retrain the model for specific classes. * Technical evaluations show a 16% average improvement on text-based image search tasks and more than double the baseline accuracy for detecting novel objects in a zero-shot setting. ## Population Dynamics and Mobility * Focuses on the interplay between people and places using globally-consistent embeddings across 17 countries. * Includes monthly updated embeddings that capture shifting human activity patterns, which are essential for time-sensitive forecasting. * Research conducted with the University of Oxford showed that incorporating these population embeddings into a Dengue fever forecasting model in Brazil improved the R² metric from 0.456 to 0.656 for long-range 12-month predictions. ## Environmental and Disaster Forecasting * Integrates established Google research into weather nowcasting, flood forecasting, and wildfire boundary mapping. * Provides the reasoning agent with the data necessary to evaluate environmental risks alongside population density and infrastructure imagery. * Aims to provide Search and Maps users with real-time, accurate alerts regarding natural disasters grounded in planetary-scale environmental data. Developers and enterprises looking to solve high-level geospatial problems can now express interest in accessing these capabilities through Google Earth and Google Cloud. By leveraging these foundation models, organizations can automate the analysis of satellite imagery and human mobility data to better prepare for environmental and social challenges.

google

Teaching Gemini to spot exploding stars with just a few examples (opens in new tab)

Researchers have demonstrated that Google’s Gemini model can classify cosmic events with 93% accuracy, rivaling specialized machine learning models while providing human-readable explanations. By utilizing few-shot learning with only 15 examples per survey, the model addresses the "black box" limitation of traditional convolutional neural networks used in astronomy. This approach enables scientists to efficiently process the millions of alerts generated by modern telescopes while maintaining a transparent and interactive reasoning process. ## Bottlenecks in Modern Transient Astronomy * Telescopes like the Vera C. Rubin Observatory are expected to generate up to 10 million alerts per night, making manual verification impossible. * The vast majority of these alerts are "bogus" signals caused by satellite trails, cosmic rays, or instrumental artifacts rather than real supernovae. * Existing specialized models often provide binary "real" or "bogus" labels without context, forcing astronomers to either blindly trust the output or spend hours on manual verification. ## Multimodal Few-Shot Learning for Classification * The research utilized few-shot learning, providing Gemini with only 15 annotated examples for three major surveys: Pan-STARRS, MeerLICHT, and ATLAS. * Input data consisted of image triplets—a "new" alert image, a "reference" image of the same sky patch, and a "difference" image—each 100x100 pixels in size. * The model successfully generalized across different telescopes with varying pixel scales, ranging from 0.25" per pixel for Pan-STARRS to 1.8" per pixel for ATLAS. * Beyond simple labels, Gemini generates a textual description of observed features and an interest score to help astronomers prioritize follow-up observations. ## Expert Validation and Self-Assessment * A panel of 12 professional astronomers evaluated the model using a 0–5 coherence rubric, confirming that Gemini’s logic aligned with expert reasoning. * The study found that Gemini can effectively assess its own uncertainty; low self-assigned "coherence scores" were strong indicators of likely classification errors. * This ability to flag its own potential mistakes allows the model to act as a reliable partner, alerting scientists when a specific case requires human intervention. The transition from "black box" classifiers to interpretable AI assistants allows the astronomical community to scale with the data flood of next-generation telescopes. By combining high-accuracy classification with transparent reasoning, researchers can maintain scientific rigor while processing millions of cosmic events in real time.

google

XR Blocks: Accelerating AI + XR innovation (opens in new tab)

XR Blocks is an open-source, cross-platform framework designed to bridge the technical gap between mature AI development ecosystems and high-friction extended reality (XR) prototyping. By providing a modular architecture and high-level abstractions, the toolkit enables creators to rapidly build and deploy intelligent, immersive web applications without managing low-level system integration. Ultimately, the framework empowers developers to move from concept to interactive prototype across both desktop simulators and mobile XR devices using a unified codebase. ### Core Design Principles * **Simplicity and Readability:** Drawing inspiration from the "Zen of Python," the framework prioritizes human-readable abstractions where a developer’s script reflects a high-level description of the experience rather than complex boilerplate code. * **Creator-Centric Workflow:** The architecture is designed to handle the "plumbing" of XR—such as sensor fusion, AI model integration, and cross-platform logic—allowing creators to focus entirely on user interaction and experience. * **Pragmatic Modularity:** Rather than attempting to be a perfect, all-encompassing system, XR Blocks favors an adaptable and simple architecture that can evolve alongside the rapidly changing fields of AI and spatial computing. ### The Reality Model Abstractions * **The Script Primitive:** Acts as the logical center of an application, separating the "what" of an interaction from the "how" of its underlying technical implementation. * **User and World:** Provides built-in support for tracking hands, gaze, and avatars while allowing the system to query the physical environment for depth, estimated lighting conditions, and object recognition. * **AI and Agents:** Facilitates the integration of intelligent assistants, such as the "Sensible Agent," which can provide proactive, context-aware suggestions within the XR environment. * **Virtual Interfaces:** Offers tools to augment blended reality with virtual UI elements that respond to the user's physical context. ### Technical Implementation and Integration * **Web-Based Foundation:** The framework is built upon accessible, standard technologies including WebXR, three.js, and LiteRT (formerly TFLite) to ensure a low barrier to entry for web developers. * **Advanced AI Support:** It features native integration with Gemini for high-level reasoning and context-aware applications. * **Cross-Platform Deployment:** Developers can prototype depth-aware, physics-based interactions in a desktop simulator and deploy the exact same code to Android XR devices. * **Open-Source Resources:** The project includes a comprehensive suite of templates and live demos covering specific use cases like depth mapping, gesture modeling, and lighting estimation. By lowering the barrier to entry for intelligent XR development, XR Blocks serves as a practical starting point for researchers and developers aiming to explore the next generation of human-centered computing. Interested creators can access the source code on GitHub to begin building immersive, AI-driven applications that function seamlessly across the web and specialized XR hardware.

google

AI as a research partner: Advancing theoretical computer science with AlphaEvolve (opens in new tab)

AlphaEvolve, an LLM-powered coding agent developed by Google DeepMind, facilitates mathematical discovery by evolving code to find complex combinatorial structures that are difficult to design manually. By utilizing a "lifting" technique, the system discovers finite structures that can be plugged into existing proof frameworks to establish new universal theorems in complexity theory. This methodology has successfully produced state-of-the-art results for the MAX-4-CUT problem and tightened bounds on the hardness of certifying properties in random graphs. ## The Role of AlphaEvolve in Mathematical Research * The system uses an iterative feedback loop to morph code snippets, evaluating the resulting mathematical structures and refining the code toward more optimal solutions. * AlphaEvolve operates as a tool-based assistant that generates specific proof elements, which can then be automatically verified by computer programs to ensure absolute mathematical correctness. * By focusing on verifiable finite structures, the agent overcomes the common "hallucination" issues of LLMs, as the final output is a computationally certified object rather than a speculative text-based proof. ## Bridging Finite Discovery and Universal Statements through Lifting * Theoretical computer science often requires proofs that hold true for all problem sizes ($\forall n$), a scale that AI systems typically struggle to address directly. * The "lifting" technique treats a proof as a modular structure where a specific finite component—such as a combinatorial gadget—can be replaced with a more efficient version while keeping the rest of the proof intact. * When AlphaEvolve finds a superior finite structure, the improvement is "lifted" through the existing mathematical framework to yield a stronger universal theorem without requiring a human to redesign the entire logical architecture. ## Optimizing Gadget Reductions and MAX-k-CUT * Researchers applied the agent to "gadget reductions," which are recipes used to map known intractable problems to new ones to prove computational hardness (NP-hardness). * AlphaEvolve discovered complex gadgets that were previously unknown because they were too intricate for researchers to construct by hand. * These discoveries led to a new state-of-the-art inapproximability result for the MAX-4-CUT problem, defining more precise limits on how accurately the problem can be solved by any efficient algorithm. ## Advancing Average-Case Hardness in Random Graphs * The agent was tasked with uncovering structures related to the average-case hardness of certifying properties within random graphs. * By evolving better combinatorial structures for these specific instances, the team was able to tighten existing mathematical bounds, providing a clearer picture of when certain graph properties become computationally intractable to verify. This research demonstrates that LLM-based agents can serve as genuine research partners by focusing on the discovery of verifiable, finite components within broader theoretical frameworks. For researchers in mathematics and computer science, this "lifting" approach provides a practical roadmap for using AI to solve bottleneck problems that were previously restricted by the limits of manual construction.

google

The anatomy of a personal health agent (opens in new tab)

Google researchers have developed the Personal Health Agent (PHA), an LLM-powered prototype designed to provide evidence-based, personalized health insights by analyzing multimodal data from wearables and blood biomarkers. By utilizing a specialized multi-agent architecture, the system deconstructs complex health queries into specific tasks to ensure statistical accuracy and clinical grounding. The study demonstrates that this modular approach significantly outperforms standard large language models in providing reliable, data-driven wellness support. ## Multi-Agent System Architecture * The PHA framework adopts a "team-based" approach, utilizing three specialist sub-agents: a Data Science agent, a Domain Expert agent, and a Health Coach. * The system was validated using a real-world dataset from 1,200 participants, featuring longitudinal Fitbit data, health questionnaires, and clinical blood test results. * This architecture was designed after a user-centered study of 1,300 health queries, identifying four key needs: general knowledge, data interpretation, wellness advice, and symptom assessment. * Evaluation involved over 1,100 hours of human expert effort across 10 benchmark tasks to ensure the system outperformed base models like Gemini. ## The Data Science Agent * This agent specializes in "contextualized numerical insights," transforming ambiguous queries (e.g., "How is my fitness trending?") into formal statistical analysis plans. * It operates through a two-stage process: first interpreting the user's intent and data sufficiency, then generating executable code to analyze time-series data. * In benchmark testing, the agent achieved a 75.6% score in analysis planning, significantly higher than the 53.7% score achieved by the base model. * The agent's code generation was validated against 173 rigorous unit tests written by human data scientists to ensure accuracy in handling wearable sensor data. ## The Domain Expert Agent * Designed for high-stakes medical accuracy, this agent functions as a grounded source of health knowledge using a multi-step reasoning framework. * It utilizes a "toolbox" approach, granting the LLM access to authoritative external databases such as the National Center for Biotechnology Information (NCBI) to provide verifiable facts. * The agent is specifically tuned to tailor information to the user’s unique profile, including specific biomarkers and pre-existing medical conditions. * Performance was measured through board certification and coaching exam questions, as well as its ability to provide accurate differential diagnoses compared to human clinicians. While currently a research framework rather than a public product, the PHA demonstrates that a modular, specialist-driven AI architecture is essential for safe and effective personal health management. Developers of future health-tech tools should prioritize grounding LLMs in external clinical databases and implementing rigorous statistical validation stages to move beyond the limitations of general-purpose chatbots.

google

Towards better health conversations: Research insights on a “wayfinding” AI agent based on Gemini (opens in new tab)

Google Research has developed "Wayfinding AI," a research prototype based on Gemini designed to transform health information seeking from a passive query-response model into a proactive, context-seeking dialogue. By prioritizing clarifying questions and iterative guidance, the agent addresses the common struggle users face when attempting to articulate complex or ambiguous medical concerns. User studies indicate that this proactive approach results in health information that participants find significantly more helpful, relevant, and tailored to their specific needs than traditional AI responses. ### Challenges in Digital Health Navigation * Formative research involving 33 participants highlighted that users often struggle to articulate health concerns because they lack the clinical background to know which details are medically relevant. * The study found that users typically "throw words" at a search engine and sift through generic, impersonal results that do not account for their unique context. * Initial UX testing revealed a strong user preference for a "deferred-answer" approach, where the AI mimics a medical professional by asking clarifying questions before jumping to a conclusion. ### Core Design Principles of Wayfinding AI * **Proactive Conversational Guidance:** At every turn, the agent asks up to three targeted questions to reduce ambiguity and help users systematically share their "health story." * **Best-Effort Answers:** To ensure immediate utility, the AI provides the best possible information based on the data available at that moment, while noting that the answer will improve as the user provides more context. * **Transparent Reasoning:** The system explicitly explains how the user’s most recent answers have helped refine the previous response, making the AI’s internal logic understandable. ### Split-Stream User Interface * To prevent clarifying questions from being buried in long paragraphs, the prototype uses a two-column layout. * The left column is dedicated to the interactive chat and specific follow-up questions to keep the user focused on the dialogue. * The right column displays the "best information so far" and detailed explanations, allowing users to dive into the technical content only when they feel enough context has been established. ### Comparative Evaluation and Performance * A randomized study with 130 participants compared the Wayfinding AI against a baseline Gemini 2.5 Flash model. * Participants interacted with both models for at least three minutes regarding a personal health question and rated them across six dimensions: helpfulness, question relevance, tailoring, goal understanding, ease of use, and efficiency. * The proactive agent outperformed the baseline significantly, with participants reporting that the context-seeking behavior felt more professional and increased their confidence in the AI's suggestions. The research suggests that for sensitive and complex topics like health, AI should move beyond being a passive knowledge base. By adopting a "wayfinding" strategy that guides users through their own information needs, AI agents can provide more personalized and empowering experiences that better mirror expert human consultation.

google

Accelerating scientific discovery with AI-powered empirical software (opens in new tab)

Google Research has introduced an AI-powered system designed to accelerate scientific discovery by automating the creation and optimization of "empirical software." By leveraging the Gemini model and tree search optimization, the system can propose, implement, and iteratively improve code for complex multidisciplinary challenges, achieving results that match or exceed human expert performance. This approach transforms scientific hypothesis evaluation from a months-long manual coding process into an automated search that can be completed in hours or days. ### The Concept of Empirical Software and Scorable Tasks * The system shifts focus from traditional functional correctness to "empirical software," where the primary objective is to maximize a predefined quality score. * It targets "scorable tasks," which are defined by a problem description, a specific scoring metric, and a dataset for training and validation. * This framework addresses the research bottleneck where scientists must manually test hundreds of models or parameters to achieve a breakthrough. ### System Architecture and Optimization Strategy * The engine takes a task description and optional context—such as ideas from scientific literature—as input to generate novel methodological concepts. * It utilizes a tree search strategy inspired by AlphaZero, employing an upper confidence bound to navigate and prioritize thousands of potential code variants. * The LLM acts as an iterative rewriter, refining executable code within a sandbox to continuously improve the performance score. * Outputs are designed to be fully verifiable, interpretable, and reproducible, providing scientists with the specific coded solutions used to reach a result. ### Demonstrated Performance Across Scientific Domains * The system was tested on six diverse benchmarks, including genomics, public health, geospatial analysis, neuroscience, and time-series forecasting. * In genomics, the system tackled the "batch integration" of single-cell RNA sequencing (scRNA-seq) data, a complex problem involving the removal of noise while preserving biological signals. * The AI discovered 40 novel methods that outperformed top expert-developed tools within the OpenProblems V2.0.0 batch integration benchmark. * Evaluation focused on advanced capabilities such as zero-shot generalization, high-dimensional signal processing, and uncertainty quantification. This system represents a significant shift toward "research engines" that participate actively in the scientific method through iterative experimentation. Scientists can utilize these tools to explore a much broader range of hypotheses than manual coding allows, potentially leading to faster breakthroughs in data-heavy fields like genomics and climate modeling.

google

How Google’s AI can help transform health professions education (opens in new tab)

To address a projected global deficit of 11 million healthcare workers by 2030, Google Research is exploring how generative AI can provide personalized, competency-based education for medical professionals. By combining qualitative user-centered design with quantitative benchmarking of the pedagogically fine-tuned LearnLM model, researchers have demonstrated that AI can effectively mimic the behaviors of high-quality human tutors. The studies conclude that specialized models, now integrated into Gemini 2.5 Pro, can significantly enhance clinical reasoning and adapt to the individual learning styles of medical students. ## Learner-Centered Design and Participatory Research * Researchers conducted interdisciplinary co-design workshops featuring medical students, clinicians, and AI researchers to identify specific educational needs. * The team developed a rapid prototype of an AI tutor designed to guide learners through clinical reasoning exercises anchored in synthetic clinical vignettes. * Qualitative feedback from medical residents and students highlighted a demand for "preceptor-like" behaviors, such as the ability to manage cognitive load, provide constructive feedback, and encourage active reflection. * Analysis revealed that learners specifically value AI tools that can identify and bridge individual knowledge gaps rather than providing generic information. ## Quantitative Benchmarking via LearnLM * The study utilized LearnLM, a version of Gemini fine-tuned specifically for educational pedagogy, and compared its performance against Gemini 1.5 Pro. * Evaluations were conducted using 50 synthetic scenarios covering a spectrum of medical education, ranging from preclinical topics like platelet activation to clinical subjects such as neonatal jaundice. * Medical students engaged in 290 role-playing conversations, which were then evaluated based on four primary metrics: overall experience, meeting learning needs, enjoyability, and understandability. * Physician educators performed blinded reviews of conversation transcripts to assess whether the AI adhered to medical education standards and core competencies. ## Pedagogical Performance and Expert Evaluation * LearnLM was consistently rated higher than the base model by both students and educators, with experts noting it behaved "more like a very good human tutor." * The fine-tuned model demonstrated a superior ability to maintain a conversation plan and use grounding materials to provide accurate, context-aware instruction. * Findings suggest that pedagogical fine-tuning is essential for AI to move beyond simple fact-delivery and toward true interactive tutoring. * These specialized learning capabilities have been transitioned from the research phase into Gemini 2.5 Pro to support broader educational applications. By integrating these specialized AI behaviors into medical training pipelines, institutions can provide scalable, individualized support to students. The transition of LearnLM’s pedagogical features into Gemini 2.5 Pro provides a practical framework for developers to create tools that not only provide medical information but actively foster the critical thinking skills required for clinical practice.

google

A scalable framework for evaluating health language models (opens in new tab)

Researchers at Google have developed a scalable framework for evaluating health-focused language models by replacing subjective, high-complexity rubrics with granular, binary criteria. This "Adaptive Precise Boolean" approach addresses the high costs and low inter-rater reliability typically associated with expert-led evaluation in specialized medical domains. By dynamically filtering rubric questions based on context, the framework significantly improves both the speed and precision of model assessments. ## Limitations of Traditional Evaluation * Current evaluation practices for health LLMs rely heavily on human experts, making them cost-prohibitive and difficult to scale. * Standard tools, such as Likert scales (e.g., 1-5 ratings) or open-ended text, often lead to subjective interpretations and low inter-rater consistency. * Evaluating complex, personalized health data requires a level of detail that traditional broad-scale rubrics fail to capture accurately. ## Precise Boolean Rubrics * The framework "granularizes" complex evaluation targets into a larger set of focused, binary (Yes/No) questions. * This format reduces ambiguity by forcing raters to make definitive judgments on specific aspects of a model's response. * By removing the middle ground found in multi-point scales, the framework produces a more robust and actionable signal for programmatic model refinement. ## The Adaptive Filtering Mechanism * To prevent the high volume of binary questions from overwhelming human raters, the researchers introduced an "Adaptive" layer. * The framework uses the Gemini model as a zero-shot classifier to analyze the user query and LLM response, identifying only the most relevant rubric questions. * This data-driven adaptation ensures that human experts only spend time on pertinent criteria, resulting in "Human-Adaptive Precise Boolean" rubrics. ## Performance and Reliability Gains * The methodology was validated in the domain of metabolic health, covering topics like diabetes, obesity, and cardiovascular disease. * The Adaptive Precise Boolean approach reduced human evaluation time by over 50% compared to traditional Likert-scale methods. * Inter-rater reliability, measured through intra-class correlation coefficients (ICC), was significantly higher than the baseline, proving that simpler scoring can provide a higher quality signal. This framework demonstrates that breaking down complex medical evaluations into simple, machine-filtered binary questions is a more efficient path toward safe and accurate health AI. Organizations developing domain-specific models should consider adopting adaptive binary rubrics to balance the need for expert oversight with the requirements of large-scale model iteration.

google

Achieving 10,000x training data reduction with high-fidelity labels (opens in new tab)

Google Ads researchers have developed a scalable active learning curation process that reduces the volume of training data required for fine-tuning LLMs by up to four orders of magnitude. By iteratively identifying the most informative and diverse examples through clustering and expert review, the method achieves significantly higher human-model alignment than traditional large-scale crowdsourced datasets. This approach effectively addresses the high costs and complexities of classifying ambiguous content, such as unsafe ads, where high-fidelity data is scarce and concept drift is frequent. ### The Iterative Curation Process * **Initial Labeling:** The process begins with a zero- or few-shot model (LLM-0) that generates a large, typically imbalanced dataset of "positive" and "benign" labels. * **Clustering and Confusion Identification:** Separate clusters are created for each label set; overlapping clusters indicate areas where the model is confused. * **Expert Sampling:** Human experts review pairs of examples located near the decision boundary of these overlapping clusters, prioritizing those that cover a larger area of the search space to ensure diversity. * **Recursive Refinement:** Expert labels are split into fine-tuning and evaluation sets; the model is retrained and the process repeats until model-human alignment plateaus or matches internal expert agreement. ### Measuring Alignment via Cohen’s Kappa * **Metric Selection:** Because ad safety is often subjective, the researchers use Cohen’s Kappa instead of precision and recall to measure how well two independent annotators align beyond chance. * **Performance Benchmarks:** A Kappa value above 0.8 is considered exceptional, while 0.4 is the minimum for acceptability. * **Goal Alignment:** The curation process aims to move model performance toward the "ceiling" of internal human agreement (which measured between 0.78 and 0.81 in these experiments). ### Experimental Results and Efficiency * **Model Scaling:** Experiments involved fine-tuning Gemini Nano-1 (1.8B parameters) and Nano-2 (3.25B parameters) on tasks of varying complexity. * **Drastic Data Reduction:** The curated method reached performance plateaus using fewer than 500 expert-labeled examples, compared to a baseline of 100,000 crowdsourced labels. * **Quality Gains:** Despite using 10,000x less data, the curated models saw up to a 65% improvement in alignment with human experts over the crowdsourced baselines. * **Class Balancing:** The process naturally corrected for production imbalances, moving from <1% positive examples in raw traffic to ~40% in the final curated sets. This curation method is a highly effective strategy for organizations managing high-stakes classification tasks where "ground truth" is subjective or data curation is prohibitively expensive. By shifting focus from data quantity to the quality and diversity of examples at the decision boundary, developers can maintain high-performing models that adapt quickly to evolving safety policies.

google

Insulin resistance prediction from wearables and routine blood biomarkers (opens in new tab)

Researchers at Google have developed a novel machine learning approach to predict insulin resistance (IR) by integrating wearable device data with routine blood biomarkers. This method aims to provide a scalable, less invasive alternative to traditional "gold standard" tests like the euglycemic insulin clamp or specialized HOMA-IR assessments. The study demonstrates that combining digital biomarkers with common laboratory results can effectively identify individuals at risk for type 2 diabetes, particularly within high-risk populations. ## Barriers to Early Diabetes Screening * Insulin resistance is a primary precursor to approximately 70% of type 2 diabetes cases, yet it often remains undetected until the disease has progressed. * Current diagnostic standards are frequently omitted from routine check-ups due to high costs, invasiveness, and the requirement for specific insulin blood tests that are not standard practice. * Early detection is vital because insulin resistance is often reversible through lifestyle modifications, making accessible screening tools a high priority for preventative medicine. ## The WEAR-ME Multimodal Dataset * The research utilized the "WEAR-ME" study, which collected data from 1,165 remote participants across the U.S. via the Google Health Studies app. * Digital biomarkers were gathered from Fitbit and Google Pixel Watch devices, tracking metrics such as resting heart rate, step counts, and sleep patterns. * Clinical data was provided through a partnership with Quest Diagnostics, focusing on routine blood biomarkers like fasting glucose and lipid panels, supplemented by participant surveys on diet, fitness, and demographics. ## Predictive Modeling and Performance * Deep neural network models were trained to estimate HOMA-IR scores by analyzing different combinations of the collected data streams. * While models using only wearables and demographics achieved an area under the receiver operating characteristic curve (auROC) of 0.70, adding fasting glucose data boosted the auROC to 0.78. * The most comprehensive models, which combined wearables, demographics, and full routine blood panels, achieved the highest accuracy across the study population. * Performance was notably strong in high-risk sub-groups, specifically individuals with obesity or sedentary lifestyles. ## AI-Driven Interpretation and Literacy * To assist with data translation, the researchers developed a prototype "Insulin Resistance Literacy and Understanding Agent" built on the Gemini family of large language models. * The agent is designed to help users interpret their IR risk predictions and provide personalized, research-backed educational content. * This AI integration aims to facilitate better communication between the data results and actionable health strategies, though it is currently intended for informational and research purposes. By utilizing ubiquitous wearable technology and existing clinical infrastructure, this approach offers a path toward proactive metabolic health monitoring. Integrating these models into consumer or clinical platforms could lower the barrier to early diabetes intervention and enable more personalized preventative care.

google

REGEN: Empowering personalized recommendations with natural language (opens in new tab)

Google Research has introduced REGEN, a benchmark dataset designed to evolve recommender systems from simple item predictors into conversational agents capable of natural language interaction. By augmenting the Amazon Product Reviews dataset with synthetic critiques and narratives using Gemini 1.5 Flash, the researchers provide a framework for training models to understand user feedback and explain their suggestions. The study demonstrates that integrating natural language critiques significantly improves recommendation accuracy while enabling models to generate personalized, context-aware content. ### Composition of the REGEN Dataset * The dataset enriches the existing Amazon Product Reviews archive by adding synthetic conversational elements, specifically targeting the gap in datasets that support natural language feedback. * **Critiques** are generated for similar item pairs within hierarchical categories, allowing users to guide the system by requesting specific changes, such as a different color or increased storage. * **Narratives** provide contextual depth through purchase reasons, product endorsements, and concise user summaries, helping the system justify its recommendations to the end-user. ### Unified Generative Modeling Approaches * The researchers framed a "jointly generative" task where models must process a purchase history and optional critique to output both a recommended item ID and a supporting narrative. * The **FLARE (Hybrid)** architecture uses a sequential recommender for item prediction based on collaborative filtering, which then feeds into a Gemma 2B LLM to generate the final text narrative. * The **LUMEN (Unified)** model functions as an end-to-end system where item IDs and text tokens are integrated into a single vocabulary, allowing one LLM to handle critiques, recommendations, and narratives simultaneously. ### Performance and Impact of User Feedback * Incorporating natural language critiques consistently improved recommendation metrics across different architectures, demonstrating that language-guided refinement is a powerful tool for accuracy. * In the Office domain, the FLARE hybrid model's Recall@10—a measure of how often the desired item appears in the top 10 results—increased from 0.124 to 0.1402 when critiques were included. * Results indicate that models trained on REGEN can achieve performance comparable to state-of-the-art specialized recommenders while maintaining high-quality natural language generation. The REGEN dataset and the accompanying LUMEN architecture provide a path forward for building more transparent and interactive AI assistants. For developers and researchers, utilizing these conversational benchmarks is essential for moving beyond "black box" recommendations toward systems that can explain their logic and adapt to specific user preferences in real time.