Google Research / gen-ai

15 posts

google

Next generation medical image interpretation with MedGemma 1.5 and medical speech to text with MedASR (opens in new tab)

Google Research has introduced MedGemma 1.5 4B and MedASR, expanding its suite of open medical AI models to support more complex clinical workflows. These updates significantly enhance the interpretation of high-dimensional imaging and medical speech-to-text, providing a compute-efficient foundation for healthcare developers to build upon. By maintaining an open-access model available on Hugging Face and Vertex AI, Google aims to accelerate the integration of multimodal AI into real-world medical applications. ### Multimodal Advancements in MedGemma 1.5 The latest update to the MedGemma 4B model focuses on high-dimensional and longitudinal data, moving beyond simple 2D image interpretation. * **3D Medical Imaging:** The model now supports volumetric representations from CT scans and MRIs, as well as whole-slide histopathology imaging. * **Longitudinal Review:** New capabilities allow for the review of chest X-ray time series, helping clinicians track disease progression over time. * **Anatomical Localization:** Developers can use the model to identify and localize specific anatomical features within chest X-rays. * **Document Understanding:** Enhanced support for extracting structured data from complex medical lab reports and documents. * **Edge Capability:** The 4B parameter size is specifically designed to be small enough to run offline while remaining accurate enough for core medical reasoning tasks. ### Medical Speech-to-Text with MedASR MedASR is a specialized automated speech recognition (ASR) model designed to bridge the gap between clinical dialogue and digital documentation. * **Clinical Dictation:** The model is specifically fine-tuned for medical terminology and the unique nuances of clinical dictation. * **Integrated Reasoning:** MedASR is designed to pair seamlessly with MedGemma, allowing transcribed text to be immediately processed for advanced medical reasoning or summarization. * **Accessibility:** Like other HAI-DEF models, it is free for research and commercial use and hosted on both Hugging Face and Google Cloud’s Vertex AI. ### Performance Benchmarks and Community Impact Google is incentivizing innovation through improved performance metrics and community-driven challenges. * **Accuracy Gains:** Internal benchmarks show MedGemma 1.5 improved disease-related CT classification by 3% and MRI classification by 14% compared to the previous version. * **MedGemma Impact Challenge:** A Kaggle-hosted hackathon with $100,000 in prizes has been launched to encourage developers to find creative applications for these multimodal tools. * **Model Collection:** The update complements existing tools like the MedSigLIP image encoder and the larger MedGemma 27B model, which remains the preferred choice for complex, text-heavy medical applications. Developers and researchers are encouraged to utilize MedGemma 1.5 for tasks requiring efficient, offline multimodal processing, while leveraging MedASR to automate clinical documentation. By participating in the MedGemma Impact Challenge, the community can help define the next generation of AI-assisted medical diagnostics and workflows.

google

Google Research 2025: Bolder breakthroughs, bigger impact (opens in new tab)

Google Research in 2025 has shifted toward an accelerated "Magic Cycle" that rapidly translates foundational breakthroughs into real-world applications across science, society, and consumer products. By prioritizing model efficiency, factuality, and agentic capabilities, the organization is moving beyond static text generation toward interactive, multi-modal systems that solve complex global challenges. This evolution is underpinned by a commitment to responsible AI development, ensuring that new technologies like quantum computing and generative UI are both safe and culturally inclusive. ## Enhancing Model Efficiency and Factuality * Google introduced new efficiency-focused techniques like block verification (an evolution of speculative decoding) and the LAVA scheduling algorithm, which optimizes resource allocation in large cloud data centers. * The Gemini 3 model achieved state-of-the-art results on factuality benchmarks, including SimpleQA Verified and the newly released FACTS benchmark suite, by emphasizing grounded world knowledge. * Research into Retrieval Augmented Generation (RAG) led to the development of the LLM Re-Ranker in Vertex AI, which helps models determine if they possess sufficient context to provide accurate answers. * The Gemma open model expanded to support over 140 languages, supported by the TUNA taxonomy and the Amplify initiative to improve socio-cultural intelligence and data representation. ## Interactive Experiences through Generative UI * A novel implementation of generative UI allows Gemini 3 to dynamically create visual interfaces, web pages, and tools in response to user prompts rather than providing static text. * This technology is powered by specialized models like "Gemini 3-interactive," which are trained to output structured code and design elements. * These capabilities have been integrated into AI Mode within Google Search, allowing for more immersive and customizable user journeys. ## Advanced Architectures and Agentic AI * Google is exploring hybrid model architectures, such as Jamba-style models that combine State Space Models (SSMs) with traditional attention mechanisms to handle long contexts more efficiently. * The development of agentic AI focuses on models that can reason, plan, and use tools, exemplified by Project Astra, a prototype for a universal AI agent. * Specialized models like Gemini 3-code have been optimized to act as autonomous collaborators for software developers, assisting in complex coding tasks and system design. ## AI for Science and Planetary Health * In biology, research teams utilized AI to map human heart and brain structures and employed RoseTTAFold-Diffusion to design new proteins for therapeutic use. * The NeuralGCM model has revolutionized Earth sciences by combining traditional physics with machine learning for faster, more accurate weather and climate forecasting. * Environmental initiatives include the FireSat satellite constellation for global wildfire detection and the expansion of AI-driven flood forecasting and contrail mitigation. ## Quantum Computing and Responsible AI * Google achieved significant milestones in quantum error correction, developing low-overhead codes that bring the industry closer to a reliable, large-scale quantum computer. * Security and safety remain central, with the expansion of SynthID—a watermarking tool for AI-generated text, audio, and video—to help users identify synthetic content. * The team continues to refine the Secure AI Framework (SAIF) to defend against emerging threats while promoting the safe deployment of generative media models like Veo and Imagen. To maximize the impact of these advancements, organizations should focus on integrating agentic workflows and RAG-based architectures to ensure their AI implementations are both factual and capable of performing multi-step tasks. Developers can leverage the Gemma open models to build culturally aware applications that scale across diverse global markets.

google

Gemini provides automated feedback for theoretical computer scientists at STOC 2026 (opens in new tab)

Google Research launched an experimental program for the STOC 2026 conference using a specialized Gemini model to provide automated, rigorous feedback on theoretical computer science submissions. By identifying critical logical errors and proof gaps within a 24-hour window, the tool demonstrated that advanced AI can serve as a powerful pre-vetting collaborator for high-level mathematical research. The overwhelmingly positive reception from authors indicates that AI can effectively augment the human peer-review process by improving paper quality before formal submission. ## Advanced Reasoning via Inference Scaling - The tool utilized an advanced version of Gemini 2.5 Deep Think specifically optimized for mathematical rigor. - It employed inference scaling methods, allowing the model to explore and combine multiple possible solutions and reasoning traces simultaneously. - This non-linear approach to problem-solving helps the model focus on the most salient technical issues while significantly reducing the likelihood of hallucinations. ## Structured Technical Feedback - Feedback was delivered in a structured format that included a high-level summary of the paper's core contributions. - The model provided a detailed analysis of potential mistakes, specifically targeting errors within lemmas, theorems, and logical proofs. - Authors also received a categorized list of minor corrections, such as inconsistent variable naming and typographical errors. ## Identified Technical Issues and Impact - The pilot saw high engagement, with over 80% of STOC 2026 submitters opting in for the AI-generated review. - The tool successfully identified "critical bugs" and calculation errors that had previously evaded human authors for months. - Survey results showed that 97% of participants found the feedback helpful, and 81% reported that the tool improved the overall clarity and readability of their work. ## Expert Verification and Hallucinations - Because the users were domain experts, they were able to act as a filter, distinguishing between deep technical insights and occasional model hallucinations. - While the model sometimes struggled to parse complex notation or interpret figures, authors valued the "neutral tone" and the speed of the two-day turnaround. - The feedback was used as a starting point for human verification, allowing researchers to refine their arguments rather than blindly following the model's output. ## Future Outlook and Educational Potential - Beyond professional research, 75% of surveyed authors see significant educational value in using the tool to train students in mathematical rigor. - The experiment's success has led to 88% of participants expressing interest in having continuous access to such a tool throughout their entire research and drafting process. The success of the STOC 2026 pilot suggests that researchers should consider integrating specialized LLMs early in the drafting phase to catch "embarrassing" or logic-breaking errors. While the human expert remains the final arbiter of truth, these tools provide a necessary layer of automated verification that can accelerate the pace of scientific discovery.

google

Spotlight on innovation: Google-sponsored Data Science for Health Ideathon across Africa (opens in new tab)

Google Research, in partnership with several pan-African machine learning communities, recently concluded the Africa-wide Data Science for Health Ideathon to address regional medical challenges. By providing access to specialized open-source health models and technical mentorship, the initiative empowered local researchers to develop tailored solutions for issues ranging from maternal health to oncology. The event demonstrated that localized innovation, supported by high-performance AI foundations, can effectively bridge healthcare gaps in resource-constrained environments. ## Collaborative Framework and Objectives * The Ideathon was launched at the 2025 Deep Learning Indaba in Kigali, Rwanda, in collaboration with SisonkeBiotik, Ro’ya, and DS-I Africa. * The primary goal was to foster capacity building within the African AI community, moving beyond theoretical research toward the execution of practical healthcare tools. * Participants received hands-on training on Google’s specialized health models and were supported with Google Cloud Vertex AI compute credits and mentorship from global experts. * Submissions were evaluated based on their innovation, technical feasibility, and contextual relevance to African health systems. ## Technical Foundations and Google Health Models * Developers focused on a suite of open health AI models, including MedGemma for clinical reasoning, TxGemma for therapeutics, and MedSigLIP for medical vision-language tasks. * The competition utilized a two-phase journey: an initial "Idea Development" stage where teams defined clinical problems and outlined AI approaches, followed by a "Prototype & Pitch" phase. * Technical implementations frequently involved advanced techniques such as Retrieval-Augmented Generation (RAG) to ensure alignment with local medical protocols and WHO guidelines. * Fine-tuning methods, specifically Low-Rank Adaptation (LoRA), were utilized by teams to specialize large-scale models like MedGemma-27B-IT for niche datasets. ## Innovative Solutions for Regional Health * **Dawa Health:** This first-place winner developed an AI-powered cervical cancer screening tool that uses MedSigLIP to identify abnormalities in colposcopy images uploaded via WhatsApp, combined with Gemini RAG for clinical guidance. * **Solver (CerviScreen AI):** This team built a web application for automated cervical-cytology screening by fine-tuning MedGemma-27B-IT on the CRIC dataset to assist cytopathologists with annotated images. * **Mkunga:** A maternal health call center that adapts MedGemma and Gemini to provide advice in Swahili using Speech-to-Text (STT) and Text-to-Speech (TTS) technologies. * **HexAI (DermaDetect):** Recognized for the best proof-of-concept, this offline-first mobile app allows community health workers to triage skin conditions using on-device versions of MedSigLIP, specifically designed for low-connectivity areas. The success of the Ideathon underscores the importance of "local solutions for local priorities." By making sophisticated models like MedGemma and MedSigLIP openly available, the technical barrier to entry is lowered, allowing African developers to build high-impact, culturally and linguistically relevant medical tools. For organizations looking to implement AI in global health, this model of providing foundational tools and cloud resources to local experts remains a highly effective strategy for sustainable innovation.

google

Toward provably private insights into AI use (opens in new tab)

Google Research has introduced Provably Private Insights (PPI), a framework designed to analyze generative AI usage patterns while providing mathematical guarantees of user privacy. By integrating Large Language Models (LLMs) with differential privacy and trusted execution environments (TEEs), the system enables developers to derive aggregate trends from unstructured data without exposing individual user content. This approach ensures that server-side processing remains limited to privacy-preserving computations that are fully auditable by external parties. ### The Role of LLMs in Structured Summarization The system employs "data expert" LLMs to transform unstructured generative AI data into actionable, structured insights. * The framework utilizes open-source Gemma 3 models to perform specific analysis tasks, such as classifying transcripts into topics or identifying user frustration levels. * This "structured summarization" occurs entirely within a TEE, ensuring that the model processes raw data in an environment inaccessible to human operators or external processes. * Developers can update LLM prompts frequently to answer new research questions without compromising the underlying privacy architecture. ### Confidential Federated Analytics (CFA) Infrastructure The PPI system is built upon Confidential Federated Analytics, a technique that isolates data through hardware-based security and cryptographic verification. * User devices encrypt data and define specific authorized processing steps before uploading it to the server. * A TEE-hosted key management service only releases decryption keys to processing steps that match public, open-source code signatures. * System integrity is verified using Rekor, a public, tamper-resistant transparency log that allows external parties to confirm that the code running in the TEE is exactly what was published. ### Anonymization via Differential Privacy Once the LLM extracts features from the data, the system applies differential privacy (DP) to ensure that the final output does not reveal information about any specific individual. * The extracted categories are aggregated into histograms, with DP noise added to the final counts to prevent the identification of single users. * Because the privacy guarantee is applied at the aggregation stage, the system remains secure even if a developer uses a prompt specifically designed to isolate a single user's data. * All aggregation algorithms are open-source and reproducibly buildable, allowing for end-to-end verifiability of the privacy claims. By open-sourcing the PPI stack through the Google Parfait project and deploying it in applications like Pixel Recorder, this framework establishes a new standard for transparent data analysis. Developers should look to integrate similar TEE-based federated analytics to balance the need for product insights with the necessity of provable, hardware-backed user privacy.

google

A picture's worth a thousand (private) words: Hierarchical generation of coherent synthetic photo albums (opens in new tab)

Researchers at Google have developed a hierarchical method for generating differentially private (DP) synthetic photo albums, providing a way to share representative datasets while protecting sensitive individual information. By utilizing an intermediate text representation and a two-stage generation process, the approach maintains thematic coherence across multiple images in an album—a significant challenge for traditional synthetic data methods. This framework allows organizations to apply standard, non-private analytical techniques to safe synthetic substitutes rather than modifying every individual analysis method for differential privacy. ## The Hierarchical Generation Process * The workflow begins by converting original photo albums into structured text; an AI model generates detailed captions for each image and a summary for the entire album. * Two large language models (LLMs) are privately fine-tuned using DP-SGD: the first is trained to produce album summaries, and the second generates individual photo captions based on those summaries. * Synthetic data is then produced hierarchically, where the model first generates a global album summary to serve as context, followed by a series of individual photo captions that remain consistent with that context. * The final step uses a text-to-image AI model to transform the private, synthetic text captions back into a set of coherent images. ## Benefits of Intermediate Text Representations * Text summarization is inherently privacy-enhancing because it is a "lossy" operation, meaning the text description is unlikely to capture the exact unique details of an original photo. * Using text as a midpoint allows for more efficient resource management, as generated albums can be filtered and curated at the text level before undergoing the computationally expensive process of image generation. * The hierarchical approach ensures that photos within a synthetic album share the same characters and themes, as every caption in a set is derived from the same contextual summary. * Training two separate models with shorter context windows is significantly more efficient than training one large model, because the computational cost of self-attention scales quadratically with the length of the context. This hierarchical, text-mediated approach demonstrates that high-level semantic information and thematic coherence can be preserved in synthetic datasets without sacrificing individual privacy. Organizations should consider this workflow—translating complex multi-modal data into structured text before synthesis—to scale differentially private data generation for advanced modeling and analysis.

google

Teaching Gemini to spot exploding stars with just a few examples (opens in new tab)

Researchers have demonstrated that Google’s Gemini model can classify cosmic events with 93% accuracy, rivaling specialized machine learning models while providing human-readable explanations. By utilizing few-shot learning with only 15 examples per survey, the model addresses the "black box" limitation of traditional convolutional neural networks used in astronomy. This approach enables scientists to efficiently process the millions of alerts generated by modern telescopes while maintaining a transparent and interactive reasoning process. ## Bottlenecks in Modern Transient Astronomy * Telescopes like the Vera C. Rubin Observatory are expected to generate up to 10 million alerts per night, making manual verification impossible. * The vast majority of these alerts are "bogus" signals caused by satellite trails, cosmic rays, or instrumental artifacts rather than real supernovae. * Existing specialized models often provide binary "real" or "bogus" labels without context, forcing astronomers to either blindly trust the output or spend hours on manual verification. ## Multimodal Few-Shot Learning for Classification * The research utilized few-shot learning, providing Gemini with only 15 annotated examples for three major surveys: Pan-STARRS, MeerLICHT, and ATLAS. * Input data consisted of image triplets—a "new" alert image, a "reference" image of the same sky patch, and a "difference" image—each 100x100 pixels in size. * The model successfully generalized across different telescopes with varying pixel scales, ranging from 0.25" per pixel for Pan-STARRS to 1.8" per pixel for ATLAS. * Beyond simple labels, Gemini generates a textual description of observed features and an interest score to help astronomers prioritize follow-up observations. ## Expert Validation and Self-Assessment * A panel of 12 professional astronomers evaluated the model using a 0–5 coherence rubric, confirming that Gemini’s logic aligned with expert reasoning. * The study found that Gemini can effectively assess its own uncertainty; low self-assigned "coherence scores" were strong indicators of likely classification errors. * This ability to flag its own potential mistakes allows the model to act as a reliable partner, alerting scientists when a specific case requires human intervention. The transition from "black box" classifiers to interpretable AI assistants allows the astronomical community to scale with the data flood of next-generation telescopes. By combining high-accuracy classification with transparent reasoning, researchers can maintain scientific rigor while processing millions of cosmic events in real time.

google

Learn Your Way: Reimagining textbooks with generative AI (opens in new tab)

Google Research has introduced Learn Your Way, an AI-driven educational experiment that reimagines traditional textbooks as personalized, multimodal learning journeys. By leveraging the LearnLM family of models integrated into Gemini 2.5 Pro, the system transforms static source material into tailored content based on a student’s specific grade level and interests. Early efficacy studies demonstrate that this approach significantly enhances retention, with students scoring 11 percentage points higher than those using standard digital readers. ### Pedagogical Foundations and Dual Coding The research is built on the "dual coding theory," which suggests that forming mental connections between different representations of information strengthens conceptual understanding. * The system moves away from a "one-size-fits-all" model toward a student-driven experience where learners can choose and intermix formats. * Personalization is used as a tool to enhance situational interest and motivation by adapting content to specific student attributes. * The framework incorporates active learning through real-time quizzing and feedback to address knowledge gaps as they arise. ### The Personalization Pipeline The technical architecture begins with a layered pipeline that processes source material, such as a textbook PDF, to create a foundational text for all other formats. * The original material is first "re-leveled" to match the learner’s reported grade level while maintaining the integrity and scope of the curriculum. * Generic examples within the text are strategically replaced with personalized examples based on user interests, such as sports, music, or food. * This personalized base text serves as the primary input for generating all subsequent multimodal representations, ensuring consistency across formats. ### Multimodal Content Generation To produce a wide variety of educational assets, the system utilizes a combination of large language models and specialized AI agents. * **Agentic Workflows:** While tools like mind maps and timelines are generated directly by Gemini, complex assets like narrated slides use multi-step agentic workflows to ensure pedagogical effectiveness. * **Custom Visuals:** Because general-purpose image models often struggle with educational accuracy, the researchers fine-tuned a dedicated model specifically for generating educational illustrations. * **Diverse Representations:** The interface provides "immersive text" with embedded questions, audio lessons for auditory learning, and interactive slides that mimic recorded classroom sessions. ### Research Outcomes and Future Application The project’s effectiveness was validated through a study comparing the GenAI approach against standard digital reading materials. * Students using the personalized AI tools showed a significant improvement in retention test scores. * Beyond retention, the system aims to transform passive reading into an active, multimodal experience that follows established learning science principles. * The "Learn Your Way" experiment is currently available on Google Labs, providing a practical look at how adaptive, learner-centric materials might replace static textbooks in future K-12 and higher education settings.

google

How Google’s AI can help transform health professions education (opens in new tab)

To address a projected global deficit of 11 million healthcare workers by 2030, Google Research is exploring how generative AI can provide personalized, competency-based education for medical professionals. By combining qualitative user-centered design with quantitative benchmarking of the pedagogically fine-tuned LearnLM model, researchers have demonstrated that AI can effectively mimic the behaviors of high-quality human tutors. The studies conclude that specialized models, now integrated into Gemini 2.5 Pro, can significantly enhance clinical reasoning and adapt to the individual learning styles of medical students. ## Learner-Centered Design and Participatory Research * Researchers conducted interdisciplinary co-design workshops featuring medical students, clinicians, and AI researchers to identify specific educational needs. * The team developed a rapid prototype of an AI tutor designed to guide learners through clinical reasoning exercises anchored in synthetic clinical vignettes. * Qualitative feedback from medical residents and students highlighted a demand for "preceptor-like" behaviors, such as the ability to manage cognitive load, provide constructive feedback, and encourage active reflection. * Analysis revealed that learners specifically value AI tools that can identify and bridge individual knowledge gaps rather than providing generic information. ## Quantitative Benchmarking via LearnLM * The study utilized LearnLM, a version of Gemini fine-tuned specifically for educational pedagogy, and compared its performance against Gemini 1.5 Pro. * Evaluations were conducted using 50 synthetic scenarios covering a spectrum of medical education, ranging from preclinical topics like platelet activation to clinical subjects such as neonatal jaundice. * Medical students engaged in 290 role-playing conversations, which were then evaluated based on four primary metrics: overall experience, meeting learning needs, enjoyability, and understandability. * Physician educators performed blinded reviews of conversation transcripts to assess whether the AI adhered to medical education standards and core competencies. ## Pedagogical Performance and Expert Evaluation * LearnLM was consistently rated higher than the base model by both students and educators, with experts noting it behaved "more like a very good human tutor." * The fine-tuned model demonstrated a superior ability to maintain a conversation plan and use grounding materials to provide accurate, context-aware instruction. * Findings suggest that pedagogical fine-tuning is essential for AI to move beyond simple fact-delivery and toward true interactive tutoring. * These specialized learning capabilities have been transitioned from the research phase into Gemini 2.5 Pro to support broader educational applications. By integrating these specialized AI behaviors into medical training pipelines, institutions can provide scalable, individualized support to students. The transition of LearnLM’s pedagogical features into Gemini 2.5 Pro provides a practical framework for developers to create tools that not only provide medical information but actively foster the critical thinking skills required for clinical practice.

google

From massive models to mobile magic: The tech behind YouTube real-time generative AI effects (opens in new tab)

YouTube has successfully deployed over 20 real-time generative AI effects by distilling the capabilities of massive cloud-based models into compact, mobile-ready architectures. By utilizing a "teacher-student" training paradigm, the system overcomes the computational bottlenecks of high-fidelity generative AI while ensuring the output remains responsive on mobile hardware. This approach allows for complex transformations, such as cartoon style transfer and makeup application, to run frame-by-frame on-device without sacrificing the user’s identity. ### Data Curation and Diversity * The foundation of the effects pipeline relies on high-quality, properly licensed face datasets. * Datasets are meticulously filtered to ensure a uniform distribution across different ages, genders, and skin tones. * The Monk Skin Tone Scale is used as a benchmark to ensure the effects work equitably for all users. ### The Teacher-Student Framework * **The Teacher:** A large, powerful pre-trained model (initially StyleGAN2 with StyleCLIP, later transitioning to Google DeepMind’s Imagen) acts as the "expert" that generates high-fidelity visual effects. * **The Student:** A lightweight UNet-based architecture designed for mobile efficiency. It utilizes a MobileNet backbone for both the encoder and decoder to ensure fast frame-by-frame processing. * The distillation process narrows the scope of the massive teacher model into a student model focused on a single, specific task. ### Iterative Distillation and Training * **Data Generation:** The teacher model processes thousands of images to create "before and after" pairs. These are augmented with synthetic elements like AR glasses, sunglasses, and hand occlusions to improve real-world robustness. * **Optimization:** The student model is trained using a sophisticated combination of loss functions, including L1, LPIPS, Adaptive, and Adversarial loss, to balance numerical accuracy with aesthetic quality. * **Architecture Search:** Neural architecture search is employed to tune "depth" and "width" multipliers, identifying the most efficient model structure for different mobile hardware constraints. ### Addressing the Inversion Problem * A major challenge in real-time effects is the "inversion problem," where the model struggles to represent a real face in latent space, leading to a loss of the user's identity (e.g., changes in skin tone or clothing). * YouTube uses Pivotal Tuning Inversion (PTI) to ensure that the user's specific features are preserved during the generative process. * By editing images in the latent space—a compressed numerical representation—the system can apply stylistic changes while maintaining the core characteristics of the original video stream. By combining advanced model distillation with on-device optimization via MediaPipe, YouTube demonstrates a practical path for bringing heavy generative AI research into consumer-facing mobile applications.

google

Enabling physician-centered oversight for AMIE (opens in new tab)

Guardrailed-AMIE (g-AMIE) is a diagnostic AI framework designed to perform patient history-taking while strictly adhering to safety guardrails that prevent it from providing direct medical advice. By decoupling data collection from clinical decision-making, the system enables an asynchronous oversight model where primary care physicians (PCPs) review and finalize AI-generated medical summaries. In virtual clinical trials, g-AMIE’s diagnostic outputs and patient communications were preferred by overseeing physicians and patient actors over human-led control groups. ## Multi-Agent Architecture and Guardrails * The system utilizes a multi-agent setup powered by Gemini 2.0 Flash, consisting of a dialogue agent, a guardrail agent, and a SOAP note agent. * The dialogue agent conducts history-taking in three distinct phases: general information gathering, targeted validation of a differential diagnosis, and a conclusion phase for patient questions. * A dedicated guardrail agent monitors and rephrases responses in real-time to ensure the AI abstains from sharing individualized diagnoses or treatment plans directly with the patient. * The SOAP note agent employs sequential multi-step generation to separate summarization tasks (Subjective and Objective) from more complex inferential tasks (Assessment and Plan). ## The Clinician Cockpit and Asynchronous Oversight * To facilitate human review, researchers developed the "clinician cockpit," a web interface co-designed with outpatient physicians through semi-structured interviews. * The interface is structured around the standard SOAP note format, presenting the patient’s perspective, measurable data, differential diagnosis, and proposed management strategy. * This framework allows overseeing PCPs to review cases asynchronously, editing the AI’s proposed differential diagnoses and management plans before sharing a final message with the patient. * The separation of history-taking from decision-making ensures that licensed medical professionals retain ultimate accountability for patient care. ## Performance Evaluation via Virtual OSCE * The system was evaluated in a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) involving 60 case scenarios. * g-AMIE’s performance was compared against primary care physicians, nurse practitioners, and physician assistants who were required to operate under the same restrictive guardrails. * Overseeing PCPs and independent physician raters preferred g-AMIE’s diagnostic accuracy and management plans over those of the human control groups. * Patient actors reported a preference for the messages generated by g-AMIE compared to those drafted by human clinicians in the study. While g-AMIE demonstrates high potential for human-AI collaboration in diagnostics, the researchers emphasize that results should be interpreted with caution. The workflow was specifically optimized for AI characteristics, and human clinicians may require specialized training to perform effectively within such highly regulated guardrail frameworks.

google

Zooming in: Efficient regional environmental risk assessment with generative AI (opens in new tab)

Google Research has introduced a dynamical-generative downscaling method that combines physics-based climate modeling with probabilistic diffusion models to produce high-resolution regional environmental risk assessments. By bridging the resolution gap between global Earth system models and city-level data needs, this approach provides a computationally efficient way to quantify climate uncertainties at a 10 km scale. This hybrid technique significantly reduces error rates compared to traditional statistical methods while remaining far less computationally expensive than full-scale dynamical simulations. ## The Resolution Gap in Climate Modeling * Traditional Earth system models typically operate at a resolution of ~100 km, which is too coarse for city-level planning regarding floods, heatwaves, and wildfires. * Existing "dynamical downscaling" uses regional climate models (RCMs) to provide physically realistic 10 km projections, but the computational cost is too high to apply to large ensembles of climate data. * Statistical downscaling offers a faster alternative but often fails to capture complex local weather patterns or extreme events, and it struggles to generalize to unprecedented future climate conditions. ## A Hybrid Dynamical-Generative Framework * The process begins with a "physics-based first pass," where an RCM downscales global data to an intermediate resolution of 50 km to establish a common physical representation. * A generative AI system called "R2D2" (Regional Residual Diffusion-based Downscaling) then adds fine-scale details, such as the effects of complex topography, to reach the target 10 km resolution. * R2D2 specifically learns the "residual"—the difference between intermediate and high-resolution fields—which simplifies the learning task and improves the model's ability to generalize to unseen environmental conditions. ## Efficiency and Accuracy in Risk Assessment * The model was trained and validated using the Western United States Dynamically Downscaled Dataset (WUS-D3), which utilizes the "gold standard" WRF model. * The dynamical-generative approach reduced fine-scale errors by over 40% compared to popular statistical methods like BCSD and STAR-ESDM. * A key advantage of this method is its scalability; the AI requires training on only one dynamically downscaled model to effectively process outputs from various other Earth system models, allowing for the rapid assessment of large climate ensembles. By combining the physical grounding of traditional regional models with the speed of diffusion-based AI, researchers can now produce granular risk assessments that were previously cost-prohibitive. This method allows for a more robust exploration of future climate scenarios, providing essential data for farming, water management, and community protection.

google

Google Research at Google I/O 2025 (opens in new tab)

Google Research at I/O 2025 showcases the "research to reality" transition, highlighting how years of foundational breakthroughs are now being integrated into Gemini models and specialized products. By focusing on multimodal capabilities, pedagogy, and extreme model efficiency, Google aims to democratize access to advanced AI while ensuring it remains grounded and useful across global contexts. ## Specialized Healthcare Models: MedGemma and AMIE * **MedGemma:** This new open model, based on Gemma 3, is optimized for multimodal medical tasks such as radiology image analysis and clinical data summarization. It is available in 4B and 27B sizes, performing similarly to much larger models on the MedQA benchmark while remaining small enough for efficient local fine-tuning. * **AMIE (Articulate Medical Intelligence Explorer):** A research AI agent designed for diagnostic medical reasoning. Its latest multimodal version can now interpret and reason about visual medical information, such as skin lesions or medical imaging, to assist clinicians in diagnostic accuracy. ## Educational Optimization through LearnLM * **Gemini 2.5 Pro Integration:** The LearnLM family of models, developed with educational experts, is now integrated into Gemini 2.5 Pro. This fine-tuning enhances STEM reasoning, multimodal understanding, and pedagogical feedback. * **Interactive Learning Tools:** A new research-optimized quiz experience allows students to generate custom assessments from their own notes, providing specific feedback on right and wrong answers rather than just providing solutions. * **Global Assessment Pilots:** Through partnerships like the one with Kayma, Google is testing the automatic assessment of short and long-form content in regions like Ghana to scale quality educational tools. ## Multilingual Expansion and On-Device Gemma Models * **Gemma 3 and 3n:** Research breakthroughs have expanded Gemma 3’s support to over 140 languages. The introduction of **Gemma 3n** targets extreme efficiency, capable of running on devices with as little as 2GB of RAM while maintaining low latency and low energy consumption. * **ECLeKTic Benchmark:** To assist the developer community, Google introduced this novel benchmark specifically for evaluating how well large language models transfer knowledge across different languages. ## Model Efficiency and Factuality in Search * **Inference Techniques:** Google Research continues to set industry standards for model speed and accessibility through technical innovations like **speculative decoding** and **cascades**, which reduce the computational cost of generating high-quality responses. * **Grounded Outputs:** Significant focus remains on factual consistency, ensuring that the AI models powering features like AI Overviews in Search provide reliable and grounded information to users. As Google continues to shrink the gap between laboratory breakthroughs and consumer products, the emphasis remains on making high-performance AI accessible on low-cost hardware and across diverse linguistic landscapes. Developers and researchers can now leverage these specialized tools via platforms like HuggingFace and Vertex AI to build more targeted, efficient applications.

google

Amplify Initiative: Localized data for globalized AI (opens in new tab)

The Amplify Initiative by Google Research addresses the critical lack of linguistic and cultural diversity in generative AI training data by establishing an open, community-based platform for localized data collection. By partnering with regional experts to co-create structured, high-quality datasets, the initiative aims to ensure AI models are both representative and effective in solving local challenges across health, finance, and education. This approach shifts data collection from a top-down model to a participatory framework that prioritizes responsible, locally respectful practices in the Global South. ## The Amplify Platform Framework The initiative is designed to bridge the gap between global AI capabilities and local needs through three core pillars: * **Participatory Co-creation:** Researchers and local communities collaborate to define specific data needs, ensuring the resulting datasets address region-specific problems like financial literacy or localized health misinformation. * **Open Access for Innovation:** The platform provides high-quality, multilingual datasets suitable for fine-tuning and evaluating models, specifically empowering developers in the Global South to build tools for their own communities. * **Author Recognition:** Contributors receive tangible rewards, including professional certificates, research acknowledgments, and data authorship attribution, creating a sustainable ecosystem for expert participation. ## Pilot Implementation in Sub-Saharan Africa To test the methodology, Google Research partnered with Makerere University’s AI Lab in Uganda to conduct an on-the-ground pilot program. * **Expert Onboarding:** The program trained 259 experts across Ghana, Kenya, Malawi, Nigeria, and Uganda through a combination of in-person workshops and app-based modules. * **Dataset Composition:** The pilot resulted in 8,091 annotated adversarial queries across seven languages, covering salient domains such as education and finance. * **Adversarial Focus:** By focusing on adversarial queries, the team captured localized nuances of potential AI harms, including regional stereotypes and specialized advice that generic models often miss. ## Technical Workflow and App-Based Methodology The initiative utilizes a structured technical pipeline to scale data collection while maintaining high quality and privacy. * **Privacy-Preserving Android App:** A dedicated app serves as the primary interface for training, data creation, and annotation, allowing experts to contribute from their own environments. * **Automated Validation:** The app includes built-in feedback loops that use automated checks to ensure queries are relevant and to prevent the submission of semantically similar or duplicate entries. * **Domain-Specific Annotation:** Experts are provided with specialized annotation topics tailored to their professional backgrounds, ensuring that the metadata for each query is technically accurate and contextually relevant. The Amplify Initiative provides a scalable blueprint for building inclusive AI by empowering experts in the Global South to define their own data needs. As the project expands to India and Brazil, it offers a vital resource for developers seeking to fine-tune models for local contexts and improve the safety and relevance of AI on a global scale.

google

Geospatial Reasoning: Unlocking insights with generative AI and multiple foundation models (opens in new tab)

Google Research is introducing Geospatial Reasoning, a new framework that integrates generative AI with specialized foundation models to streamline complex geographical problem-solving. By combining large language models like Gemini with domain-specific data, the initiative seeks to make large-scale spatial analysis accessible to sectors like public health, urban development, and climate resilience. This research effort moves beyond traditional data silos, enabling agentic workflows that can interpret diverse data types—from satellite imagery to population dynamics—through natural language. ### Specialized Foundation Models for Human Activity * The Population Dynamics Foundation Model (PDFM) captures the complex interplay between human behaviors and their local environments. * A dedicated trajectory-based mobility foundation model has been developed to process and analyze movement patterns. * While initially tested in the US, experimental datasets are expanding to include the UK, Australia, Japan, Canada, and Malawi for selected partners. ### Remote Sensing and Vision Architectures * New models utilize advanced architectures including masked autoencoders, SigLIP, MaMMUT, and OWL-ViT, specifically adapted for the remote sensing domain. * Training involves high-resolution satellite and aerial imagery paired with text descriptions and bounding box annotations to enable precise object detection. * The models support zero-shot classification and retrieval, allowing users to locate specific features—such as "residential buildings with solar panels"—using flexible natural language queries. * Internal evaluations show state-of-the-art performance across multiple benchmarks, including image segmentation and post-disaster damage assessment. ### Agentic Workflows and Industry Collaboration * The Geospatial Reasoning framework utilizes LLMs like Gemini to manage complex datasets and orchestrate "agentic" workflows. * These workflows are grounded in geospatial data to ensure that the insights generated are both useful and contextually accurate. * Google is collaborating with inaugural industry partners, including Airbus, Maxar, Planet Labs, and WPP, to test these capabilities in real-world scenarios. Organizations interested in accelerating their geospatial analysis should consider applying for the trusted tester program to explore how these foundation models can be fine-tuned for specific proprietary data and use cases.