multimodal-ai - Google Research

google Jan 12, 2026

Next generation medical image interpretation with MedGemma 1.5 and medical speech to text with MedASR (opens in new tab)

Google Research has introduced MedGemma 1.5 4B and MedASR, expanding its suite of open medical AI models to support more complex clinical workflows. These updates significantly enhance the interpretation of high-dimensional imaging and medical speech-to-text, providing a compute-efficient foundation for healthcare developers to build upon. By maintaining an open-access model available on Hugging Face and Vertex AI, Google aims to accelerate the integration of multimodal AI into real-world medical applications. ### Multimodal Advancements in MedGemma 1.5 The latest update to the MedGemma 4B model focuses on high-dimensional and longitudinal data, moving beyond simple 2D image interpretation. * **3D Medical Imaging:** The model now supports volumetric representations from CT scans and MRIs, as well as whole-slide histopathology imaging. * **Longitudinal Review:** New capabilities allow for the review of chest X-ray time series, helping clinicians track disease progression over time. * **Anatomical Localization:** Developers can use the model to identify and localize specific anatomical features within chest X-rays. * **Document Understanding:** Enhanced support for extracting structured data from complex medical lab reports and documents. * **Edge Capability:** The 4B parameter size is specifically designed to be small enough to run offline while remaining accurate enough for core medical reasoning tasks. ### Medical Speech-to-Text with MedASR MedASR is a specialized automated speech recognition (ASR) model designed to bridge the gap between clinical dialogue and digital documentation. * **Clinical Dictation:** The model is specifically fine-tuned for medical terminology and the unique nuances of clinical dictation. * **Integrated Reasoning:** MedASR is designed to pair seamlessly with MedGemma, allowing transcribed text to be immediately processed for advanced medical reasoning or summarization. * **Accessibility:** Like other HAI-DEF models, it is free for research and commercial use and hosted on both Hugging Face and Google Cloud’s Vertex AI. ### Performance Benchmarks and Community Impact Google is incentivizing innovation through improved performance metrics and community-driven challenges. * **Accuracy Gains:** Internal benchmarks show MedGemma 1.5 improved disease-related CT classification by 3% and MRI classification by 14% compared to the previous version. * **MedGemma Impact Challenge:** A Kaggle-hosted hackathon with $100,000 in prizes has been launched to encourage developers to find creative applications for these multimodal tools. * **Model Collection:** The update complements existing tools like the MedSigLIP image encoder and the larger MedGemma 27B model, which remains the preferred choice for complex, text-heavy medical applications. Developers and researchers are encouraged to utilize MedGemma 1.5 for tasks requiring efficient, offline multimodal processing, while leveraging MedASR to automate clinical documentation. By participating in the MedGemma Impact Challenge, the community can help define the next generation of AI-assisted medical diagnostics and workflows.

multimodal-ai ai gen-ai speech-to-text+5

google Dec 17, 2025

Google Research 2025: Bolder breakthroughs, bigger impact (opens in new tab)

Google Research in 2025 has shifted toward an accelerated "Magic Cycle" that rapidly translates foundational breakthroughs into real-world applications across science, society, and consumer products. By prioritizing model efficiency, factuality, and agentic capabilities, the organization is moving beyond static text generation toward interactive, multi-modal systems that solve complex global challenges. This evolution is underpinned by a commitment to responsible AI development, ensuring that new technologies like quantum computing and generative UI are both safe and culturally inclusive. ## Enhancing Model Efficiency and Factuality * Google introduced new efficiency-focused techniques like block verification (an evolution of speculative decoding) and the LAVA scheduling algorithm, which optimizes resource allocation in large cloud data centers. * The Gemini 3 model achieved state-of-the-art results on factuality benchmarks, including SimpleQA Verified and the newly released FACTS benchmark suite, by emphasizing grounded world knowledge. * Research into Retrieval Augmented Generation (RAG) led to the development of the LLM Re-Ranker in Vertex AI, which helps models determine if they possess sufficient context to provide accurate answers. * The Gemma open model expanded to support over 140 languages, supported by the TUNA taxonomy and the Amplify initiative to improve socio-cultural intelligence and data representation. ## Interactive Experiences through Generative UI * A novel implementation of generative UI allows Gemini 3 to dynamically create visual interfaces, web pages, and tools in response to user prompts rather than providing static text. * This technology is powered by specialized models like "Gemini 3-interactive," which are trained to output structured code and design elements. * These capabilities have been integrated into AI Mode within Google Search, allowing for more immersive and customizable user journeys. ## Advanced Architectures and Agentic AI * Google is exploring hybrid model architectures, such as Jamba-style models that combine State Space Models (SSMs) with traditional attention mechanisms to handle long contexts more efficiently. * The development of agentic AI focuses on models that can reason, plan, and use tools, exemplified by Project Astra, a prototype for a universal AI agent. * Specialized models like Gemini 3-code have been optimized to act as autonomous collaborators for software developers, assisting in complex coding tasks and system design. ## AI for Science and Planetary Health * In biology, research teams utilized AI to map human heart and brain structures and employed RoseTTAFold-Diffusion to design new proteins for therapeutic use. * The NeuralGCM model has revolutionized Earth sciences by combining traditional physics with machine learning for faster, more accurate weather and climate forecasting. * Environmental initiatives include the FireSat satellite constellation for global wildfire detection and the expansion of AI-driven flood forecasting and contrail mitigation. ## Quantum Computing and Responsible AI * Google achieved significant milestones in quantum error correction, developing low-overhead codes that bring the industry closer to a reliable, large-scale quantum computer. * Security and safety remain central, with the expansion of SynthID—a watermarking tool for AI-generated text, audio, and video—to help users identify synthetic content. * The team continues to refine the Secure AI Framework (SAIF) to defend against emerging threats while promoting the safe deployment of generative media models like Veo and Imagen. To maximize the impact of these advancements, organizations should focus on integrating agentic workflows and RAG-based architectures to ensure their AI implementations are both factual and capable of performing multi-step tasks. Developers can leverage the Gemma open models to build culturally aware applications that scale across diverse global markets.

multimodal-ai ai llm gen-ai+5

google Dec 2, 2025

From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence (opens in new tab)

Google Research has introduced the Massive Sound Embedding Benchmark (MSEB) to unify the fragmented landscape of machine sound intelligence. By standardizing the evaluation of eight core auditory capabilities across diverse datasets, the framework reveals that current sound representations are far from universal and have significant performance "headroom" for improvement. Ultimately, MSEB provides an open-source platform to drive the development of general-purpose sound embeddings for next-generation multimodal AI. ### Diverse Datasets for Real-World Scenarios The benchmark utilizes a curated collection of high-quality, accessible datasets designed to reflect global diversity and complex acoustic environments. * **Simple Voice Questions (SVQ):** A foundational dataset featuring 177,352 short spoken queries across 17 languages and 26 locales, recorded in varying conditions like traffic and media noise. * **Speech-MASSIVE:** Used for multilingual spoken language understanding and intent classification. * **FSD50K:** A large-scale dataset for environmental sound event recognition containing 200 classes based on the AudioSet Ontology. * **BirdSet:** A massive-scale benchmark specifically for avian bioacoustics and complex soundscape recordings. ### Eight Core Auditory Capabilities MSEB is structured around "super-tasks" that represent the essential functions an intelligent auditory system must perform within a multimodal context. * **Retrieval and Reasoning:** These tasks simulate voice search and the ability of an assistant to find precise answers within documents based on spoken questions. * **Classification and Transcription:** Standard perception tasks that categorize sounds by environment or intent and convert audio signals into verbatim text. * **Segmentation and Clustering:** These involve identifying and localizing salient terms with precise timestamps and grouping sound samples by shared attributes without predefined labels. * **Reranking and Reconstruction:** Advanced tasks that reorder ambiguous text hypotheses to match spoken queries and test embedding quality by regenerating original audio waveforms. ### Unified Evaluation and Performance Goals The framework is designed to move beyond fragmented research by providing a consistent structure for evaluating different model architectures. * **Model Agnostic:** The open framework allows for the evaluation of uni-modal, cascade, and end-to-end multimodal embedding models. * **Objective Baselines:** By establishing clear performance goals, the benchmark highlights specific research opportunities where current state-of-the-art models fall short of their potential. * **Multimodal Integration:** Every task assumes sound is the critical input but incorporates other modalities, such as text context, to better simulate real-world AI interactions. By providing a comprehensive roadmap for auditory intelligence, MSEB encourages the community to move toward universal sound embeddings. Researchers can contribute to this evolving standard by accessing the open-source GitHub repository and utilizing the newly released datasets on Hugging Face to benchmark their own models.

multimodal-ai ai machine-learning benchmarking+4

google Nov 17, 2025

Generative UI: A rich, custom, visual interactive user experience for any prompt (opens in new tab)

Google Research has introduced a novel Generative UI framework that enables AI models to dynamically construct bespoke, interactive user experiences—including web pages, games, and functional tools—in response to any natural language prompt. This shift from static, predefined interfaces to AI-generated environments allows for highly customized digital spaces that adapt to a user's specific intent and context. Evaluated through human testing, these custom-generated interfaces are strongly preferred over traditional, text-heavy LLM outputs, signaling a fundamental evolution in human-computer interaction. ### Product Integration in Gemini and Google Search The technology is currently being deployed as an experimental feature across Google’s main AI consumer platforms to enhance how users visualize and interact with data. * **Dynamic View and Visual Layout:** These experiments in the Gemini app use agentic coding capabilities to design and code a complete interactive response for every prompt. * **AI Mode in Google Search:** Available for Google AI Pro and Ultra subscribers, this feature uses Gemini 3’s multimodal understanding to build instant, bespoke interfaces for complex queries. * **Contextual Customization:** The system differentiates between user needs, such as providing a simplified interface for a child learning about the microbiome versus a data-rich layout for an adult. * **Task-Specific Tools:** Beyond text, the system generates functional applications like fashion advisors, event planners, and science simulations for topics like RNA transcription. ### Technical Architecture and Implementation The Generative UI implementation relies on a multi-layered approach centered around the Gemini 3 Pro model to ensure the generated code is both functional and accurate. * **Tool Access:** The model is connected to server-side tools, including image generation and real-time web search, to enrich the UI with external data. * **System Instructions:** Detailed guidance provides the model with specific goals, formatting requirements, and technical specifications to avoid common coding errors. * **Agentic Coding:** The model acts as both a designer and a developer, writing the necessary code to render the UI on the fly based on its interpretation of the user’s prompt. * **Post-Processing:** Outputs undergo a series of automated checks to address common issues and refine the final visual experience before it reaches the browser. ### The Shift from Static to Generative Interfaces This research represents a move away from the traditional software paradigm where users must navigate a fixed catalog of applications to find the tool they need. * **Prompt-Driven UX:** Interfaces are generated from prompts as simple as a single word or as complex as multi-paragraph instructions. * **Interactive Comprehension:** By building simulations on the fly, the system creates a dynamic environment optimized for deep learning and task completion. * **Preference Benchmarking:** Research indicates that when generation speed is excluded as a factor, users significantly prefer these custom-built visual tools over standard, static AI responses. To experience this new paradigm, users can select the "Thinking" option from the model menu in Google Search’s AI Mode or engage with the Dynamic View experiment in the Gemini app to generate tailored tools for specific learning or productivity tasks.

multimodal-ai ai llm gemini+4

google Oct 28, 2025

StreetReaderAI: Towards making street view accessible via context-aware multimodal AI (opens in new tab)

StreetReaderAI is a research prototype designed to make immersive street-level imagery accessible to the blind and low-vision community through multimodal AI. By integrating real-time scene analysis with context-aware geographic data, the system transforms visual mapping data into an interactive, audio-first experience. This framework allows users to virtually explore environments and plan routes with a level of detail and independence previously unavailable through traditional screen readers. ### Navigation and Spatial Awareness The system offers an immersive, first-person exploration interface that mimics the mechanics of accessible gaming. * Users navigate using keyboard shortcuts or voice commands, taking "virtual steps" forward or backward and panning their view in 360 degrees. * Real-time audio feedback provides cardinal and intercardinal directions, such as "Now facing North," to maintain spatial orientation. * Distance tracking informs the user how far they have traveled between panoramic images, while "teleport" features allow for quick jumps to specific addresses or landmarks. ### Context-Aware AI Describer At the core of the tool is a subsystem backed by Gemini that synthesizes visual and geographic data to generate descriptions. * The AI Describer combines the current field-of-view image with dynamic metadata about nearby roads, intersections, and points of interest. * Two distinct modes cater to different user needs: a "Default" mode focusing on pedestrian safety and navigation, and a "Tour Guide" mode that provides historical and architectural details. * The system utilizes Gemini to proactively predict and suggest follow-up questions relevant to the specific scene, such as details about crosswalks or building entrances. ### Interactive Dialogue and Session Memory StreetReaderAI utilizes the Multimodal Live API to facilitate real-time, natural language conversations about the environment. * The AI Chat agent maintains a large context window of approximately 1,048,576 tokens, allowing it to retain a "memory" of up to 4,000 previous images and interactions. * This memory allows users to ask retrospective spatial questions, such as "Where was that bus stop I just passed?", with the agent providing relative directions based on the user's current location. * By tracking every pan and movement, the agent can provide specific details about the environment that were captured in previous steps of the virtual walk. ### User Evaluation and Practical Application Testing with blind screen reader users confirmed the system's utility in practical, real-world scenarios. * Participants successfully used the prototype to evaluate potential walking routes, identifying critical environmental features like the presence of benches or shelters at bus stops. * The study highlighted the importance of multimodal inputs—combining image recognition with structured map data—to provide a more accurate and reliable description than image analysis alone could offer. While StreetReaderAI remains a proof-of-concept, it demonstrates that the integration of multimodal LLMs and spatial data can bridge significant accessibility gaps in digital mapping. Future implementation of these technologies could transform how visually impaired individuals interact with the world, turning static street imagery into a functional tool for independent mobility and exploration.

multimodal-ai ai gemini computer-vision+5

google Oct 19, 2025

Teaching Gemini to spot exploding stars with just a few examples (opens in new tab)

Researchers have demonstrated that Google’s Gemini model can classify cosmic events with 93% accuracy, rivaling specialized machine learning models while providing human-readable explanations. By utilizing few-shot learning with only 15 examples per survey, the model addresses the "black box" limitation of traditional convolutional neural networks used in astronomy. This approach enables scientists to efficiently process the millions of alerts generated by modern telescopes while maintaining a transparent and interactive reasoning process. ## Bottlenecks in Modern Transient Astronomy * Telescopes like the Vera C. Rubin Observatory are expected to generate up to 10 million alerts per night, making manual verification impossible. * The vast majority of these alerts are "bogus" signals caused by satellite trails, cosmic rays, or instrumental artifacts rather than real supernovae. * Existing specialized models often provide binary "real" or "bogus" labels without context, forcing astronomers to either blindly trust the output or spend hours on manual verification. ## Multimodal Few-Shot Learning for Classification * The research utilized few-shot learning, providing Gemini with only 15 annotated examples for three major surveys: Pan-STARRS, MeerLICHT, and ATLAS. * Input data consisted of image triplets—a "new" alert image, a "reference" image of the same sky patch, and a "difference" image—each 100x100 pixels in size. * The model successfully generalized across different telescopes with varying pixel scales, ranging from 0.25" per pixel for Pan-STARRS to 1.8" per pixel for ATLAS. * Beyond simple labels, Gemini generates a textual description of observed features and an interest score to help astronomers prioritize follow-up observations. ## Expert Validation and Self-Assessment * A panel of 12 professional astronomers evaluated the model using a 0–5 coherence rubric, confirming that Gemini’s logic aligned with expert reasoning. * The study found that Gemini can effectively assess its own uncertainty; low self-assigned "coherence scores" were strong indicators of likely classification errors. * This ability to flag its own potential mistakes allows the model to act as a reliable partner, alerting scientists when a specific case requires human intervention. The transition from "black box" classifiers to interpretable AI assistants allows the astronomical community to scale with the data flood of next-generation telescopes. By combining high-accuracy classification with transparent reasoning, researchers can maintain scientific rigor while processing millions of cosmic events in real time.

multimodal-ai ai gen-ai gemini+4

google Sep 29, 2025

The anatomy of a personal health agent (opens in new tab)

Google researchers have developed the Personal Health Agent (PHA), an LLM-powered prototype designed to provide evidence-based, personalized health insights by analyzing multimodal data from wearables and blood biomarkers. By utilizing a specialized multi-agent architecture, the system deconstructs complex health queries into specific tasks to ensure statistical accuracy and clinical grounding. The study demonstrates that this modular approach significantly outperforms standard large language models in providing reliable, data-driven wellness support. ## Multi-Agent System Architecture * The PHA framework adopts a "team-based" approach, utilizing three specialist sub-agents: a Data Science agent, a Domain Expert agent, and a Health Coach. * The system was validated using a real-world dataset from 1,200 participants, featuring longitudinal Fitbit data, health questionnaires, and clinical blood test results. * This architecture was designed after a user-centered study of 1,300 health queries, identifying four key needs: general knowledge, data interpretation, wellness advice, and symptom assessment. * Evaluation involved over 1,100 hours of human expert effort across 10 benchmark tasks to ensure the system outperformed base models like Gemini. ## The Data Science Agent * This agent specializes in "contextualized numerical insights," transforming ambiguous queries (e.g., "How is my fitness trending?") into formal statistical analysis plans. * It operates through a two-stage process: first interpreting the user's intent and data sufficiency, then generating executable code to analyze time-series data. * In benchmark testing, the agent achieved a 75.6% score in analysis planning, significantly higher than the 53.7% score achieved by the base model. * The agent's code generation was validated against 173 rigorous unit tests written by human data scientists to ensure accuracy in handling wearable sensor data. ## The Domain Expert Agent * Designed for high-stakes medical accuracy, this agent functions as a grounded source of health knowledge using a multi-step reasoning framework. * It utilizes a "toolbox" approach, granting the LLM access to authoritative external databases such as the National Center for Biotechnology Information (NCBI) to provide verifiable facts. * The agent is specifically tuned to tailor information to the user’s unique profile, including specific biomarkers and pre-existing medical conditions. * Performance was measured through board certification and coaching exam questions, as well as its ability to provide accurate differential diagnoses compared to human clinicians. While currently a research framework rather than a public product, the PHA demonstrates that a modular, specialist-driven AI architecture is essential for safe and effective personal health management. Developers of future health-tech tools should prioritize grounding LLMs in external clinical databases and implementing rigorous statistical validation stages to move beyond the limitations of general-purpose chatbots.

multimodal-ai ai llm gemini+5

google Sep 17, 2025

Sensible Agent: A framework for unobtrusive interaction with proactive AR agents (opens in new tab)

Sensible Agent is a research prototype designed to move AR agents beyond explicit voice commands toward proactive, context-aware assistance. By leveraging real-time multimodal sensing of a user's environment and physical state, the framework ensures digital help is delivered unobtrusively through the most appropriate interaction modalities. This approach fundamentally reshapes human-computer interaction by anticipating user needs while minimizing cognitive and social disruption. ## Contextual Understanding via Multimodal Parsing The framework begins by analyzing the user's immediate surroundings to establish a baseline for assistance. * A Vision-Language Model (VLM) processes egocentric camera feeds from the AR headset to identify high-level activities and locations. * YAMNet, a pre-trained audio event classifier, monitors environmental noise levels to determine if audio feedback is appropriate. * The system synthesizes these inputs into a parsed context that accounts for situational impairments, such as when a user’s hands are occupied. ## Reasoning with Proactive Query Generation Once the context is established, the system determines the specific type of assistance required through a sophisticated reasoning process. * The framework uses chain-of-thought (CoT) reasoning to decompose complex problems into intermediate logical steps. * Few-shot learning, guided by examples from data collection studies, helps the model decide between actions like providing translations or displaying a grocery list. * The generator outputs a structured suggestion that includes the specific action, the query format (e.g., binary choice or icons), and the presentation modality (visual, audio, or both). ## Dynamic Modality and Interaction Management The final stage of the framework manages how the agent communicates with the user and how the user can respond without breaking their current flow. * The prototype, built on Android XR and WebXR, utilizes a UI Manager to render visual panels or generate text-to-speech (TTS) prompts based on the agent's decision. * An Input Modality Manager activates the most discreet response methods available, such as head gestures (nods), hand gestures (thumbs up), or gaze tracking. * This adaptive selection ensures that if a user is in a noisy room or a social setting, the agent can switch from verbal interaction to subtle visual cues and gesture-based confirmations. By prioritizing social awareness and context-sensitivity, Sensible Agent provides a blueprint for AR systems that feel like helpful companions rather than intrusive tools. Implementing such frameworks is essential for making proactive digital assistants practical and acceptable for long-term, everyday use in public and private spaces.

multimodal-ai ai augmented-reality android-xr+5

google Jul 27, 2025

SensorLM: Learning the language of wearable sensors (opens in new tab)

SensorLM is a new family of foundation models designed to bridge the gap between high-dimensional wearable sensor data and natural language descriptions. By training on a massive dataset of nearly 60 million hours of de-identified health data, the models learn to interpret complex physiological signals to provide meaningful context for human activities. This research demonstrates that integrating multimodal sensor signals with language models enables sophisticated health insights, such as zero-shot activity recognition and automated health captioning, that significantly outperform general-purpose large language models. ## Dataset Scale and Automated Annotation * The models were pre-trained on an unprecedented 59.7 million hours of multimodal sensor data collected from over 103,000 individuals across 127 countries. * To overcome the high cost of manual annotation, researchers developed a hierarchical pipeline that automatically generates text descriptions by calculating statistics and identifying trends within the raw sensor streams. * Data was sourced from Fitbit and Pixel Watch devices, representing nearly 2.5 million person-days of activity and health information. ## Hybrid Training Architecture * SensorLM unifies two primary multimodal strategies: contrastive learning and generative pre-training. * Through contrastive learning, the model learns to discriminate between different states—such as a "light swim" versus a "strength workout"—by matching sensor segments to corresponding text descriptions. * The generative component allows the model to "speak" for the sensors, producing nuanced, context-aware natural language captions directly from high-dimensional biometric signals. ## Activity Recognition and Cross-Modal Capabilities * The model demonstrates state-of-the-art performance in zero-shot human activity recognition, accurately classifying 20 different activities without any specific fine-tuning. * Its few-shot learning capabilities allow the model to adapt to new tasks or individual user patterns with only a handful of examples. * SensorLM facilitates cross-modal retrieval, enabling users or experts to find specific sensor patterns using natural language queries or to generate descriptions based on specific sensor inputs. ## Generative Health Captioning * Beyond simple classification, the model can generate hierarchical captions that describe the statistical, structural, and semantic dimensions of a user’s data. * Experimental results using metrics like BERTScore show that SensorLM produces captions that are more factually correct and coherent than those created by powerful non-specialist LLMs. * This capability allows for the translation of abstract data points, such as heart rate variability or step counts, into readable summaries that explain the "why" behind physiological changes. By providing a framework where wearable data can be understood through the lens of human language, SensorLM paves the way for more intuitive and personalized health monitoring. This technology holds the potential to transform raw biometric streams into actionable insights, helping users better understand the relationship between their activities and their overall physical well-being.

multimodal-ai ai foundation-models wearable-technology+5

google Jun 22, 2025

Unlocking rich genetic insights through multimodal AI with M-REGLE (opens in new tab)

Google Research has introduced M-REGLE, a multimodal AI framework designed to analyze diverse health data streams simultaneously to uncover the genetic underpinnings of complex diseases. By jointly modeling complementary signals—such as electrocardiograms (ECG) and photoplethysmograms (PPG)—the method captures shared biological information and reduces noise more effectively than unimodal approaches. This integrated analysis significantly enhances the discovery of genetic associations and improves the prediction of cardiovascular conditions like atrial fibrillation. ## Technical Architecture and Workflow M-REGLE utilizes a multi-step process to transform raw physiological waveforms into actionable genetic insights: * **Multimodal Integration:** Instead of processing data types in isolation, the model combines multiple inputs, such as the 12 leads of an ECG or paired ECG and PPG data, to capture overlapping signals. * **Latent Representation Learning:** The system employs a convolutional variational autoencoder (CVAE) to compress these high-dimensional waveforms into a low-dimensional "signature" or latent factors. * **Statistical Refinement:** Principal component analysis (PCA) is applied to the CVAE-generated signatures to ensure the learned factors are independent and uncorrelated. * **Genetic Mapping:** These independent factors are analyzed via genome-wide association studies (GWAS) to identify significant correlations between physiological signatures and specific genetic variations. ## Improved Data Reconstruction and Genetic Sensitivity The transition from unimodal (U-REGLE) to multimodal modeling has led to substantial gains in both data accuracy and biological discovery: * **Error Reduction:** M-REGLE achieved a 72.5% reduction in reconstruction error for 12-lead ECGs compared to analyzing each lead separately, indicating a much higher fidelity in capturing essential waveform characteristics. * **Increased Discovery Power:** In a study involving over 40,000 participants from the UK Biobank, the multimodal approach identified 3,251 significant genetic loci associated with 12-lead ECGs, a notable increase over the 2,215 loci found by unimodal methods. * **Novel Findings:** The model identified specific genetic links, such as the *RBM20* locus, which were previously missed by standard clinical measurements but are known to be critical for heart muscle function. ## Interpretability and Disease Prediction Beyond identifying associations, M-REGLE offers generative capabilities that help clinicians understand the relationship between latent data and physical health: * **Waveform Synthesis:** By altering specific coordinates within the learned embeddings, researchers can observe how individual latent factors correspond to physical changes in a patient's ECG T-wave or PPG peaks. * **Clinical Utility:** The model identified specific embeddings (positions 4, 6, and 10) that distinguish patients with atrial fibrillation (AFib) from those without. * **Predictive Performance:** M-REGLE’s embeddings outperformed traditional clinical polygenic risk scores (PRS) in predicting AFib, demonstrating the value of incorporating raw waveform data into risk assessments. ## Practical Applications Researchers and clinicians can leverage M-REGLE to extract richer insights from existing biobank data and wearable device outputs. By integrating multiple modalities into a single analytical pipeline, the framework provides a more comprehensive view of organ system health, facilitating the identification of therapeutic targets and more accurate disease screening protocols.

multimodal-ai ai deep-learning genomics+5

google May 21, 2025

Google Research at Google I/O 2025 (opens in new tab)

Google Research at I/O 2025 showcases the "research to reality" transition, highlighting how years of foundational breakthroughs are now being integrated into Gemini models and specialized products. By focusing on multimodal capabilities, pedagogy, and extreme model efficiency, Google aims to democratize access to advanced AI while ensuring it remains grounded and useful across global contexts. ## Specialized Healthcare Models: MedGemma and AMIE * **MedGemma:** This new open model, based on Gemma 3, is optimized for multimodal medical tasks such as radiology image analysis and clinical data summarization. It is available in 4B and 27B sizes, performing similarly to much larger models on the MedQA benchmark while remaining small enough for efficient local fine-tuning. * **AMIE (Articulate Medical Intelligence Explorer):** A research AI agent designed for diagnostic medical reasoning. Its latest multimodal version can now interpret and reason about visual medical information, such as skin lesions or medical imaging, to assist clinicians in diagnostic accuracy. ## Educational Optimization through LearnLM * **Gemini 2.5 Pro Integration:** The LearnLM family of models, developed with educational experts, is now integrated into Gemini 2.5 Pro. This fine-tuning enhances STEM reasoning, multimodal understanding, and pedagogical feedback. * **Interactive Learning Tools:** A new research-optimized quiz experience allows students to generate custom assessments from their own notes, providing specific feedback on right and wrong answers rather than just providing solutions. * **Global Assessment Pilots:** Through partnerships like the one with Kayma, Google is testing the automatic assessment of short and long-form content in regions like Ghana to scale quality educational tools. ## Multilingual Expansion and On-Device Gemma Models * **Gemma 3 and 3n:** Research breakthroughs have expanded Gemma 3’s support to over 140 languages. The introduction of **Gemma 3n** targets extreme efficiency, capable of running on devices with as little as 2GB of RAM while maintaining low latency and low energy consumption. * **ECLeKTic Benchmark:** To assist the developer community, Google introduced this novel benchmark specifically for evaluating how well large language models transfer knowledge across different languages. ## Model Efficiency and Factuality in Search * **Inference Techniques:** Google Research continues to set industry standards for model speed and accessibility through technical innovations like **speculative decoding** and **cascades**, which reduce the computational cost of generating high-quality responses. * **Grounded Outputs:** Significant focus remains on factual consistency, ensuring that the AI models powering features like AI Overviews in Search provide reliable and grounded information to users. As Google continues to shrink the gap between laboratory breakthroughs and consumer products, the emphasis remains on making high-performance AI accessible on low-cost hardware and across diverse linguistic landscapes. Developers and researchers can now leverage these specialized tools via platforms like HuggingFace and Vertex AI to build more targeted, efficient applications.

multimodal-ai ai gen-ai gemini+5

google Apr 30, 2025

AMIE gains vision: A research AI agent for multimodal diagnostic dialogue (opens in new tab)

Google Research and DeepMind have introduced multimodal AMIE, an advanced research AI agent designed to conduct diagnostic medical dialogues that integrate text, images, and clinical documents. By building on Gemini 2.0 Flash and a novel state-aware reasoning framework, the system can intelligently request and interpret visual data such as skin photos or ECGs to refine its diagnostic hypotheses. This evolution moves AI diagnostic tools closer to real-world clinical practice, where visual evidence is often essential for accurate patient assessment and management. ### Enhancing AMIE with Multimodal Perception To move beyond text-only limitations, researchers integrated vision capabilities that allow the agent to process complex medical information during a conversation. * The system uses Gemini 2.0 Flash as its core component to interpret diverse data types, including dermatology images and laboratory reports. * By incorporating multimodal perception, the agent can resolve diagnostic ambiguities that cannot be addressed through verbal descriptions alone. * Preliminary testing with Gemini 2.5 Flash suggests that further scaling the underlying model continues to improve the agent's reasoning and diagnostic accuracy. ### Emulating Clinical Workflows via State-Aware Reasoning A key technical contribution is the state-aware phase transition framework, which helps the AI mimic the structured yet flexible approach used by experienced clinicians. * The framework orchestrates the conversation through three distinct phases: History Taking, Diagnosis & Management, and Follow-up. * The agent maintains a dynamic internal state that tracks known information about the patient and identifies specific "knowledge gaps." * When the system detects uncertainty, it strategically requests multimodal artifacts—such as a photo of a rash or an image of a lab result—to update its differential diagnosis. * Transitions between conversation phases are only triggered once the system assesses that the objectives of the current phase have been sufficiently met. ### Evaluation through Simulated OSCEs To validate the agent’s performance, the researchers developed a robust simulation environment to facilitate rapid iteration and standardized testing. * The system was tested using patient scenarios grounded in real-world datasets, including the SCIN dataset for dermatology and PTB-XL for ECG measurements. * Evaluation was conducted using a modified version of Objective Structured Clinical Examinations (OSCEs), the global standard for assessing medical students and professionals. * In comparative studies, AMIE's performance was measured against primary care physicians (PCPs) to ensure its behavior, accuracy, and tone aligned with clinical standards. This research demonstrates that multimodal AI agents can effectively navigate the complexities of a medical consultation by combining linguistic empathy with the technical ability to interpret visual clinical evidence. As these systems continue to evolve, they offer a promising path toward high-quality, accessible diagnostic assistance that mirrors the multimodal nature of human medicine.

multimodal-ai ai llm ai-agent+5

google Apr 2, 2025

Evaluating progress of LLMs on scientific problem-solving (opens in new tab)

Current scientific benchmarks for large language models (LLMs) often focus on simple knowledge recall and multiple-choice responses, which do not reflect the complex, context-rich reasoning required in real-world research. To bridge this gap, Google Research has introduced CURIE, alongside the SPIQA and FEABench datasets, to evaluate LLMs on their ability to understand long-form documents, analyze multimodal data, and solve multi-step problems. These benchmarks aim to move AI from merely surfacing facts to actively assisting scientists in workflows involving information extraction, algebraic manipulation, and tool use. ### The CURIE Multitask Benchmark * CURIE spans six diverse scientific disciplines: materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins. * The benchmark includes 10 challenging tasks, such as concept tracking, information aggregation, and cross-domain expertise, based on 429 full-length research documents. * The complexity of the benchmark is reflected in its scale, with input queries averaging 15,000 words and ground truth responses averaging 954 words. * Domain experts were involved in every phase of development, from sourcing papers to creating nuanced ground-truth answers in formats like JSON, LaTeX, and YAML. ### Multimodal Reasoning and Agentic Simulation * The SPIQA (Scientific Paper Image Question Answering) dataset evaluates the ability of multimodal LLMs to ground their answers in complex figures and tables found in scientific literature. * FEABench (Finite Element Analysis Benchmark) measures the ability of LLM agents to simulate and solve multiphysics, mathematics, and engineering problems. * These tools specifically test whether models can choose the correct computational tools and reason through the physical constraints of a given problem. ### Programmatic and Model-Based Evaluation * Because scientific answers are often descriptive or formatted heterogeneously, the evaluation uses programmatic metrics like ROUGE-L and Intersection-over-Union (IoU). * For free-form and complex technical generation, the framework incorporates model-based evaluations to ensure AI responses align with expert assessments. * Task difficulty is quantified by expert ratings, ensuring the benchmark measures high-level reasoning rather than just pattern matching. These new benchmarks provide a rigorous framework for developing LLMs that can act as true collaborators in the scientific process. By focusing on long-context understanding and tool-integrated reasoning, researchers can better track the progress of AI in handling the actual complexities of modern scientific discovery.

multimodal-ai ai llm benchmarking+4