medical-ai

3 posts

google

AfriMed-QA: Benchmarking large language models for global health (opens in new tab)

AfriMed-QA is a comprehensive benchmarking suite designed to address the critical gap in medical LLM evaluation for African healthcare contexts. Developed through a partnership between Google Research and a pan-African consortium, the project demonstrates that current models often struggle with geographic distribution shifts in disease and localized linguistic nuances. The researchers conclude that diverse, region-specific datasets are essential for training equitable AI tools that can safely provide clinical decision support in low-resource settings. ## Limitations of Western-Centric Benchmarks * Existing medical benchmarks like USMLE MedQA focus on Western clinical contexts, which may not generalize to other regions. * Models trained on traditional datasets often fail to account for specific distribution shifts in disease types and cultural symptom descriptions. * The lack of diverse data makes it difficult to assess how LLMs handle variations in language and linguistics, even when the primary language is English. ## The AfriMed-QA Dataset Composition * The dataset contains approximately 15,000 clinically diverse questions and answers sourced from 16 African countries. * It covers 32 medical specialties, ranging from neurosurgery and internal medicine to infectious diseases and obstetrics. * The content is divided into three distinct formats: 4,000+ expert multiple-choice questions (MCQs), 1,200 open-ended short-answer questions (SAQs), and 10,000 consumer-style queries. * Data was crowdsourced from 621 contributors across 60 medical schools to ensure a broad representation of the continent's medical landscape. ## Data Collection and Curation Methodology * Researchers adapted a specialized web-based platform, originally built by Intron Health, to facilitate large-scale crowdsourcing across different regions. * To protect privacy, consumer queries were generated by prompting users with specific disease scenarios rather than asking for personal health information. * The curation process included custom user interfaces for quality reviews and blinded human evaluations by clinical experts to ensure the accuracy of reference answers. ## LLM Performance and Evaluation Results * The study benchmarked 30 general and biomedical LLMs, evaluating them for accuracy, semantic similarity, and human preference. * A significant performance gap exists between model sizes; larger models consistently outperformed smaller models on the AfriMed-QA benchmark. * This trend highlights a challenge for low-resource settings, where smaller, specialized models are often preferred for on-device or edge deployment due to infrastructure constraints. * The dataset has already been utilized to improve Google’s MedGemma, demonstrating its utility in training multimodal medical models. The AfriMed-QA benchmark datasets and evaluation code have been open-sourced on Hugging Face and GitHub to support the global research community. Developers are encouraged to use these tools to build and refine medical AI that is more inclusive and effective for the Global South.

google

Enabling physician-centered oversight for AMIE (opens in new tab)

Guardrailed-AMIE (g-AMIE) is a diagnostic AI framework designed to perform patient history-taking while strictly adhering to safety guardrails that prevent it from providing direct medical advice. By decoupling data collection from clinical decision-making, the system enables an asynchronous oversight model where primary care physicians (PCPs) review and finalize AI-generated medical summaries. In virtual clinical trials, g-AMIE’s diagnostic outputs and patient communications were preferred by overseeing physicians and patient actors over human-led control groups. ## Multi-Agent Architecture and Guardrails * The system utilizes a multi-agent setup powered by Gemini 2.0 Flash, consisting of a dialogue agent, a guardrail agent, and a SOAP note agent. * The dialogue agent conducts history-taking in three distinct phases: general information gathering, targeted validation of a differential diagnosis, and a conclusion phase for patient questions. * A dedicated guardrail agent monitors and rephrases responses in real-time to ensure the AI abstains from sharing individualized diagnoses or treatment plans directly with the patient. * The SOAP note agent employs sequential multi-step generation to separate summarization tasks (Subjective and Objective) from more complex inferential tasks (Assessment and Plan). ## The Clinician Cockpit and Asynchronous Oversight * To facilitate human review, researchers developed the "clinician cockpit," a web interface co-designed with outpatient physicians through semi-structured interviews. * The interface is structured around the standard SOAP note format, presenting the patient’s perspective, measurable data, differential diagnosis, and proposed management strategy. * This framework allows overseeing PCPs to review cases asynchronously, editing the AI’s proposed differential diagnoses and management plans before sharing a final message with the patient. * The separation of history-taking from decision-making ensures that licensed medical professionals retain ultimate accountability for patient care. ## Performance Evaluation via Virtual OSCE * The system was evaluated in a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) involving 60 case scenarios. * g-AMIE’s performance was compared against primary care physicians, nurse practitioners, and physician assistants who were required to operate under the same restrictive guardrails. * Overseeing PCPs and independent physician raters preferred g-AMIE’s diagnostic accuracy and management plans over those of the human control groups. * Patient actors reported a preference for the messages generated by g-AMIE compared to those drafted by human clinicians in the study. While g-AMIE demonstrates high potential for human-AI collaboration in diagnostics, the researchers emphasize that results should be interpreted with caution. The workflow was specifically optimized for AI characteristics, and human clinicians may require specialized training to perform effectively within such highly regulated guardrail frameworks.

google

AMIE gains vision: A research AI agent for multimodal diagnostic dialogue (opens in new tab)

Google Research and DeepMind have introduced multimodal AMIE, an advanced research AI agent designed to conduct diagnostic medical dialogues that integrate text, images, and clinical documents. By building on Gemini 2.0 Flash and a novel state-aware reasoning framework, the system can intelligently request and interpret visual data such as skin photos or ECGs to refine its diagnostic hypotheses. This evolution moves AI diagnostic tools closer to real-world clinical practice, where visual evidence is often essential for accurate patient assessment and management. ### Enhancing AMIE with Multimodal Perception To move beyond text-only limitations, researchers integrated vision capabilities that allow the agent to process complex medical information during a conversation. * The system uses Gemini 2.0 Flash as its core component to interpret diverse data types, including dermatology images and laboratory reports. * By incorporating multimodal perception, the agent can resolve diagnostic ambiguities that cannot be addressed through verbal descriptions alone. * Preliminary testing with Gemini 2.5 Flash suggests that further scaling the underlying model continues to improve the agent's reasoning and diagnostic accuracy. ### Emulating Clinical Workflows via State-Aware Reasoning A key technical contribution is the state-aware phase transition framework, which helps the AI mimic the structured yet flexible approach used by experienced clinicians. * The framework orchestrates the conversation through three distinct phases: History Taking, Diagnosis & Management, and Follow-up. * The agent maintains a dynamic internal state that tracks known information about the patient and identifies specific "knowledge gaps." * When the system detects uncertainty, it strategically requests multimodal artifacts—such as a photo of a rash or an image of a lab result—to update its differential diagnosis. * Transitions between conversation phases are only triggered once the system assesses that the objectives of the current phase have been sufficiently met. ### Evaluation through Simulated OSCEs To validate the agent’s performance, the researchers developed a robust simulation environment to facilitate rapid iteration and standardized testing. * The system was tested using patient scenarios grounded in real-world datasets, including the SCIN dataset for dermatology and PTB-XL for ECG measurements. * Evaluation was conducted using a modified version of Objective Structured Clinical Examinations (OSCEs), the global standard for assessing medical students and professionals. * In comparative studies, AMIE's performance was measured against primary care physicians (PCPs) to ensure its behavior, accuracy, and tone aligned with clinical standards. This research demonstrates that multimodal AI agents can effectively navigate the complexities of a medical consultation by combining linguistic empathy with the technical ability to interpret visual clinical evidence. As these systems continue to evolve, they offer a promising path toward high-quality, accessible diagnostic assistance that mirrors the multimodal nature of human medicine.