google Jul 8, 2025

MedGemma: Our most capable open models for health AI development (opens in new tab)

gen-ai multimodal-ai fine-tuning medgemma health-ai gemma-3 siglip image-encoder

Google Research has expanded its Health AI Developer Foundations (HAI-DEF) collection with the release of MedGemma and MedSigLIP, a series of open, multimodal models designed specifically for medical research and application development. These models offer a high-performance, privacy-preserving alternative to closed systems, allowing developers to maintain full control over their infrastructure while leveraging state-of-the-art medical reasoning. By providing both 4B and 27B parameter versions, the collection balances computational efficiency with complex longitudinal data interpretation, even enabling deployment on single GPUs or mobile hardware.

MedGemma Multimodal Variants

The MedGemma collection utilizes the Gemma 3 architecture to process both image and text inputs, providing robust generative capabilities for healthcare tasks.

MedGemma 27B Multimodal: This model is designed for complex tasks such as interpreting longitudinal electronic health records (EHR) and achieves an 87.7% score on the MedQA benchmark, performing within 3 points of DeepSeek R1 at approximately one-tenth the inference cost.
MedGemma 4B Multimodal: A lightweight version that scores 64.4% on MedQA, outperforming most open models under 8B parameters; it is optimized for mobile hardware and specific tasks like chest X-ray report generation.
Clinical Accuracy: In unblinded studies, 81% of chest X-ray reports generated by the 4B model were judged by board-certified radiologists to be sufficient for patient management, achieving a RadGraph F1 score of 30.3.
Versatility: The models retain general-purpose capabilities from the original Gemma base, ensuring they remain effective at instruction-following and non-English language tasks while handling specialized medical data.

MedSigLIP Specialized Image Encoding

MedSigLIP serves as the underlying vision component for the MedGemma suite, but it is also available as a standalone 400M parameter encoder for structured data tasks.

Architecture: Based on the Sigmoid loss for Language Image Pre-training (SigLIP) framework, it bridges the gap between medical imagery and text through a shared embedding space.
Diverse Modalities: The encoder was fine-tuned on a wide variety of medical data, including fundus photography, dermatology images, histopathology patches, and chest X-rays.
Functional Use Cases: It is specifically recommended for tasks involving classification, retrieval, and search, where structured outputs are preferred over free-text generation.
Data Retention: Training protocols ensured the model retained its ability to process natural images, maintaining its utility for hybrid tasks that mix medical and non-medical visual information.

Technical Implementation and Accessibility

Google has prioritized accessibility for developers by ensuring these models can run on consumer-grade or limited hardware environments.

Hardware Compatibility: Both the 4B and 27B models are designed to run on a single GPU, while the 4B and MedSigLIP versions are adaptable for edge computing and mobile devices.
Open Resources: To support the community, Google has released the technical reports, model weights on Hugging Face, and implementation code on GitHub.
Developer Flexibility: Because these are open models, researchers can fine-tune them on proprietary datasets without compromising data privacy or being locked into specific cloud providers.

For medical AI development, the choice of model should depend on the specific output requirement: MedGemma is the optimal starting point for generative tasks like visual question answering or report drafting, while MedSigLIP is the preferred tool for building high-speed classification and image retrieval systems.