dataset-curation

2 posts

google

AfriMed-QA: Benchmarking large language models for global health (opens in new tab)

AfriMed-QA is a comprehensive benchmarking suite designed to address the critical gap in medical LLM evaluation for African healthcare contexts. Developed through a partnership between Google Research and a pan-African consortium, the project demonstrates that current models often struggle with geographic distribution shifts in disease and localized linguistic nuances. The researchers conclude that diverse, region-specific datasets are essential for training equitable AI tools that can safely provide clinical decision support in low-resource settings. ## Limitations of Western-Centric Benchmarks * Existing medical benchmarks like USMLE MedQA focus on Western clinical contexts, which may not generalize to other regions. * Models trained on traditional datasets often fail to account for specific distribution shifts in disease types and cultural symptom descriptions. * The lack of diverse data makes it difficult to assess how LLMs handle variations in language and linguistics, even when the primary language is English. ## The AfriMed-QA Dataset Composition * The dataset contains approximately 15,000 clinically diverse questions and answers sourced from 16 African countries. * It covers 32 medical specialties, ranging from neurosurgery and internal medicine to infectious diseases and obstetrics. * The content is divided into three distinct formats: 4,000+ expert multiple-choice questions (MCQs), 1,200 open-ended short-answer questions (SAQs), and 10,000 consumer-style queries. * Data was crowdsourced from 621 contributors across 60 medical schools to ensure a broad representation of the continent's medical landscape. ## Data Collection and Curation Methodology * Researchers adapted a specialized web-based platform, originally built by Intron Health, to facilitate large-scale crowdsourcing across different regions. * To protect privacy, consumer queries were generated by prompting users with specific disease scenarios rather than asking for personal health information. * The curation process included custom user interfaces for quality reviews and blinded human evaluations by clinical experts to ensure the accuracy of reference answers. ## LLM Performance and Evaluation Results * The study benchmarked 30 general and biomedical LLMs, evaluating them for accuracy, semantic similarity, and human preference. * A significant performance gap exists between model sizes; larger models consistently outperformed smaller models on the AfriMed-QA benchmark. * This trend highlights a challenge for low-resource settings, where smaller, specialized models are often preferred for on-device or edge deployment due to infrastructure constraints. * The dataset has already been utilized to improve Google’s MedGemma, demonstrating its utility in training multimodal medical models. The AfriMed-QA benchmark datasets and evaluation code have been open-sourced on Hugging Face and GitHub to support the global research community. Developers are encouraged to use these tools to build and refine medical AI that is more inclusive and effective for the Global South.

google

Amplify Initiative: Localized data for globalized AI (opens in new tab)

The Amplify Initiative by Google Research addresses the critical lack of linguistic and cultural diversity in generative AI training data by establishing an open, community-based platform for localized data collection. By partnering with regional experts to co-create structured, high-quality datasets, the initiative aims to ensure AI models are both representative and effective in solving local challenges across health, finance, and education. This approach shifts data collection from a top-down model to a participatory framework that prioritizes responsible, locally respectful practices in the Global South. ## The Amplify Platform Framework The initiative is designed to bridge the gap between global AI capabilities and local needs through three core pillars: * **Participatory Co-creation:** Researchers and local communities collaborate to define specific data needs, ensuring the resulting datasets address region-specific problems like financial literacy or localized health misinformation. * **Open Access for Innovation:** The platform provides high-quality, multilingual datasets suitable for fine-tuning and evaluating models, specifically empowering developers in the Global South to build tools for their own communities. * **Author Recognition:** Contributors receive tangible rewards, including professional certificates, research acknowledgments, and data authorship attribution, creating a sustainable ecosystem for expert participation. ## Pilot Implementation in Sub-Saharan Africa To test the methodology, Google Research partnered with Makerere University’s AI Lab in Uganda to conduct an on-the-ground pilot program. * **Expert Onboarding:** The program trained 259 experts across Ghana, Kenya, Malawi, Nigeria, and Uganda through a combination of in-person workshops and app-based modules. * **Dataset Composition:** The pilot resulted in 8,091 annotated adversarial queries across seven languages, covering salient domains such as education and finance. * **Adversarial Focus:** By focusing on adversarial queries, the team captured localized nuances of potential AI harms, including regional stereotypes and specialized advice that generic models often miss. ## Technical Workflow and App-Based Methodology The initiative utilizes a structured technical pipeline to scale data collection while maintaining high quality and privacy. * **Privacy-Preserving Android App:** A dedicated app serves as the primary interface for training, data creation, and annotation, allowing experts to contribute from their own environments. * **Automated Validation:** The app includes built-in feedback loops that use automated checks to ensure queries are relevant and to prevent the submission of semantically similar or duplicate entries. * **Domain-Specific Annotation:** Experts are provided with specialized annotation topics tailored to their professional backgrounds, ensuring that the metadata for each query is technically accurate and contextually relevant. The Amplify Initiative provides a scalable blueprint for building inclusive AI by empowering experts in the Global South to define their own data needs. As the project expands to India and Brazil, it offers a vital resource for developers seeking to fine-tune models for local contexts and improve the safety and relevance of AI on a global scale.