dataset-design

1 posts

google

Benchmarking LLMs for global health (opens in new tab)

Google Research has introduced a benchmarking pipeline and a dataset of over 11,000 synthetic personas to evaluate how Large Language Models (LLMs) handle tropical and infectious diseases (TRINDs). While LLMs excel at standard medical exams like the USMLE, this study reveals significant performance gaps when models encounter the regional context shifts and localized health data common in low-resource settings. The research concludes that integrating specific environmental context and advanced reasoning techniques is essential for making LLMs reliable decision-support tools for global health. ## Development of the TRINDs Synthetic Dataset * Researchers created a dataset of 11,000+ personas covering 50 tropical and infectious diseases to address the lack of rigorous evaluation data for out-of-distribution medical tasks. * The process began with "seed" templates based on factual data from the WHO, CDC, and PAHO, which were then reviewed by clinicians for clinical relevance. * The dataset was expanded using LLM prompting to include diverse demographic, clinical, and consumer-focused augmentations. * To test linguistic distribution shifts, the seed set was manually translated into French to evaluate how language changes impact diagnostic accuracy. ## Identifying Critical Performance Drivers * Evaluations of Gemini 1.5 models showed that accuracy on TRINDs is lower than reported performance on standard U.S. medical benchmarks, indicating a struggle with "out-of-distribution" disease types. * Contextual information is the primary driver of accuracy; the highest performance was achieved only when specific symptoms were combined with location and risk factors. * The study found that symptoms alone are often insufficient for an accurate diagnosis, emphasizing that LLMs require localized environmental data to differentiate between similar tropical conditions. * Linguistic shifts pose a significant challenge, as model performance dropped by approximately 10% when processing the French version of the dataset compared to the English version. ## Optimization and Reasoning Strategies * Implementing Chain-of-Thought (CoT) prompting—where the model is directed to explain its reasoning step-by-step—led to a significant 10% increase in diagnostic accuracy. * Researchers utilized an LLM-based "autorater" to scale the evaluation process, scoring answers as correct if the predicted diagnosis was meaningfully similar to the ground truth. * In tests regarding social biases, the study found no statistically significant difference in performance across race or gender identifiers within this specific TRINDs context. * Performance remained stable even when clinical language was swapped for consumer-style descriptions, suggesting the models are robust to variations in how patients describe their symptoms. To improve the utility of LLMs for global health, developers should prioritize the inclusion of regional risk factors and location-specific data in prompts. Utilizing reasoning-heavy strategies like Chain-of-Thought and expanding multilingual training sets are critical steps for bridging the performance gap in underserved regions.