benchmarking

6 posts

google

From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence (opens in new tab)

Google Research has introduced the Massive Sound Embedding Benchmark (MSEB) to unify the fragmented landscape of machine sound intelligence. By standardizing the evaluation of eight core auditory capabilities across diverse datasets, the framework reveals that current sound representations are far from universal and have significant performance "headroom" for improvement. Ultimately, MSEB provides an open-source platform to drive the development of general-purpose sound embeddings for next-generation multimodal AI. ### Diverse Datasets for Real-World Scenarios The benchmark utilizes a curated collection of high-quality, accessible datasets designed to reflect global diversity and complex acoustic environments. * **Simple Voice Questions (SVQ):** A foundational dataset featuring 177,352 short spoken queries across 17 languages and 26 locales, recorded in varying conditions like traffic and media noise. * **Speech-MASSIVE:** Used for multilingual spoken language understanding and intent classification. * **FSD50K:** A large-scale dataset for environmental sound event recognition containing 200 classes based on the AudioSet Ontology. * **BirdSet:** A massive-scale benchmark specifically for avian bioacoustics and complex soundscape recordings. ### Eight Core Auditory Capabilities MSEB is structured around "super-tasks" that represent the essential functions an intelligent auditory system must perform within a multimodal context. * **Retrieval and Reasoning:** These tasks simulate voice search and the ability of an assistant to find precise answers within documents based on spoken questions. * **Classification and Transcription:** Standard perception tasks that categorize sounds by environment or intent and convert audio signals into verbatim text. * **Segmentation and Clustering:** These involve identifying and localizing salient terms with precise timestamps and grouping sound samples by shared attributes without predefined labels. * **Reranking and Reconstruction:** Advanced tasks that reorder ambiguous text hypotheses to match spoken queries and test embedding quality by regenerating original audio waveforms. ### Unified Evaluation and Performance Goals The framework is designed to move beyond fragmented research by providing a consistent structure for evaluating different model architectures. * **Model Agnostic:** The open framework allows for the evaluation of uni-modal, cascade, and end-to-end multimodal embedding models. * **Objective Baselines:** By establishing clear performance goals, the benchmark highlights specific research opportunities where current state-of-the-art models fall short of their potential. * **Multimodal Integration:** Every task assumes sound is the critical input but incorporates other modalities, such as text context, to better simulate real-world AI interactions. By providing a comprehensive roadmap for auditory intelligence, MSEB encourages the community to move toward universal sound embeddings. Researchers can contribute to this evolving standard by accessing the open-source GitHub repository and utilizing the newly released datasets on Hugging Face to benchmark their own models.

google

AfriMed-QA: Benchmarking large language models for global health (opens in new tab)

AfriMed-QA is a comprehensive benchmarking suite designed to address the critical gap in medical LLM evaluation for African healthcare contexts. Developed through a partnership between Google Research and a pan-African consortium, the project demonstrates that current models often struggle with geographic distribution shifts in disease and localized linguistic nuances. The researchers conclude that diverse, region-specific datasets are essential for training equitable AI tools that can safely provide clinical decision support in low-resource settings. ## Limitations of Western-Centric Benchmarks * Existing medical benchmarks like USMLE MedQA focus on Western clinical contexts, which may not generalize to other regions. * Models trained on traditional datasets often fail to account for specific distribution shifts in disease types and cultural symptom descriptions. * The lack of diverse data makes it difficult to assess how LLMs handle variations in language and linguistics, even when the primary language is English. ## The AfriMed-QA Dataset Composition * The dataset contains approximately 15,000 clinically diverse questions and answers sourced from 16 African countries. * It covers 32 medical specialties, ranging from neurosurgery and internal medicine to infectious diseases and obstetrics. * The content is divided into three distinct formats: 4,000+ expert multiple-choice questions (MCQs), 1,200 open-ended short-answer questions (SAQs), and 10,000 consumer-style queries. * Data was crowdsourced from 621 contributors across 60 medical schools to ensure a broad representation of the continent's medical landscape. ## Data Collection and Curation Methodology * Researchers adapted a specialized web-based platform, originally built by Intron Health, to facilitate large-scale crowdsourcing across different regions. * To protect privacy, consumer queries were generated by prompting users with specific disease scenarios rather than asking for personal health information. * The curation process included custom user interfaces for quality reviews and blinded human evaluations by clinical experts to ensure the accuracy of reference answers. ## LLM Performance and Evaluation Results * The study benchmarked 30 general and biomedical LLMs, evaluating them for accuracy, semantic similarity, and human preference. * A significant performance gap exists between model sizes; larger models consistently outperformed smaller models on the AfriMed-QA benchmark. * This trend highlights a challenge for low-resource settings, where smaller, specialized models are often preferred for on-device or edge deployment due to infrastructure constraints. * The dataset has already been utilized to improve Google’s MedGemma, demonstrating its utility in training multimodal medical models. The AfriMed-QA benchmark datasets and evaluation code have been open-sourced on Hugging Face and GitHub to support the global research community. Developers are encouraged to use these tools to build and refine medical AI that is more inclusive and effective for the Global South.

google

Benchmarking LLMs for global health (opens in new tab)

Google Research has introduced a benchmarking pipeline and a dataset of over 11,000 synthetic personas to evaluate how Large Language Models (LLMs) handle tropical and infectious diseases (TRINDs). While LLMs excel at standard medical exams like the USMLE, this study reveals significant performance gaps when models encounter the regional context shifts and localized health data common in low-resource settings. The research concludes that integrating specific environmental context and advanced reasoning techniques is essential for making LLMs reliable decision-support tools for global health. ## Development of the TRINDs Synthetic Dataset * Researchers created a dataset of 11,000+ personas covering 50 tropical and infectious diseases to address the lack of rigorous evaluation data for out-of-distribution medical tasks. * The process began with "seed" templates based on factual data from the WHO, CDC, and PAHO, which were then reviewed by clinicians for clinical relevance. * The dataset was expanded using LLM prompting to include diverse demographic, clinical, and consumer-focused augmentations. * To test linguistic distribution shifts, the seed set was manually translated into French to evaluate how language changes impact diagnostic accuracy. ## Identifying Critical Performance Drivers * Evaluations of Gemini 1.5 models showed that accuracy on TRINDs is lower than reported performance on standard U.S. medical benchmarks, indicating a struggle with "out-of-distribution" disease types. * Contextual information is the primary driver of accuracy; the highest performance was achieved only when specific symptoms were combined with location and risk factors. * The study found that symptoms alone are often insufficient for an accurate diagnosis, emphasizing that LLMs require localized environmental data to differentiate between similar tropical conditions. * Linguistic shifts pose a significant challenge, as model performance dropped by approximately 10% when processing the French version of the dataset compared to the English version. ## Optimization and Reasoning Strategies * Implementing Chain-of-Thought (CoT) prompting—where the model is directed to explain its reasoning step-by-step—led to a significant 10% increase in diagnostic accuracy. * Researchers utilized an LLM-based "autorater" to scale the evaluation process, scoring answers as correct if the predicted diagnosis was meaningfully similar to the ground truth. * In tests regarding social biases, the study found no statistically significant difference in performance across race or gender identifiers within this specific TRINDs context. * Performance remained stable even when clinical language was swapped for consumer-style descriptions, suggesting the models are robust to variations in how patients describe their symptoms. To improve the utility of LLMs for global health, developers should prioritize the inclusion of regional risk factors and location-specific data in prompts. Utilizing reasoning-heavy strategies like Chain-of-Thought and expanding multilingual training sets are critical steps for bridging the performance gap in underserved regions.

google

Improving brain models with ZAPBench (opens in new tab)

Google Research, in collaboration with HHMI Janelia and Harvard, has introduced ZAPBench, a first-of-its-kind whole-brain activity dataset and benchmark designed to improve the accuracy of brain activity models. Using the larval zebrafish as a model organism, the project provides single-cell resolution recordings of approximately 70,000 neurons, capturing nearly the entire vertebrate brain in action. This resource allows researchers to bridge the gap between structural connectomics and dynamic functional activity to better understand how neural wiring generates complex behavior. ## Whole-Brain Activity in Larval Zebrafish * The dataset focuses on the six-day-old larval zebrafish because it is small, transparent, and capable of complex behaviors like motor learning, hunting, and memory. * Researchers used light-sheet microscopy to scan the brain in 3D slices, recording two hours of continuous activity. * The fish were engineered with GCaMP, a genetically encoded calcium indicator that emits light when neurons fire, allowing for the visualization of real-time neural impulses. * To correlate neural activity with behavior, the fish were placed in a virtual reality environment where stimuli—such as shifting water currents and light changes—were projected around them while tail muscle activity was recorded via electrodes. ## The ZAPBench Framework * ZAPBench standardizes the evaluation of machine learning models in neuroscience, following the tradition of benchmarks in fields like computer vision and language modeling. * The benchmark provides a high-quality dataset of 70,000 neurons, whereas previous efforts in other species often covered less than 0.1% of the brain. * It challenges models to predict how neurons will respond to specific visual stimuli and behavioral patterns. * Initial results presented at ICLR 2025 demonstrate that while simple linear models provide a baseline, advanced architectures like Transformers and Convolutional Neural Networks (CNNs) significantly improve prediction accuracy. ## Integrating Structure and Function * While previous connectomics projects mapped physical neural connections, ZAPBench adds the "dynamic" layer of how those connections are used over time. * The team is currently generating a comprehensive structural connectome for the exact same specimen used in the activity recordings. * This dual approach will eventually allow scientists to investigate the direct relationship between precise physical wiring and the resulting patterns of neural activity across an entire vertebrate brain. By providing an open-source dataset and standardized benchmark, ZAPBench enables the global research community to develop and compare more sophisticated models of neural dynamics, potentially leading to breakthroughs in how we simulate and understand vertebrate cognition.

google

Evaluating progress of LLMs on scientific problem-solving (opens in new tab)

Current scientific benchmarks for large language models (LLMs) often focus on simple knowledge recall and multiple-choice responses, which do not reflect the complex, context-rich reasoning required in real-world research. To bridge this gap, Google Research has introduced CURIE, alongside the SPIQA and FEABench datasets, to evaluate LLMs on their ability to understand long-form documents, analyze multimodal data, and solve multi-step problems. These benchmarks aim to move AI from merely surfacing facts to actively assisting scientists in workflows involving information extraction, algebraic manipulation, and tool use. ### The CURIE Multitask Benchmark * CURIE spans six diverse scientific disciplines: materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins. * The benchmark includes 10 challenging tasks, such as concept tracking, information aggregation, and cross-domain expertise, based on 429 full-length research documents. * The complexity of the benchmark is reflected in its scale, with input queries averaging 15,000 words and ground truth responses averaging 954 words. * Domain experts were involved in every phase of development, from sourcing papers to creating nuanced ground-truth answers in formats like JSON, LaTeX, and YAML. ### Multimodal Reasoning and Agentic Simulation * The SPIQA (Scientific Paper Image Question Answering) dataset evaluates the ability of multimodal LLMs to ground their answers in complex figures and tables found in scientific literature. * FEABench (Finite Element Analysis Benchmark) measures the ability of LLM agents to simulate and solve multiphysics, mathematics, and engineering problems. * These tools specifically test whether models can choose the correct computational tools and reason through the physical constraints of a given problem. ### Programmatic and Model-Based Evaluation * Because scientific answers are often descriptive or formatted heterogeneously, the evaluation uses programmatic metrics like ROUGE-L and Intersection-over-Union (IoU). * For free-form and complex technical generation, the framework incorporates model-based evaluations to ensure AI responses align with expert assessments. * Task difficulty is quantified by expert ratings, ensuring the benchmark measures high-level reasoning rather than just pattern matching. These new benchmarks provide a rigorous framework for developing LLMs that can act as true collaborators in the scientific process. By focusing on long-context understanding and tool-integrated reasoning, researchers can better track the progress of AI in handling the actual complexities of modern scientific discovery.

google

ECLeKTic: A novel benchmark for evaluating cross-lingual knowledge transfer in LLMs (opens in new tab)

ECLeKTic is a novel benchmark designed to evaluate how effectively large language models (LLMs) transfer knowledge between languages, addressing a common limitation where models possess information in a source language but fail to access it in others. By utilizing a closed-book question-answering format based on language-specific Wikipedia entries, the benchmark quantifies the gap between human-like cross-lingual understanding and current machine performance. Initial testing reveals that even state-of-the-art models have significant room for improvement, with the highest-performing model, Gemini 2.5 Pro, achieving only a 52.6% success rate. ## Methodology and Dataset Construction The researchers built the ECLeKTic dataset by focusing on "information silos" within Wikipedia to ensure the models would need to perform internal transfer rather than simply recalling translated training data. * The dataset targets 12 languages: English, French, German, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Portuguese, and Spanish. * Researchers selected 100 articles per language from a July 2023 Wikipedia snapshot that existed exclusively in that specific language and had no equivalent articles in the other 11 targeted languages. * This approach uses Wikipedia presence as a proxy to identify facts likely encountered by the model in only one language during its training phase. ## Human Refinement and Decontextualization To ensure the quality and portability of the questions, the team employed native speakers to refine and verify the data generated by AI. * Human annotators filtered Gemini-generated question-and-answer pairs to ensure they were answerable in a closed-book setting without referring to external context. * Annotators performed "decontextualization" by adding specific details to ambiguous terms; for example, a reference to the "Supreme Court" was clarified as the "Israeli Supreme Court" to ensure the question remained accurate after translation. * Questions were curated to focus on cultural and local salience rather than general global knowledge like science or universal current events. * The final dataset consists of 384 unique questions, which were translated and verified across all 11 target languages, resulting in 4,224 total examples. ## Benchmarking Model Performance The benchmark evaluates models using a specific metric called "overall success," which measures a model's ability to answer a question correctly in both the original source language and the target language. * The benchmark was used to test eight leading open and proprietary LLMs. * Gemini 2.0 Pro initially set a high bar with 41.6% success, which was later surpassed by Gemini 2.5 Pro at 52.6%. * The results demonstrate that while models are improving, they still struggle to maintain consistent knowledge across different linguistic contexts, representing a major hurdle for equitable global information access. The release of ECLeKTic as an open-source benchmark on Kaggle provides a vital tool for the AI community to bridge the "knowledge gap" between high-resource and low-resource languages. Developers and researchers should use this data to refine training methodologies, aiming for models that can express their internal knowledge regardless of the language used in the prompt.