hugging-face

3 posts

google

From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence (opens in new tab)

Google Research has introduced the Massive Sound Embedding Benchmark (MSEB) to unify the fragmented landscape of machine sound intelligence. By standardizing the evaluation of eight core auditory capabilities across diverse datasets, the framework reveals that current sound representations are far from universal and have significant performance "headroom" for improvement. Ultimately, MSEB provides an open-source platform to drive the development of general-purpose sound embeddings for next-generation multimodal AI. ### Diverse Datasets for Real-World Scenarios The benchmark utilizes a curated collection of high-quality, accessible datasets designed to reflect global diversity and complex acoustic environments. * **Simple Voice Questions (SVQ):** A foundational dataset featuring 177,352 short spoken queries across 17 languages and 26 locales, recorded in varying conditions like traffic and media noise. * **Speech-MASSIVE:** Used for multilingual spoken language understanding and intent classification. * **FSD50K:** A large-scale dataset for environmental sound event recognition containing 200 classes based on the AudioSet Ontology. * **BirdSet:** A massive-scale benchmark specifically for avian bioacoustics and complex soundscape recordings. ### Eight Core Auditory Capabilities MSEB is structured around "super-tasks" that represent the essential functions an intelligent auditory system must perform within a multimodal context. * **Retrieval and Reasoning:** These tasks simulate voice search and the ability of an assistant to find precise answers within documents based on spoken questions. * **Classification and Transcription:** Standard perception tasks that categorize sounds by environment or intent and convert audio signals into verbatim text. * **Segmentation and Clustering:** These involve identifying and localizing salient terms with precise timestamps and grouping sound samples by shared attributes without predefined labels. * **Reranking and Reconstruction:** Advanced tasks that reorder ambiguous text hypotheses to match spoken queries and test embedding quality by regenerating original audio waveforms. ### Unified Evaluation and Performance Goals The framework is designed to move beyond fragmented research by providing a consistent structure for evaluating different model architectures. * **Model Agnostic:** The open framework allows for the evaluation of uni-modal, cascade, and end-to-end multimodal embedding models. * **Objective Baselines:** By establishing clear performance goals, the benchmark highlights specific research opportunities where current state-of-the-art models fall short of their potential. * **Multimodal Integration:** Every task assumes sound is the critical input but incorporates other modalities, such as text context, to better simulate real-world AI interactions. By providing a comprehensive roadmap for auditory intelligence, MSEB encourages the community to move toward universal sound embeddings. Researchers can contribute to this evolving standard by accessing the open-source GitHub repository and utilizing the newly released datasets on Hugging Face to benchmark their own models.

line

Security threats and countermeasures in AI (opens in new tab)

Developing AI products introduces unique security vulnerabilities that extend beyond traditional software risks, ranging from package hallucinations to sophisticated indirect prompt injections. To mitigate these threats, organizations must move away from trusting LLM-generated content and instead implement rigorous validation, automated threat modeling, and input/output guardrails. The following summary details the specific risks and mitigation strategies identified by LY Corporation’s security engineering team. ## Slopsquatting and Package Hallucinations - AI models frequently hallucinate non-existent library or package names when providing coding instructions (e.g., suggesting `huggingface-cli` instead of the correct `huggingface_hub[cli]`). - Attackers exploit this by registering these hallucinated names on public registries to distribute malware to unsuspecting developers. - Mitigation requires developers to manually verify all AI-suggested commands and dependencies before execution in any environment. ## Prompt Injection and Arbitrary Code Execution - As seen in CVE-2024-5565 (Vanna AI), attackers can inject malicious instructions into prompts to force the application to execute arbitrary code. - This vulnerability arises when developers grant LLMs the autonomy to generate and run logic within the application context without sufficient isolation. - Mitigation involves treating LLM outputs as untrusted data, sanitizing user inputs, and strictly limiting the LLM's ability to execute system-level commands. ## Indirect Prompt Injection in Integrated AI - AI assistants integrated into office environments (like Gemini for Workspace) are susceptible to indirect prompt injections hidden within emails or documents. - A malicious email can contain "system-like" instructions that trick the AI into hiding content, redirecting users to phishing sites, or leaking data from other files. - Mitigation requires the implementation of robust guardrails that scan both the input data (the content being processed) and the generated output for instructional anomalies. ## Permission Risks in AI Agents and MCP - The use of Model Context Protocol (MCP) and coding agents creates risks where an agent might overstep its intended scope. - If an agent has broad access to a developer's environment, a malicious prompt in a public repository could trick the agent into accessing or leaking sensitive data (such as salary info or private keys) from a private repository. - Mitigation centers on the principle of least privilege, ensuring AI agents are restricted to specific, scoped directories and repositories. ## Embedding Inversion and Vector Store Vulnerabilities - Attacks targeting the retrieval phase of RAG (Retrieval-Augmented Generation) systems can lead to data leaks. - Embedding Inversion techniques may allow attackers to reconstruct original sensitive text from the vector embeddings stored in a database. - Securing AI products requires protecting the integrity of the vector store and ensuring that retrieved context does not bypass security filters. ## Automated Security Assessment Tools - To scale security, LY Corporation is developing internal tools like "ConA" for automated threat modeling and "LAVA" for automated vulnerability assessment. - These tools aim to identify AI-specific risks during the design and development phases rather than relying solely on manual reviews. Effective AI security requires a shift in mindset: treat every LLM response as a potential security risk. Developers should adopt automated threat modeling and implement strict input/output validation layers to protect both the application infrastructure and user data from evolving AI-based exploits.

google

AfriMed-QA: Benchmarking large language models for global health (opens in new tab)

AfriMed-QA is a comprehensive benchmarking suite designed to address the critical gap in medical LLM evaluation for African healthcare contexts. Developed through a partnership between Google Research and a pan-African consortium, the project demonstrates that current models often struggle with geographic distribution shifts in disease and localized linguistic nuances. The researchers conclude that diverse, region-specific datasets are essential for training equitable AI tools that can safely provide clinical decision support in low-resource settings. ## Limitations of Western-Centric Benchmarks * Existing medical benchmarks like USMLE MedQA focus on Western clinical contexts, which may not generalize to other regions. * Models trained on traditional datasets often fail to account for specific distribution shifts in disease types and cultural symptom descriptions. * The lack of diverse data makes it difficult to assess how LLMs handle variations in language and linguistics, even when the primary language is English. ## The AfriMed-QA Dataset Composition * The dataset contains approximately 15,000 clinically diverse questions and answers sourced from 16 African countries. * It covers 32 medical specialties, ranging from neurosurgery and internal medicine to infectious diseases and obstetrics. * The content is divided into three distinct formats: 4,000+ expert multiple-choice questions (MCQs), 1,200 open-ended short-answer questions (SAQs), and 10,000 consumer-style queries. * Data was crowdsourced from 621 contributors across 60 medical schools to ensure a broad representation of the continent's medical landscape. ## Data Collection and Curation Methodology * Researchers adapted a specialized web-based platform, originally built by Intron Health, to facilitate large-scale crowdsourcing across different regions. * To protect privacy, consumer queries were generated by prompting users with specific disease scenarios rather than asking for personal health information. * The curation process included custom user interfaces for quality reviews and blinded human evaluations by clinical experts to ensure the accuracy of reference answers. ## LLM Performance and Evaluation Results * The study benchmarked 30 general and biomedical LLMs, evaluating them for accuracy, semantic similarity, and human preference. * A significant performance gap exists between model sizes; larger models consistently outperformed smaller models on the AfriMed-QA benchmark. * This trend highlights a challenge for low-resource settings, where smaller, specialized models are often preferred for on-device or edge deployment due to infrastructure constraints. * The dataset has already been utilized to improve Google’s MedGemma, demonstrating its utility in training multimodal medical models. The AfriMed-QA benchmark datasets and evaluation code have been open-sourced on Hugging Face and GitHub to support the global research community. Developers are encouraged to use these tools to build and refine medical AI that is more inclusive and effective for the Global South.