google

AfriMed-QA: Benchmarking large language models for global health (opens in new tab)

AfriMed-QA is a comprehensive benchmarking suite designed to address the critical gap in medical LLM evaluation for African healthcare contexts. Developed through a partnership between Google Research and a pan-African consortium, the project demonstrates that current models often struggle with geographic distribution shifts in disease and localized linguistic nuances. The researchers conclude that diverse, region-specific datasets are essential for training equitable AI tools that can safely provide clinical decision support in low-resource settings. ## Limitations of Western-Centric Benchmarks * Existing medical benchmarks like USMLE MedQA focus on Western clinical contexts, which may not generalize to other regions. * Models trained on traditional datasets often fail to account for specific distribution shifts in disease types and cultural symptom descriptions. * The lack of diverse data makes it difficult to assess how LLMs handle variations in language and linguistics, even when the primary language is English. ## The AfriMed-QA Dataset Composition * The dataset contains approximately 15,000 clinically diverse questions and answers sourced from 16 African countries. * It covers 32 medical specialties, ranging from neurosurgery and internal medicine to infectious diseases and obstetrics. * The content is divided into three distinct formats: 4,000+ expert multiple-choice questions (MCQs), 1,200 open-ended short-answer questions (SAQs), and 10,000 consumer-style queries. * Data was crowdsourced from 621 contributors across 60 medical schools to ensure a broad representation of the continent's medical landscape. ## Data Collection and Curation Methodology * Researchers adapted a specialized web-based platform, originally built by Intron Health, to facilitate large-scale crowdsourcing across different regions. * To protect privacy, consumer queries were generated by prompting users with specific disease scenarios rather than asking for personal health information. * The curation process included custom user interfaces for quality reviews and blinded human evaluations by clinical experts to ensure the accuracy of reference answers. ## LLM Performance and Evaluation Results * The study benchmarked 30 general and biomedical LLMs, evaluating them for accuracy, semantic similarity, and human preference. * A significant performance gap exists between model sizes; larger models consistently outperformed smaller models on the AfriMed-QA benchmark. * This trend highlights a challenge for low-resource settings, where smaller, specialized models are often preferred for on-device or edge deployment due to infrastructure constraints. * The dataset has already been utilized to improve Google’s MedGemma, demonstrating its utility in training multimodal medical models. The AfriMed-QA benchmark datasets and evaluation code have been open-sourced on Hugging Face and GitHub to support the global research community. Developers are encouraged to use these tools to build and refine medical AI that is more inclusive and effective for the Global South.

netflix

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix’s Muse platform has evolved from a simple dashboard into a high-scale Online Analytical Processing (OLAP) system that processes trillions of rows to provide creative insights for promotional media. To meet growing demands for complex audience affinity analysis and advanced filtering, the engineering team modernized the data serving layer by moving beyond basic batch pipelines. By integrating HyperLogLog sketches for approximate counting and leveraging in-memory precomputed aggregates, the system now delivers low-latency performance and high data accuracy at an immense scale. ### Approximate Counting with HyperLogLog (HLL) Sketches To track metrics like unique impressions and qualified plays without the massive overhead of comparing billions of profile IDs, Muse utilizes the Apache Datasketches library. * The system trades a small margin of error (approximately 0.8% with a logK of 17) for significant gains in processing speed and memory efficiency. * Sketches are built during Druid ingestion using the HLLSketchBuild aggregator with rollup enabled to reduce data volume. * In the Spark ETL process, all-time aggregates are maintained by merging new daily HLL sketches into existing ones using the `hll_union` function. ### Utilizing Hollow for In-Memory Aggregates To reduce the query load on the Druid cluster, Netflix uses Hollow, an internal open-source tool designed for high-density, near-cache data sets. * Muse stores precomputed, all-time aggregates—such as lifetime impressions per asset—within Hollow’s in-memory data structures. * When a user requests "all-time" data, the application retrieves the results from the Hollow cache instead of forcing Druid to scan months or years of historical segments. * This approach significantly lowers latency for the most common queries and frees up Druid resources for more complex, dynamic filtering tasks. ### Optimizing the Druid Data Layer Efficient data retrieval from Druid is critical for supporting the application’s advanced grouping and filtering capabilities. * The team transitioned from hash-based partitioning to range-based partitioning on frequently filtered dimensions like `video_id` to improve data locality and pruning. * Background compaction tasks are utilized to merge small segments into larger ones, reducing metadata overhead and improving scan speeds across the cluster. * Specific tuning was applied to the Druid broker and historical nodes, including adjusting processing threads and buffer sizes to handle the high-concurrency demands of the Muse UI. ### Validation and Data Accuracy Because the move to HLL sketches introduces approximation, the team implemented rigorous validation processes to ensure the data remained actionable. * Internal debugging tools were developed to compare results from the new architecture against the "ground truth" provided by legacy batch systems. * Continuous monitoring ensures that HLL error rates remain within the expected 1–2% range and that data remains consistent across different time grains. For organizations building large-scale OLAP applications, the Muse architecture demonstrates that performance bottlenecks can often be solved by combining approximate data structures with specialized in-memory caches to offload heavy computations from the primary database.

google

Time series foundation models can be few-shot learners (opens in new tab)

Researchers at Google have introduced TimesFM-ICF, a foundation model that enables time-series forecasting to transition from zero-shot to few-shot learning via in-context fine-tuning. By utilizing continued pre-training and specialized separator tokens, the model learns to adapt to a handful of related examples at inference time without requiring the complex supervised fine-tuning typically needed for task-specific optimization. This approach effectively matches or exceeds the performance of specialized models while maintaining the flexibility of a general-purpose foundation model. ### Overcoming the Limitations of Zero-Shot Models * Traditional forecasting often requires building separate, specialized models for every unique task, which is resource-intensive and slow. * While zero-shot models like the original TimesFM provide immediate forecasts without task-specific training, they cannot incorporate relevant context, such as data from nearby sensors or similar historical patterns. * The In-Context Fine-tuning (ICF) approach allows the model to "learn" from a few examples provided at the time of prediction, similar to how Large Language Models (LLMs) use few-shot prompting. ### Architecture and the Common Separator Token * TimesFM-ICF utilizes a patched decoder architecture that tokenizes 32 contiguous timepoints into a single input token. * To prevent the model from conflating different data streams—such as separate store locations or distinct time periods—researchers introduced a "common separator token" as a digital boundary between examples. * The model processes these tokens through a transformer stack using causal self-attention (CSA), ensuring it learns from historical context without accidentally "peeking" into the future. * A shared multilayer perceptron (MLP) translates the processed output tokens back into a forecast spanning 128 timepoints. ### Performance Benchmarking and Results * The model was evaluated on 23 unseen datasets, using the Mean Absolute Scaled Error (MASE) metric to aggregate performance across diverse time-series tasks. * TimesFM-ICF demonstrated a significant performance boost over the original zero-shot TimesFM and other state-of-the-art foundation models like Moirai and Lag-Llama. * Test results showed that providing just a few in-context examples allowed the model to match the accuracy of supervised fine-tuning, which normally requires much more computational overhead and data curation. TimesFM-ICF represents a practical shift for businesses managing diverse data streams, offering a way to achieve high-accuracy forecasts by simply providing a few relevant historical examples. For those looking to optimize inventory or energy demands, this method provides the precision of a custom-tuned model with the deployment speed of a pre-trained foundation model.

line

P-Canvas: An Engineering Technique for (opens in new tab)

The concept of "Managing Engineering" treats team leadership as a systematic process designed to reduce the "reproduction costs" of solving recurring human and organizational challenges. By implementing the P-Canvas framework, managers can move away from abstract, directionless 1-on-1 meetings and toward a data-driven approach that visualizes a member's growth and psychological state. This methodology concludes that management can be systemized just like software engineering, allowing leads to proactively identify and resolve team issues through trend analysis and visual indicators. ### The Concept of Managing Engineering * Engineering is defined as the act of lowering reproduction costs; if a solution to a problem can be reused by others to save time and effort, it is considered engineering. * Applying this logic to management involves creating reusable frameworks for handling complex interpersonal relationships, professional growth, and team care. * The goal is to move beyond "neglect disguised as autonomy" by building a system that ensures team members are truly supported rather than just left to work independently. ### Structure and Design of P-Canvas * P-Canvas is a visual management framework consolidated into a single page, updated monthly over a five-month cycle to track changes over time. * The framework utilizes three 2D coordinate systems to map complex nuances: communication proactivity, the relationship between growth and performance, and the emotional state regarding stable versus challenging tasks. * Scale-based indicators measure quantitative factors such as workload distribution, project participation, job satisfaction, motivation levels, and the degree of "radical candor" practiced by the member. * A hexagonal skill chart tracks six dimensions of competency: communication, team-specific values (Platform 10 rules), job expertise, work completion, knowledge generalization, and cultural contribution. ### Data-Driven 1-on-1s and Problem Identification * The framework shifts 1-on-1 conversations from vague questions like "How are you?" to specific inquiries based on data patterns, such as a sudden dip in satisfaction paired with a rise in candor. * It functions as an early warning system, allowing leads to detect signs of burnout or interpersonal conflict before they escalate into long-term performance issues. * By visualizing data, the lead and the member can engage in "joint problem-solving," identifying whether a decline in motivation is due to unclear roles, cultural clashes, or inefficient processes. * The system emphasizes the "trajectory of change" rather than absolute scores, focusing on how a member recovers and grows following specific management interventions or project shifts. ### Benefits of Visualized Management * Proactive Intervention: Leads can catch subtle signals of dissatisfaction early through shifting data points rather than waiting for a member to voice a complaint. * Objective Communication: The presence of a visual chart provides a neutral ground for discussing sensitive topics, making it easier for members to express their feelings through data. * Verification of Support: The framework allows leads to track the effectiveness of their own management actions by observing if a member’s indicators improve in subsequent months. Implementing a tool like P-Canvas is highly recommended for leads who find traditional 1-on-1 meetings too abstract or difficult to facilitate. By treating management as an engineering discipline, leaders can create a more predictable and supportive environment where individual growth is measured not just by output, but by a holistic view of a member’s professional and emotional well-being.

google

Deep researcher with test-time diffusion (opens in new tab)

Google Cloud researchers have introduced Test-Time Diffusion Deep Researcher (TTD-DR), a framework that treats long-form research report writing as an iterative diffusion process. By mimicking human research patterns, the system treats initial drafts as "noisy" versions that are gradually polished through retrieval-augmented denoising and self-evolutionary algorithms. This approach achieves state-of-the-art results in generating comprehensive academic-style reports and solving complex multi-hop reasoning tasks. ### The Backbone DR Architecture The system operates through a three-stage pipeline designed to transition from a broad query to a detailed final document: * **Research Plan Generation:** Upon receiving a query, the agent produces a structured outline of key areas to guide the subsequent information-gathering process. * **Iterative Search Agents:** Two sub-agents work in tandem; one formulates specific search questions based on the plan, while the other performs Retrieval-Augmented Generation (RAG) to synthesize precise answers from available sources. * **Final Report Synthesis:** The agent combines the initial research plan with the accumulated question-answer pairs to produce a coherent, evidence-based final report. ### Component-wise Self-Evolution To ensure high-quality inputs at every stage, the framework employs a self-evolutionary algorithm that optimizes the performance of individual agents: * **Diverse Variant Generation:** The system explores multiple diverse answer variants to cover a larger search space and identify the most valuable information. * **Environmental Feedback:** An "LLM-as-a-judge" assesses these variants using auto-raters for metrics like helpfulness and comprehensiveness, providing specific textual feedback for improvement. * **Revision and Cross-over:** Variants undergo iterative revisions based on feedback before being merged into a single, high-quality output that consolidates the best information from all evolutionary paths. ### Report-level Refinement via Diffusion The core innovation of TTD-DR is modeling the writing process as a denoising diffusion mechanism: * **Messy-to-Polished Transformation:** The framework treats the initial rough draft as a noisy input that requires cleaning through factual verification. * **Denoising with Retrieval:** The agent identifies missing information or weak arguments in the draft and uses search tools as a "denoising step" to inject new facts and strengthen the content. * **Continuous Improvement Loop:** This process repeats in cycles, where each iteration uses newly retrieved information to refine the draft into a more accurate and high-quality final version. TTD-DR demonstrates that shifting AI development from linear generation to iterative, diffusion-based refinement significantly improves the depth and rigor of long-form content. This methodology serves as a powerful blueprint for building autonomous agents capable of handling complex, multi-step knowledge tasks.

google

Sensible Agent: A framework for unobtrusive interaction with proactive AR agents (opens in new tab)

Sensible Agent is a research prototype designed to move AR agents beyond explicit voice commands toward proactive, context-aware assistance. By leveraging real-time multimodal sensing of a user's environment and physical state, the framework ensures digital help is delivered unobtrusively through the most appropriate interaction modalities. This approach fundamentally reshapes human-computer interaction by anticipating user needs while minimizing cognitive and social disruption. ## Contextual Understanding via Multimodal Parsing The framework begins by analyzing the user's immediate surroundings to establish a baseline for assistance. * A Vision-Language Model (VLM) processes egocentric camera feeds from the AR headset to identify high-level activities and locations. * YAMNet, a pre-trained audio event classifier, monitors environmental noise levels to determine if audio feedback is appropriate. * The system synthesizes these inputs into a parsed context that accounts for situational impairments, such as when a user’s hands are occupied. ## Reasoning with Proactive Query Generation Once the context is established, the system determines the specific type of assistance required through a sophisticated reasoning process. * The framework uses chain-of-thought (CoT) reasoning to decompose complex problems into intermediate logical steps. * Few-shot learning, guided by examples from data collection studies, helps the model decide between actions like providing translations or displaying a grocery list. * The generator outputs a structured suggestion that includes the specific action, the query format (e.g., binary choice or icons), and the presentation modality (visual, audio, or both). ## Dynamic Modality and Interaction Management The final stage of the framework manages how the agent communicates with the user and how the user can respond without breaking their current flow. * The prototype, built on Android XR and WebXR, utilizes a UI Manager to render visual panels or generate text-to-speech (TTS) prompts based on the agent's decision. * An Input Modality Manager activates the most discreet response methods available, such as head gestures (nods), hand gestures (thumbs up), or gaze tracking. * This adaptive selection ensures that if a user is in a noisy room or a social setting, the agent can switch from verbal interaction to subtle visual cues and gesture-based confirmations. By prioritizing social awareness and context-sensitivity, Sensible Agent provides a blueprint for AR systems that feel like helpful companions rather than intrusive tools. Implementing such frameworks is essential for making proactive digital assistants practical and acceptable for long-term, everyday use in public and private spaces.

google

Making LLMs more accurate by using all of their layers (opens in new tab)

Self Logits Evolution Decoding (SLED) is a novel decoding strategy designed to reduce hallucinations and improve the factual accuracy of large language models without requiring external data or fine-tuning. By leveraging the internal representations of all model layers rather than just the final output, SLED aligns generation with the model’s intrinsic knowledge more effectively. Research shows that this approach consistently enhances performance across diverse tasks, including complex reasoning, multiple-choice questions, and open-ended generation. ## Limitations of Standard Decoding * Standard LLMs typically generate text by relying solely on the "logits" (prediction scores) of the final layer to determine the next token. * This process often leads to hallucinations because the final layer may prioritize "popular" or common patterns from training data over factual accuracy. * While techniques like Retrieval Augmented Generation (RAG) provide external context, they increase system complexity and do not address the model's internal tendency to ignore subtle contextual cues during the final projection. ## The Technical Mechanism of SLED * SLED utilizes "early exit" logits from every intermediate layer of the Transformer architecture, rather than just the final one. * The strategy reuses the model's final projection matrix on these intermediate layers to create multiple probability distributions across the same set of potential tokens. * By calculating a weighted average of the distributions from all layers, SLED refines the prediction to better reflect the model's latent knowledge. * This multi-layer approach allows the model to catch nuances—such as specific math constraints or geographic facts—that might be "smoothed over" by the final layer’s preference for high-probability sequences. ## Practical Performance and Reasoning * In chain-of-thought tasks, SLED helps the model maintain logic; for example, it can correctly identify when a discount should be applied in a math problem by favoring intermediate layers that recognize the "if/then" logic over a simple arithmetic pattern. * The method is model-agnostic and has shown consistent accuracy gains across various LLM scales and configurations. * SLED is highly flexible and can be integrated with existing factuality decoding methods or speculative decoding to further reduce hallucinations without the need for additional training data. For developers and researchers seeking to boost the reliability of LLMs, SLED offers a computationally efficient alternative to fine-tuning. By simply adjusting the decoding strategy to incorporate the rich information available in intermediate layers, models can achieve higher factuality and more robust reasoning capabilities in real-world applications.

google

Learn Your Way: Reimagining textbooks with generative AI (opens in new tab)

Google Research has introduced Learn Your Way, an AI-driven educational experiment that reimagines traditional textbooks as personalized, multimodal learning journeys. By leveraging the LearnLM family of models integrated into Gemini 2.5 Pro, the system transforms static source material into tailored content based on a student’s specific grade level and interests. Early efficacy studies demonstrate that this approach significantly enhances retention, with students scoring 11 percentage points higher than those using standard digital readers. ### Pedagogical Foundations and Dual Coding The research is built on the "dual coding theory," which suggests that forming mental connections between different representations of information strengthens conceptual understanding. * The system moves away from a "one-size-fits-all" model toward a student-driven experience where learners can choose and intermix formats. * Personalization is used as a tool to enhance situational interest and motivation by adapting content to specific student attributes. * The framework incorporates active learning through real-time quizzing and feedback to address knowledge gaps as they arise. ### The Personalization Pipeline The technical architecture begins with a layered pipeline that processes source material, such as a textbook PDF, to create a foundational text for all other formats. * The original material is first "re-leveled" to match the learner’s reported grade level while maintaining the integrity and scope of the curriculum. * Generic examples within the text are strategically replaced with personalized examples based on user interests, such as sports, music, or food. * This personalized base text serves as the primary input for generating all subsequent multimodal representations, ensuring consistency across formats. ### Multimodal Content Generation To produce a wide variety of educational assets, the system utilizes a combination of large language models and specialized AI agents. * **Agentic Workflows:** While tools like mind maps and timelines are generated directly by Gemini, complex assets like narrated slides use multi-step agentic workflows to ensure pedagogical effectiveness. * **Custom Visuals:** Because general-purpose image models often struggle with educational accuracy, the researchers fine-tuned a dedicated model specifically for generating educational illustrations. * **Diverse Representations:** The interface provides "immersive text" with embedded questions, audio lessons for auditory learning, and interactive slides that mimic recorded classroom sessions. ### Research Outcomes and Future Application The project’s effectiveness was validated through a study comparing the GenAI approach against standard digital reading materials. * Students using the personalized AI tools showed a significant improvement in retention test scores. * Beyond retention, the system aims to transform passive reading into an active, multimodal experience that follows established learning science principles. * The "Learn Your Way" experiment is currently available on Google Labs, providing a practical look at how adaptive, learner-centric materials might replace static textbooks in future K-12 and higher education settings.