google

The anatomy of a personal health agent (opens in new tab)

Google researchers have developed the Personal Health Agent (PHA), an LLM-powered prototype designed to provide evidence-based, personalized health insights by analyzing multimodal data from wearables and blood biomarkers. By utilizing a specialized multi-agent architecture, the system deconstructs complex health queries into specific tasks to ensure statistical accuracy and clinical grounding. The study demonstrates that this modular approach significantly outperforms standard large language models in providing reliable, data-driven wellness support. ## Multi-Agent System Architecture * The PHA framework adopts a "team-based" approach, utilizing three specialist sub-agents: a Data Science agent, a Domain Expert agent, and a Health Coach. * The system was validated using a real-world dataset from 1,200 participants, featuring longitudinal Fitbit data, health questionnaires, and clinical blood test results. * This architecture was designed after a user-centered study of 1,300 health queries, identifying four key needs: general knowledge, data interpretation, wellness advice, and symptom assessment. * Evaluation involved over 1,100 hours of human expert effort across 10 benchmark tasks to ensure the system outperformed base models like Gemini. ## The Data Science Agent * This agent specializes in "contextualized numerical insights," transforming ambiguous queries (e.g., "How is my fitness trending?") into formal statistical analysis plans. * It operates through a two-stage process: first interpreting the user's intent and data sufficiency, then generating executable code to analyze time-series data. * In benchmark testing, the agent achieved a 75.6% score in analysis planning, significantly higher than the 53.7% score achieved by the base model. * The agent's code generation was validated against 173 rigorous unit tests written by human data scientists to ensure accuracy in handling wearable sensor data. ## The Domain Expert Agent * Designed for high-stakes medical accuracy, this agent functions as a grounded source of health knowledge using a multi-step reasoning framework. * It utilizes a "toolbox" approach, granting the LLM access to authoritative external databases such as the National Center for Biotechnology Information (NCBI) to provide verifiable facts. * The agent is specifically tuned to tailor information to the user’s unique profile, including specific biomarkers and pre-existing medical conditions. * Performance was measured through board certification and coaching exam questions, as well as its ability to provide accurate differential diagnoses compared to human clinicians. While currently a research framework rather than a public product, the PHA demonstrates that a modular, specialist-driven AI architecture is essential for safe and effective personal health management. Developers of future health-tech tools should prioritize grounding LLMs in external clinical databases and implementing rigorous statistical validation stages to move beyond the limitations of general-purpose chatbots.

google

AI as a research partner: Advancing theoretical computer science with AlphaEvolve (opens in new tab)

AlphaEvolve, an LLM-powered coding agent developed by Google DeepMind, facilitates mathematical discovery by evolving code to find complex combinatorial structures that are difficult to design manually. By utilizing a "lifting" technique, the system discovers finite structures that can be plugged into existing proof frameworks to establish new universal theorems in complexity theory. This methodology has successfully produced state-of-the-art results for the MAX-4-CUT problem and tightened bounds on the hardness of certifying properties in random graphs. ## The Role of AlphaEvolve in Mathematical Research * The system uses an iterative feedback loop to morph code snippets, evaluating the resulting mathematical structures and refining the code toward more optimal solutions. * AlphaEvolve operates as a tool-based assistant that generates specific proof elements, which can then be automatically verified by computer programs to ensure absolute mathematical correctness. * By focusing on verifiable finite structures, the agent overcomes the common "hallucination" issues of LLMs, as the final output is a computationally certified object rather than a speculative text-based proof. ## Bridging Finite Discovery and Universal Statements through Lifting * Theoretical computer science often requires proofs that hold true for all problem sizes ($\forall n$), a scale that AI systems typically struggle to address directly. * The "lifting" technique treats a proof as a modular structure where a specific finite component—such as a combinatorial gadget—can be replaced with a more efficient version while keeping the rest of the proof intact. * When AlphaEvolve finds a superior finite structure, the improvement is "lifted" through the existing mathematical framework to yield a stronger universal theorem without requiring a human to redesign the entire logical architecture. ## Optimizing Gadget Reductions and MAX-k-CUT * Researchers applied the agent to "gadget reductions," which are recipes used to map known intractable problems to new ones to prove computational hardness (NP-hardness). * AlphaEvolve discovered complex gadgets that were previously unknown because they were too intricate for researchers to construct by hand. * These discoveries led to a new state-of-the-art inapproximability result for the MAX-4-CUT problem, defining more precise limits on how accurately the problem can be solved by any efficient algorithm. ## Advancing Average-Case Hardness in Random Graphs * The agent was tasked with uncovering structures related to the average-case hardness of certifying properties within random graphs. * By evolving better combinatorial structures for these specific instances, the team was able to tighten existing mathematical bounds, providing a clearer picture of when certain graph properties become computationally intractable to verify. This research demonstrates that LLM-based agents can serve as genuine research partners by focusing on the discovery of verifiable, finite components within broader theoretical frameworks. For researchers in mathematics and computer science, this "lifting" approach provides a practical roadmap for using AI to solve bottleneck problems that were previously restricted by the limits of manual construction.

netflix

Building a Resilient Data Platform with Write-Ahead Log at Netflix | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix has developed a distributed Write-Ahead Log (WAL) abstraction to address critical data challenges such as accidental corruption, system entropy, and the complexities of cross-region replication. By decoupling data mutation from immediate persistence and providing a unified API, this system ensures strong durability and eventual consistency across diverse storage engines. The WAL acts as a resilient buffer that powers high-leverage features like secondary indexing and delayed retry queues while maintaining the massive scale required for global operations. ### The Role of the WAL Abstraction * The system serves as a centralized mechanism to capture data changes and reliably deliver them to downstream consumers, mitigating the risk of data loss during administrative errors or database corruption. * It provides a simplified `WriteToLog` gRPC endpoint that abstracts underlying infrastructure, allowing developers to focus on data logic rather than the specifics of the storage layer. * By acting as a durable intermediary, it prevents permanent data loss during incidents where primary datastores fail or require schema changes that might otherwise lead to corruption. ### Flexible Personas and Namespaces * The architecture utilizes "namespaces" to define logical separation, allowing different services to configure specific storage backends like Kafka or SQS based on their needs. * The "Delayed Queues" persona leverages SQS to provide a scalable way to retry failed messages in real-time pipelines without sacrificing overall system throughput. * The system can be configured for "Cross-Region Replication," enabling high availability and disaster recovery for storage engines that do not natively support multi-region data transfer. ### Solving System Entropy and Consistency * The WAL addresses the "dual-write" problem, where updates to primary stores (such as Cassandra) and search indices (such as Elasticsearch) can diverge over time, leading to data inconsistency. * It facilitates reliable secondary indexing for NoSQL databases by managing updates to multiple partitions as a coordinated sequence of events. * The platform mitigates operational risks, such as Out-of-Memory (OOM) errors on Key-Value nodes caused by bulk deletes, by staging and throttling mutations through the log. Organizations operating at scale should adopt a WAL-centric architecture to simplify the management of heterogeneous data stores and enhance system resilience. By centralizing the mutation log, teams can implement complex features like Change Data Capture (CDC) and cross-region failover through a single, consistent interface rather than building bespoke solutions for every service.

line

Into the Passionate Energy of the (opens in new tab)

The PD1 AI Hackathon 2025 served as a strategic initiative by LY Corporation to embed innovative artificial intelligence directly into the LINE messaging ecosystem. Over 60 developers collaborated during an intensive 48-hour session to transition AI from a theoretical concept into practical features for messaging, content, and internal development workflows. The event successfully produced several high-utility prototypes that demonstrate how AI can enhance user safety, creative expression, and technical productivity. ## Transforming Voice Communication through NextVoIP * The "NextVoIP" project utilized Speech-to-Text (STT) technology to convert 1:1 and group call audio into real-time data for AI analysis. * The system was designed to provide life security features by detecting potential emergency situations or accidents through conversation monitoring. * AI acted as a communication assistant by suggesting relevant content and conversation topics to help maintain a seamless flow during calls. * Features were implemented to allow callers to enjoy shared digital content together, enriched by AI-driven recommendations. ## Creative Expression with MELODY LINE * This project focused on the intersection of technology and art by converting chat conversations into unique musical compositions. * The system analyzed the context and emotional sentiment of messages to automatically generate melodies that matched the atmosphere of the chat. * The implementation showcased the potential for generative AI to provide a multi-sensory experience within a standard messaging interface. ## AI-Driven QA and Test Automation * The grand prize-winning project, "IPD," addressed the bottleneck of repetitive manual testing by automating the entire Quality Assurance lifecycle. * AI was utilized to automatically generate and manage complex test cases, significantly reducing the manual effort required for mobile app validation. * The system included automated test execution and a diagnostic feature that identifies the root cause of failures when a test results in an error. * The project was specifically lauded for its immediate "production-ready" status, offering a direct path to improving development speed and software reliability. The results of this hackathon suggest that the most immediate value for AI in large-scale messaging platforms lies in two areas: enhancing user experience through contextual awareness and streamlining internal engineering via automated QA. Organizations should look toward integrating AI-driven testing tools to reduce technical debt while exploring real-time audio and text analysis to provide proactive security and engagement features for users.

google

Towards better health conversations: Research insights on a “wayfinding” AI agent based on Gemini (opens in new tab)

Google Research has developed "Wayfinding AI," a research prototype based on Gemini designed to transform health information seeking from a passive query-response model into a proactive, context-seeking dialogue. By prioritizing clarifying questions and iterative guidance, the agent addresses the common struggle users face when attempting to articulate complex or ambiguous medical concerns. User studies indicate that this proactive approach results in health information that participants find significantly more helpful, relevant, and tailored to their specific needs than traditional AI responses. ### Challenges in Digital Health Navigation * Formative research involving 33 participants highlighted that users often struggle to articulate health concerns because they lack the clinical background to know which details are medically relevant. * The study found that users typically "throw words" at a search engine and sift through generic, impersonal results that do not account for their unique context. * Initial UX testing revealed a strong user preference for a "deferred-answer" approach, where the AI mimics a medical professional by asking clarifying questions before jumping to a conclusion. ### Core Design Principles of Wayfinding AI * **Proactive Conversational Guidance:** At every turn, the agent asks up to three targeted questions to reduce ambiguity and help users systematically share their "health story." * **Best-Effort Answers:** To ensure immediate utility, the AI provides the best possible information based on the data available at that moment, while noting that the answer will improve as the user provides more context. * **Transparent Reasoning:** The system explicitly explains how the user’s most recent answers have helped refine the previous response, making the AI’s internal logic understandable. ### Split-Stream User Interface * To prevent clarifying questions from being buried in long paragraphs, the prototype uses a two-column layout. * The left column is dedicated to the interactive chat and specific follow-up questions to keep the user focused on the dialogue. * The right column displays the "best information so far" and detailed explanations, allowing users to dive into the technical content only when they feel enough context has been established. ### Comparative Evaluation and Performance * A randomized study with 130 participants compared the Wayfinding AI against a baseline Gemini 2.5 Flash model. * Participants interacted with both models for at least three minutes regarding a personal health question and rated them across six dimensions: helpfulness, question relevance, tailoring, goal understanding, ease of use, and efficiency. * The proactive agent outperformed the baseline significantly, with participants reporting that the context-seeking behavior felt more professional and increased their confidence in the AI's suggestions. The research suggests that for sensitive and complex topics like health, AI should move beyond being a passive knowledge base. By adopting a "wayfinding" strategy that guides users through their own information needs, AI agents can provide more personalized and empowering experiences that better mirror expert human consultation.

google

AfriMed-QA: Benchmarking large language models for global health (opens in new tab)

AfriMed-QA is a comprehensive benchmarking suite designed to address the critical gap in medical LLM evaluation for African healthcare contexts. Developed through a partnership between Google Research and a pan-African consortium, the project demonstrates that current models often struggle with geographic distribution shifts in disease and localized linguistic nuances. The researchers conclude that diverse, region-specific datasets are essential for training equitable AI tools that can safely provide clinical decision support in low-resource settings. ## Limitations of Western-Centric Benchmarks * Existing medical benchmarks like USMLE MedQA focus on Western clinical contexts, which may not generalize to other regions. * Models trained on traditional datasets often fail to account for specific distribution shifts in disease types and cultural symptom descriptions. * The lack of diverse data makes it difficult to assess how LLMs handle variations in language and linguistics, even when the primary language is English. ## The AfriMed-QA Dataset Composition * The dataset contains approximately 15,000 clinically diverse questions and answers sourced from 16 African countries. * It covers 32 medical specialties, ranging from neurosurgery and internal medicine to infectious diseases and obstetrics. * The content is divided into three distinct formats: 4,000+ expert multiple-choice questions (MCQs), 1,200 open-ended short-answer questions (SAQs), and 10,000 consumer-style queries. * Data was crowdsourced from 621 contributors across 60 medical schools to ensure a broad representation of the continent's medical landscape. ## Data Collection and Curation Methodology * Researchers adapted a specialized web-based platform, originally built by Intron Health, to facilitate large-scale crowdsourcing across different regions. * To protect privacy, consumer queries were generated by prompting users with specific disease scenarios rather than asking for personal health information. * The curation process included custom user interfaces for quality reviews and blinded human evaluations by clinical experts to ensure the accuracy of reference answers. ## LLM Performance and Evaluation Results * The study benchmarked 30 general and biomedical LLMs, evaluating them for accuracy, semantic similarity, and human preference. * A significant performance gap exists between model sizes; larger models consistently outperformed smaller models on the AfriMed-QA benchmark. * This trend highlights a challenge for low-resource settings, where smaller, specialized models are often preferred for on-device or edge deployment due to infrastructure constraints. * The dataset has already been utilized to improve Google’s MedGemma, demonstrating its utility in training multimodal medical models. The AfriMed-QA benchmark datasets and evaluation code have been open-sourced on Hugging Face and GitHub to support the global research community. Developers are encouraged to use these tools to build and refine medical AI that is more inclusive and effective for the Global South.

netflix

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix’s Muse platform has evolved from a simple dashboard into a high-scale Online Analytical Processing (OLAP) system that processes trillions of rows to provide creative insights for promotional media. To meet growing demands for complex audience affinity analysis and advanced filtering, the engineering team modernized the data serving layer by moving beyond basic batch pipelines. By integrating HyperLogLog sketches for approximate counting and leveraging in-memory precomputed aggregates, the system now delivers low-latency performance and high data accuracy at an immense scale. ### Approximate Counting with HyperLogLog (HLL) Sketches To track metrics like unique impressions and qualified plays without the massive overhead of comparing billions of profile IDs, Muse utilizes the Apache Datasketches library. * The system trades a small margin of error (approximately 0.8% with a logK of 17) for significant gains in processing speed and memory efficiency. * Sketches are built during Druid ingestion using the HLLSketchBuild aggregator with rollup enabled to reduce data volume. * In the Spark ETL process, all-time aggregates are maintained by merging new daily HLL sketches into existing ones using the `hll_union` function. ### Utilizing Hollow for In-Memory Aggregates To reduce the query load on the Druid cluster, Netflix uses Hollow, an internal open-source tool designed for high-density, near-cache data sets. * Muse stores precomputed, all-time aggregates—such as lifetime impressions per asset—within Hollow’s in-memory data structures. * When a user requests "all-time" data, the application retrieves the results from the Hollow cache instead of forcing Druid to scan months or years of historical segments. * This approach significantly lowers latency for the most common queries and frees up Druid resources for more complex, dynamic filtering tasks. ### Optimizing the Druid Data Layer Efficient data retrieval from Druid is critical for supporting the application’s advanced grouping and filtering capabilities. * The team transitioned from hash-based partitioning to range-based partitioning on frequently filtered dimensions like `video_id` to improve data locality and pruning. * Background compaction tasks are utilized to merge small segments into larger ones, reducing metadata overhead and improving scan speeds across the cluster. * Specific tuning was applied to the Druid broker and historical nodes, including adjusting processing threads and buffer sizes to handle the high-concurrency demands of the Muse UI. ### Validation and Data Accuracy Because the move to HLL sketches introduces approximation, the team implemented rigorous validation processes to ensure the data remained actionable. * Internal debugging tools were developed to compare results from the new architecture against the "ground truth" provided by legacy batch systems. * Continuous monitoring ensures that HLL error rates remain within the expected 1–2% range and that data remains consistent across different time grains. For organizations building large-scale OLAP applications, the Muse architecture demonstrates that performance bottlenecks can often be solved by combining approximate data structures with specialized in-memory caches to offload heavy computations from the primary database.

google

Time series foundation models can be few-shot learners (opens in new tab)

Researchers at Google have introduced TimesFM-ICF, a foundation model that enables time-series forecasting to transition from zero-shot to few-shot learning via in-context fine-tuning. By utilizing continued pre-training and specialized separator tokens, the model learns to adapt to a handful of related examples at inference time without requiring the complex supervised fine-tuning typically needed for task-specific optimization. This approach effectively matches or exceeds the performance of specialized models while maintaining the flexibility of a general-purpose foundation model. ### Overcoming the Limitations of Zero-Shot Models * Traditional forecasting often requires building separate, specialized models for every unique task, which is resource-intensive and slow. * While zero-shot models like the original TimesFM provide immediate forecasts without task-specific training, they cannot incorporate relevant context, such as data from nearby sensors or similar historical patterns. * The In-Context Fine-tuning (ICF) approach allows the model to "learn" from a few examples provided at the time of prediction, similar to how Large Language Models (LLMs) use few-shot prompting. ### Architecture and the Common Separator Token * TimesFM-ICF utilizes a patched decoder architecture that tokenizes 32 contiguous timepoints into a single input token. * To prevent the model from conflating different data streams—such as separate store locations or distinct time periods—researchers introduced a "common separator token" as a digital boundary between examples. * The model processes these tokens through a transformer stack using causal self-attention (CSA), ensuring it learns from historical context without accidentally "peeking" into the future. * A shared multilayer perceptron (MLP) translates the processed output tokens back into a forecast spanning 128 timepoints. ### Performance Benchmarking and Results * The model was evaluated on 23 unseen datasets, using the Mean Absolute Scaled Error (MASE) metric to aggregate performance across diverse time-series tasks. * TimesFM-ICF demonstrated a significant performance boost over the original zero-shot TimesFM and other state-of-the-art foundation models like Moirai and Lag-Llama. * Test results showed that providing just a few in-context examples allowed the model to match the accuracy of supervised fine-tuning, which normally requires much more computational overhead and data curation. TimesFM-ICF represents a practical shift for businesses managing diverse data streams, offering a way to achieve high-accuracy forecasts by simply providing a few relevant historical examples. For those looking to optimize inventory or energy demands, this method provides the precision of a custom-tuned model with the deployment speed of a pre-trained foundation model.

line

P-Canvas, an engineering technique for (opens in new tab)

The concept of "Managing Engineering" treats team leadership as a systematic process designed to reduce the "reproduction costs" of solving recurring human and organizational challenges. By implementing the P-Canvas framework, managers can move away from abstract, directionless 1-on-1 meetings and toward a data-driven approach that visualizes a member's growth and psychological state. This methodology concludes that management can be systemized just like software engineering, allowing leads to proactively identify and resolve team issues through trend analysis and visual indicators. ### The Concept of Managing Engineering * Engineering is defined as the act of lowering reproduction costs; if a solution to a problem can be reused by others to save time and effort, it is considered engineering. * Applying this logic to management involves creating reusable frameworks for handling complex interpersonal relationships, professional growth, and team care. * The goal is to move beyond "neglect disguised as autonomy" by building a system that ensures team members are truly supported rather than just left to work independently. ### Structure and Design of P-Canvas * P-Canvas is a visual management framework consolidated into a single page, updated monthly over a five-month cycle to track changes over time. * The framework utilizes three 2D coordinate systems to map complex nuances: communication proactivity, the relationship between growth and performance, and the emotional state regarding stable versus challenging tasks. * Scale-based indicators measure quantitative factors such as workload distribution, project participation, job satisfaction, motivation levels, and the degree of "radical candor" practiced by the member. * A hexagonal skill chart tracks six dimensions of competency: communication, team-specific values (Platform 10 rules), job expertise, work completion, knowledge generalization, and cultural contribution. ### Data-Driven 1-on-1s and Problem Identification * The framework shifts 1-on-1 conversations from vague questions like "How are you?" to specific inquiries based on data patterns, such as a sudden dip in satisfaction paired with a rise in candor. * It functions as an early warning system, allowing leads to detect signs of burnout or interpersonal conflict before they escalate into long-term performance issues. * By visualizing data, the lead and the member can engage in "joint problem-solving," identifying whether a decline in motivation is due to unclear roles, cultural clashes, or inefficient processes. * The system emphasizes the "trajectory of change" rather than absolute scores, focusing on how a member recovers and grows following specific management interventions or project shifts. ### Benefits of Visualized Management * Proactive Intervention: Leads can catch subtle signals of dissatisfaction early through shifting data points rather than waiting for a member to voice a complaint. * Objective Communication: The presence of a visual chart provides a neutral ground for discussing sensitive topics, making it easier for members to express their feelings through data. * Verification of Support: The framework allows leads to track the effectiveness of their own management actions by observing if a member’s indicators improve in subsequent months. Implementing a tool like P-Canvas is highly recommended for leads who find traditional 1-on-1 meetings too abstract or difficult to facilitate. By treating management as an engineering discipline, leaders can create a more predictable and supportive environment where individual growth is measured not just by output, but by a holistic view of a member’s professional and emotional well-being.

google

Deep researcher with test-time diffusion (opens in new tab)

Google Cloud researchers have introduced Test-Time Diffusion Deep Researcher (TTD-DR), a framework that treats long-form research report writing as an iterative diffusion process. By mimicking human research patterns, the system treats initial drafts as "noisy" versions that are gradually polished through retrieval-augmented denoising and self-evolutionary algorithms. This approach achieves state-of-the-art results in generating comprehensive academic-style reports and solving complex multi-hop reasoning tasks. ### The Backbone DR Architecture The system operates through a three-stage pipeline designed to transition from a broad query to a detailed final document: * **Research Plan Generation:** Upon receiving a query, the agent produces a structured outline of key areas to guide the subsequent information-gathering process. * **Iterative Search Agents:** Two sub-agents work in tandem; one formulates specific search questions based on the plan, while the other performs Retrieval-Augmented Generation (RAG) to synthesize precise answers from available sources. * **Final Report Synthesis:** The agent combines the initial research plan with the accumulated question-answer pairs to produce a coherent, evidence-based final report. ### Component-wise Self-Evolution To ensure high-quality inputs at every stage, the framework employs a self-evolutionary algorithm that optimizes the performance of individual agents: * **Diverse Variant Generation:** The system explores multiple diverse answer variants to cover a larger search space and identify the most valuable information. * **Environmental Feedback:** An "LLM-as-a-judge" assesses these variants using auto-raters for metrics like helpfulness and comprehensiveness, providing specific textual feedback for improvement. * **Revision and Cross-over:** Variants undergo iterative revisions based on feedback before being merged into a single, high-quality output that consolidates the best information from all evolutionary paths. ### Report-level Refinement via Diffusion The core innovation of TTD-DR is modeling the writing process as a denoising diffusion mechanism: * **Messy-to-Polished Transformation:** The framework treats the initial rough draft as a noisy input that requires cleaning through factual verification. * **Denoising with Retrieval:** The agent identifies missing information or weak arguments in the draft and uses search tools as a "denoising step" to inject new facts and strengthen the content. * **Continuous Improvement Loop:** This process repeats in cycles, where each iteration uses newly retrieved information to refine the draft into a more accurate and high-quality final version. TTD-DR demonstrates that shifting AI development from linear generation to iterative, diffusion-based refinement significantly improves the depth and rigor of long-form content. This methodology serves as a powerful blueprint for building autonomous agents capable of handling complex, multi-step knowledge tasks.

google

Sensible Agent: A framework for unobtrusive interaction with proactive AR agents (opens in new tab)

Sensible Agent is a research prototype designed to move AR agents beyond explicit voice commands toward proactive, context-aware assistance. By leveraging real-time multimodal sensing of a user's environment and physical state, the framework ensures digital help is delivered unobtrusively through the most appropriate interaction modalities. This approach fundamentally reshapes human-computer interaction by anticipating user needs while minimizing cognitive and social disruption. ## Contextual Understanding via Multimodal Parsing The framework begins by analyzing the user's immediate surroundings to establish a baseline for assistance. * A Vision-Language Model (VLM) processes egocentric camera feeds from the AR headset to identify high-level activities and locations. * YAMNet, a pre-trained audio event classifier, monitors environmental noise levels to determine if audio feedback is appropriate. * The system synthesizes these inputs into a parsed context that accounts for situational impairments, such as when a user’s hands are occupied. ## Reasoning with Proactive Query Generation Once the context is established, the system determines the specific type of assistance required through a sophisticated reasoning process. * The framework uses chain-of-thought (CoT) reasoning to decompose complex problems into intermediate logical steps. * Few-shot learning, guided by examples from data collection studies, helps the model decide between actions like providing translations or displaying a grocery list. * The generator outputs a structured suggestion that includes the specific action, the query format (e.g., binary choice or icons), and the presentation modality (visual, audio, or both). ## Dynamic Modality and Interaction Management The final stage of the framework manages how the agent communicates with the user and how the user can respond without breaking their current flow. * The prototype, built on Android XR and WebXR, utilizes a UI Manager to render visual panels or generate text-to-speech (TTS) prompts based on the agent's decision. * An Input Modality Manager activates the most discreet response methods available, such as head gestures (nods), hand gestures (thumbs up), or gaze tracking. * This adaptive selection ensures that if a user is in a noisy room or a social setting, the agent can switch from verbal interaction to subtle visual cues and gesture-based confirmations. By prioritizing social awareness and context-sensitivity, Sensible Agent provides a blueprint for AR systems that feel like helpful companions rather than intrusive tools. Implementing such frameworks is essential for making proactive digital assistants practical and acceptable for long-term, everyday use in public and private spaces.

google

Making LLMs more accurate by using all of their layers (opens in new tab)

Self Logits Evolution Decoding (SLED) is a novel decoding strategy designed to reduce hallucinations and improve the factual accuracy of large language models without requiring external data or fine-tuning. By leveraging the internal representations of all model layers rather than just the final output, SLED aligns generation with the model’s intrinsic knowledge more effectively. Research shows that this approach consistently enhances performance across diverse tasks, including complex reasoning, multiple-choice questions, and open-ended generation. ## Limitations of Standard Decoding * Standard LLMs typically generate text by relying solely on the "logits" (prediction scores) of the final layer to determine the next token. * This process often leads to hallucinations because the final layer may prioritize "popular" or common patterns from training data over factual accuracy. * While techniques like Retrieval Augmented Generation (RAG) provide external context, they increase system complexity and do not address the model's internal tendency to ignore subtle contextual cues during the final projection. ## The Technical Mechanism of SLED * SLED utilizes "early exit" logits from every intermediate layer of the Transformer architecture, rather than just the final one. * The strategy reuses the model's final projection matrix on these intermediate layers to create multiple probability distributions across the same set of potential tokens. * By calculating a weighted average of the distributions from all layers, SLED refines the prediction to better reflect the model's latent knowledge. * This multi-layer approach allows the model to catch nuances—such as specific math constraints or geographic facts—that might be "smoothed over" by the final layer’s preference for high-probability sequences. ## Practical Performance and Reasoning * In chain-of-thought tasks, SLED helps the model maintain logic; for example, it can correctly identify when a discount should be applied in a math problem by favoring intermediate layers that recognize the "if/then" logic over a simple arithmetic pattern. * The method is model-agnostic and has shown consistent accuracy gains across various LLM scales and configurations. * SLED is highly flexible and can be integrated with existing factuality decoding methods or speculative decoding to further reduce hallucinations without the need for additional training data. For developers and researchers seeking to boost the reliability of LLMs, SLED offers a computationally efficient alternative to fine-tuning. By simply adjusting the decoding strategy to incorporate the rich information available in intermediate layers, models can achieve higher factuality and more robust reasoning capabilities in real-world applications.

google

Learn Your Way: Reimagining textbooks with generative AI (opens in new tab)

Google Research has introduced Learn Your Way, an AI-driven educational experiment that reimagines traditional textbooks as personalized, multimodal learning journeys. By leveraging the LearnLM family of models integrated into Gemini 2.5 Pro, the system transforms static source material into tailored content based on a student’s specific grade level and interests. Early efficacy studies demonstrate that this approach significantly enhances retention, with students scoring 11 percentage points higher than those using standard digital readers. ### Pedagogical Foundations and Dual Coding The research is built on the "dual coding theory," which suggests that forming mental connections between different representations of information strengthens conceptual understanding. * The system moves away from a "one-size-fits-all" model toward a student-driven experience where learners can choose and intermix formats. * Personalization is used as a tool to enhance situational interest and motivation by adapting content to specific student attributes. * The framework incorporates active learning through real-time quizzing and feedback to address knowledge gaps as they arise. ### The Personalization Pipeline The technical architecture begins with a layered pipeline that processes source material, such as a textbook PDF, to create a foundational text for all other formats. * The original material is first "re-leveled" to match the learner’s reported grade level while maintaining the integrity and scope of the curriculum. * Generic examples within the text are strategically replaced with personalized examples based on user interests, such as sports, music, or food. * This personalized base text serves as the primary input for generating all subsequent multimodal representations, ensuring consistency across formats. ### Multimodal Content Generation To produce a wide variety of educational assets, the system utilizes a combination of large language models and specialized AI agents. * **Agentic Workflows:** While tools like mind maps and timelines are generated directly by Gemini, complex assets like narrated slides use multi-step agentic workflows to ensure pedagogical effectiveness. * **Custom Visuals:** Because general-purpose image models often struggle with educational accuracy, the researchers fine-tuned a dedicated model specifically for generating educational illustrations. * **Diverse Representations:** The interface provides "immersive text" with embedded questions, audio lessons for auditory learning, and interactive slides that mimic recorded classroom sessions. ### Research Outcomes and Future Application The project’s effectiveness was validated through a study comparing the GenAI approach against standard digital reading materials. * Students using the personalized AI tools showed a significant improvement in retention test scores. * Beyond retention, the system aims to transform passive reading into an active, multimodal experience that follows established learning science principles. * The "Learn Your Way" experiment is currently available on Google Labs, providing a practical look at how adaptive, learner-centric materials might replace static textbooks in future K-12 and higher education settings.

line

Code Quality Improvement Techniques Part (opens in new tab)

When implementing resource management patterns similar to Kotlin's `use` or Java's try-with-resources, developers often face the challenge of handling exceptions that occur during both primary execution and resource cleanup. Simply wrapping these multiple failures in a custom exception container can inadvertently break the calling code's error-handling logic by masking the original exception type. To maintain code quality, developers should prioritize the primary execution exception and utilize the `addSuppressed` mechanism to preserve secondary errors without disrupting the expected flow. ### The Risks of Custom Exception Wrapping Creating a new exception class to consolidate multiple errors during resource management can lead to significant issues for the caller. * Wrapping an expected exception, such as an `IOException`, inside a custom `DisposableException` prevents specific `catch` blocks from identifying and handling the original error. * This pattern often results in unhandled exceptions or the loss of specific error context, especially when the wrapper is hidden inside utility functions. * While this approach aims to be "neat" by capturing all possible failures, it forces the caller to understand the internal wrapping logic of the utility rather than the business logic errors. ### Prioritizing Primary Logic over Cleanup When errors occur in both the main execution block and the cleanup (e.g., `dispose()` or `close()`), it is critical to determine which exception takes precedence. * The exception from the main execution block is typically the "primary" failure that reflects a business logic or IO error, whereas a cleanup failure is often secondary. * Throwing a cleanup exception while discarding the primary error makes debugging difficult, as the root cause of the initial failure is lost. * In a typical `try-finally` block, if the `finally` block throws an exception, it naturally suppresses any exception thrown in the `try` block unless handled manually. ### Implementing Better Suppression Logic A more robust implementation mimics the behavior of Kotlin’s `Closeable.use` by ensuring the most relevant error is thrown while keeping others accessible for debugging. * Instead of creating a wrapper class, use `Throwable.addSuppressed()` to attach the cleanup exception to the primary exception. * If only the primary block fails, throw that exception directly to satisfy the caller's `catch` requirements. * If both the primary block and the cleanup fail, throw the primary exception and add the cleanup exception as a suppressed error. * If only the cleanup fails, it is then appropriate to throw the cleanup exception as the standalone failure. ### Considerations for Checked and Unchecked Exceptions The impact of exception handling varies by language, particularly in Java where checked exceptions are enforced by the compiler. * Converting a checked exception into an unchecked `RuntimeException` inside a wrapper can cause the compiler to miss necessary error-handling requirements. * If exceptions have parent-child relationships, such as `IOException` and `Exception`, wrapping can cause a specific handler to be bypassed in favor of a more generic one. * It is generally recommended to only wrap checked exceptions in `RuntimeException` when the error is truly unrecoverable and the caller is not expected to handle it. When designing custom resource management utilities, always evaluate which exception is most critical for the caller to see. Prioritize the primary execution error and use suppression for auxiliary cleanup failures to ensure that your error-handling remains transparent and predictable for the rest of the application.

google

VaultGemma: The world's most capable differentially private LLM (opens in new tab)

VaultGemma represents a significant milestone in privacy-preserving AI as the most capable large language model trained from scratch using differential privacy (DP). By establishing new scaling laws specifically for DP training, researchers have optimized the complex trade-offs between compute, privacy budgets, and model utility. The resulting 1-billion-parameter model demonstrates that high-performance generative AI can be achieved while maintaining rigorous mathematical guarantees against data memorization. ## Scaling Laws for Differentially Private Training * Performance in DP-trained models is primarily governed by the "noise-batch ratio," which measures the amount of random privacy noise relative to the size of the training data groups. * Research suggests that for any given compute and privacy budget, there exists an optimal training configuration that balances model size, iterations, and batch size to achieve the lowest possible training loss. * A critical finding indicates that DP training requires a departure from standard scaling practices, favoring significantly larger batch sizes and smaller model architectures than traditional non-DP training. ## Synergies in Privacy, Compute, and Data * Increasing the privacy budget (epsilon) in isolation leads to diminishing returns unless it is paired with a proportional increase in compute (FLOPs) or data (tokens). * Visualizations of the scaling laws show that different model sizes can provide similar utility if the number of training iterations and batch sizes are correctly adjusted. * The optimal configuration shifts between investing in larger models versus more iterations depending on the specific constraints of the data and privacy budgets. ## Training at Scale with Algorithmic Advancements * VaultGemma is built on the Gemma 2 architecture and utilizes a 1B parameter setup optimized for the unique constraints of DP. * To overcome hardware limitations when processing the massive batch sizes required for DP training, the team developed a "Virtual Batch" technique in JAX to aggregate gradients across multiple steps. * Training from scratch allows the model to outperform traditional DP-finetuned models, which often struggle to balance utility with the noise introduced during the fine-tuning process. ## Performance and Evaluation * VaultGemma achieves competitive results against standard 1B parameter models while providing formal privacy protections. * The model demonstrates superior privacy-utility trade-offs, proving that carefully scaled DP models can retain high levels of reasoning and language capability. * The release includes the model weights and a comprehensive technical report to assist the community in developing the next generation of private-by-design AI. VaultGemma provides a practical blueprint for developers who need to balance the power of large language models with strict data confidentiality requirements. By leveraging the provided scaling insights, organizations can now train models that are mathematically resistant to data leakage without sacrificing significant performance.