line

Essential Element for App Success: Error Monitoring (opens in new tab)

Effective mobile app management requires proactive outage monitoring to prevent user churn caused by failures in critical flows like registration or payment. Relying on user reports is often too late, so developers must implement systematic event collection and real-time dashboards to identify issues the moment they arise. By integrating tools like Sentry or Firebase, teams can maintain high quality through immediate response and detailed performance analysis. ### Implementing Sentry in Flutter * **Dependency and Initialization**: Integration begins by adding `sentry_flutter` and `sentry_dio` to the project. The initialization process involves setting the Data Source Name (DSN), environment tags (e.g., production vs. staging), and release versions to ensure logs are correctly categorized. * **Performance and Privacy**: Developers should configure `tracesSampleRate` and `profilesSampleRate` to balance monitoring depth with costs. Additionally, the `beforeSend` callback allows for masking sensitive user data like authorization headers or IP addresses before they are transmitted. * **Contextual Tracking**: To aid debugging, the system captures user IDs via `Sentry.configureScope` and tracks user movement using `SentryNavigatorObserver`. Utilizing `SentryInterceptor` with the Dio library allows for automatic tracking of HTTP request performance and API bottlenecks. ### Strategic Log Level Design * **Debug and Info**: Debug logs remain local to the terminal to save resources. Info logs are reserved for significant user actions that change data, such as successful sign-ups or purchases, while high-frequency read actions like "viewing a product list" are excluded to reduce noise and costs. * **Warning**: This level tracks external system failures, such as failed API calls or push notification losses. To prevent "alert fatigue," client-side network issues (e.g., timeouts or offline status) are ignored, and alerts are triggered only when specific thresholds are met, such as 100 failures within 10 minutes. * **Error**: Error logs represent internal logic failures that bypass defensive coding, such as null object errors, parsing failures, or unreachable code branches. These require immediate notification to the development team to facilitate rapid hotfixes. * **Fatal**: This level is dedicated to application crashes and unhandled exceptions. When configured at the app's entry point, the system automatically captures these critical failures to provide a comprehensive "crash-free users" metric. ### Creating Effective Dashboards * **Naming Conventions**: Logs should follow a strict structure, using tags for modules and event names (e.g., `[API] [postLogin] success`). This consistency allows for granular querying and clearer visualization on monitoring dashboards. * **Data Enrichment**: Using the `extra` field in log events provides vital context for troubleshooting, such as including the specific endpoint, request body, and response status code for a failed transaction. * **Actionable Metrics**: Effective monitoring focuses on key performance indicators like API error rates and the failure percentage of core business events (login, registration, payment) rather than just raw crash counts. A robust monitoring strategy shifts the focus from simple crash reporting to comprehensive service health. By standardizing log levels and automating event collection, development teams can distinguish between transient network blips and critical logic errors, ensuring they spend their time fixing high-impact issues.

google

Solving virtual machine puzzles: How AI is optimizing cloud computing (opens in new tab)

Google researchers have developed LAVA, a scheduling framework designed to optimize virtual machine (VM) allocation in large-scale data centers by accurately predicting and adapting to VM lifespans. By moving beyond static, one-time predictions toward a "continuous re-prediction" model based on survival analysis, the system significantly improves resource efficiency and reduces fragmentation. This approach allows cloud providers to solve the complex "bin packing" problem more effectively, leading to better capacity utilization and easier system maintenance. ### The Challenge of Long-Tailed VM Distributions * Cloud workloads exhibit a extreme long-tailed distribution: while 88% of VMs live for less than an hour, these short-lived jobs consume only 2% of total resources. * The rare VMs that run for 30 days or longer account for a massive fraction of compute resources, meaning their placement has a disproportionate impact on host availability. * Poor allocation leads to "resource stranding," where a server's remaining capacity is too small or unbalanced to host new VMs, effectively wasting expensive hardware. * Traditional machine learning models that provide only a single prediction at VM creation are often fragile, as a single misprediction can block a physical host from being cleared for maintenance or new tasks. ### Continuous Re-prediction via Survival Analysis * Instead of predicting a single average lifetime, LAVA uses an ML model to generate a probability distribution of a VM's expected duration. * The system employs "continuous re-prediction," asking how much longer a VM is expected to run given how long it has already survived (e.g., a VM that has run for five days is assigned a different remaining lifespan than a brand-new one). * This adaptive approach allows the scheduling logic to automatically correct for initial mispredictions as more data about the VM's actual behavior becomes available over time. ### Novel Scheduling and Rescheduling Algorithms * **Non-Invasive Lifetime Aware Scheduling (NILAS):** Currently deployed on Google’s Borg cluster manager, this algorithm ranks potential hosts by grouping VMs with similar expected exit times to increase the frequency of "empty hosts" available for maintenance. * **Lifetime-Aware VM Allocation (LAVA):** This algorithm fills resource gaps on hosts containing long-lived VMs with jobs that are at least an order of magnitude shorter. This ensures the short-lived VMs exit quickly without extending the host's overall occupation time. * **Lifetime-Aware Rescheduling (LARS):** To minimize disruptions during defragmentation, LARS identifies and migrates the longest-lived VMs first while allowing short-lived VMs to finish their tasks naturally on the original host. By integrating survival-analysis-based predictions into the core logic of data center management, cloud providers can transition from reactive scheduling to a proactive model. This system not only maximizes resource density but also ensures that the physical infrastructure remains flexible enough to handle large, resource-intensive provisioning requests and essential system updates.

google

Using AI to identify genetic variants in tumors with DeepSomatic (opens in new tab)

DeepSomatic is an AI-powered tool developed by Google Research to identify cancer-related mutations by analyzing a tumor's genetic sequence with higher accuracy than current methods. By leveraging convolutional neural networks (CNNs), the model distinguishes between inherited genetic traits and acquired somatic variants that drive cancer progression. This flexible tool supports multiple sequencing platforms and sample types, offering a critical resource for clinicians and researchers aiming to personalize cancer treatment through precision medicine. ## Challenges in Somatic Variant Detection * Somatic variants are genetic mutations acquired after birth through environmental exposure or DNA replication errors, making them distinct from the germline variants found in every cell of a person's body. * Detecting these mutations is technically difficult because tumor samples are often heterogeneous, containing a diverse set of variants at varying frequencies. * Sequencing technologies often introduce small errors that can be difficult to distinguish from actual somatic mutations, especially when the mutation is only present in a small fraction of the sampled cells. ## CNN-Based Variant Calling Architecture * DeepSomatic employs a method pioneered by DeepVariant, which involves transforming raw genetic sequencing data into a set of multi-channel images. * These images represent various data points, including alignment along the chromosome, the quality of the sequence output, and other technical variables. * The convolutional neural network processes these images to differentiate between three categories: the human reference genome, non-cancerous germline variants, and the somatic mutations driving tumor growth. * By analyzing tumor and non-cancerous cells side-by-side, the model effectively filters out sequencing artifacts that might otherwise be misidentified as mutations. ## System Versatility and Application * The model is designed to function in multiple modes, including "tumor-normal" (comparing a biopsy to a healthy sample) and "tumor-only" mode, which is vital for blood cancers like leukemia where isolating healthy cells is difficult. * DeepSomatic is platform-agnostic, meaning it can process data from all major sequencing technologies and adapt to different types of sample processing. * The tool has demonstrated the ability to generalize its learning to various cancer types, even those not specifically included in its initial training sets. ## Open-Source Contributions to Precision Medicine * Google has made the DeepSomatic tool and the CASTLE dataset—a high-quality training and evaluation set—openly available to the global research community. * This initiative is part of a broader effort to use AI for early detection and advanced research in various cancers, including breast, lung, and gynecological cancers. * The release aims to accelerate the development of personalized treatment plans by providing a more reliable way to identify the specific genetic drivers of an individual's disease. By providing a more accurate and adaptable method for variant calling, DeepSomatic helps researchers pinpoint the specific drivers of a patient's cancer. This tool represents a significant advancement in deep learning for genomics, potentially shortening the path from biopsy to targeted therapeutic intervention.

datadog

Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog | Datadog (opens in new tab)

Following a major 2023 incident that caused a near-total platform outage despite partial infrastructure availability, Datadog shifted its engineering philosophy from "never-fail" architectures to a model of graceful degradation. The company identified that prioritizing absolute data correctness during systemic stress created "square-wave" failures, where the entire platform appeared down if even a portion of data was missing. By moving toward a "fail better" mindset, Datadog now focuses on maintaining core functionality and data persistence even when underlying infrastructure is compromised. ## Limitations of the Never-Fail Approach * Classical root-cause analysis focused on a legacy, unsupervised global update mechanism that disconnected 50–60% of production Kubernetes nodes. * While the "precipitating event" was easily identified and disabled, the engineering team realized that fixing the trigger did not address the systemic fragility that caused a binary (up/down) failure pattern. * Prioritizing absolute accuracy meant that systems would wait for all data tags to process before displaying results; under stress, this caused the UI to show no data at all rather than "almost correct" data. * Sequential queuing, aggressive retry logic, and node-specific processing requirements exacerbated the bottleneck, preventing real-time recovery. ## Prioritizing Graceful Degradation * The incident prompted a shift away from relying solely on redundancy to prevent outages, acknowledging that some level of failure is eventually inevitable at scale. * Engineering priorities were redefined to ensure that data is never lost (even if delayed) and that real-time data is processed before stale backlogs. * The platform now aims to serve partial-but-accurate results to customers during an incident, providing visibility rather than a complete blackout. * Implementation is handled as a company-wide program where individual product teams adapt these principles to their specific architectural needs. ## Strengthening Data Persistence at Intake * Analysis revealed that data was lost during the outage because it was stored in memory or on local disks before being replicated to persistent stores. * The original design favored low-latency responses by acknowledging receipt of data before it was fully replicated, making that data unrecoverable if the node failed. * Downstream failures caused intake nodes to overflow their local buffers, leading to data loss even on nodes that remained online. * New architectural changes focus on implementing disk-based persistence at the very beginning of the processing pipeline to ensure data survives node restarts and downstream congestion. To build truly resilient systems, engineering teams must move beyond trying to prevent every possible failure trigger. Instead, focus on designing services that can survive partial infrastructure loss by prioritizing data persistence and allowing for degraded states that still provide value to the end user.

google

Coral NPU: A full-stack platform for Edge AI (opens in new tab)

Coral NPU is a new full-stack, open-source platform designed to bring advanced AI directly to power-constrained edge devices and wearables. By prioritizing a matrix-first hardware architecture and a unified software stack, Google aims to overcome traditional bottlenecks in performance, ecosystem fragmentation, and data privacy. The platform enables always-on, low-power ambient sensing while providing developers with a flexible, RISC-V-based environment for deploying modern machine learning models. ## Overcoming Edge AI Constraints * The platform addresses the "performance gap" where complex ML models typically exceed the power, thermal, and memory budgets of battery-operated devices. * It eliminates the "fragmentation tax" by providing a unified architecture, moving away from proprietary processors that require costly, device-specific optimizations. * On-device processing ensures a high standard of privacy and security by keeping personal context and data off the cloud. ## AI-First Hardware Architecture * Unlike traditional chips, this architecture prioritizes the ML matrix engine over scalar compute to optimize for efficient on-device inference. * The design is built on RISC-V ISA compliant architectural IP blocks, offering an open and extensible reference for system-on-chip (SoC) designers. * The base design delivers performance in the 512 giga operations per second (GOPS) range while consuming only a few milliwatts of power. * The architecture is tailored for "always-on" use cases, making it ideal for hearables, AR glasses, and smartwatches. ## Core Architectural Components * **Scalar Core:** A lightweight, C-programmable RISC-V frontend that manages data flow using an ultra-low-power "run-to-completion" model. * **Vector Execution Unit:** A SIMD co-processor compliant with the RISC-V Vector instruction set (RVV) v1.0 for simultaneous operations on large datasets. * **Matrix Execution Unit:** A specialized engine using quantized outer product multiply-accumulate (MAC) operations to accelerate fundamental neural network tasks. ## Unified Developer Ecosystem * The platform is a C-programmable target that integrates with modern compilers such as IREE and TFLM (TensorFlow Lite Micro). * It supports a wide range of popular ML frameworks, including TensorFlow, JAX, and PyTorch. * The software toolchain utilizes MLIR and the StableHLO dialect to facilitate the transition from high-level models to hardware-executable code. * Developers have access to a complete suite of tools, including a simulator, custom kernels, and a general-purpose MLIR compiler. SoC designers and ML developers looking to build the next generation of wearables should leverage the Coral NPU reference architecture to balance high-performance AI with extreme power efficiency. By utilizing the open-source documentation and RISC-V-based tools, teams can significantly reduce the complexity of deploying private, always-on ambient sensing.

discord

Staff Picks, September 2025: Welcome to Our Video Game Museum (opens in new tab)

This blog post celebrates National Video Games Day by reflecting on the cultural and historical significance of the gaming industry. By framing the discussion around a hypothetical museum of influential titles, the post seeks to identify the specific games that have left the most lasting impact on players and creators alike. ### Commemorating National Video Games Day * The post acknowledges September 12th as a day to honor gaming culture and the legacy of titles released over the years. * It encourages readers to use the occasion as an excuse to engage with their favorite games and spread appreciation for the medium. ### Identifying Historically Significant Titles * The authors utilize a "museum" concept—referencing the character Blathers from the *Animal Crossing* series—to discuss game preservation and importance. * The central inquiry focuses on identifying "prized games" that deserve to be showcased behind glass cases due to their industry-wide influence. ### Team Perspectives on Industry Impact * The post features insights from four specific team members: Veronica, Scott, Tyler, and Anni. * Each contributor provides a personal selection for the game they believe has had the most significant impact on their lives or the industry at large. Whether through a museum exhibit or personal play, reflecting on the history of gaming helps highlight the titles that defined the medium. Readers are encouraged to consider which games they would personally archive as the most influential "prized pieces" of digital history.

discord

Discord Patch Notes: October 7, 2025 (opens in new tab)

Discord’s "Patch Notes" series provides a transparent look into the engineering team's ongoing efforts to improve platform performance, reliability, and responsiveness. By focusing on bug-squishing and usability enhancements, the series outlines the specific technical changes implemented to maintain a high-quality user experience across all supported devices. ### Community-Driven Bug Discovery * Discord utilizes the community-run r/DiscordApp subreddit as a primary channel for identifying technical issues. * Users are encouraged to post in the Bimonthly Bug Megathread, which is actively monitored by the engineering team to track and resolve persistent user concerns. * This direct feedback loop allows developers to prioritize fixes that have the most significant impact on the general user base. ### Early Access via iOS TestFlight * For users interested in experimental features, Discord offers an early-access program through Apple’s TestFlight platform. * This beta version allows iOS users to test new updates before they reach the general public, serving as a final stage for identifying "pesky bugs" in a live environment. * Participation in this program provides the engineering team with critical data on feature stability and performance on mobile hardware. ### Commit and Deployment Status * All listed fixes in the series have already been committed and merged into Discord's primary codebase. * Because the deployment process is staged, these updates may roll out to individual platforms and regions at slightly different times even after the notes are published. To ensure the most stable experience and gain access to the latest performance improvements, users should keep their applications updated and consider joining the TestFlight program to help refine upcoming features.

discord

From Single-Node to Multi-GPU Clusters: How Discord Made Distributed Compute Easy for ML Engineers (opens in new tab)

Discord’s machine learning infrastructure reached a critical scaling limit as models and datasets grew beyond the capacity of single-machine systems. To overcome these bottlenecks, the engineering team transitioned to a distributed compute architecture built on the Ray framework and a suite of custom orchestration tools. This evolution moved Discord from ad-hoc experimentation to a robust production platform, resulting in significant performance gains such as a 200% improvement in business metrics for Ads Ranking. ### Overcoming Hardware and Data Bottlenecks * Initial ML systems relied on simple classifiers that eventually evolved into complex models serving hundreds of millions of users. * Training requirements shifted from single-machine tasks to workloads requiring multiple GPUs. * Datasets expanded to the point where they could no longer fit on individual machines, creating a need for distributed storage and processing. * Infrastructure growth struggled to keep pace with the exponential increase in computational demands. ### Building a Ray-Based ML Platform * The Ray framework was adopted as the foundation for distributed computing to simplify complex ML workflows. * Discord integrated Dagster with KubeRay to manage the orchestration of production-grade machine learning pipelines. * Custom CLI tooling was developed to lower the barrier to entry for engineers, focusing heavily on developer experience. * A specialized observability layer called X-Ray was implemented to provide deep insights into distributed system performance. By prioritizing developer experience and creating accessible abstractions over raw compute power, Discord successfully industrialized its ML operations. For organizations facing similar scaling hurdles, the focus should be on building a unified platform that turns the complexity of distributed systems into a seamless tool for modelers.

google

XR Blocks: Accelerating AI + XR innovation (opens in new tab)

XR Blocks is an open-source, cross-platform framework designed to bridge the technical gap between mature AI development ecosystems and high-friction extended reality (XR) prototyping. By providing a modular architecture and high-level abstractions, the toolkit enables creators to rapidly build and deploy intelligent, immersive web applications without managing low-level system integration. Ultimately, the framework empowers developers to move from concept to interactive prototype across both desktop simulators and mobile XR devices using a unified codebase. ### Core Design Principles * **Simplicity and Readability:** Drawing inspiration from the "Zen of Python," the framework prioritizes human-readable abstractions where a developer’s script reflects a high-level description of the experience rather than complex boilerplate code. * **Creator-Centric Workflow:** The architecture is designed to handle the "plumbing" of XR—such as sensor fusion, AI model integration, and cross-platform logic—allowing creators to focus entirely on user interaction and experience. * **Pragmatic Modularity:** Rather than attempting to be a perfect, all-encompassing system, XR Blocks favors an adaptable and simple architecture that can evolve alongside the rapidly changing fields of AI and spatial computing. ### The Reality Model Abstractions * **The Script Primitive:** Acts as the logical center of an application, separating the "what" of an interaction from the "how" of its underlying technical implementation. * **User and World:** Provides built-in support for tracking hands, gaze, and avatars while allowing the system to query the physical environment for depth, estimated lighting conditions, and object recognition. * **AI and Agents:** Facilitates the integration of intelligent assistants, such as the "Sensible Agent," which can provide proactive, context-aware suggestions within the XR environment. * **Virtual Interfaces:** Offers tools to augment blended reality with virtual UI elements that respond to the user's physical context. ### Technical Implementation and Integration * **Web-Based Foundation:** The framework is built upon accessible, standard technologies including WebXR, three.js, and LiteRT (formerly TFLite) to ensure a low barrier to entry for web developers. * **Advanced AI Support:** It features native integration with Gemini for high-level reasoning and context-aware applications. * **Cross-Platform Deployment:** Developers can prototype depth-aware, physics-based interactions in a desktop simulator and deploy the exact same code to Android XR devices. * **Open-Source Resources:** The project includes a comprehensive suite of templates and live demos covering specific use cases like depth mapping, gesture modeling, and lighting estimation. By lowering the barrier to entry for intelligent XR development, XR Blocks serves as a practical starting point for researchers and developers aiming to explore the next generation of human-centered computing. Interested creators can access the source code on GitHub to begin building immersive, AI-driven applications that function seamlessly across the web and specialized XR hardware.