observability

4 posts

meta

DrP: Meta's Root Cause Analysis Platform at Scale - Engineering at Meta (opens in new tab)

DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire. ## Architecture and Core Components * **Expressive SDK:** Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms. * **Built-in Analysis Tools:** The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas. * **Scalable Backend:** A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously. * **Workflow Integration:** DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention. ## Authoring and Verification Workflow * **Template Bootstrapping:** Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner. * **Analyzer Chaining:** The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services. * **Automated Backtesting:** Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance. * **Decision Tree Logic:** Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves. ## Execution and Post-Processing * **Trigger-based Analysis:** When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging. * **Automated Mitigation:** A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues. * **DrP Insights:** This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes. * **Alert Annotation:** Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder. ## Practical Conclusion Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.

aws

Amazon Bedrock AgentCore adds quality evaluations and policy controls for deploying trusted AI agents | AWS News Blog (opens in new tab)

AWS has introduced several new capabilities to Amazon Bedrock AgentCore designed to remove the trust and quality barriers that often prevent AI agents from moving into production environments. These updates, which include granular policy controls and sophisticated evaluation tools, allow developers to implement strict operational boundaries and monitor real-world performance at scale. By balancing agent autonomy with centralized verification, AgentCore provides a secure framework for deploying highly capable agents across enterprise workflows. **Governance through Policy in AgentCore** * This feature establishes clear boundaries for agent actions by intercepting tool calls via the AgentCore Gateway before they are executed. * By operating outside of the agent’s internal reasoning loop, the policy layer acts as an independent verification system that treats the agent as an autonomous actor requiring permission. * Developers can define fine-grained permissions to ensure agents do not access sensitive data inappropriately or take unauthorized actions within external systems. **Quality Monitoring with AgentCore Evaluations** * The new evaluation framework allows teams to monitor the quality of AI agents based on actual behavior rather than theoretical simulations. * Built-in evaluators provide standardized metrics for critical dimensions such as helpfulness and correctness. * Organizations can also implement custom evaluators to ensure agents meet specific business-logic requirements and industry-specific compliance standards. **Enhanced Memory and Communication Features** * New episodic functionality in AgentCore Memory introduces a long-term strategy that allows agents to learn from past experiences and apply successful solutions to similar future tasks. * Bidirectional streaming in the AgentCore Runtime supports the deployment of advanced voice agents capable of handling natural, simultaneous conversation flows. * These enhancements focus on improving consistency and user experience, enabling agents to handle complex, multi-turn interactions with higher reliability. **Real-World Application and Performance** * The AgentCore SDK has seen rapid adoption with over 2 million downloads, supporting diverse use cases from content generation at the PGA TOUR to financial data analysis at Workday. * Case studies highlight significant operational gains, such as a 1,000 percent increase in content writing speed and a 50 percent reduction in problem resolution time through improved observability. * The platform emphasizes 100 percent traceability of agent decisions, which is critical for organizations transitioning from reactive to proactive AI-driven operations. To successfully scale AI agents, organizations should transition from simple prompt engineering to a robust agentic architecture. Leveraging these new policy and evaluation tools will allow development teams to maintain the necessary control and visibility required for customer-facing and mission-critical deployments.

naver

Introduction to OpenTelemetry (feat. Collector (opens in new tab)

NAVER is transitioning its internal search monitoring platform, SEER, to an architecture built on OpenTelemetry and open-source standards to achieve a more scalable and flexible observability environment. By adopting a vendor-agnostic approach, the engineering team aims to unify the collection of metrics, logs, and traces while contributing back to the global OpenTelemetry ecosystem. This shift underscores the importance of standardized telemetry protocols in managing complex, large-scale service infrastructures. ### Standardizing Observability with OTLP * The transition focuses on the OpenTelemetry Protocol (OTLP) as the primary standard for transmitting telemetry data across the platform. * Moving away from proprietary formats allows for a unified data model that encompasses metrics, traces, and logs, ensuring consistency across different services. * A standardized protocol simplifies the integration of various open-source backends, reducing the engineering overhead associated with supporting multiple telemetry formats. ### The OpenTelemetry Collector Pipeline * The Collector acts as a critical intermediary, decoupling the application layer from the storage backend to provide greater architectural flexibility. * **Receivers** are used to ingest data from diverse sources, supporting both OTLP-native applications and legacy systems. * **Processors** enable data transformation, filtering, and metadata enrichment (such as adding resource attributes) before the data reaches its destination. * **Exporters** manage the delivery of processed telemetry to specific backends like Prometheus for metrics or Jaeger for tracing, allowing for easy swaps of infrastructure components. ### Automated Management via OpenTelemetry Operator * The OpenTelemetry Operator is utilized within Kubernetes environments to automate the deployment and lifecycle management of the Collector. * It facilitates auto-instrumentation, allowing developers to collect telemetry from applications without manual code changes for every service. * The Operator ensures that the observability stack scales dynamically alongside the production workloads it monitors. ### Open-Source Contribution and Community * Beyond mere adoption, the NAVER engineering team actively participates in the OpenTelemetry community by sharing bug fixes and feature enhancements discovered during the SEER migration. * This collaborative approach ensures that the specific requirements of high-traffic enterprise environments are reflected in the evolution of the OpenTelemetry project. Adopting OpenTelemetry is a strategic move for organizations looking to avoid vendor lock-in and build a future-proof monitoring stack. For a successful implementation, teams should focus on mastering the Collector's pipeline configuration to balance data granularity with processing performance across distributed systems.

toss

Frontend Code That Lasts (opens in new tab)

Toss Payments evolved its Payment SDK to solve the inherent complexities of integrating payment systems, where developers must navigate UI implementation, security flows, and exception handling. By transitioning from V1 to V2, the team moved beyond simply providing a library to building a robust, architecture-driven system that ensures stability and scalability across diverse merchant environments. The core conclusion is that a successful SDK must be treated as a critical infrastructure layer, relying on modular design and deep observability to handle the unpredictable nature of third-party runtimes. ## The Unique Challenges of SDK Development * SDK code lives within the merchant's runtime environment, meaning it shares the same lifecycle and performance constraints as the merchant’s own code. * Internal logging can inadvertently create bottlenecks; for instance, adding network logs to a frequently called method can lead to "self-DDoS" scenarios that crash the merchant's payment page. * Type safety is a major hurdle, as merchants may pass unexpected data types (e.g., a number instead of a string), causing fatal runtime errors like `startsWith is not a function`. * The SDK acts as a bridge for technical communication, requiring it to function as both an API consumer for internal systems and an API provider for external developers. ## Ensuring Stability through Observability * To manage the unpredictable ways merchants use the SDK, Toss implemented over 300 unit tests and 500 E2E integration tests based on real-world use cases. * The team utilizes a "Global Trace ID" to track a single payment journey across both the frontend and backend, allowing for seamless debugging across the entire system. * A custom Monitoring CLI was developed to compare payment success rates before and after deployments, categorized by merchant and runtime environment (e.g., PC Chrome vs. Android WebView). * This observability infrastructure enables the team to quickly identify edge-case failures—such as a specific merchant's checkout failing only on mobile WebViews—which are often missed by standard QA processes. ## Scaling with Modular Architecture * To avoid "if-statement hell" caused by merchant-specific requirements (e.g., fixing installment months or custom validation for a specific store), Toss moved to a "Lego-block" architecture. * The SDK is organized into three distinct layers based on the "reason for change" principle: * **Public Interface Layer:** Manages the contract with the merchant, validating inputs and translating them into internal domain models. * **Domain Layer:** Encapsulates core business logic and payment policies, keeping them isolated from external changes. * **External Service Layer:** Handles dependencies like Server APIs and Web APIs, ensuring technical shifts don't leak into the business logic. * This separation allows the team to implement custom merchant logic by swapping specific blocks without modifying the core codebase, reducing the risk of regressions and lowering maintenance costs. For developers building SDKs or integration tools, the shift from monolithic logic to a layered, observable architecture is essential. Prioritizing the separation of domain logic from public interfaces and investing in environment-specific monitoring allows for a highly flexible product that remains stable even as the client-side environment grows increasingly complex.