sre

3 posts

meta

DrP: Meta's Root Cause Analysis Platform at Scale - Engineering at Meta (opens in new tab)

DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire. ## Architecture and Core Components * **Expressive SDK:** Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms. * **Built-in Analysis Tools:** The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas. * **Scalable Backend:** A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously. * **Workflow Integration:** DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention. ## Authoring and Verification Workflow * **Template Bootstrapping:** Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner. * **Analyzer Chaining:** The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services. * **Automated Backtesting:** Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance. * **Decision Tree Logic:** Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves. ## Execution and Post-Processing * **Trigger-based Analysis:** When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging. * **Automated Mitigation:** A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues. * **DrP Insights:** This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes. * **Alert Annotation:** Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder. ## Practical Conclusion Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.

woowahan

How Woowa Brothers Detects Failures (opens in new tab)

Woowa Brothers addresses the inevitability of system failures by shifting from traditional resource-based monitoring to a specialized Service Anomaly Detection system. By focusing on high-level service metrics such as order volume and login counts rather than just CPU or memory usage, they can identify incidents that directly impact the user experience. This approach ensures near real-time detection and provides a structured response framework to minimize damage during peak service hours. ### The Shift to Service-Level Monitoring * Traditional monitoring focuses on infrastructure metrics like CPU and memory, but it is impossible to monitor every system variable, leading to "blind spots" in failure detection. * Service metrics, such as real-time login counts and payment success rates, are finite and offer a direct reflection of the actual customer experience. * By monitoring these core indicators, the SRE team can detect anomalies that system-level alerts might overlook, ensuring that no failure goes unnoticed. ### Requirements for Effective Anomaly Detection * **Real-time Performance:** Alerts must be triggered in near-real-time to allow for immediate intervention before the impact scales. * **Explainability:** The system favors transparent logic over "black-box" AI models, allowing developers to quickly understand why an alert was triggered and how to improve the detection logic. * **Integrated Response:** Beyond just detection, the system must provide a clear response process so that any engineer, regardless of experience, can follow a standardized path to resolution. ### Technical Implementation and Logic * The system leverages the predictable, pattern-based nature of delivery service traffic, which typically peaks during lunch and dinner. * The team chose a Median-based approach to generate "Prediction" values from historical data, as it is more robust against outliers and easier to analyze than complex methods like IQR or 2-sigma. * Detection is determined by comparing "Actual" values against "Warning" and "Critical" thresholds derived from the predicted median. * To prevent false positives caused by temporary spikes, the system tracks "threshold reach counts," requiring a metric to stay in an abnormal state for a specific number of consecutive cycles before firing a Slack alert. ### Optimization of Alert Accuracy * Each service metric requires a tailored "settling period" to find the optimal balance between detection speed and accuracy. * Setting a high threshold reach count improves accuracy but slows down detection, while a low count accelerates detection at the risk of increased false positives. * Alerts are delivered via Slack with comprehensive context, including current status and urgency, to facilitate rapid decision-making. For organizations running high-traffic services, prioritizing service-level indicators (SLIs) over infrastructure metrics can significantly reduce the time to detect critical failures. Implementing simple, explainable statistical models like the Median approach allows teams to maintain a reliable monitoring system that evolves alongside the service without the complexity of uninterpretable AI models.

naver

Introduction to OpenTelemetry (feat. Collector (opens in new tab)

NAVER is transitioning its internal search monitoring platform, SEER, to an architecture built on OpenTelemetry and open-source standards to achieve a more scalable and flexible observability environment. By adopting a vendor-agnostic approach, the engineering team aims to unify the collection of metrics, logs, and traces while contributing back to the global OpenTelemetry ecosystem. This shift underscores the importance of standardized telemetry protocols in managing complex, large-scale service infrastructures. ### Standardizing Observability with OTLP * The transition focuses on the OpenTelemetry Protocol (OTLP) as the primary standard for transmitting telemetry data across the platform. * Moving away from proprietary formats allows for a unified data model that encompasses metrics, traces, and logs, ensuring consistency across different services. * A standardized protocol simplifies the integration of various open-source backends, reducing the engineering overhead associated with supporting multiple telemetry formats. ### The OpenTelemetry Collector Pipeline * The Collector acts as a critical intermediary, decoupling the application layer from the storage backend to provide greater architectural flexibility. * **Receivers** are used to ingest data from diverse sources, supporting both OTLP-native applications and legacy systems. * **Processors** enable data transformation, filtering, and metadata enrichment (such as adding resource attributes) before the data reaches its destination. * **Exporters** manage the delivery of processed telemetry to specific backends like Prometheus for metrics or Jaeger for tracing, allowing for easy swaps of infrastructure components. ### Automated Management via OpenTelemetry Operator * The OpenTelemetry Operator is utilized within Kubernetes environments to automate the deployment and lifecycle management of the Collector. * It facilitates auto-instrumentation, allowing developers to collect telemetry from applications without manual code changes for every service. * The Operator ensures that the observability stack scales dynamically alongside the production workloads it monitors. ### Open-Source Contribution and Community * Beyond mere adoption, the NAVER engineering team actively participates in the OpenTelemetry community by sharing bug fixes and feature enhancements discovered during the SEER migration. * This collaborative approach ensures that the specific requirements of high-traffic enterprise environments are reflected in the evolution of the OpenTelemetry project. Adopting OpenTelemetry is a strategic move for organizations looking to avoid vendor lock-in and build a future-proof monitoring stack. For a successful implementation, teams should focus on mastering the Collector's pipeline configuration to balance data granularity with processing performance across distributed systems.