meta

DrP: Meta's Root Cause Analysis Platform at Scale - Engineering at Meta (opens in new tab)

DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire.

Architecture and Core Components

  • Expressive SDK: Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms.
  • Built-in Analysis Tools: The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas.
  • Scalable Backend: A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously.
  • Workflow Integration: DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention.

Authoring and Verification Workflow

  • Template Bootstrapping: Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner.
  • Analyzer Chaining: The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services.
  • Automated Backtesting: Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance.
  • Decision Tree Logic: Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves.

Execution and Post-Processing

  • Trigger-based Analysis: When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging.
  • Automated Mitigation: A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues.
  • DrP Insights: This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes.
  • Alert Annotation: Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder.

Practical Conclusion

Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.