DrP: Meta's Root Cause Analysis Platform at Scale - Engineering at Meta (opens in new tab)
DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire.
Architecture and Core Components
- Expressive SDK: Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms.
- Built-in Analysis Tools: The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas.
- Scalable Backend: A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously.
- Workflow Integration: DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention.
Authoring and Verification Workflow
- Template Bootstrapping: Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner.
- Analyzer Chaining: The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services.
- Automated Backtesting: Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance.
- Decision Tree Logic: Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves.
Execution and Post-Processing
- Trigger-based Analysis: When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging.
- Automated Mitigation: A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues.
- DrP Insights: This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes.
- Alert Annotation: Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder.
Practical Conclusion
Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.