meta Dec 19, 2025

DrP: Meta's Root Cause Analysis Platform at Scale - Engineering at Meta (opens in new tab)

ai machine-learning automation incident-management observability anomaly-detection sre root-cause-analysis

DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire.

Architecture and Core Components

Expressive SDK: Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms.
Built-in Analysis Tools: The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas.
Scalable Backend: A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously.
Workflow Integration: DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention.

Authoring and Verification Workflow

Template Bootstrapping: Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner.
Analyzer Chaining: The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services.
Automated Backtesting: Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance.
Decision Tree Logic: Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves.

Execution and Post-Processing

Trigger-based Analysis: When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging.
Automated Mitigation: A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues.
DrP Insights: This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes.
Alert Annotation: Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder.

Practical Conclusion

Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.