우아한Tech에 문의하기 개발자 채용 장애 대응의 성패를 가르는 First Action: 우아한형제들의 장애 관리 라이프사이클 Backend Infra First Action에 따라 달라지는 장애 영향 우아한형제들의 2025년 장애를 돌아보면 인지는 비교적 빠른 편이었습니다. 그러나, 장애로 고객 경험의 악영향이 오래 이어진 사례들이 적지 않았습니다. 장애 대응 과정을 하나씩 다시 들여다보면 차이는 대부분 인지 이후 가장 먼저 어떤 조치를 취했는지, 즉 First Action(초동 조치)에서 시…
DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire.
## Architecture and Core Components
* **Expressive SDK:** Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms.
* **Built-in Analysis Tools:** The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas.
* **Scalable Backend:** A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously.
* **Workflow Integration:** DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention.
## Authoring and Verification Workflow
* **Template Bootstrapping:** Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner.
* **Analyzer Chaining:** The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services.
* **Automated Backtesting:** Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance.
* **Decision Tree Logic:** Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves.
## Execution and Post-Processing
* **Trigger-based Analysis:** When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging.
* **Automated Mitigation:** A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues.
* **DrP Insights:** This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes.
* **Alert Annotation:** Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder.
## Practical Conclusion
Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.