우아한형제들 / incident-management

2 posts

woowahan

장애 대응의 성패를 가르는 First Action — 우아한형제들의 장애 관리 라이프사이클 (opens in new tab)

우아한Tech에 문의하기 개발자 채용 장애 대응의 성패를 가르는 First Action: 우아한형제들의 장애 관리 라이프사이클 Backend Infra First Action에 따라 달라지는 장애 영향 우아한형제들의 2025년 장애를 돌아보면 인지는 비교적 빠른 편이었습니다. 그러나, 장애로 고객 경험의 악영향이 오래 이어진 사례들이 적지 않았습니다. 장애 대응 과정을 하나씩 다시 들여다보면 차이는 대부분 인지 이후 가장 먼저 어떤 조치를 취했는지, 즉 First Action(초동 조치)에서 시…

woowahan

우아한형제들이 장애를 놓치지 않고 탐지하는 방법 | 우아한형제들 기술블로그 (opens in new tab)

Woowa Brothers addresses the inevitability of system failures by shifting from traditional resource-based monitoring to a specialized Service Anomaly Detection system. By focusing on high-level service metrics such as order volume and login counts rather than just CPU or memory usage, they can identify incidents that directly impact the user experience. This approach ensures near real-time detection and provides a structured response framework to minimize damage during peak service hours. ### The Shift to Service-Level Monitoring * Traditional monitoring focuses on infrastructure metrics like CPU and memory, but it is impossible to monitor every system variable, leading to "blind spots" in failure detection. * Service metrics, such as real-time login counts and payment success rates, are finite and offer a direct reflection of the actual customer experience. * By monitoring these core indicators, the SRE team can detect anomalies that system-level alerts might overlook, ensuring that no failure goes unnoticed. ### Requirements for Effective Anomaly Detection * **Real-time Performance:** Alerts must be triggered in near-real-time to allow for immediate intervention before the impact scales. * **Explainability:** The system favors transparent logic over "black-box" AI models, allowing developers to quickly understand why an alert was triggered and how to improve the detection logic. * **Integrated Response:** Beyond just detection, the system must provide a clear response process so that any engineer, regardless of experience, can follow a standardized path to resolution. ### Technical Implementation and Logic * The system leverages the predictable, pattern-based nature of delivery service traffic, which typically peaks during lunch and dinner. * The team chose a Median-based approach to generate "Prediction" values from historical data, as it is more robust against outliers and easier to analyze than complex methods like IQR or 2-sigma. * Detection is determined by comparing "Actual" values against "Warning" and "Critical" thresholds derived from the predicted median. * To prevent false positives caused by temporary spikes, the system tracks "threshold reach counts," requiring a metric to stay in an abnormal state for a specific number of consecutive cycles before firing a Slack alert. ### Optimization of Alert Accuracy * Each service metric requires a tailored "settling period" to find the optimal balance between detection speed and accuracy. * Setting a high threshold reach count improves accuracy but slows down detection, while a low count accelerates detection at the risk of increased false positives. * Alerts are delivered via Slack with comprehensive context, including current status and urgency, to facilitate rapid decision-making. For organizations running high-traffic services, prioritizing service-level indicators (SLIs) over infrastructure metrics can significantly reduce the time to detect critical failures. Implementing simple, explainable statistical models like the Median approach allows teams to maintain a reliable monitoring system that evolves alongside the service without the complexity of uninterpretable AI models.