real-time-detection

1 posts

woowahan

How Woowa Brothers Detects Failures (opens in new tab)

Woowa Brothers addresses the inevitability of system failures by shifting from traditional resource-based monitoring to a specialized Service Anomaly Detection system. By focusing on high-level service metrics such as order volume and login counts rather than just CPU or memory usage, they can identify incidents that directly impact the user experience. This approach ensures near real-time detection and provides a structured response framework to minimize damage during peak service hours. ### The Shift to Service-Level Monitoring * Traditional monitoring focuses on infrastructure metrics like CPU and memory, but it is impossible to monitor every system variable, leading to "blind spots" in failure detection. * Service metrics, such as real-time login counts and payment success rates, are finite and offer a direct reflection of the actual customer experience. * By monitoring these core indicators, the SRE team can detect anomalies that system-level alerts might overlook, ensuring that no failure goes unnoticed. ### Requirements for Effective Anomaly Detection * **Real-time Performance:** Alerts must be triggered in near-real-time to allow for immediate intervention before the impact scales. * **Explainability:** The system favors transparent logic over "black-box" AI models, allowing developers to quickly understand why an alert was triggered and how to improve the detection logic. * **Integrated Response:** Beyond just detection, the system must provide a clear response process so that any engineer, regardless of experience, can follow a standardized path to resolution. ### Technical Implementation and Logic * The system leverages the predictable, pattern-based nature of delivery service traffic, which typically peaks during lunch and dinner. * The team chose a Median-based approach to generate "Prediction" values from historical data, as it is more robust against outliers and easier to analyze than complex methods like IQR or 2-sigma. * Detection is determined by comparing "Actual" values against "Warning" and "Critical" thresholds derived from the predicted median. * To prevent false positives caused by temporary spikes, the system tracks "threshold reach counts," requiring a metric to stay in an abnormal state for a specific number of consecutive cycles before firing a Slack alert. ### Optimization of Alert Accuracy * Each service metric requires a tailored "settling period" to find the optimal balance between detection speed and accuracy. * Setting a high threshold reach count improves accuracy but slows down detection, while a low count accelerates detection at the risk of increased false positives. * Alerts are delivered via Slack with comprehensive context, including current status and urgency, to facilitate rapid decision-making. For organizations running high-traffic services, prioritizing service-level indicators (SLIs) over infrastructure metrics can significantly reduce the time to detect critical failures. Implementing simple, explainable statistical models like the Median approach allows teams to maintain a reliable monitoring system that evolves alongside the service without the complexity of uninterpretable AI models.