우아한형제들 / monitoring

2 posts

woowahan

기획부터 개발까지 전부 직접 했습니다 – 우테코 7기 크루 서비스 론칭! | 우아한형제들 기술블로그 (opens in new tab)

The 7th Woowacourse crew has successfully launched three distinct services, demonstrating that modern software engineering requires a synergy of technical mastery and "soft skills" like product planning and team communication. By owning the entire lifecycle from ideation to deployment, these developers moved beyond mere coding to solve real-world problems through agile iterations, user feedback, and robust infrastructure management. The program’s focus on the full stack of development—including monitoring, 2-week sprints, and collaborative design—highlights a shift toward producing well-rounded engineers capable of navigating professional environments. ### The Woowacourse Full-Cycle Philosophy * The 10-month curriculum emphasizes soft skills, including speaking and writing, alongside traditional technical tracks like Web Backend, Frontend, and Mobile Android. * During Level 3 and 4, crews transition from fundamental programming to managing team projects where they must handle everything from initial architecture to UI/UX design. * The process mimics real-world industry standards by implementing 2-week development sprints, establishing monitoring environments, and managing automated deployment pipelines. * The core goal is to shift the developer's mindset from simply writing code to understanding why certain features are planned and how architecture choices impact the final user value. ### Pickeat: Collaborative Dining Decisions * This service addresses "decision fatigue" during group meals by providing a collaborative platform to filter restaurants based on dietary constraints and preferences. * Technical challenges included frequent domain restructuring and UI overhauls as the team pivoted based on real-world user feedback during demo days. * The platform utilizes location data for automatic restaurant lookups and supports real-time voting mechanisms to ensure democratic and efficient group decisions. * Development focused on aligning team judgment standards and iterating quickly to validate product-market fit rather than adhering strictly to initial specifications. ### Bottari: Real-Time Synchronized Checklists * Bottari is a checklist service designed for situations like traveling or moving, focusing on "becoming a companion for the user’s memory." * The service features template-based list generation and a "Team Bottari" function that allows multiple users to collaborate on a single list with real-time synchronization. * A major technical focus was placed on the user experience flow, specifically optimizing notification timing and sync states to provide "peace of mind" for users. * The project demonstrates the principle that technology serves as a tool for solving psychological pain points, such as the anxiety of forgetting essential items. ### Coffee Shout: Real-Time Betting and Mini-Games * Designed to gamify office culture, this service replaces simple "rock-paper-scissors" with interactive mini-games and weighted roulette for coffee bets. * The technical stack involved challenging implementations of WebSockets and distributed environments to handle the concurrency required for real-time gaming. * The team focused on algorithm balancing for the weighted roulette system to ensure fairness and excitement during the betting process. * Refinement of the service was driven by direct feedback from other Woowacourse crews, emphasizing the importance of community testing in the development lifecycle. These projects underscore that the transition from a student to a professional developer is defined by the ability to manage shifting requirements and technical complexity while maintaining a focus on the end-user's experience.

woowahan

우아한형제들이 장애를 놓치지 않고 탐지하는 방법 | 우아한형제들 기술블로그 (opens in new tab)

Woowa Brothers addresses the inevitability of system failures by shifting from traditional resource-based monitoring to a specialized Service Anomaly Detection system. By focusing on high-level service metrics such as order volume and login counts rather than just CPU or memory usage, they can identify incidents that directly impact the user experience. This approach ensures near real-time detection and provides a structured response framework to minimize damage during peak service hours. ### The Shift to Service-Level Monitoring * Traditional monitoring focuses on infrastructure metrics like CPU and memory, but it is impossible to monitor every system variable, leading to "blind spots" in failure detection. * Service metrics, such as real-time login counts and payment success rates, are finite and offer a direct reflection of the actual customer experience. * By monitoring these core indicators, the SRE team can detect anomalies that system-level alerts might overlook, ensuring that no failure goes unnoticed. ### Requirements for Effective Anomaly Detection * **Real-time Performance:** Alerts must be triggered in near-real-time to allow for immediate intervention before the impact scales. * **Explainability:** The system favors transparent logic over "black-box" AI models, allowing developers to quickly understand why an alert was triggered and how to improve the detection logic. * **Integrated Response:** Beyond just detection, the system must provide a clear response process so that any engineer, regardless of experience, can follow a standardized path to resolution. ### Technical Implementation and Logic * The system leverages the predictable, pattern-based nature of delivery service traffic, which typically peaks during lunch and dinner. * The team chose a Median-based approach to generate "Prediction" values from historical data, as it is more robust against outliers and easier to analyze than complex methods like IQR or 2-sigma. * Detection is determined by comparing "Actual" values against "Warning" and "Critical" thresholds derived from the predicted median. * To prevent false positives caused by temporary spikes, the system tracks "threshold reach counts," requiring a metric to stay in an abnormal state for a specific number of consecutive cycles before firing a Slack alert. ### Optimization of Alert Accuracy * Each service metric requires a tailored "settling period" to find the optimal balance between detection speed and accuracy. * Setting a high threshold reach count improves accuracy but slows down detection, while a low count accelerates detection at the risk of increased false positives. * Alerts are delivered via Slack with comprehensive context, including current status and urgency, to facilitate rapid decision-making. For organizations running high-traffic services, prioritizing service-level indicators (SLIs) over infrastructure metrics can significantly reduce the time to detect critical failures. Implementing simple, explainable statistical models like the Median approach allows teams to maintain a reliable monitoring system that evolves alongside the service without the complexity of uninterpretable AI models.