monitoring

5 posts

woowahan

We Did Everything from Planning to (opens in new tab)

The 7th Woowacourse crew has successfully launched three distinct services, demonstrating that modern software engineering requires a synergy of technical mastery and "soft skills" like product planning and team communication. By owning the entire lifecycle from ideation to deployment, these developers moved beyond mere coding to solve real-world problems through agile iterations, user feedback, and robust infrastructure management. The program’s focus on the full stack of development—including monitoring, 2-week sprints, and collaborative design—highlights a shift toward producing well-rounded engineers capable of navigating professional environments. ### The Woowacourse Full-Cycle Philosophy * The 10-month curriculum emphasizes soft skills, including speaking and writing, alongside traditional technical tracks like Web Backend, Frontend, and Mobile Android. * During Level 3 and 4, crews transition from fundamental programming to managing team projects where they must handle everything from initial architecture to UI/UX design. * The process mimics real-world industry standards by implementing 2-week development sprints, establishing monitoring environments, and managing automated deployment pipelines. * The core goal is to shift the developer's mindset from simply writing code to understanding why certain features are planned and how architecture choices impact the final user value. ### Pickeat: Collaborative Dining Decisions * This service addresses "decision fatigue" during group meals by providing a collaborative platform to filter restaurants based on dietary constraints and preferences. * Technical challenges included frequent domain restructuring and UI overhauls as the team pivoted based on real-world user feedback during demo days. * The platform utilizes location data for automatic restaurant lookups and supports real-time voting mechanisms to ensure democratic and efficient group decisions. * Development focused on aligning team judgment standards and iterating quickly to validate product-market fit rather than adhering strictly to initial specifications. ### Bottari: Real-Time Synchronized Checklists * Bottari is a checklist service designed for situations like traveling or moving, focusing on "becoming a companion for the user’s memory." * The service features template-based list generation and a "Team Bottari" function that allows multiple users to collaborate on a single list with real-time synchronization. * A major technical focus was placed on the user experience flow, specifically optimizing notification timing and sync states to provide "peace of mind" for users. * The project demonstrates the principle that technology serves as a tool for solving psychological pain points, such as the anxiety of forgetting essential items. ### Coffee Shout: Real-Time Betting and Mini-Games * Designed to gamify office culture, this service replaces simple "rock-paper-scissors" with interactive mini-games and weighted roulette for coffee bets. * The technical stack involved challenging implementations of WebSockets and distributed environments to handle the concurrency required for real-time gaming. * The team focused on algorithm balancing for the weighted roulette system to ensure fairness and excitement during the betting process. * Refinement of the service was driven by direct feedback from other Woowacourse crews, emphasizing the importance of community testing in the development lifecycle. These projects underscore that the transition from a student to a professional developer is defined by the ability to manage shifting requirements and technical complexity while maintaining a focus on the end-user's experience.

woowahan

How Woowa Brothers Detects Failures (opens in new tab)

Woowa Brothers addresses the inevitability of system failures by shifting from traditional resource-based monitoring to a specialized Service Anomaly Detection system. By focusing on high-level service metrics such as order volume and login counts rather than just CPU or memory usage, they can identify incidents that directly impact the user experience. This approach ensures near real-time detection and provides a structured response framework to minimize damage during peak service hours. ### The Shift to Service-Level Monitoring * Traditional monitoring focuses on infrastructure metrics like CPU and memory, but it is impossible to monitor every system variable, leading to "blind spots" in failure detection. * Service metrics, such as real-time login counts and payment success rates, are finite and offer a direct reflection of the actual customer experience. * By monitoring these core indicators, the SRE team can detect anomalies that system-level alerts might overlook, ensuring that no failure goes unnoticed. ### Requirements for Effective Anomaly Detection * **Real-time Performance:** Alerts must be triggered in near-real-time to allow for immediate intervention before the impact scales. * **Explainability:** The system favors transparent logic over "black-box" AI models, allowing developers to quickly understand why an alert was triggered and how to improve the detection logic. * **Integrated Response:** Beyond just detection, the system must provide a clear response process so that any engineer, regardless of experience, can follow a standardized path to resolution. ### Technical Implementation and Logic * The system leverages the predictable, pattern-based nature of delivery service traffic, which typically peaks during lunch and dinner. * The team chose a Median-based approach to generate "Prediction" values from historical data, as it is more robust against outliers and easier to analyze than complex methods like IQR or 2-sigma. * Detection is determined by comparing "Actual" values against "Warning" and "Critical" thresholds derived from the predicted median. * To prevent false positives caused by temporary spikes, the system tracks "threshold reach counts," requiring a metric to stay in an abnormal state for a specific number of consecutive cycles before firing a Slack alert. ### Optimization of Alert Accuracy * Each service metric requires a tailored "settling period" to find the optimal balance between detection speed and accuracy. * Setting a high threshold reach count improves accuracy but slows down detection, while a low count accelerates detection at the risk of increased false positives. * Alerts are delivered via Slack with comprehensive context, including current status and urgency, to facilitate rapid decision-making. For organizations running high-traffic services, prioritizing service-level indicators (SLIs) over infrastructure metrics can significantly reduce the time to detect critical failures. Implementing simple, explainable statistical models like the Median approach allows teams to maintain a reliable monitoring system that evolves alongside the service without the complexity of uninterpretable AI models.

naver

Collecting Custom Metrics with Te (opens in new tab)

This technical session from NAVER ENGINEERING DAY 2025 details the transition from traditional open-source exporters to a Telegraf-based architecture for collecting custom system metrics. By evaluating various monitoring tools through rigorous benchmarking, the developers demonstrate how Telegraf provides a more flexible and high-performance framework for infrastructure observability. The presentation concludes that adopting Telegraf streamlines the metric collection pipeline and offers superior scalability for complex, large-scale service environments. ### Context and Motivation for Open-Source Exporters * The project originated from the need to overcome the limitations of standard open-source exporters that lacked support for specific internal business logic. * Engineers sought a unified way to collect diverse data points without managing dozens of fragmented, single-purpose agents. * The primary goal was to find a solution that could handle high-frequency data ingestion while maintaining low resource overhead on production servers. ### Benchmark Testing for Metric Collection * A comparative analysis was conducted between several open-source monitoring agents to determine their efficiency under load. * Testing focused on critical performance indicators, including CPU and memory footprint during peak metric throughput. * The results highlighted Telegraf's stability and consistent performance compared to other exporter-based alternatives, leading to its selection as the primary collection tool. ### Telegraf Architecture and Customization * Telegraf operates as a plugin-driven agent, utilizing four distinct categories: Input, Processor, Aggregator, and Output plugins. * The development team shared their experience writing custom exporters by leveraging Telegraf’s modular Go-based framework. * This approach allowed for the seamless transformation of raw data into various formats (such as Prometheus or InfluxDB) using a single, unified configuration. ### Operational Gains and Technical Options * Post-implementation, the system saw a significant reduction in operational complexity by consolidating various metric streams into a single agent. * Specific Telegraf options were utilized to fine-tune the collection interval and batch size, optimizing the balance between data granularity and network load. * The migration improved the reliability of metric delivery through built-in retry mechanisms and internal buffers that prevent data loss during transient network failures. For teams currently managing a sprawling array of open-source exporters, migrating to a Telegraf-based architecture is recommended to centralize metric collection. The plugin-based system not only reduces the maintenance burden but also provides the necessary extensibility to support specialized custom metrics as service requirements evolve.

naver

Introduction to OpenTelemetry (feat. Collector (opens in new tab)

NAVER is transitioning its internal search monitoring platform, SEER, to an architecture built on OpenTelemetry and open-source standards to achieve a more scalable and flexible observability environment. By adopting a vendor-agnostic approach, the engineering team aims to unify the collection of metrics, logs, and traces while contributing back to the global OpenTelemetry ecosystem. This shift underscores the importance of standardized telemetry protocols in managing complex, large-scale service infrastructures. ### Standardizing Observability with OTLP * The transition focuses on the OpenTelemetry Protocol (OTLP) as the primary standard for transmitting telemetry data across the platform. * Moving away from proprietary formats allows for a unified data model that encompasses metrics, traces, and logs, ensuring consistency across different services. * A standardized protocol simplifies the integration of various open-source backends, reducing the engineering overhead associated with supporting multiple telemetry formats. ### The OpenTelemetry Collector Pipeline * The Collector acts as a critical intermediary, decoupling the application layer from the storage backend to provide greater architectural flexibility. * **Receivers** are used to ingest data from diverse sources, supporting both OTLP-native applications and legacy systems. * **Processors** enable data transformation, filtering, and metadata enrichment (such as adding resource attributes) before the data reaches its destination. * **Exporters** manage the delivery of processed telemetry to specific backends like Prometheus for metrics or Jaeger for tracing, allowing for easy swaps of infrastructure components. ### Automated Management via OpenTelemetry Operator * The OpenTelemetry Operator is utilized within Kubernetes environments to automate the deployment and lifecycle management of the Collector. * It facilitates auto-instrumentation, allowing developers to collect telemetry from applications without manual code changes for every service. * The Operator ensures that the observability stack scales dynamically alongside the production workloads it monitors. ### Open-Source Contribution and Community * Beyond mere adoption, the NAVER engineering team actively participates in the OpenTelemetry community by sharing bug fixes and feature enhancements discovered during the SEER migration. * This collaborative approach ensures that the specific requirements of high-traffic enterprise environments are reflected in the evolution of the OpenTelemetry project. Adopting OpenTelemetry is a strategic move for organizations looking to avoid vendor lock-in and build a future-proof monitoring stack. For a successful implementation, teams should focus on mastering the Collector's pipeline configuration to balance data granularity with processing performance across distributed systems.

line

Essential Element for App Success: (opens in new tab)

Effective mobile app management requires proactive outage monitoring to prevent user churn caused by failures in critical flows like registration or payment. Relying on user reports is often too late, so developers must implement systematic event collection and real-time dashboards to identify issues the moment they arise. By integrating tools like Sentry or Firebase, teams can maintain high quality through immediate response and detailed performance analysis. ### Implementing Sentry in Flutter * **Dependency and Initialization**: Integration begins by adding `sentry_flutter` and `sentry_dio` to the project. The initialization process involves setting the Data Source Name (DSN), environment tags (e.g., production vs. staging), and release versions to ensure logs are correctly categorized. * **Performance and Privacy**: Developers should configure `tracesSampleRate` and `profilesSampleRate` to balance monitoring depth with costs. Additionally, the `beforeSend` callback allows for masking sensitive user data like authorization headers or IP addresses before they are transmitted. * **Contextual Tracking**: To aid debugging, the system captures user IDs via `Sentry.configureScope` and tracks user movement using `SentryNavigatorObserver`. Utilizing `SentryInterceptor` with the Dio library allows for automatic tracking of HTTP request performance and API bottlenecks. ### Strategic Log Level Design * **Debug and Info**: Debug logs remain local to the terminal to save resources. Info logs are reserved for significant user actions that change data, such as successful sign-ups or purchases, while high-frequency read actions like "viewing a product list" are excluded to reduce noise and costs. * **Warning**: This level tracks external system failures, such as failed API calls or push notification losses. To prevent "alert fatigue," client-side network issues (e.g., timeouts or offline status) are ignored, and alerts are triggered only when specific thresholds are met, such as 100 failures within 10 minutes. * **Error**: Error logs represent internal logic failures that bypass defensive coding, such as null object errors, parsing failures, or unreachable code branches. These require immediate notification to the development team to facilitate rapid hotfixes. * **Fatal**: This level is dedicated to application crashes and unhandled exceptions. When configured at the app's entry point, the system automatically captures these critical failures to provide a comprehensive "crash-free users" metric. ### Creating Effective Dashboards * **Naming Conventions**: Logs should follow a strict structure, using tags for modules and event names (e.g., `[API] [postLogin] success`). This consistency allows for granular querying and clearer visualization on monitoring dashboards. * **Data Enrichment**: Using the `extra` field in log events provides vital context for troubleshooting, such as including the specific endpoint, request body, and response status code for a failed transaction. * **Actionable Metrics**: Effective monitoring focuses on key performance indicators like API error rates and the failure percentage of core business events (login, registration, payment) rather than just raw crash counts. A robust monitoring strategy shifts the focus from simple crash reporting to comprehensive service health. By standardizing log levels and automating event collection, development teams can distinguish between transient network blips and critical logic errors, ensuring they spend their time fixing high-impact issues.