네이버 / data-engineering

2 posts

naver

비용, 성능, 안정성을 목표로 한 지능형 로그 파이프라인 도입 (opens in new tab)

Naver’s Logiss platform, responsible for processing tens of billions of daily logs, evolved its architecture to overcome systemic inefficiencies in resource utilization and deployment stability. By transitioning from a rigid, single-topology structure to an intelligent, multi-topology pipeline, the team achieved zero-downtime deployments and optimized infrastructure costs. These enhancements ensure that critical business data is prioritized during traffic surges while minimizing redundant storage for search-optimized indices. ### Limitations of the Legacy Pipeline * **Deployment Disruptions:** The previous single-topology setup in Apache Storm lacked a "swap" feature, requiring a total shutdown for updates and causing 3–8 minute processing lags during every deployment. * **Resource Inefficiency:** Infrastructure was provisioned based on daytime peak loads, which are five times higher than nighttime traffic, resulting in significant underutilization during off-peak hours. * **Indiscriminate Processing:** During traffic spikes or hardware failures, the system treated all logs equally, causing critical service logs to be delayed alongside low-priority telemetry. * **Storage Redundancy:** Data was stored at 100% volume in both real-time search (OpenSearch) and long-term storage (Landing Zones), even when sampled data would have sufficed for search purposes. ### Transitioning to Multi-Topology and Subscribe Mode * **Custom Storm Client:** The team modified `storm-kafka-client` 2.3.0 to revert from the default `assign` mode back to the `subscribe` mode for Kafka partition management. * **Partition Rebalancing:** While `assign` mode is standard in Storm 2.x, it prevents multiple topologies from sharing a consumer group without duplication; the custom `subscribe` implementation allows Kafka to manage rebalancing across multiple topologies. * **Zero-Downtime Deployments:** This architectural shift enables rolling updates and canary deployments by allowing new topologies to join the consumer group and take over partitions without stopping the entire pipeline. ### Intelligent Traffic Steering and Sampling * **Dynamic Throughput Control:** The "Traffic-Controller" (Storm topology) monitors downstream load and diverts excess non-critical traffic to a secondary "retry" path, protecting the stability of the main pipeline. * **Tiered Log Prioritization:** The system identifies critical business logs to ensure they bypass bottlenecks, while less urgent logs are queued for post-processing during traffic surges. * **Storage Optimization via Sampling:** Logiss now supports per-destination sampling rates, allowing the system to send 100% of data to long-term Landing Zones while only indexing a representative sample in OpenSearch, significantly reducing indexing overhead and storage costs. ### Results and Recommendations The implementation of an intelligent log pipeline demonstrates that modifying core open-source components, such as the Storm-Kafka client, can be a viable path to achieving specific architectural goals like zero-downtime deployment. For high-volume platforms, moving away from a "one-size-fits-all" processing model toward a priority-aware and sampling-capable pipeline is essential for balancing operational costs with system reliability. Organizations should evaluate whether their real-time search requirements truly necessitate 100% data ingestion or if sampling can provide the necessary insights at a fraction of the cost.

naver

네이버 TV (opens in new tab)

Naver Webtoon developed "Flow.er," an on-demand data lineage pipeline service designed to overcome the operational inefficiencies and high maintenance costs of legacy data workflows. By integrating dbt for modular modeling and Airflow for scalable orchestration, the platform automates complex backfill and recovery processes while maintaining high data integrity. This shift to a lineage-centric architecture allows the engineering team to manage data as a high-quality product rather than a series of disconnected tasks. ### Challenges in Traditional Data Pipelines * High operational burdens were caused by manual backfilling and recovery tasks, which became increasingly difficult as data volume and complexity grew. * Legacy systems lacked transparency in data dependencies, making it hard to predict the downstream impact of code changes or upstream data failures. * Fragmented development environments led to inconsistencies between local testing and production outputs, slowing down the deployment of new data products. ### Core Architecture and the Role of dbt and Airflow * dbt serves as the central modeling layer, defining transformations and establishing clear data lineage that maps how information flows between tables. * Airflow functions as the orchestration engine, utilizing the lineage defined in dbt to trigger tasks in the correct order and manage execution schedules. * Individual development instances provide engineers with isolated environments to test dbt models, ensuring that logic is validated before being merged into the main pipeline. * The system includes a dedicated model management page and a robust CI/CD pipeline to streamline the transition from development to production. ### Expanding the Platform with Tower and Playground * "Tower" and "Playground" were introduced as supplementary components to support a broader range of data organizations and facilitate easier experimentation. * A specialized Partition Checker was developed to enhance data integrity by automatically verifying that all required data partitions are present before downstream processing begins. * Improvements to the Manager DAG system allow the platform to handle large-scale pipeline deployments across different teams while maintaining a unified view of the data lineage. ### Future Evolution with AI and MCP * The team is exploring the integration of Model Context Protocol (MCP) servers to bridge the gap between data pipelines and AI applications. * Future developments focus on utilizing AI agents to further automate pipeline monitoring and troubleshooting, reducing the need for human intervention in routine maintenance. To build a sustainable and scalable data infrastructure, organizations should transition from simple task scheduling to a lineage-aware architecture. Adopting a framework like Flow.er, which combines the modeling strengths of dbt with the orchestration power of Airflow, enables teams to automate the most labor-intensive parts of data engineering—such as backfills and dependency management—while ensuring the reliability of the final data product.