네이버 / apache-kafka

3 posts

naver

비용, 성능, 안정성을 목표로 한 지능형 로그 파이프라인 도입 (opens in new tab)

Naver’s Logiss platform, responsible for processing tens of billions of daily logs, evolved its architecture to overcome systemic inefficiencies in resource utilization and deployment stability. By transitioning from a rigid, single-topology structure to an intelligent, multi-topology pipeline, the team achieved zero-downtime deployments and optimized infrastructure costs. These enhancements ensure that critical business data is prioritized during traffic surges while minimizing redundant storage for search-optimized indices. ### Limitations of the Legacy Pipeline * **Deployment Disruptions:** The previous single-topology setup in Apache Storm lacked a "swap" feature, requiring a total shutdown for updates and causing 3–8 minute processing lags during every deployment. * **Resource Inefficiency:** Infrastructure was provisioned based on daytime peak loads, which are five times higher than nighttime traffic, resulting in significant underutilization during off-peak hours. * **Indiscriminate Processing:** During traffic spikes or hardware failures, the system treated all logs equally, causing critical service logs to be delayed alongside low-priority telemetry. * **Storage Redundancy:** Data was stored at 100% volume in both real-time search (OpenSearch) and long-term storage (Landing Zones), even when sampled data would have sufficed for search purposes. ### Transitioning to Multi-Topology and Subscribe Mode * **Custom Storm Client:** The team modified `storm-kafka-client` 2.3.0 to revert from the default `assign` mode back to the `subscribe` mode for Kafka partition management. * **Partition Rebalancing:** While `assign` mode is standard in Storm 2.x, it prevents multiple topologies from sharing a consumer group without duplication; the custom `subscribe` implementation allows Kafka to manage rebalancing across multiple topologies. * **Zero-Downtime Deployments:** This architectural shift enables rolling updates and canary deployments by allowing new topologies to join the consumer group and take over partitions without stopping the entire pipeline. ### Intelligent Traffic Steering and Sampling * **Dynamic Throughput Control:** The "Traffic-Controller" (Storm topology) monitors downstream load and diverts excess non-critical traffic to a secondary "retry" path, protecting the stability of the main pipeline. * **Tiered Log Prioritization:** The system identifies critical business logs to ensure they bypass bottlenecks, while less urgent logs are queued for post-processing during traffic surges. * **Storage Optimization via Sampling:** Logiss now supports per-destination sampling rates, allowing the system to send 100% of data to long-term Landing Zones while only indexing a representative sample in OpenSearch, significantly reducing indexing overhead and storage costs. ### Results and Recommendations The implementation of an intelligent log pipeline demonstrates that modifying core open-source components, such as the Storm-Kafka client, can be a viable path to achieving specific architectural goals like zero-downtime deployment. For high-volume platforms, moving away from a "one-size-fits-all" processing model toward a priority-aware and sampling-capable pipeline is essential for balancing operational costs with system reliability. Organizations should evaluate whether their real-time search requirements truly necessitate 100% data ingestion or if sampling can provide the necessary insights at a fraction of the cost.

naver

[DAN25] 기술세션 영상이 모두 공개되었습니다. (opens in new tab)

Naver recently released the full video archives from its DAN25 conference, highlighting the company’s strategic roadmap for AI agents, Sovereign AI, and digital transformation. The sessions showcase how Naver is moving beyond general AI applications to implement specialized, real-time systems that integrate large language models (LLMs) directly into core services like search, commerce, and content. By open-sourcing these technical insights, Naver demonstrates its progress in building a cohesive AI ecosystem capable of handling massive scale and complex user intent. ### Naver PersonA and LLM-Based User Memory * The "PersonA" project focuses on building a "user memory" by treating fragmented logs across various Naver services as indirect conversations with the user. * By leveraging LLM reasoning, the system transitions from simple data tracking to a sophisticated AI agent that offers context-aware, real-time suggestions. * Technical hurdles addressed include the stable implementation of real-time log reflection for a massive user base and the selection of optimal LLM architectures for personalized inference. ### Trend Analysis and Search-Optimized Models * The Place Trend Analysis system utilizes ranking algorithms to distinguish between temporary surges and sustained popularity, providing a balanced view of "hot places." * LLMs and text mining are employed to move beyond raw data, extracting specific keywords that explain the underlying reasons for a location's trending status. * To improve search quality, Naver developed search-specific LLMs that outperform general models by using specialized data "recipes" and integrating traditional information retrieval with features like "AI briefing" and "AuthGR" for higher reliability. ### Unified Recommendation and Real-Time CRM * Naver Webtoon and Series replaced fragmented recommendation and CRM (Customer Relationship Management) models with a single, unified framework to ensure data consistency. * The architecture shifted from batch-based processing to a real-time, API-based serving system to reduce management complexity and improve the immediacy of personalized user experiences. * This transition focuses on maintaining a seamless UX by synchronizing different ML models under a unified serving logic. ### Scalable Log Pipelines and Infrastructure Stability * The "Logiss" pipeline manages up to tens of billions of logs daily, utilizing a Storm and Kafka environment to ensure high availability and performance. * Engineers implemented a multi-topology approach to allow for seamless, non-disruptive deployments even under heavy loads. * Intelligent features such as "peak-shaving" (distributing peak traffic to off-peak hours), priority-based processing during failures, and efficient data sampling help balance cost, performance, and stability. These sessions provide a practical blueprint for organizations aiming to scale LLM-driven services while maintaining infrastructure integrity. For developers and system architects, Naver’s transition toward unified ML frameworks and specialized, real-time data pipelines offers a proven model for moving AI from experimental phases into high-traffic production environments.

naver

6개월 만에 연간 수십조를 처리하는 DB CDC 복제 도구 무중단/무장애 교체하기 (opens in new tab)

Naver Pay successfully transitioned its core database replication system from a legacy tool to "ergate," a high-performance CDC (Change Data Capture) solution built on Apache Flink and Spring. This strategic overhaul was designed to improve maintainability for backend developers while resolving rigid schema dependencies that previously caused operational bottlenecks. By leveraging a modern stream-processing architecture, the system now manages massive transaction volumes with sub-second latency and enhanced reliability. ### Limitations of the Legacy System * **Maintenance Barriers:** The previous tool, mig-data, was written in pure Java by database core specialists, making it difficult for standard backend developers to maintain or extend. * **Strict Schema Dependency:** Developers were forced to follow a rigid DDL execution order (Target DB before Source DB) to avoid replication halts, complicating database operations. * **Blocking Failures:** Because the legacy system prioritized bi-directional data integrity, a single failed record could stall the entire replication pipeline for a specific shard. * **Operational Risk:** Recovery procedures were manual and restricted to a small group of specialized personnel, increasing the time-to-recovery during outages. ### Technical Architecture and Stack * **Apache Flink (LTS 2.0.0):** Selected for its high-availability, low-latency, and native Kafka integration, allowing the team to focus on replication logic rather than infrastructure. * **Kubernetes Session Mode:** Used to manage 12 concurrent jobs (6 replication, 6 verification) through a single Job Manager endpoint for streamlined monitoring and deployment. * **Hybrid Framework Approach:** The team isolated high-speed replication logic within Flink while using Spring (Kotlin) for complex recovery modules to leverage developer familiarity. * **Data Pipeline:** The system captures MySQL binlogs via `nbase-cdc`, publishes them to Kafka, and uses Flink `jdbc-sink` jobs to apply changes to Target DBs (nBase-T and Oracle). ### Three-Tier Operational Model: Replication, Verification, and Recovery * **Real-time Replication:** Processes incoming Kafka records and appends custom metadata columns (`ergate_yn`, `rpc_time`) to track the replication source and original commit time. * **Delayed Verification:** A dedicated "verifier" Flink job consumes the same Kafka topic with a 2-minute delay to check Target DB consistency against the source record. * **Secondary Logic:** To prevent false positives from rapid updates, the verifier performs a live re-query of the Source DB if a mismatch is initially detected. * **Multi-Stage Recovery:** * **Automatic Short-term:** Retries transient failures after 5 minutes. * **Automatic Long-term:** Uses batch processes to resolve persistent discrepancies. * **Manual:** Provides an admin interface for developers to trigger targeted reconciliations via API. ### Improvements in Schema Management and Performance * **DDL Independence:** By implementing query and schema caching, ergate allows Source and Target tables to be updated in any order without halting the pipeline. * **Performance Scaling:** The new system is designed to handle 10x the current peak QPS, ensuring stability even during high-traffic events like major sales or promotions. * **Metadata Tracking:** The inclusion of specific replication identifiers allows for clear distinction between automated replication and manual force-sync actions during troubleshooting. The ergate project demonstrates that a hybrid architecture—combining the high-throughput processing of Apache Flink with the robust logic handling of Spring—is highly effective for mission-critical financial systems. Organizations managing large-scale data replication should consider decoupling complex recovery logic from the main processing stream to ensure both performance and developer productivity.