카테고리 분류에 2조 토큰을 쓰면서 알게된 것들 -- Share 안녕하세요. 당근 Taxonomy 팀 윈터(winter.jung), 지원(jiwon)이에요. 저희 팀은 택소노미(Taxonomy)라고 부르는 카테고리 체계를 만들고, 그 체계를 기반으로 중고거래, 모임 게시글 등 당근에 올라오는 게시글을 자동으로 분류해 실제 서비스가 사용하도록 적재하는 파이프라인을 운영하고 있어요. 이번 글에서는 프로덕션 파이프라인에서 카테고리 분류를 위해 LLM을 어떻게 쓰고 있는지, 그리고 성능, 비용, 운영…
Naver’s Logiss platform, responsible for processing tens of billions of daily logs, evolved its architecture to overcome systemic inefficiencies in resource utilization and deployment stability. By transitioning from a rigid, single-topology structure to an intelligent, multi-topology pipeline, the team achieved zero-downtime deployments and optimized infrastructure costs. These enhancements ensure that critical business data is prioritized during traffic surges while minimizing redundant storage for search-optimized indices.
### Limitations of the Legacy Pipeline
* **Deployment Disruptions:** The previous single-topology setup in Apache Storm lacked a "swap" feature, requiring a total shutdown for updates and causing 3–8 minute processing lags during every deployment.
* **Resource Inefficiency:** Infrastructure was provisioned based on daytime peak loads, which are five times higher than nighttime traffic, resulting in significant underutilization during off-peak hours.
* **Indiscriminate Processing:** During traffic spikes or hardware failures, the system treated all logs equally, causing critical service logs to be delayed alongside low-priority telemetry.
* **Storage Redundancy:** Data was stored at 100% volume in both real-time search (OpenSearch) and long-term storage (Landing Zones), even when sampled data would have sufficed for search purposes.
### Transitioning to Multi-Topology and Subscribe Mode
* **Custom Storm Client:** The team modified `storm-kafka-client` 2.3.0 to revert from the default `assign` mode back to the `subscribe` mode for Kafka partition management.
* **Partition Rebalancing:** While `assign` mode is standard in Storm 2.x, it prevents multiple topologies from sharing a consumer group without duplication; the custom `subscribe` implementation allows Kafka to manage rebalancing across multiple topologies.
* **Zero-Downtime Deployments:** This architectural shift enables rolling updates and canary deployments by allowing new topologies to join the consumer group and take over partitions without stopping the entire pipeline.
### Intelligent Traffic Steering and Sampling
* **Dynamic Throughput Control:** The "Traffic-Controller" (Storm topology) monitors downstream load and diverts excess non-critical traffic to a secondary "retry" path, protecting the stability of the main pipeline.
* **Tiered Log Prioritization:** The system identifies critical business logs to ensure they bypass bottlenecks, while less urgent logs are queued for post-processing during traffic surges.
* **Storage Optimization via Sampling:** Logiss now supports per-destination sampling rates, allowing the system to send 100% of data to long-term Landing Zones while only indexing a representative sample in OpenSearch, significantly reducing indexing overhead and storage costs.
### Results and Recommendations
The implementation of an intelligent log pipeline demonstrates that modifying core open-source components, such as the Storm-Kafka client, can be a viable path to achieving specific architectural goals like zero-downtime deployment. For high-volume platforms, moving away from a "one-size-fits-all" processing model toward a priority-aware and sampling-capable pipeline is essential for balancing operational costs with system reliability. Organizations should evaluate whether their real-time search requirements truly necessitate 100% data ingestion or if sampling can provide the necessary insights at a fraction of the cost.
Toss Payments addressed the challenge of serving rapidly growing transaction data within a microservices architecture (MSA) by evolving their data platform from simple Elasticsearch indexing to a robust CQRS pattern. While Apache Druid initially provided high-performance time-series aggregation and significant cost savings, the team eventually integrated StarRocks to overcome limitations in data consistency and complex join operations. This architectural journey highlights the necessity of balancing real-time query performance with operational scalability and domain decoupling.
### Transitioning to MSA and Early Search Solutions
* The shift from a monolithic structure to MSA decoupled application logic but created "data silos" where joining ledgers across domains became difficult.
* The initial solution utilized Elasticsearch to index specific fields for merchant transaction lookups and basic refunds.
* As transaction volumes doubled between 2022 and 2024, the need for complex OLAP-style aggregations led to the adoption of a CQRS (Command Query Responsibility Segregation) architecture.
### Adopting Apache Druid for Time-Series Data
* Druid was selected for its optimization toward time-series data, offering low-latency aggregation for massive datasets.
* It provided a low learning curve by supporting Druid SQL and featured automatic bitmap indexing for all columns, including nested JSON keys.
* The system decoupled reads from writes, allowing the data team to serve billions of records without impacting the primary transaction databases' resources.
### Data Ingestion: Message Publishing over CDC
* The team chose a message publishing approach via Kafka rather than Change Data Capture (CDC) to minimize domain dependency.
* In this model, domain teams publish finalized data packets, reducing the data team's need to maintain complex internal business logic for over 20 different payment methods.
* This strategy simplified system dependencies and leveraged Druid’s ability to automatically index incoming JSON fields.
### Infrastructure and Cost Optimization in AWS
* The architecture separates computing and storage, using AWS S3 for deep storage to keep costs low.
* Performance was optimized by using instances with high-performance local storage instead of network-attached EBS, resulting in up to 9x faster I/O.
* The team utilized Spot Instances for development and testing environments, contributing to a monthly cloud cost reduction of approximately 50 million KRW.
### Operational Challenges and Druid’s Limitations
* **Idempotency and Consistency:** Druid struggled with native idempotency, requiring complex "Merge on Read" logic to handle duplicate messages or state changes.
* **Data Fragmentation:** Transaction cancellations often targeted old partitions, causing fragmentation; the team implemented a 60-second detection process to trigger automatic compaction.
* **Join Constraints:** While Druid supports joins, its capabilities are limited, making it difficult to link complex lifecycles across payment, purchase, and settlement domains.
### Hybrid Search and Rollup Performance
* To ensure high-speed lookups across 10 billion records, a hybrid architecture was built: Elasticsearch handles specific keyword searches to retrieve IDs, which are then used to fetch full details from Druid.
* Druid’s "Rollup" feature was utilized to pre-aggregate data at ingestion time.
* Implementing Rollup reduced average query response times from tens of seconds to under 1 second, representing a 99% performance improvement for aggregate views.
### Moving Toward StarRocks
* To solve Druid's limitations regarding idempotency and multi-table joins, Toss Payments began transitioning to StarRocks.
* StarRocks provides a more stable environment for managing inconsistent events and simplifies the data flow by aligning with existing analytical infrastructure.
* This shift supports the need for a "Unified Ledger" that can track the entire lifecycle of a transaction—from payment to net profit—across disparate database sources.
Naver Pay successfully transitioned its core database replication system from a legacy tool to "ergate," a high-performance CDC (Change Data Capture) solution built on Apache Flink and Spring. This strategic overhaul was designed to improve maintainability for backend developers while resolving rigid schema dependencies that previously caused operational bottlenecks. By leveraging a modern stream-processing architecture, the system now manages massive transaction volumes with sub-second latency and enhanced reliability.
### Limitations of the Legacy System
* **Maintenance Barriers:** The previous tool, mig-data, was written in pure Java by database core specialists, making it difficult for standard backend developers to maintain or extend.
* **Strict Schema Dependency:** Developers were forced to follow a rigid DDL execution order (Target DB before Source DB) to avoid replication halts, complicating database operations.
* **Blocking Failures:** Because the legacy system prioritized bi-directional data integrity, a single failed record could stall the entire replication pipeline for a specific shard.
* **Operational Risk:** Recovery procedures were manual and restricted to a small group of specialized personnel, increasing the time-to-recovery during outages.
### Technical Architecture and Stack
* **Apache Flink (LTS 2.0.0):** Selected for its high-availability, low-latency, and native Kafka integration, allowing the team to focus on replication logic rather than infrastructure.
* **Kubernetes Session Mode:** Used to manage 12 concurrent jobs (6 replication, 6 verification) through a single Job Manager endpoint for streamlined monitoring and deployment.
* **Hybrid Framework Approach:** The team isolated high-speed replication logic within Flink while using Spring (Kotlin) for complex recovery modules to leverage developer familiarity.
* **Data Pipeline:** The system captures MySQL binlogs via `nbase-cdc`, publishes them to Kafka, and uses Flink `jdbc-sink` jobs to apply changes to Target DBs (nBase-T and Oracle).
### Three-Tier Operational Model: Replication, Verification, and Recovery
* **Real-time Replication:** Processes incoming Kafka records and appends custom metadata columns (`ergate_yn`, `rpc_time`) to track the replication source and original commit time.
* **Delayed Verification:** A dedicated "verifier" Flink job consumes the same Kafka topic with a 2-minute delay to check Target DB consistency against the source record.
* **Secondary Logic:** To prevent false positives from rapid updates, the verifier performs a live re-query of the Source DB if a mismatch is initially detected.
* **Multi-Stage Recovery:**
* **Automatic Short-term:** Retries transient failures after 5 minutes.
* **Automatic Long-term:** Uses batch processes to resolve persistent discrepancies.
* **Manual:** Provides an admin interface for developers to trigger targeted reconciliations via API.
### Improvements in Schema Management and Performance
* **DDL Independence:** By implementing query and schema caching, ergate allows Source and Target tables to be updated in any order without halting the pipeline.
* **Performance Scaling:** The new system is designed to handle 10x the current peak QPS, ensuring stability even during high-traffic events like major sales or promotions.
* **Metadata Tracking:** The inclusion of specific replication identifiers allows for clear distinction between automated replication and manual force-sync actions during troubleshooting.
The ergate project demonstrates that a hybrid architecture—combining the high-throughput processing of Apache Flink with the robust logic handling of Spring—is highly effective for mission-critical financial systems. Organizations managing large-scale data replication should consider decoupling complex recovery logic from the main processing stream to ensure both performance and developer productivity.
Netflix has developed a distributed Write-Ahead Log (WAL) abstraction to address critical data challenges such as accidental corruption, system entropy, and the complexities of cross-region replication. By decoupling data mutation from immediate persistence and providing a unified API, this system ensures strong durability and eventual consistency across diverse storage engines. The WAL acts as a resilient buffer that powers high-leverage features like secondary indexing and delayed retry queues while maintaining the massive scale required for global operations.
### The Role of the WAL Abstraction
* The system serves as a centralized mechanism to capture data changes and reliably deliver them to downstream consumers, mitigating the risk of data loss during administrative errors or database corruption.
* It provides a simplified `WriteToLog` gRPC endpoint that abstracts underlying infrastructure, allowing developers to focus on data logic rather than the specifics of the storage layer.
* By acting as a durable intermediary, it prevents permanent data loss during incidents where primary datastores fail or require schema changes that might otherwise lead to corruption.
### Flexible Personas and Namespaces
* The architecture utilizes "namespaces" to define logical separation, allowing different services to configure specific storage backends like Kafka or SQS based on their needs.
* The "Delayed Queues" persona leverages SQS to provide a scalable way to retry failed messages in real-time pipelines without sacrificing overall system throughput.
* The system can be configured for "Cross-Region Replication," enabling high availability and disaster recovery for storage engines that do not natively support multi-region data transfer.
### Solving System Entropy and Consistency
* The WAL addresses the "dual-write" problem, where updates to primary stores (such as Cassandra) and search indices (such as Elasticsearch) can diverge over time, leading to data inconsistency.
* It facilitates reliable secondary indexing for NoSQL databases by managing updates to multiple partitions as a coordinated sequence of events.
* The platform mitigates operational risks, such as Out-of-Memory (OOM) errors on Key-Value nodes caused by bulk deletes, by staging and throttling mutations through the log.
Organizations operating at scale should adopt a WAL-centric architecture to simplify the management of heterogeneous data stores and enhance system resilience. By centralizing the mutation log, teams can implement complex features like Change Data Capture (CDC) and cross-region failover through a single, consistent interface rather than building bespoke solutions for every service.
Building a Next-Generation Key-Value Store at Airbnb -- 1 Listen Share How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2. By Shravan Gaonkar, Chandramouli Rangarajan, Yanhan Zhang How we compl…
LY Corporation’s ABC Studio developed a specialized retail Merchant system by leveraging Domain-Driven Design (DDD) to overcome the functional limitations of a legacy food-delivery infrastructure. The project demonstrates that the primary value of DDD lies not just in technical implementation, but in aligning organizational structures and team responsibilities with domain boundaries. By focusing on the roles and responsibilities of the system rather than just the code, the team created a scalable platform capable of supporting diverse consumer interfaces.
### Redefining the Retail Domain
* The legacy system treated retail items like restaurant entries, creating friction for specialized retail services; the new system was built to be a standalone platform.
* The team narrowed the domain focus to five core areas: Shop, Item, Category, Inventory, and Order.
* Sales-specific logic, such as coupons and promotions, was delegated to external "Consumer Platforms," allowing the Merchant system to serve as a high-performance information provider.
### Clean Architecture and Modular Composition
* The system utilizes Clean Architecture to ensure domain entities remain independent of external frameworks, which also provided a manageable learning curve for new team members.
* Services are split into two distinct modules: "API" modules for receiving external requests and "Engine" modules for processing business logic.
* Communication between these modules is handled asynchronously via gRPC and Apache Kafka, using the Decaton library to increase throughput while maintaining a low partition count.
* The architecture prioritizes eventual consistency, allowing for high responsiveness and scalability across the platform.
### Global Collaboration and Conway’s Law
* Development was split between teams in Korea (Core Domain) and Japan (System Integration and BFF), requiring a shared understanding of domain boundaries.
* Architectural Decision Records (ADR) were implemented to document critical decisions and prevent "knowledge drift" during long-term collaboration.
* The organizational structure was intentionally designed to mirror the system architecture, with specific teams (Core, Link, BFF, and Merchant Link) assigned to distinct domain layers.
* This alignment, reflecting Conway’s Law, ensures that changes to external consumer platforms have minimal impact on the stable core domain logic.
Successful DDD adoption requires moving beyond technical patterns like hexagonal architecture and focusing on establishing a shared understanding of roles across the organization. By structuring teams to match domain boundaries, companies can build resilient systems where the core business logic remains protected even as the external service ecosystem evolves.
From multi-day latency to near real-time insights: Figma’s data pipeline upgrade Inside Figma Engineering Infrastructure After an exponential growth in users and data, daily synchronization tasks started taking hours or even days to complete. Here’s how rebuilding a data pipelin…