data-pipeline

8 posts

daangn

Daangn's User Behavior (opens in new tab)

Daangn transitioned its user behavior log management from a manual, code-based Git workflow to a centralized UI platform called Event Center to improve data consistency and operational efficiency. By automating schema creation and enforcing standardized naming conventions, the platform reduced the technical barriers for developers and analysts while ensuring high data quality for downstream analysis. This transition has streamlined the entire data lifecycle, from collection in the mobile app to structured storage in BigQuery. ### Challenges of Code-Based Schema Management Prior to Event Center, Daangn managed its event schemas—definitions that describe the ownership, domain, and custom parameters of a log—using Git and manual JSON files. This approach created several bottlenecks for the engineering team: * **High Entry Barrier**: Users were required to write complex Spark `StructType` JSON files, which involved managing nested structures and specific metadata fields like `nullable` and `type`. * **Inconsistent Naming**: Without a central enforcement mechanism, event names followed different patterns (e.g., `item_click` vs. `click_item`), making it difficult for analysts to discover relevant data. * **Operational Friction**: Every schema change required a Pull Request (PR), manual review by the data team, and a series of CI checks, leading to slow iteration cycles and frequent communication overhead. ### The User Behavior Log Pipeline To support data-driven decision-making, Daangn employs a robust pipeline that processes millions of events daily through several critical stages: * **Collection and Validation**: Events are sent from the mobile SDK to an event server, which performs initial validation before passing data to GCP Pub/Sub. * **Streaming Processing**: GCP Dataflow handles real-time deduplication, field validation, and data transformation (flattening) to prepare logs for storage. * **Storage and Accessibility**: Data is stored in Google Cloud Storage and BigQuery, where custom parameters defined in the schema are automatically expanded into searchable columns, removing the need for complex JSON parsing in SQL. ### Standardizing Discovery via Event Center The Event Center platform was designed to transform log management into a user-friendly, UI-driven experience while maintaining technical rigor. * **Standardized Naming Conventions**: The platform enforces a strict "Action-Object-Service" naming rule, ensuring that all events are categorized logically across the entire organization. * **Recursive Schema Builder**: To handle the complexity of nested JSON data, the team built a UI component that uses a recursive tree structure, allowing users to define deep data hierarchies without writing code. * **Centralized Dictionary**: The platform serves as a "single source of truth" where any employee can search for events, view their descriptions, and identify the team responsible for specific data points. ### Technical Implementation and Integration The system architecture was built to bridge the gap between a modern web UI and the existing Git-based infrastructure. * **Tech Stack**: The backend is powered by Go (Gin framework) and PostgreSQL (GORM), while the frontend utilizes React, TypeScript, and TanStack Query for state management. * **Automated Git Sync**: When a user saves a schema in Event Center, the system automatically triggers a GitHub Action that generates the necessary JSON files and pushes them to the repository, maintaining the codebase as the ultimate source of truth while abstracting the complexity. * **Real-time Validation**: The UI provides immediate feedback on data types and naming errors, preventing invalid schemas from reaching the production pipeline. Implementing a dedicated log management platform like Event Center is highly recommended for organizations scaling their data operations. Moving away from manual file management to a UI-based system not only reduces the risk of human error but also democratizes data access by allowing non-engineers to define and discover the logs they need for analysis.

toss

Enhancing Data Literacy for (opens in new tab)

Toss’s Business Data Team addressed the lack of centralized insights into their business customer (BC) base by building a standardized Single Source of Truth (SSOT) data mart and an iterative Monthly BC Report. This initiative successfully unified fragmented data across business units like Shopping, Ads, and Pay, enabling consistent data-driven decision-making and significantly raising the organization's overall data literacy. ## Establishing a Single Source of Truth (SSOT) - Addressed the inefficiency of fragmented data across various departments by integrating disparate datasets into a unified, enterprise-wide data mart. - Standardized the definition of an "active" Business Customer through cross-functional communication and a deep understanding of how revenue and costs are generated in each service domain. - Eliminated communication overhead by ensuring all stakeholders used a single, verified dataset rather than conflicting numbers from different business silos. ## Designing the Monthly BC Report for Actionable Insights - Visualized monthly revenue trends by segmenting customers into specific tiers and categories, such as New, Churn, and Retained, to identify where growth or attrition was occurring. - Implemented Cohort Retention metrics by business unit to measure platform stickiness and help teams understand which services were most effective at retaining business users. - Provided granular Raw Data lists for high-revenue customers showing significant growth or churn, allowing operational teams to identify immediate action points. - Refined reporting metrics through in-depth interviews with Product Owners (POs), Sales Leaders, and Domain Heads to ensure the data addressed real-world business questions. ## Technical Architecture and Validation - Built the core SSOT data mart using Airflow for scalable data orchestration and workflow management. - Leveraged Jenkins to handle the batch processing and deployment of the specific data layers required for the reporting environment. - Integrated Tableau with SQL-based fact aggregations to automate the monthly refresh of charts and dashboards, ensuring the report remains a "living" document. - Conducted "collective intelligence" verification meetings to check metric definitions, units, and visual clarity, ensuring the final report was intuitive for all users. ## Driving Organizational Change and Data Literacy - Sparked a surge in data demand, leading to follow-up projects such as daily real-time tracking, Cross-Domain Activation analysis, and deeper funnel analysis for BC registrations. - Transitioned the organizational culture from passive data consumption to active utilization, with diverse roles—including Strategy Managers and Business Marketers—now using BC data to prove their business impact. - Maintained an iterative approach where the report format evolves every month based on stakeholder feedback, ensuring the data remains relevant to the shifting needs of the business. Establishing a centralized data culture requires more than just technical infrastructure; it requires a commitment to iterative feedback and clear communication. By moving from fragmented silos to a unified reporting standard, data analysts can transform from simple "number providers" into strategic partners who drive company-wide literacy and growth.

daangn

You don't need to fetch it (opens in new tab)

As Daangn’s data volume grew, their traditional full-dump approach using Spark for MongoDB began causing significant CPU spikes and failing to meet the two-hour data delivery Service Level Objectives (SLOs). To resolve this, the team implemented a Change Data Capture (CDC) pipeline using Flink CDC to synchronize data efficiently without the need for resource-intensive full table scans. This transition successfully stabilized database performance and ensured timely data availability in BigQuery by focusing on incremental change logs rather than repeated bulk extracts. ### Limitations of Traditional Dump Methods * The previous Spark Connector method required full table scans, creating a direct conflict between service stability and data freshness. * Attempts to lower DB load resulted in missing the 2-hour SLO, while meeting the SLO pushed CPU usage to dangerous levels. * Standard incremental loading was ruled out because it relied on `updated_at` fields, which were not consistently updated across all business logic or schemas. * The team targeted the top five largest and most frequently updated collections for the initial CDC transition to maximize performance gains. ### Advantages of Flink CDC * Flink CDC provides native support for MongoDB Change Streams, allowing the system to use resume tokens and Flink checkpoints for seamless recovery after failures. * It guarantees "Exactly-Once" processing by periodically saving the pipeline state to distributed storage, ensuring data integrity during restarts. * Unlike tools like Debezium that require separate systems for data processing, Flink handles the entire "Extract-Transform-Load" (ETL) lifecycle within a single job. * The architecture is horizontally scalable; increasing the number of TaskManagers allows the pipeline to handle surges in event volume with linear performance improvements. ### Pipeline Architecture and Implementation * The system utilizes the MongoDB Oplog to capture real-time write operations (inserts, updates, and deletes) which are then processed by Flink. * The backend pipeline operates on an hourly batch cycle to extract the latest change events, deduplicate them, and merge them into raw JSON tables in BigQuery. * A "Schema Evolution" step automatically detects and adds missing fields to BigQuery tables, bridging the gap between NoSQL flexibility and SQL structure. * While Flink captures data in real-time, the team opted for hourly materialization to maintain idempotency, simplify error recovery, and meet existing business requirements without unnecessary architectural complexity. For organizations managing large-scale MongoDB instances, moving from bulk extracts to a CDC-based model is a critical step in balancing database health with analytical needs. Implementing a unified framework like Flink CDC not only reduces the load on operational databases but also simplifies the management of complex data transformations and schema changes.

daangn

No Need to Fetch Everything Every Time (opens in new tab)

To optimize data synchronization and ensure production stability, Daangn’s data engineering team transitioned their MongoDB data pipeline from a resource-intensive full-dump method to a Change Data Capture (CDC) architecture. By leveraging Flink CDC, the team successfully reduced database CPU usage to under 60% while consistently meeting a two-hour data delivery Service Level Objective (SLO). This shift enables efficient, schema-agnostic data replication to BigQuery, facilitating high-scale analysis without compromising the performance of live services. ### Limitations of Traditional Dump Methods * The previous Spark Connector-based approach required full table scans, leading to a direct trade-off between hitting delivery deadlines and maintaining database health. * Increasing data volumes caused significant CPU spikes, threatening the stability of transaction processing in production environments. * Standard incremental loads were unreliable because many collections lacked consistent `updated_at` fields or required the tracking of hard deletes, which full dumps handle poorly at scale. ### Advantages of Flink CDC for MongoDB * Flink CDC provides native support for MongoDB Change Streams, allowing the system to read the Oplog directly and use resume tokens to restart from specific failure points. * The framework’s checkpointing mechanism ensures "Exactly-Once" processing by periodically saving the pipeline state to distributed storage like GCS or S3. * Unlike standalone tools like Debezium, Flink allows for an integrated "Extract-Transform-Load" (ETL) flow within a single job, reducing operational complexity and the need for intermediate message queues. * The architecture is horizontally scalable, meaning TaskManagers can be increased to handle sudden bursts in event volume without re-architecting the pipeline. ### Pipeline Architecture and Processing Logic * The core engine monitors MongoDB write operations (Insert, Update, Delete) in real-time via Change Streams and transmits them to BigQuery. * An hourly batch process is utilized rather than pure real-time streaming to prioritize operational stability, idempotency, and easier recovery from failures. * The downstream pipeline includes a Schema Evolution step that automatically detects and adds new fields to BigQuery tables, ensuring the NoSQL-to-SQL transition is seamless. * Data processing involves deduplicating recent change events and merging them into a raw JSON table before materializing them into a final structured table for end-users. For organizations managing large-scale MongoDB clusters, implementing Flink CDC serves as a powerful solution to balance analytical requirements with database performance. Prioritizing a robust, batch-integrated CDC flow allows teams to meet strict delivery targets and maintain data integrity without the infrastructure overhead of a fully real-time streaming system.

naver

[DAN25] (opens in new tab)

Naver recently released the full video archives from its DAN25 conference, highlighting the company’s strategic roadmap for AI agents, Sovereign AI, and digital transformation. The sessions showcase how Naver is moving beyond general AI applications to implement specialized, real-time systems that integrate large language models (LLMs) directly into core services like search, commerce, and content. By open-sourcing these technical insights, Naver demonstrates its progress in building a cohesive AI ecosystem capable of handling massive scale and complex user intent. ### Naver PersonA and LLM-Based User Memory * The "PersonA" project focuses on building a "user memory" by treating fragmented logs across various Naver services as indirect conversations with the user. * By leveraging LLM reasoning, the system transitions from simple data tracking to a sophisticated AI agent that offers context-aware, real-time suggestions. * Technical hurdles addressed include the stable implementation of real-time log reflection for a massive user base and the selection of optimal LLM architectures for personalized inference. ### Trend Analysis and Search-Optimized Models * The Place Trend Analysis system utilizes ranking algorithms to distinguish between temporary surges and sustained popularity, providing a balanced view of "hot places." * LLMs and text mining are employed to move beyond raw data, extracting specific keywords that explain the underlying reasons for a location's trending status. * To improve search quality, Naver developed search-specific LLMs that outperform general models by using specialized data "recipes" and integrating traditional information retrieval with features like "AI briefing" and "AuthGR" for higher reliability. ### Unified Recommendation and Real-Time CRM * Naver Webtoon and Series replaced fragmented recommendation and CRM (Customer Relationship Management) models with a single, unified framework to ensure data consistency. * The architecture shifted from batch-based processing to a real-time, API-based serving system to reduce management complexity and improve the immediacy of personalized user experiences. * This transition focuses on maintaining a seamless UX by synchronizing different ML models under a unified serving logic. ### Scalable Log Pipelines and Infrastructure Stability * The "Logiss" pipeline manages up to tens of billions of logs daily, utilizing a Storm and Kafka environment to ensure high availability and performance. * Engineers implemented a multi-topology approach to allow for seamless, non-disruptive deployments even under heavy loads. * Intelligent features such as "peak-shaving" (distributing peak traffic to off-peak hours), priority-based processing during failures, and efficient data sampling help balance cost, performance, and stability. These sessions provide a practical blueprint for organizations aiming to scale LLM-driven services while maintaining infrastructure integrity. For developers and system architects, Naver’s transition toward unified ML frameworks and specialized, real-time data pipelines offers a proven model for moving AI from experimental phases into high-traffic production environments.

naver

Building Data Lineage- (opens in new tab)

Naver Webtoon developed "Flow.er," an on-demand data lineage pipeline service designed to overcome the operational inefficiencies and high maintenance costs of legacy data workflows. By integrating dbt for modular modeling and Airflow for scalable orchestration, the platform automates complex backfill and recovery processes while maintaining high data integrity. This shift to a lineage-centric architecture allows the engineering team to manage data as a high-quality product rather than a series of disconnected tasks. ### Challenges in Traditional Data Pipelines * High operational burdens were caused by manual backfilling and recovery tasks, which became increasingly difficult as data volume and complexity grew. * Legacy systems lacked transparency in data dependencies, making it hard to predict the downstream impact of code changes or upstream data failures. * Fragmented development environments led to inconsistencies between local testing and production outputs, slowing down the deployment of new data products. ### Core Architecture and the Role of dbt and Airflow * dbt serves as the central modeling layer, defining transformations and establishing clear data lineage that maps how information flows between tables. * Airflow functions as the orchestration engine, utilizing the lineage defined in dbt to trigger tasks in the correct order and manage execution schedules. * Individual development instances provide engineers with isolated environments to test dbt models, ensuring that logic is validated before being merged into the main pipeline. * The system includes a dedicated model management page and a robust CI/CD pipeline to streamline the transition from development to production. ### Expanding the Platform with Tower and Playground * "Tower" and "Playground" were introduced as supplementary components to support a broader range of data organizations and facilitate easier experimentation. * A specialized Partition Checker was developed to enhance data integrity by automatically verifying that all required data partitions are present before downstream processing begins. * Improvements to the Manager DAG system allow the platform to handle large-scale pipeline deployments across different teams while maintaining a unified view of the data lineage. ### Future Evolution with AI and MCP * The team is exploring the integration of Model Context Protocol (MCP) servers to bridge the gap between data pipelines and AI applications. * Future developments focus on utilizing AI agents to further automate pipeline monitoring and troubleshooting, reducing the need for human intervention in routine maintenance. To build a sustainable and scalable data infrastructure, organizations should transition from simple task scheduling to a lineage-aware architecture. Adopting a framework like Flow.er, which combines the modeling strengths of dbt with the orchestration power of Airflow, enables teams to automate the most labor-intensive parts of data engineering—such as backfills and dependency management—while ensuring the reliability of the final data product.

line

Hosting the Tech Conference Tech-Verse (opens in new tab)

LY Corporation is hosting its global technology conference, Tech-Verse 2025, on June 30 and July 1 to showcase the engineering expertise of its international teams. The event features 127 sessions centered on core themes of AI and security, offering a deep dive into how the group's developers, designers, and product managers solve large-scale technical challenges. Interested participants can register for free on the official website to access the online live-streamed sessions, which include real-time interpretation in English, Korean, and Japanese. ### Conference Overview and Access * The event runs for two days, from 10:00 AM to 6:00 PM (KST), and is primarily delivered via online streaming. * Registration is open to the public at no cost through the Tech-Verse 2025 official website. * The conference brings together technical talent from across the LY Corporation Group, including LINE Plus, LINE Taiwan, and LINE Vietnam. ### Multi-Disciplinary Technical Tracks * The agenda is divided into 12 distinct categories to cover the full spectrum of software development and product lifecycle. * Day 1 focuses on foundational technologies: AI, Security, Server-side development, Private Cloud, Infrastructure, and Data Platforms. * Day 2 explores application and management layers: AI Use Cases, Frontend, Mobile Applications, Design, Product Management, and Engineering Management. ### Key Engineering Case Studies and Sessions * **AI and Data Automation:** Sessions explore the evolution of development processes using AI, the shift from "Vibe Coding" to professional AI-assisted engineering, and the use of Generative AI to automate data pipelines. * **Infrastructure and Scaling:** Presentations include how the "Central Dogma Control Plane" connects thousands of services within LY Corporation and methods for improving video playback quality for LINE Call. * **Framework Migration:** A featured case study details the strategic transition of the "Demae-can" service from React Native to Flutter. * **Product Insights:** Deep dives into user experience design and data-driven insights gathered from LINE Talk's global user base. Tech-Verse 2025 provides a valuable opportunity for developers to learn from real-world deployments of AI and large-scale infrastructure. Given the breadth of the 127 sessions and the availability of real-time translation, tech professionals should review the timetable in advance to prioritize tracks relevant to their specific engineering interests.

coupang

Coupang SCM Workflow: Developing (opens in new tab)

Coupang has developed an internal SCM Workflow platform to streamline the complex data and operational needs of its Supply Chain Management team. By implementing low-code and no-code functionalities, the platform enables developers, data scientists, and business analysts to build data pipelines and launch services without the traditional bottlenecks of manual development. ### Addressing Inefficiencies in SCM Data Management * The SCM team manages a massive network of suppliers and fulfillment centers (FCs) where demand forecasting and inventory distribution require constant data feedback. * Traditionally, non-technical stakeholders like business analysts (BAs) relied heavily on developers to build or modify data pipelines, leading to high communication costs and slower response times to changing business requirements. * The new platform aims to simplify the complexity found in traditional tools like Jenkins, Airflow, and Jupyter Notebooks, providing a unified interface for data creation and visualization. ### Democratizing Access with the No-code Data Builder * The "Data Builder" allows users to perform data queries, extraction, and system integration through a visual interface rather than writing backend code. * It provides seamless access to a wide array of data sources used across Coupang, including Redshift, Hive, Presto, Aurora, MySQL, Elasticsearch, and S3. * Users can construct workflows by creating "nodes" for specific tasks—such as extracting inventory data from Hive or calculating transfer quantities—and linking them together to automate complex decisions like inter-center product transfers. ### Expanding Capabilities through Low-code Service Building * The platform functions as a "Service Builder," allowing users to expand domains and launch simple services without building entirely new infrastructure from scratch. * This approach enables developers to focus on high-level algorithm development while allowing data scientists to apply and test new models directly within the production environment. * By reducing the need for code changes to reflect new requirements, the platform significantly increases the agility of the SCM pipeline. Organizations managing complex, data-driven ecosystems can significantly reduce operational friction by adopting low-code/no-code platforms. Empowering non-technical stakeholders to handle data processing and service integration not only accelerates innovation but also allows engineering resources to be redirected toward core architectural challenges.