dbt

2 posts

daangn

Why did Karrot make (opens in new tab)

Daangn transitioned from manually calculating user activation metrics to a centralized "Activation Layer" built on DBT to solve inconsistencies and high operational overhead. By standardizing the definitions of user states and transitions, the team provides a reliable foundation for analyzing why active user counts fluctuate rather than just reporting the final numbers. This common data layer improves data reliability and cost-efficiency while allowing various teams to reuse the same logic for different core user behaviors. ### The Role of User Activation Analysis * While Active User counts show "what" happened, User Activation explains "why" by breaking users down into specific categories. * The system tracks **Activation States**, classifying users as New, Retained, Reactivated, or Inactive at any given time. * It monitors **State Transitions** to identify how users move between categories, such as "New to Retained" or "Reactivated to Inactive." * The layer provides granular behavioral metadata, including continuous activity streaks, the interval between visits, and the duration of churned periods. ### Ensuring Reliability via Fact Models * Raw event logs are often tied to specific UI elements and contain "noise" that makes them unreliable for direct activation analysis. * To ensure consistency, the Activation Layer uses **Fact Models** as its primary input, which are refined datasets where business logic and core behaviors are already defined. * A strict naming convention (`fact_name_activation_time_grain`) is enforced so that users can immediately identify which specific behavior is being analyzed. * This structure ensures that "Active" status is interpreted identically across the entire organization, regardless of which team is performing the analysis. ### Incremental Processing for Cost Efficiency * Calculating the entire history of user activity every day is computationally expensive and leads to high cloud infrastructure costs. * The architecture utilizes a **FirstLast model** to store only the essential metadata for each user: the date of their very first activity and their most recent activity. * By joining daily activity logs with this lightweight FirstLast table, the system can calculate new states and transitions incrementally. * This approach maintains data idempotency and ensures high performance even as the volume of user interaction data grows. ### Scaling with DBT Macros * To support various metrics—such as app visits, item sales, or community posts—the team encapsulated the complex transition logic into **DBT Macros**. * This abstraction allows data engineers to generate a new activation model by simply specifying the source Fact model and the desired time grain (daily, weekly, or monthly). * Centralizing the logic in macros ensures that any bug fixes or improvements to the activation calculation are automatically reflected across all related data models. * The standardized output format allows for the creation of universal dashboards and analysis templates that work for any tracked behavior. Centralizing User Activation logic into a common data layer allows organizations to move beyond surface-level vanity metrics and gain deep, actionable behavioral insights. By combining DBT’s macro capabilities with incremental modeling, teams can maintain high data quality and operational efficiency even as the variety of tracked user behaviors expands.

naver

Building Data Lineage- (opens in new tab)

Naver Webtoon developed "Flow.er," an on-demand data lineage pipeline service designed to overcome the operational inefficiencies and high maintenance costs of legacy data workflows. By integrating dbt for modular modeling and Airflow for scalable orchestration, the platform automates complex backfill and recovery processes while maintaining high data integrity. This shift to a lineage-centric architecture allows the engineering team to manage data as a high-quality product rather than a series of disconnected tasks. ### Challenges in Traditional Data Pipelines * High operational burdens were caused by manual backfilling and recovery tasks, which became increasingly difficult as data volume and complexity grew. * Legacy systems lacked transparency in data dependencies, making it hard to predict the downstream impact of code changes or upstream data failures. * Fragmented development environments led to inconsistencies between local testing and production outputs, slowing down the deployment of new data products. ### Core Architecture and the Role of dbt and Airflow * dbt serves as the central modeling layer, defining transformations and establishing clear data lineage that maps how information flows between tables. * Airflow functions as the orchestration engine, utilizing the lineage defined in dbt to trigger tasks in the correct order and manage execution schedules. * Individual development instances provide engineers with isolated environments to test dbt models, ensuring that logic is validated before being merged into the main pipeline. * The system includes a dedicated model management page and a robust CI/CD pipeline to streamline the transition from development to production. ### Expanding the Platform with Tower and Playground * "Tower" and "Playground" were introduced as supplementary components to support a broader range of data organizations and facilitate easier experimentation. * A specialized Partition Checker was developed to enhance data integrity by automatically verifying that all required data partitions are present before downstream processing begins. * Improvements to the Manager DAG system allow the platform to handle large-scale pipeline deployments across different teams while maintaining a unified view of the data lineage. ### Future Evolution with AI and MCP * The team is exploring the integration of Model Context Protocol (MCP) servers to bridge the gap between data pipelines and AI applications. * Future developments focus on utilizing AI agents to further automate pipeline monitoring and troubleshooting, reducing the need for human intervention in routine maintenance. To build a sustainable and scalable data infrastructure, organizations should transition from simple task scheduling to a lineage-aware architecture. Adopting a framework like Flow.er, which combines the modeling strengths of dbt with the orchestration power of Airflow, enables teams to automate the most labor-intensive parts of data engineering—such as backfills and dependency management—while ensuring the reliability of the final data product.