dbt

3 posts

daangn

Why Karrot made User (opens in new tab)

Daangn transitioned from manually calculating user activation metrics to a centralized "Activation Layer" built on DBT to solve inconsistencies and high operational overhead. By standardizing the definitions of user states and transitions, the team provides a reliable foundation for analyzing why active user counts fluctuate rather than just reporting the final numbers. This common data layer improves data reliability and cost-efficiency while allowing various teams to reuse the same logic for different core user behaviors. ### The Role of User Activation Analysis * While Active User counts show "what" happened, User Activation explains "why" by breaking users down into specific categories. * The system tracks **Activation States**, classifying users as New, Retained, Reactivated, or Inactive at any given time. * It monitors **State Transitions** to identify how users move between categories, such as "New to Retained" or "Reactivated to Inactive." * The layer provides granular behavioral metadata, including continuous activity streaks, the interval between visits, and the duration of churned periods. ### Ensuring Reliability via Fact Models * Raw event logs are often tied to specific UI elements and contain "noise" that makes them unreliable for direct activation analysis. * To ensure consistency, the Activation Layer uses **Fact Models** as its primary input, which are refined datasets where business logic and core behaviors are already defined. * A strict naming convention (`fact_name_activation_time_grain`) is enforced so that users can immediately identify which specific behavior is being analyzed. * This structure ensures that "Active" status is interpreted identically across the entire organization, regardless of which team is performing the analysis. ### Incremental Processing for Cost Efficiency * Calculating the entire history of user activity every day is computationally expensive and leads to high cloud infrastructure costs. * The architecture utilizes a **FirstLast model** to store only the essential metadata for each user: the date of their very first activity and their most recent activity. * By joining daily activity logs with this lightweight FirstLast table, the system can calculate new states and transitions incrementally. * This approach maintains data idempotency and ensures high performance even as the volume of user interaction data grows. ### Scaling with DBT Macros * To support various metrics—such as app visits, item sales, or community posts—the team encapsulated the complex transition logic into **DBT Macros**. * This abstraction allows data engineers to generate a new activation model by simply specifying the source Fact model and the desired time grain (daily, weekly, or monthly). * Centralizing the logic in macros ensures that any bug fixes or improvements to the activation calculation are automatically reflected across all related data models. * The standardized output format allows for the creation of universal dashboards and analysis templates that work for any tracked behavior. Centralizing User Activation logic into a common data layer allows organizations to move beyond surface-level vanity metrics and gain deep, actionable behavioral insights. By combining DBT’s macro capabilities with incremental modeling, teams can maintain high data quality and operational efficiency even as the variety of tracked user behaviors expands.

naver

Naver TV (opens in new tab)

Naver Webtoon developed "Flow.er," an on-demand data lineage pipeline service designed to overcome the operational inefficiencies and high maintenance costs of legacy data workflows. By integrating dbt for modular modeling and Airflow for scalable orchestration, the platform automates complex backfill and recovery processes while maintaining high data integrity. This shift to a lineage-centric architecture allows the engineering team to manage data as a high-quality product rather than a series of disconnected tasks. ### Challenges in Traditional Data Pipelines * High operational burdens were caused by manual backfilling and recovery tasks, which became increasingly difficult as data volume and complexity grew. * Legacy systems lacked transparency in data dependencies, making it hard to predict the downstream impact of code changes or upstream data failures. * Fragmented development environments led to inconsistencies between local testing and production outputs, slowing down the deployment of new data products. ### Core Architecture and the Role of dbt and Airflow * dbt serves as the central modeling layer, defining transformations and establishing clear data lineage that maps how information flows between tables. * Airflow functions as the orchestration engine, utilizing the lineage defined in dbt to trigger tasks in the correct order and manage execution schedules. * Individual development instances provide engineers with isolated environments to test dbt models, ensuring that logic is validated before being merged into the main pipeline. * The system includes a dedicated model management page and a robust CI/CD pipeline to streamline the transition from development to production. ### Expanding the Platform with Tower and Playground * "Tower" and "Playground" were introduced as supplementary components to support a broader range of data organizations and facilitate easier experimentation. * A specialized Partition Checker was developed to enhance data integrity by automatically verifying that all required data partitions are present before downstream processing begins. * Improvements to the Manager DAG system allow the platform to handle large-scale pipeline deployments across different teams while maintaining a unified view of the data lineage. ### Future Evolution with AI and MCP * The team is exploring the integration of Model Context Protocol (MCP) servers to bridge the gap between data pipelines and AI applications. * Future developments focus on utilizing AI agents to further automate pipeline monitoring and troubleshooting, reducing the need for human intervention in routine maintenance. To build a sustainable and scalable data infrastructure, organizations should transition from simple task scheduling to a lineage-aware architecture. Adopting a framework like Flow.er, which combines the modeling strengths of dbt with the orchestration power of Airflow, enables teams to automate the most labor-intensive parts of data engineering—such as backfills and dependency management—while ensuring the reliability of the final data product.

discord

Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data (opens in new tab)

Discord scaled its data infrastructure to manage petabytes of data and over 2,500 models by moving beyond a standard dbt implementation. While the tool initially provided a modular and developer-friendly framework, the sheer volume of data and a high headcount of over 100 concurrent developers led to critical performance bottlenecks. To resolve these issues, Discord developed custom extensions to dbt’s core functionality, successfully reducing compilation times and automating complex data transformations. ### Strategic Adoption of dbt * Discord integrated dbt into its stack to leverage software engineering principles like modular design and code reusability for SQL transformations. * The tool’s open-source nature allowed the team to align with Discord’s internal philosophy of community-driven engineering. * The framework offered seamless integration with other internal tools, such as the Dagster orchestrator, and provided a robust testing environment to ensure data quality. ### Scaling Bottlenecks and Performance Issues * The project grew to a size where recompiling the entire dbt project took upwards of 20 minutes, severely hindering developer velocity. * Standard incremental materialization strategies provided by dbt proved inefficient for the petabyte-scale data volumes generated by millions of concurrent users. * Developer workflows often collided, resulting in teams inadvertently overwriting each other’s test tables and creating data silos or inconsistencies. * The lack of specialized handling for complex backfills threatened the organization’s ability to deliver timely and accurate insights. ### Engineering Custom Extensions for Growth * The team built a provider-agnostic layer over Google BigQuery to streamline complex calculations and automate massive data backfills. * Custom optimizations were implemented to prevent breaking changes during the development cycle, ensuring that 100+ developers could work simultaneously without friction. * By extending dbt’s core, Discord transformed slow development cycles into a rapid, automated system capable of serving as the backbone for their global analytics infrastructure. For organizations operating at massive scale, standard open-source tools often require custom-built orchestration and optimization layers to remain viable. Prioritizing the automation of backfills and optimizing compilation logic is essential to maintaining developer productivity and data integrity when dealing with thousands of models and petabytes of information.