analytics-engineering

2 posts

daangn

Why did Karrot make (opens in new tab)

Daangn transitioned from manually calculating user activation metrics to a centralized "Activation Layer" built on DBT to solve inconsistencies and high operational overhead. By standardizing the definitions of user states and transitions, the team provides a reliable foundation for analyzing why active user counts fluctuate rather than just reporting the final numbers. This common data layer improves data reliability and cost-efficiency while allowing various teams to reuse the same logic for different core user behaviors. ### The Role of User Activation Analysis * While Active User counts show "what" happened, User Activation explains "why" by breaking users down into specific categories. * The system tracks **Activation States**, classifying users as New, Retained, Reactivated, or Inactive at any given time. * It monitors **State Transitions** to identify how users move between categories, such as "New to Retained" or "Reactivated to Inactive." * The layer provides granular behavioral metadata, including continuous activity streaks, the interval between visits, and the duration of churned periods. ### Ensuring Reliability via Fact Models * Raw event logs are often tied to specific UI elements and contain "noise" that makes them unreliable for direct activation analysis. * To ensure consistency, the Activation Layer uses **Fact Models** as its primary input, which are refined datasets where business logic and core behaviors are already defined. * A strict naming convention (`fact_name_activation_time_grain`) is enforced so that users can immediately identify which specific behavior is being analyzed. * This structure ensures that "Active" status is interpreted identically across the entire organization, regardless of which team is performing the analysis. ### Incremental Processing for Cost Efficiency * Calculating the entire history of user activity every day is computationally expensive and leads to high cloud infrastructure costs. * The architecture utilizes a **FirstLast model** to store only the essential metadata for each user: the date of their very first activity and their most recent activity. * By joining daily activity logs with this lightweight FirstLast table, the system can calculate new states and transitions incrementally. * This approach maintains data idempotency and ensures high performance even as the volume of user interaction data grows. ### Scaling with DBT Macros * To support various metrics—such as app visits, item sales, or community posts—the team encapsulated the complex transition logic into **DBT Macros**. * This abstraction allows data engineers to generate a new activation model by simply specifying the source Fact model and the desired time grain (daily, weekly, or monthly). * Centralizing the logic in macros ensures that any bug fixes or improvements to the activation calculation are automatically reflected across all related data models. * The standardized output format allows for the creation of universal dashboards and analysis templates that work for any tracked behavior. Centralizing User Activation logic into a common data layer allows organizations to move beyond surface-level vanity metrics and gain deep, actionable behavioral insights. By combining DBT’s macro capabilities with incremental modeling, teams can maintain high data quality and operational efficiency even as the variety of tracked user behaviors expands.

toss

Toss People: Designing a (opens in new tab)

Data architecture is evolving from a reactive "cleanup" task into a proactive, end-to-end design process that ensures high data quality from the moment of creation. In fast-paced platform environments, the role of a Data Architect is to bridge the gap between rapid product development and reliable data structures, ultimately creating a foundation that both humans and AI can interpret accurately. By shifting from mere post-processing to foundational governance, organizations can maintain technical agility without sacrificing the integrity of their data assets. **From Post-Processing to End-to-End Governance** * Traditional data management often involves "fixing" or "matching puzzles" at the end of the pipeline after a service has already changed, leading to perpetual technical debt. * Effective data architecture requires a culture where data is treated as a primary design object from its inception, rather than a byproduct of application development. * The transition to an end-to-end governance model ensures that data quality is maintained throughout its entire lifecycle—from initial generation in production systems to final analysis and consumption. **Machine-Understandable Data and Ontologies** * Modern data design must move beyond human-readable metadata to structures that AI can autonomously process and understand. * The implementation of semantic-based standard dictionaries and ontologies reduces the need for "inference" or guessing by either humans or machines. * By explicitly defining the relationships and conceptual meanings of columns and tables, organizations create a high-fidelity environment where AI can provide accurate, context-aware responses without interpretive errors. **Balancing Development Speed with Data Quality** * In high-growth environments, insisting on "perfect" design can hinder competitive speed; therefore, architects must find a middle ground that allows for future extensibility. * Practical strategies include designing for current needs while leaving "logical room" for anticipated changes, ensuring that future cleanup is minimally disruptive. * Instead of enforcing rigid rules, architects should design systems where following the standard is the "path of least resistance," making high-quality data entry easier for developers than the alternative. **The Role of the Modern Data Architect** * The role has shifted from a fixed, corporate function to a dynamic problem-solver who uses structural design to solve business bottlenecks. * A successful architect must act as a mediator, convincing stakeholders that investing in a 5% quality improvement (e.g., moving from 90 to 95 points) provides significant long-term ROI in decision-making and AI reliability. * Aspiring architects should focus on incremental structural improvements, as any data professional who cares about how data functions is already operating on the path to data architecture.