sql

2 posts

toss

Improving Business Data Literacy: (opens in new tab)

Toss’s Business Data Team addressed the lack of centralized insights into their business customer (BC) base by building a standardized Single Source of Truth (SSOT) data mart and an iterative Monthly BC Report. This initiative successfully unified fragmented data across business units like Shopping, Ads, and Pay, enabling consistent data-driven decision-making and significantly raising the organization's overall data literacy. ## Establishing a Single Source of Truth (SSOT) - Addressed the inefficiency of fragmented data across various departments by integrating disparate datasets into a unified, enterprise-wide data mart. - Standardized the definition of an "active" Business Customer through cross-functional communication and a deep understanding of how revenue and costs are generated in each service domain. - Eliminated communication overhead by ensuring all stakeholders used a single, verified dataset rather than conflicting numbers from different business silos. ## Designing the Monthly BC Report for Actionable Insights - Visualized monthly revenue trends by segmenting customers into specific tiers and categories, such as New, Churn, and Retained, to identify where growth or attrition was occurring. - Implemented Cohort Retention metrics by business unit to measure platform stickiness and help teams understand which services were most effective at retaining business users. - Provided granular Raw Data lists for high-revenue customers showing significant growth or churn, allowing operational teams to identify immediate action points. - Refined reporting metrics through in-depth interviews with Product Owners (POs), Sales Leaders, and Domain Heads to ensure the data addressed real-world business questions. ## Technical Architecture and Validation - Built the core SSOT data mart using Airflow for scalable data orchestration and workflow management. - Leveraged Jenkins to handle the batch processing and deployment of the specific data layers required for the reporting environment. - Integrated Tableau with SQL-based fact aggregations to automate the monthly refresh of charts and dashboards, ensuring the report remains a "living" document. - Conducted "collective intelligence" verification meetings to check metric definitions, units, and visual clarity, ensuring the final report was intuitive for all users. ## Driving Organizational Change and Data Literacy - Sparked a surge in data demand, leading to follow-up projects such as daily real-time tracking, Cross-Domain Activation analysis, and deeper funnel analysis for BC registrations. - Transitioned the organizational culture from passive data consumption to active utilization, with diverse roles—including Strategy Managers and Business Marketers—now using BC data to prove their business impact. - Maintained an iterative approach where the report format evolves every month based on stakeholder feedback, ensuring the data remains relevant to the shifting needs of the business. Establishing a centralized data culture requires more than just technical infrastructure; it requires a commitment to iterative feedback and clear communication. By moving from fragmented silos to a unified reporting standard, data analysts can transform from simple "number providers" into strategic partners who drive company-wide literacy and growth.

discord

Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data (opens in new tab)

Discord scaled its data infrastructure to manage petabytes of data and over 2,500 models by moving beyond a standard dbt implementation. While the tool initially provided a modular and developer-friendly framework, the sheer volume of data and a high headcount of over 100 concurrent developers led to critical performance bottlenecks. To resolve these issues, Discord developed custom extensions to dbt’s core functionality, successfully reducing compilation times and automating complex data transformations. ### Strategic Adoption of dbt * Discord integrated dbt into its stack to leverage software engineering principles like modular design and code reusability for SQL transformations. * The tool’s open-source nature allowed the team to align with Discord’s internal philosophy of community-driven engineering. * The framework offered seamless integration with other internal tools, such as the Dagster orchestrator, and provided a robust testing environment to ensure data quality. ### Scaling Bottlenecks and Performance Issues * The project grew to a size where recompiling the entire dbt project took upwards of 20 minutes, severely hindering developer velocity. * Standard incremental materialization strategies provided by dbt proved inefficient for the petabyte-scale data volumes generated by millions of concurrent users. * Developer workflows often collided, resulting in teams inadvertently overwriting each other’s test tables and creating data silos or inconsistencies. * The lack of specialized handling for complex backfills threatened the organization’s ability to deliver timely and accurate insights. ### Engineering Custom Extensions for Growth * The team built a provider-agnostic layer over Google BigQuery to streamline complex calculations and automate massive data backfills. * Custom optimizations were implemented to prevent breaking changes during the development cycle, ensuring that 100+ developers could work simultaneously without friction. * By extending dbt’s core, Discord transformed slow development cycles into a rapid, automated system capable of serving as the backbone for their global analytics infrastructure. For organizations operating at massive scale, standard open-source tools often require custom-built orchestration and optimization layers to remain viable. Prioritizing the automation of backfills and optimizing compilation logic is essential to maintaining developer productivity and data integrity when dealing with thousands of models and petabytes of information.