discord

Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data (opens in new tab)

scalability data-engineering dbt dagster sql google-bigquery data-transformation

Discord scaled its data infrastructure to manage petabytes of data and over 2,500 models by moving beyond a standard dbt implementation. While the tool initially provided a modular and developer-friendly framework, the sheer volume of data and a high headcount of over 100 concurrent developers led to critical performance bottlenecks. To resolve these issues, Discord developed custom extensions to dbt’s core functionality, successfully reducing compilation times and automating complex data transformations.

Strategic Adoption of dbt

Discord integrated dbt into its stack to leverage software engineering principles like modular design and code reusability for SQL transformations.
The tool’s open-source nature allowed the team to align with Discord’s internal philosophy of community-driven engineering.
The framework offered seamless integration with other internal tools, such as the Dagster orchestrator, and provided a robust testing environment to ensure data quality.

Scaling Bottlenecks and Performance Issues

The project grew to a size where recompiling the entire dbt project took upwards of 20 minutes, severely hindering developer velocity.
Standard incremental materialization strategies provided by dbt proved inefficient for the petabyte-scale data volumes generated by millions of concurrent users.
Developer workflows often collided, resulting in teams inadvertently overwriting each other’s test tables and creating data silos or inconsistencies.
The lack of specialized handling for complex backfills threatened the organization’s ability to deliver timely and accurate insights.

Engineering Custom Extensions for Growth

The team built a provider-agnostic layer over Google BigQuery to streamline complex calculations and automate massive data backfills.
Custom optimizations were implemented to prevent breaking changes during the development cycle, ensuring that 100+ developers could work simultaneously without friction.
By extending dbt’s core, Discord transformed slow development cycles into a rapid, automated system capable of serving as the backbone for their global analytics infrastructure.

For organizations operating at massive scale, standard open-source tools often require custom-built orchestration and optimization layers to remain viable. Prioritizing the automation of backfills and optimizing compilation logic is essential to maintaining developer productivity and data integrity when dealing with thousands of models and petabytes of information.