dagster

2 posts

discord

From Single-Node to Multi-GPU Clusters: How Discord Made Distributed Compute Easy for ML Engineers (opens in new tab)

Discord’s machine learning infrastructure reached a critical scaling limit as models and datasets grew beyond the capacity of single-machine systems. To overcome these bottlenecks, the engineering team transitioned to a distributed compute architecture built on the Ray framework and a suite of custom orchestration tools. This evolution moved Discord from ad-hoc experimentation to a robust production platform, resulting in significant performance gains such as a 200% improvement in business metrics for Ads Ranking. ### Overcoming Hardware and Data Bottlenecks * Initial ML systems relied on simple classifiers that eventually evolved into complex models serving hundreds of millions of users. * Training requirements shifted from single-machine tasks to workloads requiring multiple GPUs. * Datasets expanded to the point where they could no longer fit on individual machines, creating a need for distributed storage and processing. * Infrastructure growth struggled to keep pace with the exponential increase in computational demands. ### Building a Ray-Based ML Platform * The Ray framework was adopted as the foundation for distributed computing to simplify complex ML workflows. * Discord integrated Dagster with KubeRay to manage the orchestration of production-grade machine learning pipelines. * Custom CLI tooling was developed to lower the barrier to entry for engineers, focusing heavily on developer experience. * A specialized observability layer called X-Ray was implemented to provide deep insights into distributed system performance. By prioritizing developer experience and creating accessible abstractions over raw compute power, Discord successfully industrialized its ML operations. For organizations facing similar scaling hurdles, the focus should be on building a unified platform that turns the complexity of distributed systems into a seamless tool for modelers.

discord

Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data (opens in new tab)

Discord scaled its data infrastructure to manage petabytes of data and over 2,500 models by moving beyond a standard dbt implementation. While the tool initially provided a modular and developer-friendly framework, the sheer volume of data and a high headcount of over 100 concurrent developers led to critical performance bottlenecks. To resolve these issues, Discord developed custom extensions to dbt’s core functionality, successfully reducing compilation times and automating complex data transformations. ### Strategic Adoption of dbt * Discord integrated dbt into its stack to leverage software engineering principles like modular design and code reusability for SQL transformations. * The tool’s open-source nature allowed the team to align with Discord’s internal philosophy of community-driven engineering. * The framework offered seamless integration with other internal tools, such as the Dagster orchestrator, and provided a robust testing environment to ensure data quality. ### Scaling Bottlenecks and Performance Issues * The project grew to a size where recompiling the entire dbt project took upwards of 20 minutes, severely hindering developer velocity. * Standard incremental materialization strategies provided by dbt proved inefficient for the petabyte-scale data volumes generated by millions of concurrent users. * Developer workflows often collided, resulting in teams inadvertently overwriting each other’s test tables and creating data silos or inconsistencies. * The lack of specialized handling for complex backfills threatened the organization’s ability to deliver timely and accurate insights. ### Engineering Custom Extensions for Growth * The team built a provider-agnostic layer over Google BigQuery to streamline complex calculations and automate massive data backfills. * Custom optimizations were implemented to prevent breaking changes during the development cycle, ensuring that 100+ developers could work simultaneously without friction. * By extending dbt’s core, Discord transformed slow development cycles into a rapid, automated system capable of serving as the backbone for their global analytics infrastructure. For organizations operating at massive scale, standard open-source tools often require custom-built orchestration and optimization layers to remain viable. Prioritizing the automation of backfills and optimizing compilation logic is essential to maintaining developer productivity and data integrity when dealing with thousands of models and petabytes of information.