We are open-sourcing the initial version of RCCLX – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chose…
Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest -- Listen Share Felix Loesing | Software Engineer In 2025, we set out to drastically reduce out-of-memory errors (OOMs) and cut resource usage in our Spark applications by automatically identifying tasks with…
Discord’s machine learning infrastructure reached a critical scaling limit as models and datasets grew beyond the capacity of single-machine systems. To overcome these bottlenecks, the engineering team transitioned to a distributed compute architecture built on the Ray framework and a suite of custom orchestration tools. This evolution moved Discord from ad-hoc experimentation to a robust production platform, resulting in significant performance gains such as a 200% improvement in business metrics for Ads Ranking.
### Overcoming Hardware and Data Bottlenecks
* Initial ML systems relied on simple classifiers that eventually evolved into complex models serving hundreds of millions of users.
* Training requirements shifted from single-machine tasks to workloads requiring multiple GPUs.
* Datasets expanded to the point where they could no longer fit on individual machines, creating a need for distributed storage and processing.
* Infrastructure growth struggled to keep pace with the exponential increase in computational demands.
### Building a Ray-Based ML Platform
* The Ray framework was adopted as the foundation for distributed computing to simplify complex ML workflows.
* Discord integrated Dagster with KubeRay to manage the orchestration of production-grade machine learning pipelines.
* Custom CLI tooling was developed to lower the barrier to entry for engineers, focusing heavily on developer experience.
* A specialized observability layer called X-Ray was implemented to provide deep insights into distributed system performance.
By prioritizing developer experience and creating accessible abstractions over raw compute power, Discord successfully industrialized its ML operations. For organizations facing similar scaling hurdles, the focus should be on building a unified platform that turns the complexity of distributed systems into a seamless tool for modelers.