Khayyam Guliyev Duarte Nunes Ming Chen Justin Jaffray As Datadog continues to scale, the volume, complexity, and cardinality of the metrics we ingest and store steadily grow by orders of magnitude. This growth pushes the boundaries of our core timeseries database—the internal sy…
Discord's initial message search architecture was designed to handle billions of messages using a sharded Elasticsearch configuration spread across two clusters. By sharding data by guilds and direct messages, the system prioritized fast querying and operational manageability for its growing user base. While this approach utilized lazy indexing and bulk processing to remain cost-effective, the rapid growth of the platform eventually revealed scalability limitations within the existing design.
### Sharding and Cluster Management
* The system utilized Elasticsearch as the primary engine, with messages sharded across indices based on the logical namespace of the Discord server (guild) or direct message (DM).
* This sharding strategy ensured that all messages for a specific guild were stored together, allowing for localized, high-speed query performance.
* Infrastructure was split across two distinct Elasticsearch clusters to keep individual indices smaller and more manageable.
### Optimized Indexing via Bulk Queues
* To minimize resource overhead, Discord implemented lazy indexing, only processing messages for search when necessary rather than indexing every message in real-time.
* A custom message queue allowed background workers to aggregate messages into chunks, maximizing the efficiency of Elasticsearch’s bulk-indexing API.
* This architecture allowed the system to remain performant and cost-effective by focusing compute power on active guilds rather than idling on unused data.
For teams building large-scale search infrastructure, Discord's early experience suggests that sharding by logical ownership (like guilds) and utilizing bulk-processing queues can provide significant initial scalability. However, as data volume reaches the multi-billion message threshold, it is essential to monitor for architectural "cracks" where sharding imbalances or indexing delays may require a transition to more robust distributed systems.
Artem Krylysov May Lee Datadog collects billions of events from millions of hosts every minute and that number keeps growing and fast. Our data volumes grew 30x between 2017 and 2022. On top of that, the kind of queries we receive from our users has changed significantly. Why? B…