sharding

2 posts

line

Slow Query Resolution: Optimizing Bit (opens in new tab)

들어가며 안녕하세요. LINE VOOM 서비스의 포스트 서버를 개발하고 있는 서용준입니다. 이번 글에서는 저희 팀이 약 7개월에 걸쳐 슬로우 쿼리 문제를 해결한 과정과 그 과정에서 배운 교훈을 공유하고자 합니다. 저희 서비스에서는 헤비 유저의 소셜 프로필을 조회할 때 간헐적으로 슬로우 쿼리가 발생하고 있었습니다. 발생 빈도가 높지는 않았지만 한 번 발생하면 쿼리가 30초 이상 실행되다가 타임아웃이 발생했습니다. 결론부터 말씀드리면, MySQL 8.0.13의 함수형 인덱스(functional in…

discord

How Discord Indexes Trillions of Messages (opens in new tab)

Discord's initial message search architecture was designed to handle billions of messages using a sharded Elasticsearch configuration spread across two clusters. By sharding data by guilds and direct messages, the system prioritized fast querying and operational manageability for its growing user base. While this approach utilized lazy indexing and bulk processing to remain cost-effective, the rapid growth of the platform eventually revealed scalability limitations within the existing design. ### Sharding and Cluster Management * The system utilized Elasticsearch as the primary engine, with messages sharded across indices based on the logical namespace of the Discord server (guild) or direct message (DM). * This sharding strategy ensured that all messages for a specific guild were stored together, allowing for localized, high-speed query performance. * Infrastructure was split across two distinct Elasticsearch clusters to keep individual indices smaller and more manageable. ### Optimized Indexing via Bulk Queues * To minimize resource overhead, Discord implemented lazy indexing, only processing messages for search when necessary rather than indexing every message in real-time. * A custom message queue allowed background workers to aggregate messages into chunks, maximizing the efficiency of Elasticsearch’s bulk-indexing API. * This architecture allowed the system to remain performant and cost-effective by focusing compute power on active guilds rather than idling on unused data. For teams building large-scale search infrastructure, Discord's early experience suggests that sharding by logical ownership (like guilds) and utilizing bulk-processing queues can provide significant initial scalability. However, as data volume reaches the multi-billion message threshold, it is essential to monitor for architectural "cracks" where sharding imbalances or indexing delays may require a transition to more robust distributed systems.