scalability

11 posts

toss

Rethinking Design Systems (opens in new tab)

Toss Design System (TDS) argues that as organizations scale, design systems often become a source of friction rather than efficiency, leading teams to bypass them through "forking" or "detaching" components. To prevent this, TDS treats the design system as a product that must adapt to user demand rather than a set of rigid constraints to be enforced. By shifting from a philosophy of control to one of flexible expansion, they ensure that the system remains a helpful tool rather than an obstacle. ### The Limits of Control and System Fragmentation * When a design system is too rigid, product teams often fork packages to make minor adjustments, which breaks the link to central updates and creates UI inconsistencies. * Treating "system bypasses" as user errors is ineffective; instead, they should be viewed as unmet needs in the system's "supply." * The goal of a modern design system should be to reduce the reason to bypass the system by providing natural extension points. ### Comparing Flat and Compound API Patterns * **Flat Pattern:** These components hide internal structures and use props to manage variations (e.g., `title`, `description`). While easy to use, they suffer from "prop bloat" as more edge cases are added, making long-term maintenance difficult. * **Compound Pattern:** This approach provides sub-components (e.g., `Card.Header`, `Card.Body`) for the user to assemble manually. This offers high flexibility for unexpected layouts but increases the learning curve and the amount of boilerplate code required. ### The Hybrid API Strategy * TDS employs a hybrid approach, offering both Flat APIs for common, simple use cases and Compound APIs for complex, customized needs. * Developers can choose a `FlatCard` for speed or a `Compound Card` when they need to inject custom elements like badges or unique button placements. * To avoid the burden of maintaining two separate codebases, TDS uses a "primitive" layer where the Flat API is simply a pre-assembled version of the Compound components. Design systems should function as guardrails that guide developers toward consistency, rather than fences that stop them from solving product-specific problems. By providing flexible architecture that supports exceptions, a system can maintain its relevance and ensure that teams stay within the ecosystem even as their requirements evolve.

datadog

Hardening eBPF for runtime security: Lessons from Datadog Workload Protection | Datadog (opens in new tab)

Scaling real-time file monitoring across high-traffic environments requires a strategy to process billions of kernel events without exhausting system resources. By leveraging eBPF, organizations can move filtering logic directly into the Linux kernel, drastically reducing the overhead associated with traditional userspace monitoring tools. This approach enables precise observability of file system activity while maintaining the performance necessary for large-scale production workloads. ### Limitations of Traditional Monitoring Tools * Conventional tools like `auditd` often struggle with performance bottlenecks because they require every event to be copied from the kernel to userspace for evaluation. * Standard APIs like `fanotify` and `inotify` lack the granularity needed for complex filtering, often resulting in "event storms" during high I/O operations. * The high frequency of context switching between kernel and userspace when processing billions of events per minute can lead to significant CPU spikes and system instability. ### Architecture of eBPF-Based File Monitoring * The system hooks into the Virtual File System (VFS) layer using `kprobes` and `tracepoints` to capture actions such as `vfs_read`, `vfs_write`, and `vfs_open`. * LSM (Linux Security Module) hooks are utilized for security-focused monitoring, providing a stable interface that is less prone to kernel version changes than raw kprobes. * By executing C-like code within the kernel’s sandboxed environment, the system can inspect file paths and process IDs (PIDs) instantly upon event creation. ### In-Kernel Filtering and Data Management * High-performance eBPF maps, specifically `BPF_MAP_TYPE_HASH` and `BPF_MAP_TYPE_LPM_TRIE`, are used to store allowlists and denylists for specific directories and file extensions. * The system implements prefix matching to ignore high-volume, low-value paths like `/proc`, `/sys`, or temporary build directories, discarding these events before they ever leave the kernel. * To minimize memory contention, per-CPU maps are employed, allowing the eBPF programs to aggregate data locally on each core without the need for expensive global locks. ### Efficient Data Transmission with Ring Buffers * The implementation utilizes `BPF_RINGBUF` rather than the older `BPF_PERF_EVENT_ARRAY` to handle data transfer to userspace. * Ring buffers provide a shared memory space between the kernel and userspace, offering better memory efficiency and guaranteeing event ordering. * By only pushing "filtered" events—representing a tiny fraction of the billions of raw kernel events—the system prevents userspace consumers from becoming overwhelmed. For organizations operating at massive scale, moving from reactive userspace logging to proactive kernel-level filtering is essential. Implementing an eBPF-based monitoring stack allows for deep visibility into file system changes with minimal performance impact, making it the recommended standard for modern, high-throughput cloud environments.

netflix

Behind the Streams: Real-Time Recommendations for Live Events Part 3 | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix manages the massive surge of concurrent users during live events by utilizing a hybrid strategy of prefetching and real-time broadcasting to deliver synchronized recommendations. By decoupling data delivery from the live trigger, the system avoids the "thundering herd" effect that would otherwise overwhelm cloud infrastructure during record-breaking broadcasts. This architecture ensures that millions of global devices receive timely updates and visual cues without requiring linear, inefficient scaling of compute resources. ### The Constraint Optimization Problem To maintain a seamless experience, Netflix engineers balance three primary technical constraints: time to update, request throughput, and compute cardinality. * **Time:** The specific duration required to coordinate and push a recommendation update to the entire global fleet. * **Throughput:** The maximum capacity of cloud services to handle incoming requests without service degradation. * **Cardinality:** The variety and complexity of unique requests necessary to serve personalized updates to different user segments. ### Two-Phase Recommendation Delivery The system splits the delivery process into two distinct stages to smooth out traffic spikes and ensure high availability. * **Prefetching Phase:** While members browse the app normally before an event, the system downloads materialized recommendations, metadata, and artwork into the device's local cache. * **Broadcasting Phase:** When the event begins, a low-cardinality "at least once" message is broadcast to all connected devices, triggering them to display the already-cached content instantaneously. * **Traffic Smoothing:** This approach eliminates the need for massive, real-time data fetches at the moment of kickoff, distributing the heavy lifting of data transfer over a longer period. ### Live State Management and UI Synchronization A dedicated Live State Management (LSM) system tracks event schedules in real time to ensure the user interface stays perfectly in sync with the production. * **Dynamic Adjustments:** If a live event is delayed or ends early, the LSM adjusts the broadcast triggers to preserve accuracy and prevent "spoilers" or dead links. * **Visual Cues:** The UI utilizes "Live" badging and dynamic artwork transitions to signal urgency and guide users toward the stream. * **Frictionless Playback:** For members already on a title’s detail page, the system can trigger an automatic transition into the live player the moment the broadcast begins, reducing navigation latency. To support global-scale live events, technical teams should prioritize edge-heavy strategies that pre-position assets on client devices. By shifting from a reactive request-response model to a proactive prefetch-and-trigger model, platforms can maintain high performance and reliability even during the most significant traffic peaks.

discord

Discord Update: September 25, 2025 Changelog (opens in new tab)

Discord’s September 2025 update focuses on enhancing user expression and scaling server infrastructure to unprecedented levels. By introducing massive server capacity increases and highly customizable interface features, the platform aims to better support its largest communities and most active power users. Ultimately, these changes provide a more dynamic social experience through improved profile visibility, expanded pin limits, and flexible multitasking tools. ### Enhanced User Profiles and Multitasking - Desktop profiles now feature a refreshed layout designed to showcase a user's current activities and history more clearly. - Multiple concurrent activities, such as playing a game while listening to music in a voice channel, are now displayed as a "stack of cards" on the profile. - Activities can be moved into a pop-out floating window, allowing users to participate in shared experiences like "Watch Together" while navigating other servers or DMs. - A new audio cue now plays whenever a user turns their camera on to provide immediate feedback that their video stream is live. ### Massive Scaling and Embed Improvements - The default server member cap has been increased to 25 million, supported by engineering optimizations to member list loading speeds for "super-super-large" communities. - The channel pin limit has been expanded fivefold, moving from a 50-message cap to 250 messages per channel. - Native support for AV1 video attachments and embeds was integrated to improve video quality and loading performance. - Tumblr link embeds have been overhauled to include detailed descriptions and metadata for hashtags used in the original post. ### Custom Themes and Aesthetic Upgrades - Nitro users can now create custom gradient themes using up to five different colors, a feature that synchronizes across both desktop and mobile clients. - Two new Server Tag badge packs—the Pet pack and the Flex pack—introduce new iconography for server roles, including animal icons and royalty-themed badges. - Visual updates were made to Group DM icons, which the development team refers to as "facepiles," to better represent groups of friends in the chat list. Users should explore the new custom gradient settings in their Nitro preferences to personalize their workspace and take advantage of the expanded pin limits to better manage information in high-traffic channels.

google

Securing private data at scale with differentially private partition selection (opens in new tab)

Google Research has introduced a novel parallel algorithm called MaxAdaptiveDegree (MAD) to enhance differentially private (DP) partition selection, a critical process for identifying common data items in massive datasets without compromising individual privacy. By utilizing an adaptive weighting mechanism, the algorithm optimizes the utility-privacy trade-off, allowing researchers to safely release significantly more data than previous non-adaptive methods. This breakthrough enables privacy-preserving analysis on datasets containing hundreds of billions of items, scaling up to three orders of magnitude larger than existing sequential approaches. ## The Role of DP Partition Selection * DP partition selection identifies a meaningful subset of unique items from large collections based on their frequency across multiple users. * The process ensures that no single individual's data can be identified in the final list by adding controlled noise and filtering out items that are not sufficiently common. * This technique is a foundational step for various machine learning tasks, including extracting n-gram vocabularies for language models, analyzing private data streams, and increasing efficiency in private model fine-tuning. ## The Weight, Noise, and Filter Paradigm * The standard approach to private partition selection begins by computing a "weight" for each item, typically representing its frequency, while ensuring "low sensitivity" so no single user has an outsized impact. * Random Gaussian noise is added to these weights to obfuscate exact counts, preventing attackers from inferring the presence of specific individuals. * A threshold determined by DP parameters is then applied; only items whose noisy weights exceed this threshold are included in the final output. ## Improving Utility via Adaptive Weighting * Traditional non-adaptive methods often result in "wastage," where highly popular items receive significantly more weight than necessary to cross the selection threshold. * The MaxAdaptiveDegree (MAD) algorithm introduces adaptivity by identifying items with excess weight and rerouting that weight to "under-allocated" items sitting just below the threshold. * This strategic reallocation allows a larger number of less-frequent items to be safely released, significantly increasing the utility of the dataset without compromising privacy or computational efficiency. ## Scalability and Parallelization * Unlike sequential algorithms that process data one piece at a time, MAD is designed as a parallel algorithm to handle the scale of modern user-based datasets. * The algorithm can process datasets with hundreds of billions of items by breaking the problem down into smaller parts computed simultaneously across multiple processors. * Google has open-sourced the implementation on GitHub to provide the research community with a tool that maintains robust privacy guarantees even at a massive scale. Researchers and data scientists working with large-scale sensitive datasets should consider implementing the MaxAdaptiveDegree algorithm to maximize the amount of shareable data while strictly adhering to user-level differential privacy standards.

discord

How Discord Indexes Trillions of Messages (opens in new tab)

Discord's initial message search architecture was designed to handle billions of messages using a sharded Elasticsearch configuration spread across two clusters. By sharding data by guilds and direct messages, the system prioritized fast querying and operational manageability for its growing user base. While this approach utilized lazy indexing and bulk processing to remain cost-effective, the rapid growth of the platform eventually revealed scalability limitations within the existing design. ### Sharding and Cluster Management * The system utilized Elasticsearch as the primary engine, with messages sharded across indices based on the logical namespace of the Discord server (guild) or direct message (DM). * This sharding strategy ensured that all messages for a specific guild were stored together, allowing for localized, high-speed query performance. * Infrastructure was split across two distinct Elasticsearch clusters to keep individual indices smaller and more manageable. ### Optimized Indexing via Bulk Queues * To minimize resource overhead, Discord implemented lazy indexing, only processing messages for search when necessary rather than indexing every message in real-time. * A custom message queue allowed background workers to aggregate messages into chunks, maximizing the efficiency of Elasticsearch’s bulk-indexing API. * This architecture allowed the system to remain performant and cost-effective by focusing compute power on active guilds rather than idling on unused data. For teams building large-scale search infrastructure, Discord's early experience suggests that sharding by logical ownership (like guilds) and utilizing bulk-processing queues can provide significant initial scalability. However, as data volume reaches the multi-billion message threshold, it is essential to monitor for architectural "cracks" where sharding imbalances or indexing delays may require a transition to more robust distributed systems.

discord

Overclocking dbt: Discord's Custom Solution in Processing Petabytes of Data (opens in new tab)

Discord scaled its data infrastructure to manage petabytes of data and over 2,500 models by moving beyond a standard dbt implementation. While the tool initially provided a modular and developer-friendly framework, the sheer volume of data and a high headcount of over 100 concurrent developers led to critical performance bottlenecks. To resolve these issues, Discord developed custom extensions to dbt’s core functionality, successfully reducing compilation times and automating complex data transformations. ### Strategic Adoption of dbt * Discord integrated dbt into its stack to leverage software engineering principles like modular design and code reusability for SQL transformations. * The tool’s open-source nature allowed the team to align with Discord’s internal philosophy of community-driven engineering. * The framework offered seamless integration with other internal tools, such as the Dagster orchestrator, and provided a robust testing environment to ensure data quality. ### Scaling Bottlenecks and Performance Issues * The project grew to a size where recompiling the entire dbt project took upwards of 20 minutes, severely hindering developer velocity. * Standard incremental materialization strategies provided by dbt proved inefficient for the petabyte-scale data volumes generated by millions of concurrent users. * Developer workflows often collided, resulting in teams inadvertently overwriting each other’s test tables and creating data silos or inconsistencies. * The lack of specialized handling for complex backfills threatened the organization’s ability to deliver timely and accurate insights. ### Engineering Custom Extensions for Growth * The team built a provider-agnostic layer over Google BigQuery to streamline complex calculations and automate massive data backfills. * Custom optimizations were implemented to prevent breaking changes during the development cycle, ensuring that 100+ developers could work simultaneously without friction. * By extending dbt’s core, Discord transformed slow development cycles into a rapid, automated system capable of serving as the backbone for their global analytics infrastructure. For organizations operating at massive scale, standard open-source tools often require custom-built orchestration and optimization layers to remain viable. Prioritizing the automation of backfills and optimizing compilation logic is essential to maintaining developer productivity and data integrity when dealing with thousands of models and petabytes of information.