nosql

1 posts

netflix

Building a Resilient Data Platform with Write-Ahead Log at Netflix | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix has developed a distributed Write-Ahead Log (WAL) abstraction to address critical data challenges such as accidental corruption, system entropy, and the complexities of cross-region replication. By decoupling data mutation from immediate persistence and providing a unified API, this system ensures strong durability and eventual consistency across diverse storage engines. The WAL acts as a resilient buffer that powers high-leverage features like secondary indexing and delayed retry queues while maintaining the massive scale required for global operations. ### The Role of the WAL Abstraction * The system serves as a centralized mechanism to capture data changes and reliably deliver them to downstream consumers, mitigating the risk of data loss during administrative errors or database corruption. * It provides a simplified `WriteToLog` gRPC endpoint that abstracts underlying infrastructure, allowing developers to focus on data logic rather than the specifics of the storage layer. * By acting as a durable intermediary, it prevents permanent data loss during incidents where primary datastores fail or require schema changes that might otherwise lead to corruption. ### Flexible Personas and Namespaces * The architecture utilizes "namespaces" to define logical separation, allowing different services to configure specific storage backends like Kafka or SQS based on their needs. * The "Delayed Queues" persona leverages SQS to provide a scalable way to retry failed messages in real-time pipelines without sacrificing overall system throughput. * The system can be configured for "Cross-Region Replication," enabling high availability and disaster recovery for storage engines that do not natively support multi-region data transfer. ### Solving System Entropy and Consistency * The WAL addresses the "dual-write" problem, where updates to primary stores (such as Cassandra) and search indices (such as Elasticsearch) can diverge over time, leading to data inconsistency. * It facilitates reliable secondary indexing for NoSQL databases by managing updates to multiple partitions as a coordinated sequence of events. * The platform mitigates operational risks, such as Out-of-Memory (OOM) errors on Key-Value nodes caused by bulk deletes, by staging and throttling mutations through the log. Organizations operating at scale should adopt a WAL-centric architecture to simplify the management of heterogeneous data stores and enhance system resilience. By centralizing the mutation log, teams can implement complex features like Change Data Capture (CDC) and cross-region failover through a single, consistent interface rather than building bespoke solutions for every service.