system-design

4 posts

datadog

Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog | Datadog (opens in new tab)

Following a major 2023 incident that caused a near-total platform outage despite partial infrastructure availability, Datadog shifted its engineering philosophy from "never-fail" architectures to a model of graceful degradation. The company identified that prioritizing absolute data correctness during systemic stress created "square-wave" failures, where the entire platform appeared down if even a portion of data was missing. By moving toward a "fail better" mindset, Datadog now focuses on maintaining core functionality and data persistence even when underlying infrastructure is compromised. ## Limitations of the Never-Fail Approach * Classical root-cause analysis focused on a legacy, unsupervised global update mechanism that disconnected 50–60% of production Kubernetes nodes. * While the "precipitating event" was easily identified and disabled, the engineering team realized that fixing the trigger did not address the systemic fragility that caused a binary (up/down) failure pattern. * Prioritizing absolute accuracy meant that systems would wait for all data tags to process before displaying results; under stress, this caused the UI to show no data at all rather than "almost correct" data. * Sequential queuing, aggressive retry logic, and node-specific processing requirements exacerbated the bottleneck, preventing real-time recovery. ## Prioritizing Graceful Degradation * The incident prompted a shift away from relying solely on redundancy to prevent outages, acknowledging that some level of failure is eventually inevitable at scale. * Engineering priorities were redefined to ensure that data is never lost (even if delayed) and that real-time data is processed before stale backlogs. * The platform now aims to serve partial-but-accurate results to customers during an incident, providing visibility rather than a complete blackout. * Implementation is handled as a company-wide program where individual product teams adapt these principles to their specific architectural needs. ## Strengthening Data Persistence at Intake * Analysis revealed that data was lost during the outage because it was stored in memory or on local disks before being replicated to persistent stores. * The original design favored low-latency responses by acknowledging receipt of data before it was fully replicated, making that data unrecoverable if the node failed. * Downstream failures caused intake nodes to overflow their local buffers, leading to data loss even on nodes that remained online. * New architectural changes focus on implementing disk-based persistence at the very beginning of the processing pipeline to ensure data survives node restarts and downstream congestion. To build truly resilient systems, engineering teams must move beyond trying to prevent every possible failure trigger. Instead, focus on designing services that can survive partial infrastructure loss by prioritizing data persistence and allowing for degraded states that still provide value to the end user.

line

Hey, won't you become a (opens in new tab)

Hack Day 2025 serves as a cornerstone of LY Corporation’s engineering culture, bringing together diverse global teams to innovate beyond their daily operational scopes. By fostering a high-intensity environment focused on creative freedom, the event facilitates technical growth and strengthens interpersonal bonds across international branches. This 19th edition demonstrated how rapid prototyping and cross-functional collaboration can transform abstract ideas into functional AI-driven prototypes within a strict 24-hour window. ### Structure and Participation Dynamics * The hackathon follows a "9 to 9" format, providing exactly 24 hours of development time followed by a day for presentations and awards. * Participation is inclusive of all roles, including developers, designers, planners, and HR staff, allowing for holistic product development. * Teams can be "General Teams" from the same legal entity or "Global Mixed Teams" comprising members from different regions like Korea, Japan, Taiwan, and Vietnam. * The Developer Relations (DevRel) team facilitates team building for remote employees using digital collaboration tools like Zoom and Miro. ### AI-Powered Personality Analysis Project * The author's team developed a "Scouter" program inspired by Dragon Ball, designed to measure professional "combat power" based on communication history. * The system utilizes Slack bots and AI models to analyze message logs and map them to the Big 5 Personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism). * Professional metrics are visualized as game-like character statistics to make personality insights engaging and less intimidating. * While the original plan involved using AI to generate and print physical character cards, hardware failures with photo printers forced a technical pivot to digital file downloads. ### High-Pressure Presentation and Networking * Every team is allotted a strict 90-second window to pitch their product and demonstrate a live demo. * The "90-second rule" includes a mandatory microphone cutoff to maintain momentum and keep the large-scale event engaging for all attendees. * Dedicated booth sessions follow the presentations, allowing participants to provide hands-on experiences to colleagues and judges. * The event emphasizes "Perfect the Details," a core company value, by encouraging teams to utilize all available resources—from whiteboards to AI image generators—within the time limit. ### Environmental Support and Culture * The event occupies an entire office floor, providing a high-density yet comfortable environment designed to minimize distractions during the "Hack Time." * Cultural exchange is encouraged through "humanity snacks," where participants from different global offices share local treats in dedicated rest areas. * Strategic scheduling, such as "Travel Days" for international participants, ensures that teams can focus entirely on technical execution once the event begins. Participating in internal hackathons provides a vital platform for testing new technologies—like LLMs and personality modeling—that may not fit into immediate product roadmaps. For organizations with hybrid work models, these intensive in-person events are highly recommended to bridge the communication gap and build lasting trust between global teammates.