distributed-systems

6 posts

woowahan

We Did Everything from Planning to (opens in new tab)

The 7th Woowacourse crew has successfully launched three distinct services, demonstrating that modern software engineering requires a synergy of technical mastery and "soft skills" like product planning and team communication. By owning the entire lifecycle from ideation to deployment, these developers moved beyond mere coding to solve real-world problems through agile iterations, user feedback, and robust infrastructure management. The program’s focus on the full stack of development—including monitoring, 2-week sprints, and collaborative design—highlights a shift toward producing well-rounded engineers capable of navigating professional environments. ### The Woowacourse Full-Cycle Philosophy * The 10-month curriculum emphasizes soft skills, including speaking and writing, alongside traditional technical tracks like Web Backend, Frontend, and Mobile Android. * During Level 3 and 4, crews transition from fundamental programming to managing team projects where they must handle everything from initial architecture to UI/UX design. * The process mimics real-world industry standards by implementing 2-week development sprints, establishing monitoring environments, and managing automated deployment pipelines. * The core goal is to shift the developer's mindset from simply writing code to understanding why certain features are planned and how architecture choices impact the final user value. ### Pickeat: Collaborative Dining Decisions * This service addresses "decision fatigue" during group meals by providing a collaborative platform to filter restaurants based on dietary constraints and preferences. * Technical challenges included frequent domain restructuring and UI overhauls as the team pivoted based on real-world user feedback during demo days. * The platform utilizes location data for automatic restaurant lookups and supports real-time voting mechanisms to ensure democratic and efficient group decisions. * Development focused on aligning team judgment standards and iterating quickly to validate product-market fit rather than adhering strictly to initial specifications. ### Bottari: Real-Time Synchronized Checklists * Bottari is a checklist service designed for situations like traveling or moving, focusing on "becoming a companion for the user’s memory." * The service features template-based list generation and a "Team Bottari" function that allows multiple users to collaborate on a single list with real-time synchronization. * A major technical focus was placed on the user experience flow, specifically optimizing notification timing and sync states to provide "peace of mind" for users. * The project demonstrates the principle that technology serves as a tool for solving psychological pain points, such as the anxiety of forgetting essential items. ### Coffee Shout: Real-Time Betting and Mini-Games * Designed to gamify office culture, this service replaces simple "rock-paper-scissors" with interactive mini-games and weighted roulette for coffee bets. * The technical stack involved challenging implementations of WebSockets and distributed environments to handle the concurrency required for real-time gaming. * The team focused on algorithm balancing for the weighted roulette system to ensure fairness and excitement during the betting process. * Refinement of the service was driven by direct feedback from other Woowacourse crews, emphasizing the importance of community testing in the development lifecycle. These projects underscore that the transition from a student to a professional developer is defined by the ability to manage shifting requirements and technical complexity while maintaining a focus on the end-user's experience.

naver

FE News - December 202 (opens in new tab)

The December 2025 FE News highlights a significant shift in front-end development where the dominance of React is being cemented by LLM training cycles, even as the browser platform begins to absorb core framework functionalities. It explores the evolution of WebAssembly beyond its name and Vercel’s vision for managing distributed systems through language-level abstractions. Ultimately, the industry is moving toward a convergence of native web standards and AI-driven development paradigms that prioritize collective intelligence and simplified architectures. ### Clarifying the Identity of WebAssembly * Wasm is frequently misunderstood as a web-only assembly language, but it functions more like a platform-agnostic bytecode similar to JVM or .NET. * The name "WebAssembly" was originally a strategic choice for project funding rather than an accurate technical description of its capabilities or intended environment. ### The LLM Feedback Loop and React’s Dominance * The "dead framework theory" suggests that because LLM tools like Replit and Bolt hardcode React into system prompts, the framework has reached a state of perpetual self-reinforcement. * With over 13 million React sites deployed in the last year, new frameworks face a 12-18 month lag to be included in LLM training data, making it nearly impossible for competitors to disrupt React's current platform status. ### Vercel and the Evolution of Programming Abstractions * Vercel is integrating complex distributed system management directly into the development experience via directives like `Server Actions`, `use cache`, and `use workflow`. * These features are built on serializable closures, algebraic effects, and incremental computation, moving complexity from external libraries into the native language structure. ### Native Browser APIs vs. Third-Party Frameworks * Modern web standards, including Shadow DOM, ES Modules, and the Navigation and View Transitions APIs, are now capable of handling routing and state management natively. * This transition allows for high-performance application development with reduced bundle sizes, as the browser platform takes over responsibilities previously exclusive to heavy frameworks. ### LLM Council: Collective AI Decision Making * Andrej Karpathy’s LLM Council is a local web application that utilizes a three-stage process—independent suggestion, peer review, and final synthesis—to overcome the limitations of single AI models. * The system utilizes the OpenRouter API to combine the strengths of various models, such as GPT-5.1 and Claude Sonnet 4.5, using a stack built on Python (FastAPI) and React with Vite. Developers should focus on mastering native browser APIs as they become more capable while recognizing that React’s ecosystem remains the most robust choice for AI-integrated workflows. Additionally, exploring multi-model consensus systems like the LLM Council can provide more reliable results for complex technical decision-making than relying on a single AI provider.

naver

Things to know when using Kafka in a (opens in new tab)

The Apache Kafka ecosystem is undergoing a significant architectural shift with the introduction of Consumer Group Protocol v2, as outlined in KIP-848. This update addresses long-standing performance bottlenecks and stability issues inherent in the original client-side rebalancing logic by moving the responsibility of partition assignment to the broker. This change effectively eliminates the "stop-the-world" effect during rebalances and significantly improves the scalability of large-scale consumer groups. ### Limitations of the Legacy Consumer Group Protocol (v1) * **Heavy Client-Side Logic:** In v1, the "Group Leader" (a specific consumer instance) is responsible for calculating partition assignments, which creates a heavy burden on the client and leads to inconsistent behavior across different programming language implementations. * **Stop-the-World Rebalancing:** Whenever a member joins or leaves the group, all consumers must stop processing data until the new assignment is synchronized, leading to significant latency spikes. * **Sensitivity to Processing Delays:** Because heartbeats and data processing often share the same thread, a slow consumer can trigger a session timeout, causing an unnecessary and disruptive group rebalance. ### Architectural Improvements in Protocol v2 * **Server-Side Reconciliation:** The reconciliation logic is moved to the Group Coordinator on the broker, simplifying the client and ensuring that partition assignment is managed centrally and consistently. * **Incremental Rebalancing:** Unlike the "eager" rebalancing of v1, the new protocol allows consumers to keep their existing partitions while negotiating new ones, ensuring continuous data processing. * **Decoupled Heartbeats:** The heartbeat mechanism is separated from the main processing loop, preventing "zombie member" scenarios where a busy consumer is incorrectly marked as dead. ### Performance and Scalability Gains * **Reduced Rebalance Latency:** By offloading the assignment logic to the broker, the time required to stabilize a group after a membership change is reduced from seconds to milliseconds. * **Large-Scale Group Support:** The new protocol is designed to handle thousands of partitions and hundreds of consumers within a single group without the exponential performance degradation seen in v1. * **Stable Deployments:** During rolling restarts or deployments, the group remains stable and avoids the "rebalance storms" that typically occur when multiple instances cycle at once. ### Migration and Practical Implementation * **Configuration Requirements:** Users can opt-in to the new protocol by setting the `group.protocol` configuration to `consumer` (introduced as early access in Kafka 3.7 and standard in 4.0). * **Compatibility:** While the new protocol requires updated brokers and clients, it is designed to support a transition phase to allow organizations to migrate their workloads gradually. * **New Tooling:** Updated command-line tools and metrics are provided to monitor the server-side assignment process and track group state more granularly. Organizations experiencing frequent rebalance issues or managing high-throughput Kafka clusters should plan for a migration to Consumer Group Protocol v2. Transitioning to this server-side assignment model is highly recommended for stabilizing production environments and reducing the operational overhead associated with consumer group management.

netflix

How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet Scale | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix has developed a Real-Time Distributed Graph (RDG) to unify member interaction data across its expanding business verticals, including streaming, live events, and mobile gaming. By transitioning from siloed microservice data to a graph-based model, the company can perform low-latency, relationship-centric queries that were previously hindered by expensive manual joins and data fragmentation. The resulting system enables Netflix to track user journeys across various devices and platforms in real-time, providing a foundation for deeper personalization and pattern detection. ### Challenges of Data Isolation in Microservices * While Netflix’s microservices architecture facilitates independent scaling and service decomposition, it inherently leads to data isolation where each service manages its own storage. * Data scientists and engineers previously had to "stitch" together disparate data from various databases and the central data warehouse, which was a slow and manual process. * The RDG moves away from table-based models to a relationship-centric model, allowing for efficient "hops" across nodes without the need for complex denormalization. * This flexibility allows the system to adapt to new business entities (like live sports or games) without requiring massive schema re-architectures. ### Real-Time Ingestion and Normalization * The ingestion layer is designed to capture events from diverse upstream sources, including Change Data Capture (CDC) from databases and request/response logs. * Netflix utilizes its internal data pipeline, Keystone, to funnel these high-volume event streams into the processing framework. * The system must handle "Internet scale" data, ensuring that events from millions of members are captured as they happen to maintain an up-to-date view of the graph. ### Stream Processing with Apache Flink * Netflix uses Apache Flink as the core stream processing engine to handle the transformation of raw events into graph entities. * Incoming data undergoes normalization to ensure a standardized format, regardless of which microservice or business vertical the data originated from. * The pipeline performs data enrichment, joining incoming streams with auxiliary metadata to provide a comprehensive context for each interaction. * The final step of the processing layer involves mapping these enriched events into a graph structure of nodes (entities) and edges (relationships), which are then emitted to the system's storage layer. ### Practical Conclusion Organizations operating with a highly decoupled microservices architecture should consider a graph-based ingestion strategy to overcome the limitations of data silos. By leveraging stream processing tools like Apache Flink to build a real-time graph, engineering teams can provide stakeholders with the ability to discover hidden relationships and cross-domain insights that are often lost in traditional data warehouses.

netflix

Behind the Streams: Real-Time Recommendations for Live Events Part 3 | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix manages the massive surge of concurrent users during live events by utilizing a hybrid strategy of prefetching and real-time broadcasting to deliver synchronized recommendations. By decoupling data delivery from the live trigger, the system avoids the "thundering herd" effect that would otherwise overwhelm cloud infrastructure during record-breaking broadcasts. This architecture ensures that millions of global devices receive timely updates and visual cues without requiring linear, inefficient scaling of compute resources. ### The Constraint Optimization Problem To maintain a seamless experience, Netflix engineers balance three primary technical constraints: time to update, request throughput, and compute cardinality. * **Time:** The specific duration required to coordinate and push a recommendation update to the entire global fleet. * **Throughput:** The maximum capacity of cloud services to handle incoming requests without service degradation. * **Cardinality:** The variety and complexity of unique requests necessary to serve personalized updates to different user segments. ### Two-Phase Recommendation Delivery The system splits the delivery process into two distinct stages to smooth out traffic spikes and ensure high availability. * **Prefetching Phase:** While members browse the app normally before an event, the system downloads materialized recommendations, metadata, and artwork into the device's local cache. * **Broadcasting Phase:** When the event begins, a low-cardinality "at least once" message is broadcast to all connected devices, triggering them to display the already-cached content instantaneously. * **Traffic Smoothing:** This approach eliminates the need for massive, real-time data fetches at the moment of kickoff, distributing the heavy lifting of data transfer over a longer period. ### Live State Management and UI Synchronization A dedicated Live State Management (LSM) system tracks event schedules in real time to ensure the user interface stays perfectly in sync with the production. * **Dynamic Adjustments:** If a live event is delayed or ends early, the LSM adjusts the broadcast triggers to preserve accuracy and prevent "spoilers" or dead links. * **Visual Cues:** The UI utilizes "Live" badging and dynamic artwork transitions to signal urgency and guide users toward the stream. * **Frictionless Playback:** For members already on a title’s detail page, the system can trigger an automatic transition into the live player the moment the broadcast begins, reducing navigation latency. To support global-scale live events, technical teams should prioritize edge-heavy strategies that pre-position assets on client devices. By shifting from a reactive request-response model to a proactive prefetch-and-trigger model, platforms can maintain high performance and reliability even during the most significant traffic peaks.

netflix

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix has significantly optimized Maestro, its horizontally scalable workflow orchestrator, to meet the evolving demands of low-latency use cases like live events, advertising, and gaming. By redesigning the core engine to transition from a polling-based architecture to a high-performance event-driven model, the team achieved a 100x increase in speed. This evolution reduced workflow overhead from several seconds to mere milliseconds, drastically improving developer productivity and system efficiency. ### Limitations of the Legacy Architecture The original Maestro architecture was built on a three-layer system that, while scalable, introduced significant latency during execution. * **Polling Latency:** The internal flow engine relied on calling execution functions at set intervals, creating a "speedbump" where tasks waited seconds to be picked up by workers. * **Execution Overhead:** The process of translating complex workflow graphs into parallel flows and sequentially chained tasks added internal processing time that hindered sub-hourly and ad-hoc workloads. * **Concurrency Issues:** A lack of strong guarantees from the internal flow engine occasionally led to race conditions, where a single step might be executed by multiple workers simultaneously. ### Transitioning to an Event-Driven Engine To support the highest level of user needs, Netflix replaced the traditional flow engine with a custom, high-performance execution model. * **Direct Dispatching:** The engine moved away from periodic polling in favor of an event-driven mechanism that triggers state transitions instantly. * **State Machine Optimization:** The new design manages the lifecycle of workflows and steps through a more streamlined state machine, ensuring faster transitions between "start," "restart," "stop," and "pause" actions. * **Reduced Data Latency:** The team optimized data access patterns for internal state storage, reducing the time required to write Maestro data to the database during high-volume executions. ### Scalability and Functional Improvements The redesign not only improved speed but also strengthened the engine's ability to handle massive, complex data pipelines. * **Isolation Layers:** The engine maintains strict isolation between the Maestro step runtime (integrated with Spark and Trino) and the underlying execution logic. * **Support for Heterogeneous Workflows:** The supercharged engine continues to support massive workflows with hundreds of thousands of jobs while providing the low latency required for iterative development cycles. * **Reliability Guarantees:** By moving to a more robust internal event bus, the system eliminated the race conditions found in the previous distributed job queue implementation. For organizations managing large-scale Data or ML workflows, moving toward an event-driven orchestration model is essential for supporting sub-hourly execution and low-latency ad-hoc queries. These performance improvements are now available in the Maestro open-source project for wider community adoption.