search-engine

2 posts

daangn

Easily Operating Karrot (opens in new tab)

This blog post by the Daangn (Karrot) search platform team details their journey in optimizing Elasticsearch operations on Kubernetes (ECK). While their initial migration to ECK reduced deployment times, the team faced critical latency spikes during rolling restarts due to "cold caches" and high traffic volumes. To achieve a "deploy anytime" environment, they developed a data node warm-up system to ensure nodes are performance-ready before they begin handling live search requests. ## Scaling Challenges and Operational Constraints - Over two years, Daangn's search infrastructure expanded from a single cluster to four specialized clusters, with peak traffic jumping from 1,000 to over 10,000 QPS. - The initial strategy of "avoiding peak hours" for deployments became a bottleneck, as the window for safe updates narrowed while total deployment time across all clusters exceeded six hours. - Manual monitoring became a necessity rather than an option, as engineers had to verify traffic conditions and latency graphs before and during every ArgoCD sync. ## The Hazards of Rolling Restarts in Elasticsearch - Standard Kubernetes rolling restarts are problematic for stateful systems because a "Ready" Pod does not equate to a "Performant" Pod; Elasticsearch relies heavily on memory-resident caches (page cache, query cache, field data cache). - A version update in the Elastic Operator once triggered an unintended rolling restart that caused a 60% error rate and 3-second latency spikes because new nodes had to fetch all data from disk. - When a node restarts, the cluster enters a "Yellow" state where remaining replicas must handle 100% of the traffic, creating a single point of failure and increasing the load on the surviving nodes. ## Strategy for Reliable Node Warm-up - The primary goal was to reach a state where p99 latency remains stable during restarts, regardless of whether the deployment occurs during peak traffic hours. - The solution involves a "Warm-up System" designed to pre-load frequently accessed data into the filesystem and Elasticsearch internal caches before the node is allowed to join the load balancer. - By executing representative search queries against a newly started node, the system ensures that the necessary segments are already in the page cache, preventing the disk I/O thrashing that typically follows a cold start. ## Implementation Goals - Automate the validation of node readiness beyond simple health checks to include performance readiness. - Eliminate the need for human "eyes-on-glass" monitoring during the 90-minute deployment cycles. - Maintain high availability and consistent user experience even when shards are being reallocated and replicas are temporarily unassigned. To maintain a truly resilient search platform on Kubernetes, it is critical to recognize that for stateful applications, "available" is not the same as "ready." Implementing a customized warm-up controller or logic is a recommended practice for any high-traffic Elasticsearch environment to decouple deployment schedules from traffic patterns.

naver

Naver Integrated Search LLM DevOps (opens in new tab)

Naver’s Integrated Search team is transitioning from manual fault response to an automated system using LLM Agents to manage the increasing complexity of search infrastructure. By integrating Large Language Models into the DevOps pipeline, the system evolves through accumulated experience, moving beyond simple alert monitoring to intelligent diagnostic analysis and action recommendation. ### Limitations of Traditional Fault Response * **Complex Search Flows:** Naver’s search architecture involves multiple interdependent layers, which makes manual root cause analysis slow and prone to human error. * **Fragmented Context:** Existing monitoring requires developers to manually synthesize logs and metrics from disparate telemetry sources, leading to high cognitive load during outages. * **Delayed Intervention:** Human-led responses often suffer from a "detection-to-action" lag, especially during high-traffic periods or subtle service regressions. ### Architecture of DevOps Agent v1 * **Initial Design:** Focused on automating basic data gathering and providing preliminary textual reports to engineers. * **Infrastructure Integration:** Built using a specialized software stack designed to bridge frontend (FE) and backend (BE) telemetry within the search infrastructure. * **Standardized Logic:** The v1 agent operated on a fixed set of instructions to perform predefined diagnostic tasks when triggered by specific system alarms. ### Evolution to DevOps Agent v2 * **Overcoming V1 Limitations:** The first iteration struggled with maintaining deep context and providing diverse actionable insights, necessitating a more robust agentic structure. * **Enhanced Memory and Learning:** V2 incorporates a more sophisticated architecture that allows the agent to reference historical failure data and learn from past incident resolutions. * **Advanced Tool Interaction:** The system was upgraded to handle more complex tool-calling capabilities, allowing the agent to interact more deeply with internal infrastructure APIs. ### System Operations and Evaluation * **Trigger Queue Management:** Implements a queuing system to efficiently process and prioritize multiple concurrent system alerts without overwhelming the diagnostic pipeline. * **Anomaly Detection:** Utilizes advanced detection methods to distinguish between routine traffic fluctuations and genuine service anomalies that require LLM intervention. * **Rigorous Evaluation:** The agent’s performance is measured through a dedicated evaluation framework that assesses the accuracy of its diagnoses against known ground-truth incidents. ### Scaling and Future Challenges * **Context Expansion:** Efforts are focused on integrating a wider range of metadata and environmental context to provide a holistic view of system health. * **Action Recommendation:** The system is moving toward suggesting specific recovery actions, such as rollbacks or traffic rerouting, rather than just identifying the problem. * **Sustainability:** Ensuring the DevOps Agent remains maintainable and cost-effective as the underlying search infrastructure and LLM models continue to evolve. Organizations managing high-scale search traffic should consider LLM-based agents as integrated infrastructure components rather than standalone tools. Moving from reactive monitoring to a proactive, experience-based agent system is essential for reducing the mean time to recovery (MTTR) in complex distributed environments.