안녕하세요. LY Corporation Observability Infrastructure 팀에서 사내 시계열 데이터베이스(time-series database, TSDB)의 개발 및 운영을 맡고 있는 오기준입니다. LY Corporation의 사내 프라이빗 클라우드 플랫폼은 단순한 가상 머신(virtual machine)을 제공하는 것을 넘어 쿠버네티스(Kubernetes) 기반의 컨테이너 환경과 데이터베이스, 로드 밸런서(load balancer) 등 방대한 서비스 포트폴리오를 제공하고 있습…
AWS Weekly Roundup: Amazon Bedrock agent workflows, Amazon SageMaker private connectivity, and more (February 2, 2026) Over the past week, we passed Laba festival, a traditional marker in the Chinese calendar that signals the final stretch leading up to the Lunar New Year. For m…
DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire.
## Architecture and Core Components
* **Expressive SDK:** Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms.
* **Built-in Analysis Tools:** The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas.
* **Scalable Backend:** A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously.
* **Workflow Integration:** DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention.
## Authoring and Verification Workflow
* **Template Bootstrapping:** Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner.
* **Analyzer Chaining:** The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services.
* **Automated Backtesting:** Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance.
* **Decision Tree Logic:** Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves.
## Execution and Post-Processing
* **Trigger-based Analysis:** When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging.
* **Automated Mitigation:** A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues.
* **DrP Insights:** This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes.
* **Alert Annotation:** Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder.
## Practical Conclusion
Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.
AWS has introduced several new capabilities to Amazon Bedrock AgentCore designed to remove the trust and quality barriers that often prevent AI agents from moving into production environments. These updates, which include granular policy controls and sophisticated evaluation tools, allow developers to implement strict operational boundaries and monitor real-world performance at scale. By balancing agent autonomy with centralized verification, AgentCore provides a secure framework for deploying highly capable agents across enterprise workflows.
**Governance through Policy in AgentCore**
* This feature establishes clear boundaries for agent actions by intercepting tool calls via the AgentCore Gateway before they are executed.
* By operating outside of the agent’s internal reasoning loop, the policy layer acts as an independent verification system that treats the agent as an autonomous actor requiring permission.
* Developers can define fine-grained permissions to ensure agents do not access sensitive data inappropriately or take unauthorized actions within external systems.
**Quality Monitoring with AgentCore Evaluations**
* The new evaluation framework allows teams to monitor the quality of AI agents based on actual behavior rather than theoretical simulations.
* Built-in evaluators provide standardized metrics for critical dimensions such as helpfulness and correctness.
* Organizations can also implement custom evaluators to ensure agents meet specific business-logic requirements and industry-specific compliance standards.
**Enhanced Memory and Communication Features**
* New episodic functionality in AgentCore Memory introduces a long-term strategy that allows agents to learn from past experiences and apply successful solutions to similar future tasks.
* Bidirectional streaming in the AgentCore Runtime supports the deployment of advanced voice agents capable of handling natural, simultaneous conversation flows.
* These enhancements focus on improving consistency and user experience, enabling agents to handle complex, multi-turn interactions with higher reliability.
**Real-World Application and Performance**
* The AgentCore SDK has seen rapid adoption with over 2 million downloads, supporting diverse use cases from content generation at the PGA TOUR to financial data analysis at Workday.
* Case studies highlight significant operational gains, such as a 1,000 percent increase in content writing speed and a 50 percent reduction in problem resolution time through improved observability.
* The platform emphasizes 100 percent traceability of agent decisions, which is critical for organizations transitioning from reactive to proactive AI-driven operations.
To successfully scale AI agents, organizations should transition from simple prompt engineering to a robust agentic architecture. Leveraging these new policy and evaluation tools will allow development teams to maintain the necessary control and visibility required for customer-facing and mission-critical deployments.
NAVER is transitioning its internal search monitoring platform, SEER, to an architecture built on OpenTelemetry and open-source standards to achieve a more scalable and flexible observability environment. By adopting a vendor-agnostic approach, the engineering team aims to unify the collection of metrics, logs, and traces while contributing back to the global OpenTelemetry ecosystem. This shift underscores the importance of standardized telemetry protocols in managing complex, large-scale service infrastructures.
### Standardizing Observability with OTLP
* The transition focuses on the OpenTelemetry Protocol (OTLP) as the primary standard for transmitting telemetry data across the platform.
* Moving away from proprietary formats allows for a unified data model that encompasses metrics, traces, and logs, ensuring consistency across different services.
* A standardized protocol simplifies the integration of various open-source backends, reducing the engineering overhead associated with supporting multiple telemetry formats.
### The OpenTelemetry Collector Pipeline
* The Collector acts as a critical intermediary, decoupling the application layer from the storage backend to provide greater architectural flexibility.
* **Receivers** are used to ingest data from diverse sources, supporting both OTLP-native applications and legacy systems.
* **Processors** enable data transformation, filtering, and metadata enrichment (such as adding resource attributes) before the data reaches its destination.
* **Exporters** manage the delivery of processed telemetry to specific backends like Prometheus for metrics or Jaeger for tracing, allowing for easy swaps of infrastructure components.
### Automated Management via OpenTelemetry Operator
* The OpenTelemetry Operator is utilized within Kubernetes environments to automate the deployment and lifecycle management of the Collector.
* It facilitates auto-instrumentation, allowing developers to collect telemetry from applications without manual code changes for every service.
* The Operator ensures that the observability stack scales dynamically alongside the production workloads it monitors.
### Open-Source Contribution and Community
* Beyond mere adoption, the NAVER engineering team actively participates in the OpenTelemetry community by sharing bug fixes and feature enhancements discovered during the SEER migration.
* This collaborative approach ensures that the specific requirements of high-traffic enterprise environments are reflected in the evolution of the OpenTelemetry project.
Adopting OpenTelemetry is a strategic move for organizations looking to avoid vendor lock-in and build a future-proof monitoring stack. For a successful implementation, teams should focus on mastering the Collector's pipeline configuration to balance data granularity with processing performance across distributed systems.
Toss Payments evolved its Payment SDK to solve the inherent complexities of integrating payment systems, where developers must navigate UI implementation, security flows, and exception handling. By transitioning from V1 to V2, the team moved beyond simply providing a library to building a robust, architecture-driven system that ensures stability and scalability across diverse merchant environments. The core conclusion is that a successful SDK must be treated as a critical infrastructure layer, relying on modular design and deep observability to handle the unpredictable nature of third-party runtimes.
## The Unique Challenges of SDK Development
* SDK code lives within the merchant's runtime environment, meaning it shares the same lifecycle and performance constraints as the merchant’s own code.
* Internal logging can inadvertently create bottlenecks; for instance, adding network logs to a frequently called method can lead to "self-DDoS" scenarios that crash the merchant's payment page.
* Type safety is a major hurdle, as merchants may pass unexpected data types (e.g., a number instead of a string), causing fatal runtime errors like `startsWith is not a function`.
* The SDK acts as a bridge for technical communication, requiring it to function as both an API consumer for internal systems and an API provider for external developers.
## Ensuring Stability through Observability
* To manage the unpredictable ways merchants use the SDK, Toss implemented over 300 unit tests and 500 E2E integration tests based on real-world use cases.
* The team utilizes a "Global Trace ID" to track a single payment journey across both the frontend and backend, allowing for seamless debugging across the entire system.
* A custom Monitoring CLI was developed to compare payment success rates before and after deployments, categorized by merchant and runtime environment (e.g., PC Chrome vs. Android WebView).
* This observability infrastructure enables the team to quickly identify edge-case failures—such as a specific merchant's checkout failing only on mobile WebViews—which are often missed by standard QA processes.
## Scaling with Modular Architecture
* To avoid "if-statement hell" caused by merchant-specific requirements (e.g., fixing installment months or custom validation for a specific store), Toss moved to a "Lego-block" architecture.
* The SDK is organized into three distinct layers based on the "reason for change" principle:
* **Public Interface Layer:** Manages the contract with the merchant, validating inputs and translating them into internal domain models.
* **Domain Layer:** Encapsulates core business logic and payment policies, keeping them isolated from external changes.
* **External Service Layer:** Handles dependencies like Server APIs and Web APIs, ensuring technical shifts don't leak into the business logic.
* This separation allows the team to implement custom merchant logic by swapping specific blocks without modifying the core codebase, reducing the risk of regressions and lowering maintenance costs.
For developers building SDKs or integration tools, the shift from monolithic logic to a layered, observable architecture is essential. Prioritizing the separation of domain logic from public interfaces and investing in environment-specific monitoring allows for a highly flexible product that remains stable even as the client-side environment grows increasingly complex.
Datadog’s container observability team significantly improved the performance of kube-state-metrics (KSM) by contributing core architectural enhancements to the upstream open-source project. Faced with scalability bottlenecks where metrics collection for large clusters took tens of seconds and generated massive data payloads, they revamped the underlying library to achieve a 15x improvement in processing duration. These contributions allowed for high-granularity monitoring at scale, ensuring that the Datadog Agent can efficiently handle millions of metrics across thousands of Kubernetes nodes.
### Challenges with KSM Scalability
* KSM uses the informer pattern to expose cluster-level metadata via the Openmetrics format, but the volume of data grows exponentially with cluster size.
* In high-scale environments, a single node generates approximately nine metrics, while a single pod can generate up to 40 metrics.
* In clusters with thousands of nodes and tens of thousands of pods, the `/metrics` endpoint produced payloads weighing tens of megabytes.
* The time required to crawl these metrics often exceeded 15 seconds, forcing administrators to reduce check frequency and sacrifice real-time data granularity.
### Limitations of Legacy Implementations
* KSM v1 relied on a monolithic loop that instantiated a Builder to track resources via stores, but it lacked efficient hooks for metric generation.
* The original Python-based Datadog Agent check struggled with the "data dump" approach of KSM, where all metrics were processed at once during query time.
* To manage the load, Datadog was forced to split KSM into multiple deployments based on resource types (e.g., separate deployments for pods, nodes, and secondary resources like services or deployments).
* This fragmentation made the infrastructure more complex to manage and did not solve the fundamental issue of inefficient metric serialization.
### Architectural Improvements in KSM v2.0
* Datadog collaborated with the upstream community during the development of KSM v2.0 to introduce a more extensible design.
* The team focused on improving the Builder and metric generation hooks to prevent the system from dumping the entire dataset at query time.
* By moving away from the restrictive v1 library structure, they enabled more efficient reconciliation of metric names and metadata joins.
* The resulting 15x performance gain allows the Datadog Agent to reconcile labels and tags—such as joining deployment labels to specific metrics—without the significant latency overhead previously experienced.
Contributing back to the open-source community proved more effective than maintaining internal forks for scaling Kubernetes infrastructure. Organizations running high-density clusters should prioritize upgrading to KSM v2.0 and optimizing their agent configurations to leverage these architectural improvements for better observability performance.
Datadog has introduced Toto, a new open-weights foundation model specifically designed for time-series forecasting and anomaly detection within observability contexts. While general-purpose time-series models often struggle with the unique volatility and high-frequency patterns of IT telemetry, Toto is pre-trained on a massive dataset of 500 billion observations to provide superior zero-shot performance. This release, accompanied by the BOOM benchmark, addresses the critical need for specialized AI tools capable of handling the complexity of modern cloud infrastructure.
### Toto Model Architecture and Training
* Toto utilizes a decoder-only transformer architecture, adapting large language model (LLM) principles to the domain of continuous numerical data.
* The model employs a "patching" mechanism, which groups multiple time-series data points into single tokens to improve computational efficiency and allow the model to capture longer historical dependencies.
* It incorporates Rotary Positional Embeddings (RoPE) to better handle sequences of varying lengths and maintain temporal relationships across different frequencies.
* Training was conducted on a curated dataset of 500 billion anonymized data points from real-world observability metrics, including CPU usage, memory consumption, and network traffic.
### Specialized Observability Features
* Unlike existing models like TimesFM or Chronos, which are trained on diverse but general datasets like weather or retail trends, Toto is optimized for the specific "spikiness" and abrupt level shifts common in IT environments.
* The model supports zero-shot forecasting, allowing users to generate predictions for new metrics immediately without the need for expensive or time-consuming fine-tuning.
* Toto is designed to handle varying sampling rates, from one-second intervals to hourly aggregations, making it versatile across different infrastructure layers.
* The open-weights release on Hugging Face allows researchers and engineers to integrate the model into their own AIOps workflows or private cloud environments.
### The BOOM Evaluation Framework
* Datadog released the Benchmarking Observability Models (BOOM) framework to provide a standardized method for evaluating time-series models on infrastructure-specific tasks.
* BOOM focuses on metrics that represent real-world operational challenges, such as seasonal traffic patterns and sudden system failures.
* Comparative testing shows that Toto consistently outperforms general-purpose models in Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) when applied to observability datasets.
* The benchmark provides a transparent way for the industry to measure progress in time-series foundation models, moving beyond generic datasets that do not reflect the realities of microservices and distributed systems.
Organizations looking to automate capacity planning, optimize cloud spend, or implement intelligent alerting should consider adopting Toto for their time-series analysis. By utilizing the open-weights model alongside the BOOM benchmark, teams can achieve high-accuracy forecasting and objective performance validation without the overhead of building specialized models from scratch.