시작하며 안녕하세요. SRE(Site Reliability Engineer)로 일하고 있는 어다희입니다. 저희 팀은 Media Platform SRE를 비롯해 글로벌 트래픽 관리 업무를 담당하고 있습니다. 그동안 ‘신뢰성 향상을 위한 SLI/SLO 도입’ 시리즈를 통해 SLI/SLO의 개념과 필요성을 살펴보고 플랫폼과 서비스에 적용한 사례를 공유해 왔습니다. 신뢰성 향상을 위한 SLI/SLO 도입 1편 - 소개와 필요성 신뢰성 향상을 위한 SLI/SLO 도입 2편 - 플랫폼 적용 사례 신뢰성 향…
Amazon CloudWatch has evolved into a unified platform for managing operational, security, and compliance log data, significantly reducing the need for redundant data stores and complex ETL pipelines. By standardizing ingestion through industry-standard formats like OCSF and OpenTelemetry, the service enables seamless cross-source analytics while lowering operational overhead and storage costs. This update allows organizations to move away from fragmented data silos toward a centralized, Iceberg-compatible architecture for deeper technical and business insights.
**Data Ingestion and Schema Normalization**
* Automatically collects AWS-vended logs across accounts and regions via AWS Organizations, including CloudTrail, VPC Flow Logs, WAF access logs, and Route 53 resolver logs.
* Includes pre-built connectors for a wide range of third-party sources, such as endpoint security (CrowdStrike, SentinelOne), identity providers (Okta, Entra ID), and network security (Zscaler, Palo Alto Networks).
* Utilizes managed Open Cybersecurity Schema Framework (OCSF) and OpenTelemetry (OTel) conversion to ensure data consistency across disparate sources.
* Provides built-in processors, such as Grok for custom parsing and field-level operations, to transform and manipulate strings during the ingestion phase.
**Unified Architecture and Cost Optimization**
* Consolidates log management into a single service with built-in governance, eliminating the need to store and maintain duplicate copies of data across different tools.
* Introduces Apache Iceberg-compatible access via Amazon S3 Tables, allowing data to be queried in place by external tools.
* Removes the requirement for complex ETL pipelines by providing a unified data store that is accessible to Amazon Athena, Amazon SageMaker Unified Studio, and other Iceberg-compatible analytics engines.
**Advanced Analytics and Discovery Tools**
* Supports multiple query interfaces, allowing users to interact with logs using natural language, SQL, LogsQL, or PPL (Piped Processing Language).
* The new "Facets" interface enables intuitive filtering by application, account, region, and log type, featuring intelligent parameter inference for cross-account queries.
* Enables the correlation of operational logs with business data from third-party tools like ServiceNow CMDB or GitHub to provide a more comprehensive view of organizational health.
Organizations should leverage these unified management features to consolidate their security and operational monitoring into a single source of truth. By adopting OCSF normalization and the new S3 Tables integration, teams can reduce the technical debt associated with managing multiple log silos while improving their ability to run cross-functional analytics.
NAVER is transitioning its internal search monitoring platform, SEER, to an architecture built on OpenTelemetry and open-source standards to achieve a more scalable and flexible observability environment. By adopting a vendor-agnostic approach, the engineering team aims to unify the collection of metrics, logs, and traces while contributing back to the global OpenTelemetry ecosystem. This shift underscores the importance of standardized telemetry protocols in managing complex, large-scale service infrastructures.
### Standardizing Observability with OTLP
* The transition focuses on the OpenTelemetry Protocol (OTLP) as the primary standard for transmitting telemetry data across the platform.
* Moving away from proprietary formats allows for a unified data model that encompasses metrics, traces, and logs, ensuring consistency across different services.
* A standardized protocol simplifies the integration of various open-source backends, reducing the engineering overhead associated with supporting multiple telemetry formats.
### The OpenTelemetry Collector Pipeline
* The Collector acts as a critical intermediary, decoupling the application layer from the storage backend to provide greater architectural flexibility.
* **Receivers** are used to ingest data from diverse sources, supporting both OTLP-native applications and legacy systems.
* **Processors** enable data transformation, filtering, and metadata enrichment (such as adding resource attributes) before the data reaches its destination.
* **Exporters** manage the delivery of processed telemetry to specific backends like Prometheus for metrics or Jaeger for tracing, allowing for easy swaps of infrastructure components.
### Automated Management via OpenTelemetry Operator
* The OpenTelemetry Operator is utilized within Kubernetes environments to automate the deployment and lifecycle management of the Collector.
* It facilitates auto-instrumentation, allowing developers to collect telemetry from applications without manual code changes for every service.
* The Operator ensures that the observability stack scales dynamically alongside the production workloads it monitors.
### Open-Source Contribution and Community
* Beyond mere adoption, the NAVER engineering team actively participates in the OpenTelemetry community by sharing bug fixes and feature enhancements discovered during the SEER migration.
* This collaborative approach ensures that the specific requirements of high-traffic enterprise environments are reflected in the evolution of the OpenTelemetry project.
Adopting OpenTelemetry is a strategic move for organizations looking to avoid vendor lock-in and build a future-proof monitoring stack. For a successful implementation, teams should focus on mastering the Collector's pipeline configuration to balance data granularity with processing performance across distributed systems.