apache-iceberg

5 posts

aws

Announcing replication support and Intelligent-Tiering for Amazon S3 Tables (opens in new tab)

AWS has expanded the capabilities of Amazon S3 Tables by introducing Intelligent-Tiering for automated cost optimization and cross-region replication for enhanced data availability. These updates address the operational overhead of managing large-scale Apache Iceberg datasets by automating storage lifecycle management and simplifying the architecture required for global data distribution. By integrating these features, organizations can reduce storage costs without manual intervention while ensuring consistent data access across multiple AWS Regions and accounts. ### Cost Optimization with S3 Tables Intelligent-Tiering This feature automatically shifts data between storage tiers based on access frequency to maximize cost efficiency without impacting application performance. * The system utilizes three low-latency tiers: Frequent Access, Infrequent Access (offering 40% lower costs), and Archive Instant Access (offering 68% lower costs than Infrequent Access). * Data transitions are automated, moving to Infrequent Access after 30 days of inactivity and to Archive Instant Access after 90 days. * Automated table maintenance tasks, such as compaction and snapshot expiration, are optimized to skip colder files; for example, compaction only processes data in the Frequent Access tier to minimize unnecessary compute and storage costs. * Users can configure Intelligent-Tiering as the default storage class at the table bucket level using the AWS CLI commands `put-table-bucket-storage-class` and `get-table-bucket-storage-class`. ### Cross-Region and Cross-Account Replication New replication support allows users to maintain synchronized, read-only replicas of their S3 Tables across different geographic locations and ownership boundaries. * Replication maintains chronological consistency and preserves parent-child snapshot relationships, ensuring that replicas remain identical to the source for query purposes. * Replica tables are typically updated within minutes of changes to the source table and support independent encryption and retention policies to meet specific regional compliance requirements. * The service eliminates the need for complex, custom-built architectures to track metadata transformations or manually sync objects between Iceberg tables. * This functionality is primarily designed to reduce query latency for geographically distributed teams and provide robust data protection for disaster recovery scenarios. ### Practical Implementation To maximize the benefits of these new features, organizations should consider setting Intelligent-Tiering as the default storage class at the bucket level for all new datasets to ensure immediate cost savings. For global operations, setting up read-only replicas in regions closest to end-users will significantly improve query performance for analytics tools like Amazon Athena and Amazon SageMaker.

aws

Amazon CloudWatch introduces unified data management and analytics for operations, security, and compliance (opens in new tab)

Amazon CloudWatch has evolved into a unified platform for managing operational, security, and compliance log data, significantly reducing the need for redundant data stores and complex ETL pipelines. By standardizing ingestion through industry-standard formats like OCSF and OpenTelemetry, the service enables seamless cross-source analytics while lowering operational overhead and storage costs. This update allows organizations to move away from fragmented data silos toward a centralized, Iceberg-compatible architecture for deeper technical and business insights. **Data Ingestion and Schema Normalization** * Automatically collects AWS-vended logs across accounts and regions via AWS Organizations, including CloudTrail, VPC Flow Logs, WAF access logs, and Route 53 resolver logs. * Includes pre-built connectors for a wide range of third-party sources, such as endpoint security (CrowdStrike, SentinelOne), identity providers (Okta, Entra ID), and network security (Zscaler, Palo Alto Networks). * Utilizes managed Open Cybersecurity Schema Framework (OCSF) and OpenTelemetry (OTel) conversion to ensure data consistency across disparate sources. * Provides built-in processors, such as Grok for custom parsing and field-level operations, to transform and manipulate strings during the ingestion phase. **Unified Architecture and Cost Optimization** * Consolidates log management into a single service with built-in governance, eliminating the need to store and maintain duplicate copies of data across different tools. * Introduces Apache Iceberg-compatible access via Amazon S3 Tables, allowing data to be queried in place by external tools. * Removes the requirement for complex ETL pipelines by providing a unified data store that is accessible to Amazon Athena, Amazon SageMaker Unified Studio, and other Iceberg-compatible analytics engines. **Advanced Analytics and Discovery Tools** * Supports multiple query interfaces, allowing users to interact with logs using natural language, SQL, LogsQL, or PPL (Piped Processing Language). * The new "Facets" interface enables intuitive filtering by application, account, region, and log type, featuring intelligent parameter inference for cross-account queries. * Enables the correlation of operational logs with business data from third-party tools like ServiceNow CMDB or GitHub to provide a more comprehensive view of organizational health. Organizations should leverage these unified management features to consolidate their security and operational monitoring into a single source of truth. By adopting OCSF normalization and the new S3 Tables integration, teams can reduce the technical debt associated with managing multiple log silos while improving their ability to run cross-functional analytics.

naver

Naver TV (opens in new tab)

This technical session from NAVER ENGINEERING DAY 2025 explores the architectural journey of building a low-latency query system for real-time transaction reports. The project focuses on resolving the tension between high data freshness, massive scalability, and rapid response times for complex, multi-dimensional filtering. By leveraging Apache Iceberg in conjunction with StarRocks’ materialized views, the team established a performant data pipeline that meets the demands of modern business intelligence. ### Challenges in Real-Time Transaction Reporting * **Query Latency vs. Data Freshness:** Traditional architectures often struggle to provide immediate visibility into transaction data while maintaining sub-second query speeds across diverse filter conditions. * **High-Dimensional Filtering:** Users require the ability to query reports based on numerous variables, necessitating an engine that can handle complex aggregations without pre-defining every possible index. * **Scalability Requirements:** The system must handle increasing transaction volumes without degrading performance or requiring significant manual intervention in the underlying storage layer. ### Optimized Architecture with Iceberg and StarRocks * **Apache Iceberg Integration:** Iceberg serves as the open table format, providing a reliable foundation for managing large-scale data snapshots and ensuring consistency during concurrent reads and writes. * **StarRocks for Query Acceleration:** The team selected StarRocks as the primary OLAP engine to take advantage of its high-speed vectorized execution and native support for Iceberg tables. * **Spark-Based Processing:** Apache Spark is utilized for the initial data ingestion and transformation phases, preparing the transaction data for efficient storage and downstream consumption. ### Enhancing Performance via Materialized Views * **Pre-computed Aggregations:** By implementing Materialized Views, the system pre-calculates intensive transaction summaries, significantly reducing the computational load during active user queries. * **Automatic Query Rewrite:** The architecture utilizes StarRocks' ability to automatically route queries to the most efficient materialized view, ensuring that even ad-hoc reports benefit from pre-computed results. * **Balanced Refresh Strategies:** The research focused on optimizing the refresh intervals of these views to maintain high "freshness" while minimizing the overhead on the cluster resources. The adoption of a modern lakehouse architecture combining Apache Iceberg with a high-performance OLAP engine like StarRocks is a recommended strategy for organizations dealing with high-volume, real-time reporting. This approach effectively decouples storage and compute while providing the low-latency response times necessary for interactive data analysis.