data-replication

2 posts

aws

Announcing replication support and Intelligent-Tiering for Amazon S3 Tables | AWS News Blog (opens in new tab)

AWS has expanded the capabilities of Amazon S3 Tables by introducing Intelligent-Tiering for automated cost optimization and cross-region replication for enhanced data availability. These updates address the operational overhead of managing large-scale Apache Iceberg datasets by automating storage lifecycle management and simplifying the architecture required for global data distribution. By integrating these features, organizations can reduce storage costs without manual intervention while ensuring consistent data access across multiple AWS Regions and accounts. ### Cost Optimization with S3 Tables Intelligent-Tiering This feature automatically shifts data between storage tiers based on access frequency to maximize cost efficiency without impacting application performance. * The system utilizes three low-latency tiers: Frequent Access, Infrequent Access (offering 40% lower costs), and Archive Instant Access (offering 68% lower costs than Infrequent Access). * Data transitions are automated, moving to Infrequent Access after 30 days of inactivity and to Archive Instant Access after 90 days. * Automated table maintenance tasks, such as compaction and snapshot expiration, are optimized to skip colder files; for example, compaction only processes data in the Frequent Access tier to minimize unnecessary compute and storage costs. * Users can configure Intelligent-Tiering as the default storage class at the table bucket level using the AWS CLI commands `put-table-bucket-storage-class` and `get-table-bucket-storage-class`. ### Cross-Region and Cross-Account Replication New replication support allows users to maintain synchronized, read-only replicas of their S3 Tables across different geographic locations and ownership boundaries. * Replication maintains chronological consistency and preserves parent-child snapshot relationships, ensuring that replicas remain identical to the source for query purposes. * Replica tables are typically updated within minutes of changes to the source table and support independent encryption and retention policies to meet specific regional compliance requirements. * The service eliminates the need for complex, custom-built architectures to track metadata transformations or manually sync objects between Iceberg tables. * This functionality is primarily designed to reduce query latency for geographically distributed teams and provide robust data protection for disaster recovery scenarios. ### Practical Implementation To maximize the benefits of these new features, organizations should consider setting Intelligent-Tiering as the default storage class at the bucket level for all new datasets to ensure immediate cost savings. For global operations, setting up read-only replicas in regions closest to end-users will significantly improve query performance for analytics tools like Amazon Athena and Amazon SageMaker.

naver

Replacing a DB CDC replication tool that processes (opens in new tab)

Naver Pay successfully transitioned its core database replication system from a legacy tool to "ergate," a high-performance CDC (Change Data Capture) solution built on Apache Flink and Spring. This strategic overhaul was designed to improve maintainability for backend developers while resolving rigid schema dependencies that previously caused operational bottlenecks. By leveraging a modern stream-processing architecture, the system now manages massive transaction volumes with sub-second latency and enhanced reliability. ### Limitations of the Legacy System * **Maintenance Barriers:** The previous tool, mig-data, was written in pure Java by database core specialists, making it difficult for standard backend developers to maintain or extend. * **Strict Schema Dependency:** Developers were forced to follow a rigid DDL execution order (Target DB before Source DB) to avoid replication halts, complicating database operations. * **Blocking Failures:** Because the legacy system prioritized bi-directional data integrity, a single failed record could stall the entire replication pipeline for a specific shard. * **Operational Risk:** Recovery procedures were manual and restricted to a small group of specialized personnel, increasing the time-to-recovery during outages. ### Technical Architecture and Stack * **Apache Flink (LTS 2.0.0):** Selected for its high-availability, low-latency, and native Kafka integration, allowing the team to focus on replication logic rather than infrastructure. * **Kubernetes Session Mode:** Used to manage 12 concurrent jobs (6 replication, 6 verification) through a single Job Manager endpoint for streamlined monitoring and deployment. * **Hybrid Framework Approach:** The team isolated high-speed replication logic within Flink while using Spring (Kotlin) for complex recovery modules to leverage developer familiarity. * **Data Pipeline:** The system captures MySQL binlogs via `nbase-cdc`, publishes them to Kafka, and uses Flink `jdbc-sink` jobs to apply changes to Target DBs (nBase-T and Oracle). ### Three-Tier Operational Model: Replication, Verification, and Recovery * **Real-time Replication:** Processes incoming Kafka records and appends custom metadata columns (`ergate_yn`, `rpc_time`) to track the replication source and original commit time. * **Delayed Verification:** A dedicated "verifier" Flink job consumes the same Kafka topic with a 2-minute delay to check Target DB consistency against the source record. * **Secondary Logic:** To prevent false positives from rapid updates, the verifier performs a live re-query of the Source DB if a mismatch is initially detected. * **Multi-Stage Recovery:** * **Automatic Short-term:** Retries transient failures after 5 minutes. * **Automatic Long-term:** Uses batch processes to resolve persistent discrepancies. * **Manual:** Provides an admin interface for developers to trigger targeted reconciliations via API. ### Improvements in Schema Management and Performance * **DDL Independence:** By implementing query and schema caching, ergate allows Source and Target tables to be updated in any order without halting the pipeline. * **Performance Scaling:** The new system is designed to handle 10x the current peak QPS, ensuring stability even during high-traffic events like major sales or promotions. * **Metadata Tracking:** The inclusion of specific replication identifiers allows for clear distinction between automated replication and manual force-sync actions during troubleshooting. The ergate project demonstrates that a hybrid architecture—combining the high-throughput processing of Apache Flink with the robust logic handling of Spring—is highly effective for mission-critical financial systems. Organizations managing large-scale data replication should consider decoupling complex recovery logic from the main processing stream to ensure both performance and developer productivity.