apache-airflow

3 posts

aws

AWS Weekly Roundup: AWS Lambda for .NET 10, AWS Client VPN quickstart, Best of AWS re:Invent, and more (January 12, 2026) | AWS News Blog (opens in new tab)

The AWS Weekly Roundup for January 2026 highlights a significant push toward modernization, headlined by the introduction of .NET 10 support for AWS Lambda and Apache Airflow 2.11 for Amazon MWAA. To encourage exploration of these and other emerging technologies, AWS has revamped its Free Tier to offer new users up to $200 in credits and six months of risk-free experimentation. These updates collectively aim to streamline serverless development, enhance container storage efficiency, and provide more robust authentication options for messaging services. ### Modernized Runtimes and Orchestration * AWS Lambda now supports .NET 10 as both a managed runtime and a container base image, with AWS providing automatic updates to these environments as they become available. * Amazon Managed Workflows for Apache Airflow (MWAA) has added support for version 2.11, which serves as a critical stepping stone for users preparing to migrate to Apache Airflow 3. ### Infrastructure and Resource Management * Amazon ECS has extended support for `tmpfs` mounts to Linux tasks running on AWS Fargate and Managed Instances; this allows developers to utilize memory-backed file systems for containerized workloads to avoid writing sensitive or temporary data to task storage. * AWS Config has expanded its monitoring capabilities to discover, assess, and audit new resource types across Amazon EC2, Amazon SageMaker, and Amazon S3 Tables. * A new AWS Client VPN quickstart was released, providing a CloudFormation template and a step-by-step guide to automate the deployment of secure client-to-site VPN connections. ### Security and Messaging Enhancements * Amazon MQ for RabbitMQ brokers now supports HTTP-based authentication, which can be enabled and managed through the broker’s configuration file. * RabbitMQ brokers on Amazon MQ also now support certificate-based authentication using mutual TLS (mTLS) to improve the security posture of messaging applications. ### Educational Initiatives and Community Events * New AWS Free Tier accounts now include a 6-month trial period featuring $200 in credits and access to over 30 always-free services, specifically targeting developers interested in AI/ML and compute experimentation. * AWS published a curated "Best of re:Invent 2025" playlist, featuring high-impact sessions and keynotes for those who missed the live event. * The 2026 AWS Summit season begins shortly, with upcoming events scheduled for Dubai on February 10 and Paris on March 10. Developers should take immediate advantage of the new .NET 10 Lambda runtime for serverless applications and review the updated ECS `tmpfs` documentation to optimize container performance. For those new to the platform, the expanded Free Tier credits provide an excellent opportunity to prototype AI/ML workloads with minimal financial risk.

daangn

won Park": Author. * (opens in new tab)

Daangn’s data governance team addressed the lack of transparency in their data pipelines by building a column-level lineage system using SQL parsing. By analyzing BigQuery query logs with specialized parsing tools, they successfully mapped intricate data dependencies that standard table-level tracking could not capture. This system now enables precise impact analysis and significantly improves data reliability and troubleshooting speed across the organization. **The Necessity of Column-Level Visibility** * Table-level lineage, while easily accessible via BigQuery’s `JOBS` view, fails to identify how specific fields—such as PII or calculated metrics—propagate through downstream systems. * Without granular lineage, the team faced "cascading failures" where a single pipeline error triggered a chain of broken tables that were difficult to trace manually. * Schema migrations, such as modifying a source MySQL column, were historically high-risk because the impact on derivative BigQuery tables and columns was unknown. **Evaluating Extraction Strategies** * BigQuery’s native `INFORMATION_SCHEMA` was found to be insufficient because it does not support column-level detail and often obscures original source tables when Views are involved. * Frameworks like OpenLineage were considered but rejected due to high operational costs; requiring every team to instrument their own Airflow jobs or notebooks was deemed impractical for a central governance team. * The team chose a centralized SQL parsing approach, leveraging the fact that nearly all data transformations within the company are executed as SQL queries within BigQuery. **Technical Implementation and Tech Stack** * **sqlglot:** This library serves as the core engine, parsing SQL strings into Abstract Syntax Trees (AST) to programmatically identify source and destination columns. * **Data Collection:** The system pulls raw query text from `INFORMATION_SCHEMA.JOBS` across all Google Cloud projects to ensure comprehensive coverage. * **Processing and Orchestration:** Spark is utilized to handle the parallel processing of massive query logs, while Airflow schedules regular updates to the lineage data. * **Storage:** The resulting mappings are stored in a centralized BigQuery table (`data_catalog.lineage`), making the dependency map easily accessible for impact analysis and data cataloging. By centralizing lineage extraction through SQL parsing rather than per-job instrumentation, organizations can achieve comprehensive visibility without placing an integration burden on individual developers. This approach is highly effective for BigQuery-centric environments where SQL is the primary language for data movement and transformation.

naver

Building Data Lineage- (opens in new tab)

Naver Webtoon developed "Flow.er," an on-demand data lineage pipeline service designed to overcome the operational inefficiencies and high maintenance costs of legacy data workflows. By integrating dbt for modular modeling and Airflow for scalable orchestration, the platform automates complex backfill and recovery processes while maintaining high data integrity. This shift to a lineage-centric architecture allows the engineering team to manage data as a high-quality product rather than a series of disconnected tasks. ### Challenges in Traditional Data Pipelines * High operational burdens were caused by manual backfilling and recovery tasks, which became increasingly difficult as data volume and complexity grew. * Legacy systems lacked transparency in data dependencies, making it hard to predict the downstream impact of code changes or upstream data failures. * Fragmented development environments led to inconsistencies between local testing and production outputs, slowing down the deployment of new data products. ### Core Architecture and the Role of dbt and Airflow * dbt serves as the central modeling layer, defining transformations and establishing clear data lineage that maps how information flows between tables. * Airflow functions as the orchestration engine, utilizing the lineage defined in dbt to trigger tasks in the correct order and manage execution schedules. * Individual development instances provide engineers with isolated environments to test dbt models, ensuring that logic is validated before being merged into the main pipeline. * The system includes a dedicated model management page and a robust CI/CD pipeline to streamline the transition from development to production. ### Expanding the Platform with Tower and Playground * "Tower" and "Playground" were introduced as supplementary components to support a broader range of data organizations and facilitate easier experimentation. * A specialized Partition Checker was developed to enhance data integrity by automatically verifying that all required data partitions are present before downstream processing begins. * Improvements to the Manager DAG system allow the platform to handle large-scale pipeline deployments across different teams while maintaining a unified view of the data lineage. ### Future Evolution with AI and MCP * The team is exploring the integration of Model Context Protocol (MCP) servers to bridge the gap between data pipelines and AI applications. * Future developments focus on utilizing AI agents to further automate pipeline monitoring and troubleshooting, reducing the need for human intervention in routine maintenance. To build a sustainable and scalable data infrastructure, organizations should transition from simple task scheduling to a lineage-aware architecture. Adopting a framework like Flow.er, which combines the modeling strengths of dbt with the orchestration power of Airflow, enables teams to automate the most labor-intensive parts of data engineering—such as backfills and dependency management—while ensuring the reliability of the final data product.