daangn Dec 10, 2025

won Park": Author. * (opens in new tab)

bigquery apache-spark apache-airflow data-lineage data-governance column-level-lineage sql-parsing sqlglot

Daangn’s data governance team addressed the lack of transparency in their data pipelines by building a column-level lineage system using SQL parsing. By analyzing BigQuery query logs with specialized parsing tools, they successfully mapped intricate data dependencies that standard table-level tracking could not capture. This system now enables precise impact analysis and significantly improves data reliability and troubleshooting speed across the organization.

The Necessity of Column-Level Visibility

Table-level lineage, while easily accessible via BigQuery’s JOBS view, fails to identify how specific fields—such as PII or calculated metrics—propagate through downstream systems.
Without granular lineage, the team faced "cascading failures" where a single pipeline error triggered a chain of broken tables that were difficult to trace manually.
Schema migrations, such as modifying a source MySQL column, were historically high-risk because the impact on derivative BigQuery tables and columns was unknown.

Evaluating Extraction Strategies

BigQuery’s native INFORMATION_SCHEMA was found to be insufficient because it does not support column-level detail and often obscures original source tables when Views are involved.
Frameworks like OpenLineage were considered but rejected due to high operational costs; requiring every team to instrument their own Airflow jobs or notebooks was deemed impractical for a central governance team.
The team chose a centralized SQL parsing approach, leveraging the fact that nearly all data transformations within the company are executed as SQL queries within BigQuery.

Technical Implementation and Tech Stack

sqlglot: This library serves as the core engine, parsing SQL strings into Abstract Syntax Trees (AST) to programmatically identify source and destination columns.
Data Collection: The system pulls raw query text from INFORMATION_SCHEMA.JOBS across all Google Cloud projects to ensure comprehensive coverage.
Processing and Orchestration: Spark is utilized to handle the parallel processing of massive query logs, while Airflow schedules regular updates to the lineage data.
Storage: The resulting mappings are stored in a centralized BigQuery table (data_catalog.lineage), making the dependency map easily accessible for impact analysis and data cataloging.

By centralizing lineage extraction through SQL parsing rather than per-job instrumentation, organizations can achieve comprehensive visibility without placing an integration burden on individual developers. This approach is highly effective for BigQuery-centric environments where SQL is the primary language for data movement and transformation.