won Park": Author. * (opens in new tab)
Daangn’s data governance team addressed the lack of transparency in their data pipelines by building a column-level lineage system using SQL parsing. By analyzing BigQuery query logs with specialized parsing tools, they successfully mapped intricate data dependencies that standard table-level tracking could not capture. This system now enables precise impact analysis and significantly improves data reliability and troubleshooting speed across the organization.
The Necessity of Column-Level Visibility
- Table-level lineage, while easily accessible via BigQuery’s
JOBSview, fails to identify how specific fields—such as PII or calculated metrics—propagate through downstream systems. - Without granular lineage, the team faced "cascading failures" where a single pipeline error triggered a chain of broken tables that were difficult to trace manually.
- Schema migrations, such as modifying a source MySQL column, were historically high-risk because the impact on derivative BigQuery tables and columns was unknown.
Evaluating Extraction Strategies
- BigQuery’s native
INFORMATION_SCHEMAwas found to be insufficient because it does not support column-level detail and often obscures original source tables when Views are involved. - Frameworks like OpenLineage were considered but rejected due to high operational costs; requiring every team to instrument their own Airflow jobs or notebooks was deemed impractical for a central governance team.
- The team chose a centralized SQL parsing approach, leveraging the fact that nearly all data transformations within the company are executed as SQL queries within BigQuery.
Technical Implementation and Tech Stack
- sqlglot: This library serves as the core engine, parsing SQL strings into Abstract Syntax Trees (AST) to programmatically identify source and destination columns.
- Data Collection: The system pulls raw query text from
INFORMATION_SCHEMA.JOBSacross all Google Cloud projects to ensure comprehensive coverage. - Processing and Orchestration: Spark is utilized to handle the parallel processing of massive query logs, while Airflow schedules regular updates to the lineage data.
- Storage: The resulting mappings are stored in a centralized BigQuery table (
data_catalog.lineage), making the dependency map easily accessible for impact analysis and data cataloging.
By centralizing lineage extraction through SQL parsing rather than per-job instrumentation, organizations can achieve comprehensive visibility without placing an integration burden on individual developers. This approach is highly effective for BigQuery-centric environments where SQL is the primary language for data movement and transformation.