daangn

won Park": Author. * (opens in new tab)

Daangn’s data governance team addressed the lack of transparency in their data pipelines by building a column-level lineage system using SQL parsing. By analyzing BigQuery query logs with specialized parsing tools, they successfully mapped intricate data dependencies that standard table-level tracking could not capture. This system now enables precise impact analysis and significantly improves data reliability and troubleshooting speed across the organization.

The Necessity of Column-Level Visibility

  • Table-level lineage, while easily accessible via BigQuery’s JOBS view, fails to identify how specific fields—such as PII or calculated metrics—propagate through downstream systems.
  • Without granular lineage, the team faced "cascading failures" where a single pipeline error triggered a chain of broken tables that were difficult to trace manually.
  • Schema migrations, such as modifying a source MySQL column, were historically high-risk because the impact on derivative BigQuery tables and columns was unknown.

Evaluating Extraction Strategies

  • BigQuery’s native INFORMATION_SCHEMA was found to be insufficient because it does not support column-level detail and often obscures original source tables when Views are involved.
  • Frameworks like OpenLineage were considered but rejected due to high operational costs; requiring every team to instrument their own Airflow jobs or notebooks was deemed impractical for a central governance team.
  • The team chose a centralized SQL parsing approach, leveraging the fact that nearly all data transformations within the company are executed as SQL queries within BigQuery.

Technical Implementation and Tech Stack

  • sqlglot: This library serves as the core engine, parsing SQL strings into Abstract Syntax Trees (AST) to programmatically identify source and destination columns.
  • Data Collection: The system pulls raw query text from INFORMATION_SCHEMA.JOBS across all Google Cloud projects to ensure comprehensive coverage.
  • Processing and Orchestration: Spark is utilized to handle the parallel processing of massive query logs, while Airflow schedules regular updates to the lineage data.
  • Storage: The resulting mappings are stored in a centralized BigQuery table (data_catalog.lineage), making the dependency map easily accessible for impact analysis and data cataloging.

By centralizing lineage extraction through SQL parsing rather than per-job instrumentation, organizations can achieve comprehensive visibility without placing an integration burden on individual developers. This approach is highly effective for BigQuery-centric environments where SQL is the primary language for data movement and transformation.