apache-spark

4 posts

daangn

won Park": Author. * (opens in new tab)

Daangn’s data governance team addressed the lack of transparency in their data pipelines by building a column-level lineage system using SQL parsing. By analyzing BigQuery query logs with specialized parsing tools, they successfully mapped intricate data dependencies that standard table-level tracking could not capture. This system now enables precise impact analysis and significantly improves data reliability and troubleshooting speed across the organization. **The Necessity of Column-Level Visibility** * Table-level lineage, while easily accessible via BigQuery’s `JOBS` view, fails to identify how specific fields—such as PII or calculated metrics—propagate through downstream systems. * Without granular lineage, the team faced "cascading failures" where a single pipeline error triggered a chain of broken tables that were difficult to trace manually. * Schema migrations, such as modifying a source MySQL column, were historically high-risk because the impact on derivative BigQuery tables and columns was unknown. **Evaluating Extraction Strategies** * BigQuery’s native `INFORMATION_SCHEMA` was found to be insufficient because it does not support column-level detail and often obscures original source tables when Views are involved. * Frameworks like OpenLineage were considered but rejected due to high operational costs; requiring every team to instrument their own Airflow jobs or notebooks was deemed impractical for a central governance team. * The team chose a centralized SQL parsing approach, leveraging the fact that nearly all data transformations within the company are executed as SQL queries within BigQuery. **Technical Implementation and Tech Stack** * **sqlglot:** This library serves as the core engine, parsing SQL strings into Abstract Syntax Trees (AST) to programmatically identify source and destination columns. * **Data Collection:** The system pulls raw query text from `INFORMATION_SCHEMA.JOBS` across all Google Cloud projects to ensure comprehensive coverage. * **Processing and Orchestration:** Spark is utilized to handle the parallel processing of massive query logs, while Airflow schedules regular updates to the lineage data. * **Storage:** The resulting mappings are stored in a centralized BigQuery table (`data_catalog.lineage`), making the dependency map easily accessible for impact analysis and data cataloging. By centralizing lineage extraction through SQL parsing rather than per-job instrumentation, organizations can achieve comprehensive visibility without placing an integration burden on individual developers. This approach is highly effective for BigQuery-centric environments where SQL is the primary language for data movement and transformation.

line

Introducing a New A/B (opens in new tab)

LY Corporation has developed an advanced A/B testing system that moves beyond simple random assignment to support dynamic user segmentation. By integrating a dedicated targeting system with a high-performance experiment assigner, the platform allows for precise experiments tailored to specific user characteristics and behaviors. This architecture enables data-driven decisions that are more relevant to localized or specialized user groups rather than relying on broad averages. ## Limitations of Traditional A/B Testing * General A/B test systems typically rely on random assignment, such as applying a hash function to a user ID (`hash(id) % 2`), which is simple and cost-effective. * While random assignment reduces selection bias, it is insufficient for hypotheses that only apply to specific cohorts, such as "iOS users living in Osaka." * Advanced systems solve this by shifting from general testing across an entire user base to personalized testing for specific segments. ## Architecture of the Targeting System * The system processes massive datasets including user information, mobile device data, and application activity stored in HDFS. * Apache Spark is used to execute complex conditional operations—such as unions, intersections, and subtractions—to refine user segments. * Segment data is written to Object Storage and then cached in Redis using a `{user_id}-{segment_id}` key format to ensure low-latency lookups during live requests. ## A/B Test Management and Assignment * The system utilizes "Central Dogma" as a configuration repository where operators and administrators define experiment parameters. * A Test Group Assigner orchestrates the process: when a client makes a request, the assigner retrieves experiment info and checks the user's segment membership in Redis. * Once a user is assigned to a specific group (e.g., Test Group 1), the system serves the corresponding content and logs the event to a data store for dashboard visualization and analysis. ## Strategic Use Cases and Future Plans * **Content Recommendation:** Testing different Machine Learning models to see which performs better for a specific user demographic. * **Targeted Incentives:** Limiting shopping discount experiments to "light users," as coupons may not significantly change the behavior of "heavy users." * **Onboarding Optimization:** Restricting UI tests to new users only, ensuring that existing users' experiences remain uninterrupted. * **Platform Expansion:** Future goals include building a unified admin interface for the entire lifecycle of an experiment and expanding the system to cover all services within LY Corporation. For organizations looking to optimize user experience, transitioning from random assignment to dynamic segmentation is essential for high-precision product development. Ensuring that segment data is cached in a high-performance store like Redis is critical to maintaining low latency when serving experimental variations in real-time.

naver

Iceberg Low-Latency Queries with Materialized Views (opens in new tab)

This technical session from NAVER ENGINEERING DAY 2025 explores the architectural journey of building a low-latency query system for real-time transaction reports. The project focuses on resolving the tension between high data freshness, massive scalability, and rapid response times for complex, multi-dimensional filtering. By leveraging Apache Iceberg in conjunction with StarRocks’ materialized views, the team established a performant data pipeline that meets the demands of modern business intelligence. ### Challenges in Real-Time Transaction Reporting * **Query Latency vs. Data Freshness:** Traditional architectures often struggle to provide immediate visibility into transaction data while maintaining sub-second query speeds across diverse filter conditions. * **High-Dimensional Filtering:** Users require the ability to query reports based on numerous variables, necessitating an engine that can handle complex aggregations without pre-defining every possible index. * **Scalability Requirements:** The system must handle increasing transaction volumes without degrading performance or requiring significant manual intervention in the underlying storage layer. ### Optimized Architecture with Iceberg and StarRocks * **Apache Iceberg Integration:** Iceberg serves as the open table format, providing a reliable foundation for managing large-scale data snapshots and ensuring consistency during concurrent reads and writes. * **StarRocks for Query Acceleration:** The team selected StarRocks as the primary OLAP engine to take advantage of its high-speed vectorized execution and native support for Iceberg tables. * **Spark-Based Processing:** Apache Spark is utilized for the initial data ingestion and transformation phases, preparing the transaction data for efficient storage and downstream consumption. ### Enhancing Performance via Materialized Views * **Pre-computed Aggregations:** By implementing Materialized Views, the system pre-calculates intensive transaction summaries, significantly reducing the computational load during active user queries. * **Automatic Query Rewrite:** The architecture utilizes StarRocks' ability to automatically route queries to the most efficient materialized view, ensuring that even ad-hoc reports benefit from pre-computed results. * **Balanced Refresh Strategies:** The research focused on optimizing the refresh intervals of these views to maintain high "freshness" while minimizing the overhead on the cluster resources. The adoption of a modern lakehouse architecture combining Apache Iceberg with a high-performance OLAP engine like StarRocks is a recommended strategy for organizations dealing with high-volume, real-time reporting. This approach effectively decouples storage and compute while providing the low-latency response times necessary for interactive data analysis.

netflix

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale | by Netflix Technology Blog | Netflix TechBlog (opens in new tab)

Netflix’s Muse platform has evolved from a simple dashboard into a high-scale Online Analytical Processing (OLAP) system that processes trillions of rows to provide creative insights for promotional media. To meet growing demands for complex audience affinity analysis and advanced filtering, the engineering team modernized the data serving layer by moving beyond basic batch pipelines. By integrating HyperLogLog sketches for approximate counting and leveraging in-memory precomputed aggregates, the system now delivers low-latency performance and high data accuracy at an immense scale. ### Approximate Counting with HyperLogLog (HLL) Sketches To track metrics like unique impressions and qualified plays without the massive overhead of comparing billions of profile IDs, Muse utilizes the Apache Datasketches library. * The system trades a small margin of error (approximately 0.8% with a logK of 17) for significant gains in processing speed and memory efficiency. * Sketches are built during Druid ingestion using the HLLSketchBuild aggregator with rollup enabled to reduce data volume. * In the Spark ETL process, all-time aggregates are maintained by merging new daily HLL sketches into existing ones using the `hll_union` function. ### Utilizing Hollow for In-Memory Aggregates To reduce the query load on the Druid cluster, Netflix uses Hollow, an internal open-source tool designed for high-density, near-cache data sets. * Muse stores precomputed, all-time aggregates—such as lifetime impressions per asset—within Hollow’s in-memory data structures. * When a user requests "all-time" data, the application retrieves the results from the Hollow cache instead of forcing Druid to scan months or years of historical segments. * This approach significantly lowers latency for the most common queries and frees up Druid resources for more complex, dynamic filtering tasks. ### Optimizing the Druid Data Layer Efficient data retrieval from Druid is critical for supporting the application’s advanced grouping and filtering capabilities. * The team transitioned from hash-based partitioning to range-based partitioning on frequently filtered dimensions like `video_id` to improve data locality and pruning. * Background compaction tasks are utilized to merge small segments into larger ones, reducing metadata overhead and improving scan speeds across the cluster. * Specific tuning was applied to the Druid broker and historical nodes, including adjusting processing threads and buffer sizes to handle the high-concurrency demands of the Muse UI. ### Validation and Data Accuracy Because the move to HLL sketches introduces approximation, the team implemented rigorous validation processes to ensure the data remained actionable. * Internal debugging tools were developed to compare results from the new architecture against the "ground truth" provided by legacy batch systems. * Continuous monitoring ensures that HLL error rates remain within the expected 1–2% range and that data remains consistent across different time grains. For organizations building large-scale OLAP applications, the Muse architecture demonstrates that performance bottlenecks can often be solved by combining approximate data structures with specialized in-memory caches to offload heavy computations from the primary database.