coupang

Coupang SCM Workflow: Developing (opens in new tab)

Coupang has developed an internal SCM Workflow platform to streamline the complex data and operational needs of its Supply Chain Management team. By implementing low-code and no-code functionalities, the platform enables developers, data scientists, and business analysts to build data pipelines and launch services without the traditional bottlenecks of manual development. ### Addressing Inefficiencies in SCM Data Management * The SCM team manages a massive network of suppliers and fulfillment centers (FCs) where demand forecasting and inventory distribution require constant data feedback. * Traditionally, non-technical stakeholders like business analysts (BAs) relied heavily on developers to build or modify data pipelines, leading to high communication costs and slower response times to changing business requirements. * The new platform aims to simplify the complexity found in traditional tools like Jenkins, Airflow, and Jupyter Notebooks, providing a unified interface for data creation and visualization. ### Democratizing Access with the No-code Data Builder * The "Data Builder" allows users to perform data queries, extraction, and system integration through a visual interface rather than writing backend code. * It provides seamless access to a wide array of data sources used across Coupang, including Redshift, Hive, Presto, Aurora, MySQL, Elasticsearch, and S3. * Users can construct workflows by creating "nodes" for specific tasks—such as extracting inventory data from Hive or calculating transfer quantities—and linking them together to automate complex decisions like inter-center product transfers. ### Expanding Capabilities through Low-code Service Building * The platform functions as a "Service Builder," allowing users to expand domains and launch simple services without building entirely new infrastructure from scratch. * This approach enables developers to focus on high-level algorithm development while allowing data scientists to apply and test new models directly within the production environment. * By reducing the need for code changes to reflect new requirements, the platform significantly increases the agility of the SCM pipeline. Organizations managing complex, data-driven ecosystems can significantly reduce operational friction by adopting low-code/no-code platforms. Empowering non-technical stakeholders to handle data processing and service integration not only accelerates innovation but also allows engineering resources to be redirected toward core architectural challenges.

coupang

Optimizing Logistics Receiving Processes Using Machine (opens in new tab)

Coupang has implemented a machine learning-based prediction system to optimize its logistics inbound process by accurately forecasting the number of trucks required for product deliveries. By analyzing historical logistics data and vendor characteristics, the system minimizes resource waste at fulfillment center docks and prevents operational delays caused by slot shortages. This data-driven approach ensures that limited dock slots are allocated efficiently, improving overall supply chain speed and reliability. ### Challenges in Inbound Logistics * Fulfillment centers operate with a fixed number of "docks" for unloading and specific time "slots" assigned to each truck. * Inaccurate predictions create a resource dilemma: under-estimating slots causes unloading delays and backlogs, while over-estimating leads to idle docks and wasted capacity. * The goal was to move beyond manual estimation to an automated system that balances vendor requirements with actual facility throughput. ### Feature Engineering and Data Collection * The team performed Exploratory Data Analysis (EDA) on approximately 800,000 instances of inbound data collected over two years. * In-depth interviews with domain experts and logistics managers were conducted to identify hidden patterns and qualitative factors that influence truck requirements. * Final feature sets were refined through feature engineering, focusing on vendor-specific behaviors and the physical characteristics of the products being delivered. ### LightGBM Implementation and Optimization * The LightGBM algorithm was selected due to its high performance with large datasets and its efficiency in handling categorical features. * The model utilizes a leaf-wise tree growth strategy, which allows for faster training speeds and lower loss compared to traditional level-wise growth algorithms. * Hyperparameters were optimized using Bayesian Optimization, a method that finds the most effective model configurations more efficiently than traditional grid search methods. * The trained model is integrated directly into the booking system, providing real-time truck quantity recommendations to vendors during the application process. ### Operational Trade-offs and Results * The system must navigate the trade-off between under-prediction (which risks logistical bottlenecks) and over-prediction (which risks resource waste). * By automating the prediction of necessary slots, Coupang has reduced the manual workload for vendors and improved the accuracy of fulfillment center scheduling. * This optimization allows for more products to be processed in a shorter time frame, directly contributing to faster delivery times for the end customer. By replacing manual estimates with a LightGBM-based predictive model, Coupang has successfully synchronized vendor deliveries with fulfillment center capacity. This technical shift not only maximizes dock utilization but also builds a more resilient and scalable inbound supply chain.

coupang

Accelerating ML development through Cou (opens in new tab)

Coupang’s internal Machine Learning (ML) platform serves as a standardized ecosystem designed to accelerate the transition from experimental research to stable production services. By centralizing core functions like automated pipelines, feature engineering, and scalable inference, the platform addresses the operational complexities of managing ML at an enterprise scale. This infrastructure allows engineers to focus on model innovation rather than manual resource management, ultimately driving efficiency across Coupang’s diverse service offerings. ### Addressing Scalability and Development Bottlenecks * The platform aims to drastically reduce "Time to Market" by providing "ready-to-use" services that eliminate the need for engineers to build custom infrastructure for every model. * Integrating Continuous Integration and Continuous Deployment (CI/CD) into the ML lifecycle ensures that updates to data, code, and models are handled with the same rigor as traditional software engineering. * By optimizing ML computing resources, the platform allows for the efficient scaling of training and inference workloads, preventing infrastructure costs from spiraling as the number of models grows. ### Core Services of the ML Platform * **Notebooks and Pipelines:** Integrated Jupyter environments allow for ad-hoc exploration, while workflow orchestration tools enable the construction of reproducible ML pipelines. * **Feature Engineering:** A dedicated feature store facilitates the reuse of data components and ensures consistency between the features used during model training and those used in real-time inference. * **Scalable Training and Inference:** The platform provides dedicated clusters for high-performance model training and robust hosting services for real-time and batch model predictions. * **Monitoring and Observability:** Automated tools track model performance and data drift in production, alerting engineers when a model’s accuracy begins to degrade due to changing real-world data. ### Real-World Success in Search and Pricing * **Search Query Understanding:** The platform enabled the training of Ko-BERT (Korean Bidirectional Encoder Representations from Transformers), significantly improving the accuracy of search results by better understanding customer intent. * **Real-time Dynamic Pricing:** Using the platform’s low-latency inference services, Coupang can predict and adjust product prices in real-time based on fluctuating market conditions and inventory levels. To maintain a competitive edge in e-commerce, organizations should transition away from fragmented, ad-hoc ML workflows toward a unified platform that treats ML as a first-class citizen of the software development lifecycle. Investing in such a platform not only speeds up deployment but also ensures the long-term reliability and observability of production models.

coupang

Optimizing the inbound process with a machine learning model (opens in new tab)

Coupang optimized its fulfillment center inbound process by implementing a machine learning model to predict the exact number of delivery trucks and dock slots required for vendor shipments. By moving away from manual estimates, the system minimizes resource waste from over-allocation while preventing processing delays caused by under-prediction. This automated approach ensures that the limited capacity of fulfillment center docks is utilized with maximum efficiency. ### The Challenges of Dock Slot Allocation * Fulfillment centers operate with a fixed number of hourly "slots," representing the time and space a single truck occupies at a dock to unload goods. * Inaccurate slot forecasting creates a binary risk: under-prediction leads to logistical bottlenecks and delivery delays, while over-prediction results in idle docks and wasted operational overhead. * The diversity of vendor behaviors and product types makes manual estimation of truck requirements highly inconsistent across the supply chain. ### Predictive Modeling and Feature Engineering * Coupang utilized years of historical logistics data to extract features influencing truck counts, including product dimensions, categories, and vendor-specific shipment patterns. * The system employs the LightGBM algorithm, a gradient-boosting framework selected for its high performance and ability to handle large-scale tabular logistics data. * Hyperparameter tuning is managed via Bayesian optimization, which efficiently searches the parameter space to minimize prediction error. * The model accounts for the inherent trade-off between under-prediction and over-prediction, prioritizing a balance that maintains high throughput without straining labor resources. ### System Integration and Real-time Processing * The trained ML model is integrated directly into the inbound reservation system, providing vendors with an immediate prediction of required slots during the request process. * By automating the truck-count calculation, the system removes the burden of estimation from vendors and ensures consistency across different fulfillment centers. * This integration allows Coupang to dynamically adjust its dock capacity planning based on real-time data rather than static, historical averages. To maximize logistics efficiency, organizations should leverage granular product data and historical vendor behavior to automate capacity planning. Integrating predictive models directly into the reservation workflow ensures that data-driven insights are applied at the point of action, reducing human error and resource waste.

coupang

Accelerating Coupang’s AI Journey with LLMs (opens in new tab)

Coupang is strategically evolving its machine learning infrastructure to integrate Large Language Models (LLMs) and foundation models across its e-commerce ecosystem. By transitioning from task-specific deep learning models to multi-modal transformers, the company aims to enhance customer experiences in search, recommendations, and logistics. This shift necessitates a robust ML platform capable of handling the massive compute, networking, and latency demands inherent in generative AI. ### Core Machine Learning Domains Coupang’s existing ML ecosystem is built upon three primary pillars that drive business logic: * **Recommendation Systems:** These models leverage vast datasets of user interactions—including clicks, purchases, and relevance judgments—to power home feeds, search results, and advertising. * **Content Understanding:** Utilizing deep learning to process product catalogs, user reviews, and merchant data to create unified representations of customers and products. * **Forecasting Models:** Predictive algorithms manage over 100 fulfillment centers, optimizing pricing and logistics for millions of products through a mix of statistical methods and deep learning. ### Enhancing Multimodal and Language Understanding The adoption of Foundation Models (FM) has unified previously fragmented ML tasks, particularly in multilingual environments: * **Joint Modeling:** Instead of separate embeddings, vision and language transformer models jointly model product images and metadata (titles/descriptions) to improve ad retrieval and similarity searches. * **Cross-Border Localization:** LLMs facilitate the translation of product titles from Korean to Mandarin and improve the quality of shopping feeds for global sellers. * **Weak Label Generation:** To overcome the high cost of human labeling in multiple languages, Coupang uses LLMs to generate high-quality "weak labels" for training downstream models, addressing label scarcity in under-resourced segments. ### Infrastructure for Large-Scale Training Scaling LLM training requires a shift in hardware architecture and distributed computing strategies: * **High-Performance Clusters:** The platform utilizes H100 and A100 GPU clusters interconnected with high-speed InfiniBand or RoCE (RDMA over Converged Ethernet) networking to minimize communication bottlenecks. * **Distributed Frameworks:** To fit massive models into GPU memory, Coupang employs various parallelism techniques, including Fully Sharded Data Parallelism (FSDP), Tensor Parallelism (TP), and Pipeline Parallelism (PP). * **Efficient Categorization:** Traditional architectures that required a separate model for every product category are being replaced by a single, massive multi-modal transformer capable of handling categorization and attribute extraction across the entire catalog. ### Optimizing LLM Serving and Inference The transition to real-time generative AI features requires significant optimizations to manage the high computational cost of inference: * **Quantization Strategies:** To reduce memory footprint and increase throughput, models are compressed using FP8, INT8, or INT4 precision without significant loss in accuracy. * **Advanced Serving Techniques:** The platform implements Key-Value (KV) caching to avoid redundant computations during text generation and utilizes continuous batching (via engines like vLLM or TGI) to maximize GPU utilization. * **Lifecycle Management:** A unified platform vision ensures that the entire end-to-end lifecycle—from data preparation and fine-tuning to deployment—is streamlined for ML engineers. To stay competitive, Coupang is moving toward an integrated AI lifecycle where foundation models serve as the backbone for both content generation and predictive analytics. This infrastructure-first approach allows for the rapid deployment of generative features while maintaining the resource efficiency required for massive e-commerce scales.

coupang

Cloud expenditure optimization for cost efficiency (opens in new tab)

Coupang addressed rising cloud costs by establishing a cross-functional Central team to bridge the gap between engineering usage and financial accountability. Through a data-driven approach involving custom analytics and automated resource management, the company successfully reduced on-demand expenditure by millions of dollars. This initiative demonstrates that aligning technical infrastructure with financial governance is essential for maintaining growth without unnecessary waste. **The Central Team and Data-Driven Governance** * Coupang formed a specialized Central team consisting of infrastructure engineers and technical program managers to identify efficiency opportunities across the organization. * The team developed custom BI dashboards utilizing Amazon CloudWatch, AWS Cost and Usage Reports (CUR), and Amazon Athena to provide domain teams with actionable insights into their spending. * The finance department partnered with engineering to enforce strict budget compliance, ensuring that domain teams managed their resources within assigned monthly and quarterly limits. **Strategies for Spending and Paying Less** * The company implemented "Spending Less" strategies by automating the launch of resources in non-production environments only when needed, resulting in a 25% cost reduction for those areas. * "Paying Less" initiatives focused on rightsizing, where the Central team worked with domain owners to manually identify and eliminate unutilized or underutilized EC2 resources. * Workloads were migrated to more efficient hardware and pricing models, specifically leveraging ARM-based AWS Graviton processors and AWS Spot Instances for data processing and storage. **Targeted Infrastructure Optimization** * Engineering teams focused on instance generation alignment, ensuring that services were running on the most cost-effective hardware generations available. * Storage costs were reduced by optimizing Amazon S3 structures at rest, improving how data is organized and stored. * The team refined Amazon EMR (Elastic MapReduce) configurations to enhance processing efficiency, significantly lowering the cost of large-scale data analysis. To achieve sustainable cloud efficiency, engineering organizations should move beyond viewing cloud costs as a purely financial concern and instead treat resource management as a core technical metric. By integrating financial accountability directly into the engineering workflow through shared analytics and automated resource controls, companies can foster a culture of efficiency that supports long-term scalability.

coupang

Coupang Rocket Delivery’s spatial index-based delivery management system (opens in new tab)

Coupang’s Rocket Delivery system recently transitioned from a text-based postal code infrastructure to a sophisticated spatial index-based management system to handle increasing delivery density. By adopting Uber’s H3 hexagonal grid system, the engineering team enabled the visualization and precise segmentation of delivery areas that were previously too large for a single driver to manage. This move has transformed the delivery process into an intuitive, map-centric operation that allows for data-driven optimization and real-time area modifications. ### Limitations of Text-Based Postal Codes * While postal codes provided a government-standardized starting point, they became inefficient as delivery volumes grew from double to triple digits per code. * The lack of spatial data meant that segmenting a single postal code into smaller units, such as individual apartment complexes or buildings, required manual input from local experts familiar with the terrain. * Relying on text strings prevented the system from providing intuitive visual feedback or automated metrics for optimizing delivery routes. ### Adopting H3 for Geospatial Indexing * The team evaluated different spatial indexing systems, specifically comparing Google’s S2 (square-based) and Uber’s H3 (hexagon-based) frameworks. * H3 was chosen because hexagons provide a constant distance between the center of a cell and all six of its neighbors, which simplifies the modeling of movement and coverage. * The hexagonal structure minimizes "edge effect" distortions compared to squares or triangles, making it more accurate for calculating delivery radius and area density. ### Technical Redesign and Implementation * The system utilizes H3’s hierarchical indexing, allowing the platform to store delivery data at various resolutions to balance granularity with computational performance. * Delivery zones were converted from standard polygons into "hexagonized" groups, enabling the system to treat complex geographical shapes as sets of standardized cell IDs. * This transition allowed for the creation of a visual interface where camp leaders can modify delivery boundaries directly on a map, with changes reflected instantly across the logistics chain. By shifting to a spatial index, Coupang has decoupled its logistics logic from rigid administrative boundaries like postal codes. This technical foundation allows for more agile resource distribution and provides the scalability needed to handle the continued growth of high-density urban deliveries.

coupang

Meet Coupang’s Machine Learning Platform (opens in new tab)

Coupang’s internal Machine Learning Platform (MLP) is a comprehensive "batteries-included" ecosystem designed to streamline the end-to-end lifecycle of ML development across its diverse business units, including e-commerce, logistics, and streaming. By providing standardized tools for feature engineering, pipeline authoring, and model serving, the platform significantly reduces the time-to-production while enabling scalable, efficient compute management. Ultimately, this infrastructure allows Coupang to leverage advanced models like Ko-BERT for search and real-time forecasting to enhance the customer experience at scale. **Motivation for a Centralized Platform** * **Reduced Time to Production:** The platform aims to accelerate the transition from ad-hoc exploration to production-ready services by eliminating repetitive infrastructure setup. * **CI/CD Integration:** By incorporating continuous integration and delivery into ML workflows, the platform ensures that experiments are reproducible and deployments are reliable. * **Compute Efficiency:** Managed clusters allow for the optimization of expensive hardware resources, such as GPUs, across multiple teams and diverse workloads like NLP and Computer Vision. **Notebooks and Pipeline Authoring** * **Managed Jupyter Notebooks:** Provides data scientists with a standardized environment for initial data exploration and prototyping. * **Pipeline SDK:** Developers can use a dedicated SDK to define complex ML workflows as code, facilitating the transition from research to automated pipelines. * **Framework Agnostic:** The platform supports a wide range of ML frameworks and programming languages to accommodate different model architectures. **Feature Engineering and Data Management** * **Centralized Feature Store:** Enables teams to share and reuse features, reducing redundant data processing and ensuring consistency across the organization. * **Consistent Data Pipelines:** Bridges the gap between offline training and online real-time inference by providing a unified interface for data transformations. * **Large-scale Preparation:** Streamlines the creation of training datasets from Coupang’s massive logs, including product catalogs and user behavior data. **Training and Inference Services** * **Scalable Model Training:** Handles distributed training jobs and resource orchestration, allowing for the development of high-parameter models. * **Robust Model Inference:** Supports low-latency model serving for real-time applications such as ad ranking, video recommendations in Coupang Play, and pricing. * **Dedicated Infrastructure:** Training and inference clusters abstract the underlying hardware complexity, allowing engineers to focus on model logic rather than server maintenance. **Monitoring and Observability** * **Performance Tracking:** Integrated tools monitor model health and performance metrics in live production environments. * **Drift Detection:** Provides visibility into data and model drift, ensuring that models remain accurate as consumer behavior and market conditions change. For organizations looking to scale their AI capabilities, investing in an integrated platform that bridges the gap between experimentation and production is essential. By standardizing the "plumbing" of machine learning—such as feature stores and automated pipelines—companies can drastically increase the velocity of their data science teams and ensure the long-term reliability of their production models.

datadog

How we built a Ruby library that saves 50% in testing time | Datadog (opens in new tab)

Lengthy CI pipelines and flaky tests often hinder developer productivity by causing unnecessary wait times and costly infrastructure usage. To address this, Datadog developed a Ruby test impact analysis library that dynamically maps tests to specific source files, allowing the CI runner to skip tests unrelated to the latest code changes. By moving beyond standard coverage tools and utilizing low-level Ruby VM interpreter events, this solution significantly reduces testing time while maintaining high performance and correctness. ## The Strategy of Test Impact Analysis * Lengthy CI pipelines (often exceeding 20 minutes) increase the likelihood of intermittent "flaky" failures that are unrelated to current code changes. * While parallelization can reduce time, it increases cloud computing costs and does not mitigate the flakiness of irrelevant tests. * Test impact analysis generates a dynamic map between each test and the source files executed during its run; if a commit doesn't touch those files, the test is safely skipped. * Success depends on three pillars: correctness (never skipping a necessary test), performance (low overhead), and seamlessness (no required code changes for the user). ## Limitations of Standard Coverage Tools * Ruby’s built-in `Coverage` module (enhanced in version 3.1 with `resume`/`suspend` methods) proved incompatible with existing total code coverage tools like `simplecov`. * Initial prototypes using the `Coverage` module showed a performance overhead of 300%, making the test suite four times slower. * The `TracePoint` API was also evaluated as an alternative to spy on code execution via the `line` event, but it still produced a significant median overhead of 200% to 400%. * Benchmarks were conducted using the `rubocop` test suite—a "hard mode" scenario with 20,000+ tests—to ensure the tool could handle high-sensitivity environments. ## Implementing a Custom C Extension * To bypass the limitations of high-level APIs, developers utilized Ruby’s C extension capabilities to hook directly into the Virtual Machine. * The library uses `rb_add_event_hook2` and `rb_thread_add_event_hook` to subscribe to the `RUBY_EVENT_LINE` event at the interpreter level. * The implementation involves a C-based `dd_cov_start` function that triggers when a test begins and a `dd_cov_stop` function to collect the results. * During execution, the tool uses `rb_sourcefile()` to identify the current file and stores it in a Ruby hash only if the file is located within the project’s root directory. For engineering teams struggling with bloated CI pipelines, adopting test impact analysis is a highly effective way to optimize resources. By utilizing tools like Datadog’s Intelligent Test Runner, which leverages low-level VM events for minimal overhead, teams can cut their testing time in half without sacrificing the reliability of their master branch.