feature-engineering

5 posts

toss

Toss Next ML Challenge - Ad (opens in new tab)

Toss recently hosted the "Toss Next ML Challenge," a large-scale competition focused on predicting advertisement Click-Through Rates (CTR) using real-world, anonymized data from the Toss app. By tasking over 2,600 participants with developing high-performance models under real-time serving constraints, the event successfully identified innovative technical approaches to feature engineering and model ensembling. ### Designing a Real-World CTR Prediction Task * The competition required participants to predict the probability of a user clicking a display ad based on a dataset of 10.7 million training samples. * Data included anonymized features such as age, gender, ad inventory IDs, and historical user behavior. * A primary technical requirement was "real-time navigability," meaning models had to be optimized for fast inference to function within a live service environment. ### Overcoming Anonymization with Sequence Engineering * To maintain data privacy while allowing external access, Toss provided anonymized features in a single flattened table, which limited the ability of participants to perform traditional data joins. * A complex, raw "Sequence" feature was intentionally left unprocessed to serve as a differentiator for high-performing teams. * Top-tier participants demonstrated extreme persistence by deriving up to 37 unique variables from this single sequence, including transition probabilities, unique token counts, and sequence lengths. ### Winning Strategies and Technical Trends * All of the top 30 teams utilized Boosting Tree-based models (such as XGBoost or LightGBM), while Deep Learning was used only by a subset of participants. * One standout solution utilized a massive ensemble of 260 different models, providing a fresh perspective on the limits of ensemble learning for predictive accuracy. * Performance was largely driven by the ability to extract meaningful signals from anonymized data through rigorous cross-validation and creative feature interactions. The results of the Toss Next ML Challenge suggest that even in the absence of domain-specific context due to anonymization, meticulous feature engineering and robust tree-based architectures remains the gold standard for tabular data. For ML engineers, the competition underscores that the key to production-ready models lies in balancing complex ensembling with the strict latency requirements of real-time serving.

google

Reducing EV range anxiety: How a simple AI model predicts port availability (opens in new tab)

Google Research has developed a lightweight AI model designed to predict the probability of EV charging port availability at specific future intervals, directly addressing the "range anxiety" experienced by electric vehicle drivers. By co-designing the model with deployment infrastructure, researchers found that a simple linear regression approach outperformed more complex architectures like neural networks and decision trees. The resulting system effectively predicts availability changes during high-turnover periods, providing more reliable navigation and planning data than traditional "no-change" assumptions. ### Model Architecture and Feature Selection * The development team prioritized a minimal feature set to ensure low-latency deployment and high speed in real-world navigational applications. * After testing various architectures, a straightforward linear regression model was selected for its robustness and superior performance in this specific predictive task. * The model was trained using real-time availability data from diverse geographical regions, specifically California and Germany, with an emphasis on larger charging stations that reflect high-traffic usage patterns. ### Temporal Feature Weights and Occupancy Trends * The model uses the hour of the day as a primary feature, treating each hour as an independent variable to capture specific daily cycles. * Learned numerical "weights" dictate the predicted rate of occupancy change: positive weights indicate ports are becoming occupied (e.g., during morning rush), while negative weights indicate ports are being freed up (e.g., during evening hours). * The system is designed to only deviate from the current occupancy state when the change rate is statistically significant or when a station's large size amplifies the likelihood of a status change. ### Performance Benchmarking and Validation * The model was evaluated against a "Keep Current State" baseline, which assumes future availability will be identical to the present status—a difficult baseline to beat since port status remains unchanged roughly 90% of the time over 30-minute windows. * Accuracy was measured using Mean Squared Error (MSE) and Mean Absolute Error (MAE) over 30-minute and 60-minute time horizons across 100 randomly selected stations. * Testing confirmed that the linear regression model provides its greatest value during infrequent but critical moments of high turnover, successfully identifying when a station is likely to become full or available. The success of this model demonstrates that sophisticated deep learning is not always the optimal solution for infrastructure challenges. By combining intuitive real-world logic—such as driver schedules and station capacity—with simple machine learning techniques, developers can create highly efficient tools that significantly improve the EV user experience without requiring massive computational overhead.

google

MLE-STAR: A state-of-the-art machine learning engineering agent (opens in new tab)

MLE-STAR is a state-of-the-art machine learning engineering agent designed to automate complex ML tasks by treating them as iterative code optimization challenges. Unlike previous agents that rely solely on an LLM’s internal knowledge, MLE-STAR integrates external web searches and targeted ablation studies to pinpoint and refine specific pipeline components. This approach allows the agent to achieve high-performance results, evidenced by its ability to win medals in 63% of Kaggle competitions within the MLE-Bench-Lite benchmark. ## External Knowledge and Targeted Ablation The core of MLE-STAR’s effectiveness lies in its ability to move beyond generic machine learning libraries by incorporating external research and specific performance testing. * The agent uses web search to retrieve task-specific, state-of-the-art models and approaches rather than defaulting to familiar libraries like scikit-learn. * Instead of modifying an entire script at once, the system conducts an ablation study to evaluate the impact of individual pipeline components, such as feature engineering or model selection. * By identifying which code blocks have the most significant impact on performance, the agent can focus its reasoning and optimization efforts where they are most needed. ## Iterative Refinement and Intelligent Ensembling Once the critical components are identified, MLE-STAR employs a specialized refinement process to maximize the effectiveness of the generated solution. * Targeted code blocks undergo iterative refinement based on LLM-suggested plans that incorporate feedback from prior experimental failures and successes. * The agent features a unique ensembling strategy where it proposes multiple candidate solutions and then designs its own method to merge them. * Rather than using simple validation-score voting, the agent iteratively improves the ensemble strategy itself, treating the combination of models as a distinct optimization task. ## Robustness and Safety Verification To ensure the generated code is both functional and reliable for real-world deployment, MLE-STAR incorporates three specialized diagnostic modules. * **Debugging Agent:** Automatically analyzes tracebacks and execution errors in Python scripts to provide iterative corrections. * **Data Leakage Checker:** Reviews the solution script prior to execution to ensure the model does not improperly access test dataset information during the training phase. * **Data Usage Checker:** Analyzes whether the script is utilizing all available data sources, preventing the agent from overlooking complex data formats in favor of simpler files like CSVs. By combining external grounding with a granular, component-based optimization strategy, MLE-STAR represents a significant shift in automated machine learning. For organizations looking to scale their ML workflows, such an agent suggests a future where the role of the engineer shifts from manual coding to high-level supervision of autonomous agents that can navigate the vast landscape of research and data engineering.

coupang

Optimizing Logistics Inbound Process Using (opens in new tab)

Coupang has implemented a machine learning-based prediction system to optimize its logistics inbound process by accurately forecasting the number of trucks required for product deliveries. By analyzing historical logistics data and vendor characteristics, the system minimizes resource waste at fulfillment center docks and prevents operational delays caused by slot shortages. This data-driven approach ensures that limited dock slots are allocated efficiently, improving overall supply chain speed and reliability. ### Challenges in Inbound Logistics * Fulfillment centers operate with a fixed number of "docks" for unloading and specific time "slots" assigned to each truck. * Inaccurate predictions create a resource dilemma: under-estimating slots causes unloading delays and backlogs, while over-estimating leads to idle docks and wasted capacity. * The goal was to move beyond manual estimation to an automated system that balances vendor requirements with actual facility throughput. ### Feature Engineering and Data Collection * The team performed Exploratory Data Analysis (EDA) on approximately 800,000 instances of inbound data collected over two years. * In-depth interviews with domain experts and logistics managers were conducted to identify hidden patterns and qualitative factors that influence truck requirements. * Final feature sets were refined through feature engineering, focusing on vendor-specific behaviors and the physical characteristics of the products being delivered. ### LightGBM Implementation and Optimization * The LightGBM algorithm was selected due to its high performance with large datasets and its efficiency in handling categorical features. * The model utilizes a leaf-wise tree growth strategy, which allows for faster training speeds and lower loss compared to traditional level-wise growth algorithms. * Hyperparameters were optimized using Bayesian Optimization, a method that finds the most effective model configurations more efficiently than traditional grid search methods. * The trained model is integrated directly into the booking system, providing real-time truck quantity recommendations to vendors during the application process. ### Operational Trade-offs and Results * The system must navigate the trade-off between under-prediction (which risks logistical bottlenecks) and over-prediction (which risks resource waste). * By automating the prediction of necessary slots, Coupang has reduced the manual workload for vendors and improved the accuracy of fulfillment center scheduling. * This optimization allows for more products to be processed in a shorter time frame, directly contributing to faster delivery times for the end customer. By replacing manual estimates with a LightGBM-based predictive model, Coupang has successfully synchronized vendor deliveries with fulfillment center capacity. This technical shift not only maximizes dock utilization but also builds a more resilient and scalable inbound supply chain.

coupang

Accelerating ML Development through Coupang (opens in new tab)

Coupang’s internal Machine Learning (ML) platform serves as a standardized ecosystem designed to accelerate the transition from experimental research to stable production services. By centralizing core functions like automated pipelines, feature engineering, and scalable inference, the platform addresses the operational complexities of managing ML at an enterprise scale. This infrastructure allows engineers to focus on model innovation rather than manual resource management, ultimately driving efficiency across Coupang’s diverse service offerings. ### Addressing Scalability and Development Bottlenecks * The platform aims to drastically reduce "Time to Market" by providing "ready-to-use" services that eliminate the need for engineers to build custom infrastructure for every model. * Integrating Continuous Integration and Continuous Deployment (CI/CD) into the ML lifecycle ensures that updates to data, code, and models are handled with the same rigor as traditional software engineering. * By optimizing ML computing resources, the platform allows for the efficient scaling of training and inference workloads, preventing infrastructure costs from spiraling as the number of models grows. ### Core Services of the ML Platform * **Notebooks and Pipelines:** Integrated Jupyter environments allow for ad-hoc exploration, while workflow orchestration tools enable the construction of reproducible ML pipelines. * **Feature Engineering:** A dedicated feature store facilitates the reuse of data components and ensures consistency between the features used during model training and those used in real-time inference. * **Scalable Training and Inference:** The platform provides dedicated clusters for high-performance model training and robust hosting services for real-time and batch model predictions. * **Monitoring and Observability:** Automated tools track model performance and data drift in production, alerting engineers when a model’s accuracy begins to degrade due to changing real-world data. ### Real-World Success in Search and Pricing * **Search Query Understanding:** The platform enabled the training of Ko-BERT (Korean Bidirectional Encoder Representations from Transformers), significantly improving the accuracy of search results by better understanding customer intent. * **Real-time Dynamic Pricing:** Using the platform’s low-latency inference services, Coupang can predict and adjust product prices in real-time based on fluctuating market conditions and inventory levels. To maintain a competitive edge in e-commerce, organizations should transition away from fragmented, ad-hoc ML workflows toward a unified platform that treats ML as a first-class citizen of the software development lifecycle. Investing in such a platform not only speeds up deployment but also ensures the long-term reliability and observability of production models.