Toss Next ML Challenge - Ad (opens in new tab)
Toss recently hosted the "Toss Next ML Challenge," a large-scale competition focused on predicting advertisement Click-Through Rates (CTR) using real-world, anonymized data from the Toss app. By tasking over 2,600 participants with developing high-performance models under real-time serving constraints, the event successfully identified innovative technical approaches to feature engineering and model ensembling.
Designing a Real-World CTR Prediction Task
- The competition required participants to predict the probability of a user clicking a display ad based on a dataset of 10.7 million training samples.
- Data included anonymized features such as age, gender, ad inventory IDs, and historical user behavior.
- A primary technical requirement was "real-time navigability," meaning models had to be optimized for fast inference to function within a live service environment.
Overcoming Anonymization with Sequence Engineering
- To maintain data privacy while allowing external access, Toss provided anonymized features in a single flattened table, which limited the ability of participants to perform traditional data joins.
- A complex, raw "Sequence" feature was intentionally left unprocessed to serve as a differentiator for high-performing teams.
- Top-tier participants demonstrated extreme persistence by deriving up to 37 unique variables from this single sequence, including transition probabilities, unique token counts, and sequence lengths.
Winning Strategies and Technical Trends
- All of the top 30 teams utilized Boosting Tree-based models (such as XGBoost or LightGBM), while Deep Learning was used only by a subset of participants.
- One standout solution utilized a massive ensemble of 260 different models, providing a fresh perspective on the limits of ensemble learning for predictive accuracy.
- Performance was largely driven by the ability to extract meaningful signals from anonymized data through rigorous cross-validation and creative feature interactions.
The results of the Toss Next ML Challenge suggest that even in the absence of domain-specific context due to anonymization, meticulous feature engineering and robust tree-based architectures remains the gold standard for tabular data. For ML engineers, the competition underscores that the key to production-ready models lies in balancing complex ensembling with the strict latency requirements of real-time serving.