Achieving 10,000x training data reduction with high-fidelity labels (opens in new tab)
Google Ads researchers have developed a scalable active learning curation process that reduces the volume of training data required for fine-tuning LLMs by up to four orders of magnitude. By iteratively identifying the most informative and diverse examples through clustering and expert review, the method achieves significantly higher human-model alignment than traditional large-scale crowdsourced datasets. This approach effectively addresses the high costs and complexities of classifying ambiguous content, such as unsafe ads, where high-fidelity data is scarce and concept drift is frequent. ### The Iterative Curation Process * **Initial Labeling:** The process begins with a zero- or few-shot model (LLM-0) that generates a large, typically imbalanced dataset of "positive" and "benign" labels. * **Clustering and Confusion Identification:** Separate clusters are created for each label set; overlapping clusters indicate areas where the model is confused. * **Expert Sampling:** Human experts review pairs of examples located near the decision boundary of these overlapping clusters, prioritizing those that cover a larger area of the search space to ensure diversity. * **Recursive Refinement:** Expert labels are split into fine-tuning and evaluation sets; the model is retrained and the process repeats until model-human alignment plateaus or matches internal expert agreement. ### Measuring Alignment via Cohen’s Kappa * **Metric Selection:** Because ad safety is often subjective, the researchers use Cohen’s Kappa instead of precision and recall to measure how well two independent annotators align beyond chance. * **Performance Benchmarks:** A Kappa value above 0.8 is considered exceptional, while 0.4 is the minimum for acceptability. * **Goal Alignment:** The curation process aims to move model performance toward the "ceiling" of internal human agreement (which measured between 0.78 and 0.81 in these experiments). ### Experimental Results and Efficiency * **Model Scaling:** Experiments involved fine-tuning Gemini Nano-1 (1.8B parameters) and Nano-2 (3.25B parameters) on tasks of varying complexity. * **Drastic Data Reduction:** The curated method reached performance plateaus using fewer than 500 expert-labeled examples, compared to a baseline of 100,000 crowdsourced labels. * **Quality Gains:** Despite using 10,000x less data, the curated models saw up to a 65% improvement in alignment with human experts over the crowdsourced baselines. * **Class Balancing:** The process naturally corrected for production imbalances, moving from <1% positive examples in raw traffic to ~40% in the final curated sets. This curation method is a highly effective strategy for organizations managing high-stakes classification tasks where "ground truth" is subjective or data curation is prohibitively expensive. By shifting focus from data quantity to the quality and diversity of examples at the decision boundary, developers can maintain high-performing models that adapt quickly to evolving safety policies.