clustering

3 posts

line

Extracting Trending Keywords from Open Chat Messages (opens in new tab)

To enhance user engagement on the LINE OpenChat main screen, LY Corporation developed a system to extract and surface "trending keywords" from real-time message data. By shifting focus from chat room recommendations to content-driven keyword clusters, the team addresses the lack of context in individual messages while providing a more dynamic discovery experience. This approach utilizes a combination of statistical Z-tests to identify frequency spikes and MinHash clustering to eliminate near-duplicate content, ensuring that the trending topics are both relevant and diverse. **The Shift from Chat Rooms to Content-Driven Recommendations** * Traditional recommendations focus on entire chat rooms, which often require significant user effort to investigate and evaluate. * Inspired by micro-blogging services, the team aimed to surface messages as individual content pieces to increase the "main screen visit" KPI. * Because individual chat messages are often fragmented or full of typos, the system groups them by keywords to create meaningful thematic content. **Statistical Detection of Trending Keywords** * Simple frequency counts are ineffective because they capture common social fillers like greetings or expressions of gratitude rather than actual trends. * Trends are defined as keywords showing a sharp increase in frequency compared to a baseline from seven days prior. * The system uses a Z-test for two-sample proportions to assign a score to each word, filtering for terms with at least a 30% frequency growth. * A seven-day comparison window is specifically used to suppress weekly cyclical noise (e.g., mentions of "weekend") and to capture topics whose popularity peaks over several consecutive days. **MinHash-based Message Deduplication** * Redundant messages, such as copy-pasted text, are removed prior to frequency aggregation to prevent skewed results and repetitive user experiences. * The system employs MinHash, a dimensionality reduction technique, to identify near-duplicate messages based on Jaccard similarity. * The process involves "shingling" messages into sets of tokens (primarily nouns) and generating $k$-length signatures; messages with identical signatures are clustered together. * To evaluate the efficiency of these clusters without high computational costs, the team developed a "SetDiv" (Set Diversity) metric that operates in linear time complexity. By combining Z-test statistical modeling with MinHash deduplication, this methodology successfully transforms fragmented chat data into a structured discovery layer. For developers working with high-volume social data, using a rolling weekly baseline and signature-based clustering offers a scalable way to surface high-velocity trends while filtering out both routine social noise and repetitive content.

google

Beyond billion-parameter burdens: Unlocking data synthesis with a conditional generator (opens in new tab)

The CTCL (Data Synthesis with ConTrollability and CLustering) framework provides a lightweight alternative to the computationally expensive process of fine-tuning billion-parameter models for differentially private synthetic data generation. By utilizing a 140-million parameter generator and a universal topic model, the system achieves high-quality distribution matching while remaining accessible for resource-constrained applications. This approach allows for the generation of unlimited synthetic samples without incurring additional privacy costs, consistently outperforming existing API-based and large-scale baselines under strict privacy guarantees. ### Pre-training Universal Components The framework relies on two core components developed using large-scale public corpora, which can be reused across different private domains: * **CTCL-Topic:** A universal topic model derived from Wikipedia documents. It uses BERTopic to embed and cluster data into approximately 1,000 distinct topics, each represented by 10 descriptive keywords. * **CTCL-Generator:** A conditional language model based on the 140M-parameter BART-base architecture. It was pre-trained on 430 million description–document pairs from the SlimPajama dataset, with descriptions generated by Gemma-2-2B to ensure the model can generate text based on specific input conditions. ### Learning the Private Domain Once the universal components are established, the framework learns the specific characteristics of a private dataset through a two-step process: * **Differentially Private (DP) Histograms:** The system captures high-level distributional information by creating a DP-protected histogram that represents the percentage of each topic present in the private corpus. * **DP Fine-Tuning:** Each document in the private dataset is associated with its corresponding keywords from the CTCL-Topic model. The CTCL-Generator is then fine-tuned on these keyword-document pairs using differential privacy to ensure individual data points are protected. ### Controllable Data Generation The final stage involves producing the synthetic dataset by sampling from the fine-tuned generator: * **Proportional Sampling:** The system generates data by targeting the exact topic proportions found in the private domain histogram. * **Keyword Conditioning:** For each topic, the model uses the associated 10 keywords as input to prompt the DP fine-tuned generator to produce relevant documents. * **Post-Processing Efficiency:** Because the generator is already fine-tuned with DP, the framework can generate an unlimited number of synthetic samples without further privacy budget expenditure, a significant advantage over iterative selection algorithms. CTCL offers a highly scalable and efficient solution for organizations needing to synthesize private text data without the infrastructure requirements of massive LLMs. Its ability to maintain topic-wise distribution through keyword conditioning makes it an ideal choice for specialized domains where maintaining the statistical utility of the data is as critical as protecting user privacy.

google

Achieving 10,000x training data reduction with high-fidelity labels (opens in new tab)

Google Ads researchers have developed a scalable active learning curation process that reduces the volume of training data required for fine-tuning LLMs by up to four orders of magnitude. By iteratively identifying the most informative and diverse examples through clustering and expert review, the method achieves significantly higher human-model alignment than traditional large-scale crowdsourced datasets. This approach effectively addresses the high costs and complexities of classifying ambiguous content, such as unsafe ads, where high-fidelity data is scarce and concept drift is frequent. ### The Iterative Curation Process * **Initial Labeling:** The process begins with a zero- or few-shot model (LLM-0) that generates a large, typically imbalanced dataset of "positive" and "benign" labels. * **Clustering and Confusion Identification:** Separate clusters are created for each label set; overlapping clusters indicate areas where the model is confused. * **Expert Sampling:** Human experts review pairs of examples located near the decision boundary of these overlapping clusters, prioritizing those that cover a larger area of the search space to ensure diversity. * **Recursive Refinement:** Expert labels are split into fine-tuning and evaluation sets; the model is retrained and the process repeats until model-human alignment plateaus or matches internal expert agreement. ### Measuring Alignment via Cohen’s Kappa * **Metric Selection:** Because ad safety is often subjective, the researchers use Cohen’s Kappa instead of precision and recall to measure how well two independent annotators align beyond chance. * **Performance Benchmarks:** A Kappa value above 0.8 is considered exceptional, while 0.4 is the minimum for acceptability. * **Goal Alignment:** The curation process aims to move model performance toward the "ceiling" of internal human agreement (which measured between 0.78 and 0.81 in these experiments). ### Experimental Results and Efficiency * **Model Scaling:** Experiments involved fine-tuning Gemini Nano-1 (1.8B parameters) and Nano-2 (3.25B parameters) on tasks of varying complexity. * **Drastic Data Reduction:** The curated method reached performance plateaus using fewer than 500 expert-labeled examples, compared to a baseline of 100,000 crowdsourced labels. * **Quality Gains:** Despite using 10,000x less data, the curated models saw up to a 65% improvement in alignment with human experts over the crowdsourced baselines. * **Class Balancing:** The process naturally corrected for production imbalances, moving from <1% positive examples in raw traffic to ~40% in the final curated sets. This curation method is a highly effective strategy for organizations managing high-stakes classification tasks where "ground truth" is subjective or data curation is prohibitively expensive. By shifting focus from data quantity to the quality and diversity of examples at the decision boundary, developers can maintain high-performing models that adapt quickly to evolving safety policies.