line Aug 29, 2025

Extracting Trending Keywords from OpenChat (opens in new tab)

ai machine-learning nlp clustering keyword-extraction minhash statistical-analysis text-mining jaccard-similarity

To enhance user engagement on the LINE OpenChat main screen, LY Corporation developed a system to extract and surface "trending keywords" from real-time message data. By shifting focus from chat room recommendations to content-driven keyword clusters, the team addresses the lack of context in individual messages while providing a more dynamic discovery experience. This approach utilizes a combination of statistical Z-tests to identify frequency spikes and MinHash clustering to eliminate near-duplicate content, ensuring that the trending topics are both relevant and diverse.

The Shift from Chat Rooms to Content-Driven Recommendations

Traditional recommendations focus on entire chat rooms, which often require significant user effort to investigate and evaluate.
Inspired by micro-blogging services, the team aimed to surface messages as individual content pieces to increase the "main screen visit" KPI.
Because individual chat messages are often fragmented or full of typos, the system groups them by keywords to create meaningful thematic content.

Statistical Detection of Trending Keywords

Simple frequency counts are ineffective because they capture common social fillers like greetings or expressions of gratitude rather than actual trends.
Trends are defined as keywords showing a sharp increase in frequency compared to a baseline from seven days prior.
The system uses a Z-test for two-sample proportions to assign a score to each word, filtering for terms with at least a 30% frequency growth.
A seven-day comparison window is specifically used to suppress weekly cyclical noise (e.g., mentions of "weekend") and to capture topics whose popularity peaks over several consecutive days.

MinHash-based Message Deduplication

Redundant messages, such as copy-pasted text, are removed prior to frequency aggregation to prevent skewed results and repetitive user experiences.
The system employs MinHash, a dimensionality reduction technique, to identify near-duplicate messages based on Jaccard similarity.
The process involves "shingling" messages into sets of tokens (primarily nouns) and generating $k$-length signatures; messages with identical signatures are clustered together.
To evaluate the efficiency of these clusters without high computational costs, the team developed a "SetDiv" (Set Diversity) metric that operates in linear time complexity.

By combining Z-test statistical modeling with MinHash deduplication, this methodology successfully transforms fragmented chat data into a structured discovery layer. For developers working with high-volume social data, using a rolling weekly baseline and signature-based clustering offers a scalable way to surface high-velocity trends while filtering out both routine social noise and repetitive content.