Extracting Trending Keywords from Open Chat Messages (opens in new tab)
To enhance user engagement on the LINE OpenChat main screen, LY Corporation developed a system to extract and surface "trending keywords" from real-time message data. By shifting focus from chat room recommendations to content-driven keyword clusters, the team addresses the lack of context in individual messages while providing a more dynamic discovery experience. This approach utilizes a combination of statistical Z-tests to identify frequency spikes and MinHash clustering to eliminate near-duplicate content, ensuring that the trending topics are both relevant and diverse.
The Shift from Chat Rooms to Content-Driven Recommendations
- Traditional recommendations focus on entire chat rooms, which often require significant user effort to investigate and evaluate.
- Inspired by micro-blogging services, the team aimed to surface messages as individual content pieces to increase the "main screen visit" KPI.
- Because individual chat messages are often fragmented or full of typos, the system groups them by keywords to create meaningful thematic content.
Statistical Detection of Trending Keywords
- Simple frequency counts are ineffective because they capture common social fillers like greetings or expressions of gratitude rather than actual trends.
- Trends are defined as keywords showing a sharp increase in frequency compared to a baseline from seven days prior.
- The system uses a Z-test for two-sample proportions to assign a score to each word, filtering for terms with at least a 30% frequency growth.
- A seven-day comparison window is specifically used to suppress weekly cyclical noise (e.g., mentions of "weekend") and to capture topics whose popularity peaks over several consecutive days.
MinHash-based Message Deduplication
- Redundant messages, such as copy-pasted text, are removed prior to frequency aggregation to prevent skewed results and repetitive user experiences.
- The system employs MinHash, a dimensionality reduction technique, to identify near-duplicate messages based on Jaccard similarity.
- The process involves "shingling" messages into sets of tokens (primarily nouns) and generating $k$-length signatures; messages with identical signatures are clustered together.
- To evaluate the efficiency of these clusters without high computational costs, the team developed a "SetDiv" (Set Diversity) metric that operates in linear time complexity.
By combining Z-test statistical modeling with MinHash deduplication, this methodology successfully transforms fragmented chat data into a structured discovery layer. For developers working with high-volume social data, using a rolling weekly baseline and signature-based clustering offers a scalable way to surface high-velocity trends while filtering out both routine social noise and repetitive content.