line

Extracting Trending Keywords from Open Chat Messages (opens in new tab)

To enhance user engagement on the LINE OpenChat main screen, LY Corporation developed a system to extract and surface "trending keywords" from real-time message data. By shifting focus from chat room recommendations to content-driven keyword clusters, the team addresses the lack of context in individual messages while providing a more dynamic discovery experience. This approach utilizes a combination of statistical Z-tests to identify frequency spikes and MinHash clustering to eliminate near-duplicate content, ensuring that the trending topics are both relevant and diverse.

The Shift from Chat Rooms to Content-Driven Recommendations

  • Traditional recommendations focus on entire chat rooms, which often require significant user effort to investigate and evaluate.
  • Inspired by micro-blogging services, the team aimed to surface messages as individual content pieces to increase the "main screen visit" KPI.
  • Because individual chat messages are often fragmented or full of typos, the system groups them by keywords to create meaningful thematic content.

Statistical Detection of Trending Keywords

  • Simple frequency counts are ineffective because they capture common social fillers like greetings or expressions of gratitude rather than actual trends.
  • Trends are defined as keywords showing a sharp increase in frequency compared to a baseline from seven days prior.
  • The system uses a Z-test for two-sample proportions to assign a score to each word, filtering for terms with at least a 30% frequency growth.
  • A seven-day comparison window is specifically used to suppress weekly cyclical noise (e.g., mentions of "weekend") and to capture topics whose popularity peaks over several consecutive days.

MinHash-based Message Deduplication

  • Redundant messages, such as copy-pasted text, are removed prior to frequency aggregation to prevent skewed results and repetitive user experiences.
  • The system employs MinHash, a dimensionality reduction technique, to identify near-duplicate messages based on Jaccard similarity.
  • The process involves "shingling" messages into sets of tokens (primarily nouns) and generating $k$-length signatures; messages with identical signatures are clustered together.
  • To evaluate the efficiency of these clusters without high computational costs, the team developed a "SetDiv" (Set Diversity) metric that operates in linear time complexity.

By combining Z-test statistical modeling with MinHash deduplication, this methodology successfully transforms fragmented chat data into a structured discovery layer. For developers working with high-volume social data, using a rolling weekly baseline and signature-based clustering offers a scalable way to surface high-velocity trends while filtering out both routine social noise and repetitive content.