keyword-extraction

2 posts

google

A differentially private framework for gaining insights into AI chatbot use (opens in new tab)

Google Research has introduced Urania, a novel framework designed to extract high-level usage insights from AI chatbot conversations while maintaining rigorous differential privacy (DP) guarantees. Unlike previous heuristic methods that rely on simple redaction or LLM-based PII stripping, this pipeline ensures that no individual user's data can be reconstructed from the resulting summaries. By combining DP clustering and keyword extraction with LLM-based summarization, the system provides a formal, auditable approach to understanding platform trends without compromising sensitive information. ## Limitations of Heuristic Privacy * Existing frameworks often rely on large language models to manually strip personally identifiable information (PII) from text before analysis. * These heuristic protections are difficult to formalize or audit, and their effectiveness may diminish as models evolve or face sophisticated prompt injection attacks. * The Urania framework addresses these weaknesses by using mathematical privacy budgets (the epsilon parameter) to measure and limit the influence of any single user's data on the final output. ## The Differentially Private Pipeline * **DP Clustering**: The framework first converts conversation data into numerical embeddings. These are grouped using a DP clustering algorithm, ensuring that cluster centers reflect broad trends rather than specific individual inputs. * **DP Keyword Extraction**: The system identifies keywords for each cluster and generates a histogram of their frequency. By adding mathematical noise to these counts, the framework masks individual contributions and ensures that only keywords common to many users are retained. * **Keyword Generation Methods**: The researchers explored three methods for extraction: LLM-guided selection of relevant terms, a differentially private version of TF-IDF, and an LLM-guided approach that selects terms from a pre-defined list of public keywords. * **LLM Summarization**: In the final stage, an LLM generates a high-level summary of the cluster using only the noisy, anonymized keywords. Because the LLM never sees the raw conversation text, the "post-processing" property of DP guarantees that the final summary remains private. ## Privacy and Utility Trade-offs * The framework was tested against a non-private baseline (Simple-CLIO) to evaluate how privacy constraints affect the quality of the insights generated. * Stronger privacy settings (lower epsilon values) inherently result in a utility trade-off, as the added noise can obscure some niche usage patterns. * Despite these trade-offs, the framework provides a robust defense against data leakage, as the summarization model is structurally prevented from seeing sensitive original text, making it resilient to prompt injection. This framework offers a scalable way for platform providers to analyze chatbot usage patterns and enforce safety policies while providing mathematical certainty regarding user privacy. For organizations handling sensitive conversation data, moving from heuristic redaction to formal DP pipelines like Urania provides a more robust and auditable path for service improvement.

line

Extracting Trending Keywords from Open Chat Messages (opens in new tab)

To enhance user engagement on the LINE OpenChat main screen, LY Corporation developed a system to extract and surface "trending keywords" from real-time message data. By shifting focus from chat room recommendations to content-driven keyword clusters, the team addresses the lack of context in individual messages while providing a more dynamic discovery experience. This approach utilizes a combination of statistical Z-tests to identify frequency spikes and MinHash clustering to eliminate near-duplicate content, ensuring that the trending topics are both relevant and diverse. **The Shift from Chat Rooms to Content-Driven Recommendations** * Traditional recommendations focus on entire chat rooms, which often require significant user effort to investigate and evaluate. * Inspired by micro-blogging services, the team aimed to surface messages as individual content pieces to increase the "main screen visit" KPI. * Because individual chat messages are often fragmented or full of typos, the system groups them by keywords to create meaningful thematic content. **Statistical Detection of Trending Keywords** * Simple frequency counts are ineffective because they capture common social fillers like greetings or expressions of gratitude rather than actual trends. * Trends are defined as keywords showing a sharp increase in frequency compared to a baseline from seven days prior. * The system uses a Z-test for two-sample proportions to assign a score to each word, filtering for terms with at least a 30% frequency growth. * A seven-day comparison window is specifically used to suppress weekly cyclical noise (e.g., mentions of "weekend") and to capture topics whose popularity peaks over several consecutive days. **MinHash-based Message Deduplication** * Redundant messages, such as copy-pasted text, are removed prior to frequency aggregation to prevent skewed results and repetitive user experiences. * The system employs MinHash, a dimensionality reduction technique, to identify near-duplicate messages based on Jaccard similarity. * The process involves "shingling" messages into sets of tokens (primarily nouns) and generating $k$-length signatures; messages with identical signatures are clustered together. * To evaluate the efficiency of these clusters without high computational costs, the team developed a "SetDiv" (Set Diversity) metric that operates in linear time complexity. By combining Z-test statistical modeling with MinHash deduplication, this methodology successfully transforms fragmented chat data into a structured discovery layer. For developers working with high-volume social data, using a rolling weekly baseline and signature-based clustering offers a scalable way to surface high-velocity trends while filtering out both routine social noise and repetitive content.