information-retrieval

4 posts

meta

Adapting the Facebook Reels RecSys AI Model Based on User Feedback - Engineering at Meta (opens in new tab)

Meta has enhanced the Facebook Reels recommendation engine by shifting focus from traditional engagement signals, like watch time and likes, to direct user feedback. By implementing the User True Interest Survey (UTIS) model, the system now prioritizes content that aligns with genuine user preferences rather than just short-term interactions. This shift has resulted in significant improvements in recommendation relevance, high-quality content delivery, and long-term user retention. **Limitations of Engagement-Based Metrics** * Traditional signals like "likes" and "watch time" are often noisy and may not reflect a user’s actual long-term interests. * Models optimized solely for engagement tend to favor short-term value over the long-term utility of the product. * Internal research found that previous heuristic-based interest models only achieved 48.3% precision in identifying what users truly care about. * Effective interest matching requires understanding nuanced factors such as production style, mood, audio, and motivation, which implicit signals often miss. **The User True Interest Survey (UTIS) Model** * Meta collects direct feedback via randomized, single-question surveys asking users to rate video interest on a 1–5 scale. * The raw survey data is binarized to denoise responses and weighted to correct for sampling and nonresponse bias. * The UTIS model functions as a lightweight "alignment model layer" built on top of the main multi-task ranking system. * The architecture uses existing model predictions as input features, supplemented by engineered features that capture content attributes and user behavior. **Integration into the Ranking Funnel** * **Late Stage Ranking (LSR):** The UTIS score is used as an additional input feature in the final value formula, allowing the system to boost high-interest videos and demote low-interest ones. * **Early Stage Ranking (Retrieval):** The model aggregates survey data to reconstruct user interest profiles, helping the system source more relevant candidates during the initial retrieval phase. * **Knowledge Distillation:** Large sequence-based retrieval models are aligned using UTIS predictions as labels through distillation objectives. **Performance and Impact** * The deployment of UTIS has led to a measurable increase in the delivery of niche, high-quality content. * Generic, popularity-based recommendations that often lack depth have been reduced. * Meta observed robust improvements across core metrics, including higher follow rates, more shares, and increased user retention. * The system now offers better interpretability, allowing engineers to understand which specific factors contribute to a user’s sense of "interest match." To continue improving the Reels ecosystem, Meta is focusing on doubling down on personalization by tackling challenges related to sparse data and sampling bias while exploring more advanced AI architectures to further diversify recommendations.

line

Building an Enterprise LLM Service 1 (opens in new tab)

LY Corporation’s engineering team developed an AI assistant for their private cloud platform, Flava, by prioritizing "context engineering" over traditional prompt engineering. To manage a complex environment of 260 APIs and hundreds of technical documents, they implemented a strategy of progressive disclosure to ensure the LLM receives only the most relevant information for any given query. This approach allows the assistant to move beyond simple RAG-based document summarization to perform active diagnostics and resource management based on real-time API data. ### Performance Limitations of Long Contexts * Research indicates that LLM performance can drop by 13.9% to 85% as context length increases, even if the model technically supports a large token window. * The phenomenon of "context rot" occurs when low-quality or irrelevant information is mixed into the input, causing the model to generate confident but incorrect answers. * Because LLMs are stateless, maintaining conversation history and processing dense JSON responses from multiple APIs quickly exhausts context windows and degrades reasoning quality. ### Progressive Disclosure and Tool Selection * The system avoids loading all 260+ API definitions at once; instead, it analyzes the user's intent to select only the necessary tools, such as loading only Redis-related APIs when a user asks about a cluster. * Specific product usage hints, such as the distinction between private and CDN settings for Object Storage, are injected only when those specific services are invoked. * This phased approach significantly reduces token consumption and prevents the model from being overwhelmed by irrelevant technical specifications. ### Response Guidelines and the "Mock Tool Message" Strategy * The team distinguished between "System Prompts" (global rules) and "Response Guidelines" (situational instructions), such as directing users to a console UI before suggesting CLI commands. * Injecting specific guidelines into the system prompt often caused "instruction conflict," where the LLM might hallucinate information to satisfy a guideline while ignoring core requirements like using search tools. * To resolve these conflicts, the team utilized "ToolMessages" to inject guidelines; by formatting instructions as if they were results from a tool execution, the LLM treats the information as factual context rather than a command that might override the system prompt. To build a robust enterprise LLM service, developers should focus on dynamic context management rather than static prompt optimization. Treating operational guidelines as external data via mock tool messages, rather than system instructions, provides a scalable way to reduce hallucinations and maintain high performance across hundreds of integrated services.

google

​​Speech-to-Retrieval (S2R): A new approach to voice search (opens in new tab)

Google Research has introduced Speech-to-Retrieval (S2R), a direct speech-to-intent engine designed to overcome the fundamental limitations of traditional cascade-based voice search. By bypassing the error-prone intermediate step of text transcription, S2R significantly reduces information loss and prevents minor phonetic errors from derailing search accuracy. This shift from identifying literal words to understanding underlying intent represents an architectural change that promises faster and more reliable search experiences globally. ## Limitations of Cascade Modeling * Traditional systems rely on Automatic Speech Recognition (ASR) to convert audio into a text string before passing it to a search engine. * This "cascade" approach suffers from error propagation, where a single phonetic mistake—such as transcribing "The Scream painting" as "The Screen painting"—leads to entirely irrelevant search results. * Textual transcription often results in information loss, as the system may strip away vocal nuances or contextual cues that could help disambiguate the user's actual intent. ## The S2R Architectural Shift * S2R interprets and retrieves information directly from spoken queries, treating the audio as the primary source of intent rather than a precursor to text. * The system shifts the technical focus from "What words were said?" to "What information is being sought?", allowing the model to bridge the quality gap between current voice search and human-level understanding. * This approach is designed to be more robust across different languages and audio conditions by mapping speech features directly to a retrieval space. ## Evaluating Performance with the SVQ Dataset * Researchers used Mean Reciprocal Rank (MRR) to evaluate search effectiveness, comparing real-world ASR systems against "Cascade Groundtruth" models that use perfect, human-verified text. * The study found that Word Error Rate (WER) is often a poor predictor of search success; a lower WER does not always result in a higher MRR, as the nature of the error matters more than the frequency. * To facilitate further research, Google has open-sourced the Simple Voice Questions (SVQ) dataset, which includes audio queries in 17 languages and 26 locales. * The SVQ dataset is integrated into the new Massive Sound Embedding Benchmark (MSEB) to provide a standardized way to measure direct speech-to-intent performance. The transition to Speech-to-Retrieval signifies a major evolution in how AI handles human voice. For developers and researchers, the release of the SVQ dataset and the focus on MRR over traditional transcription metrics provide a new roadmap for building voice interfaces that are resilient to the phonetic ambiguities of natural speech.

google

MUVERA: Making multi-vector retrieval as fast as single-vector search (opens in new tab)

MUVERA is a state-of-the-art retrieval algorithm that simplifies the computationally intensive process of multi-vector retrieval by converting it into a single-vector Maximum Inner Product Search (MIPS). By transforming complex multi-vector sets into Fixed Dimensional Encodings (FDEs), the system maintains the high accuracy of models like ColBERT while achieving the speed and scalability of traditional search infrastructures. This approach allows for efficient retrieval across massive datasets by leveraging highly optimized geometric search techniques that were previously incompatible with multi-vector similarity measures. ## The Limitations of Multi-Vector Retrieval While traditional models use a single embedding for an entire document, multi-vector models generate an embedding for every token, providing superior semantic depth but creating significant overhead. * Multi-vector representations lead to a massive increase in embedding volume, requiring more storage and processing power. * Similarity is typically calculated using "Chamfer matching," a non-linear operation that measures the maximum similarity between query tokens and document tokens. * Because Chamfer similarity is more complex than a standard dot-product, it cannot directly use sublinear search algorithms, often necessitating expensive exhaustive comparisons. ## Fixed Dimensional Encodings (FDEs) The core innovation of MUVERA is the reduction of multi-vector sets into a single, manageable vector representation that preserves mathematical relationships. * FDEs are single vectors designed so that their inner product closely approximates the original multi-vector Chamfer similarity. * The transformation process is "data-oblivious," meaning the mapping does not need to be trained on or adjusted for specific datasets or changes in data distribution. * By squeezing group information into a fixed-length format, MUVERA allows complex data points to be stored and queried using existing single-vector indexing structures. ## The MUVERA Retrieval Pipeline The algorithm functions as a multi-stage process that prioritizes both speed and precision through a retrieve-and-rerank architecture. * **FDE Generation:** Query and document multi-vector sets are mapped into FDEs to capture essential similarity information. * **MIPS-based Retrieval:** A standard MIPS solver indexes the document FDEs and rapidly identifies a set of likely candidates for a given query. * **Re-ranking:** The initial candidates are refined using the original, exact Chamfer similarity score to ensure the highest possible accuracy in the final results. MUVERA provides a practical framework for scaling high-accuracy multi-vector models to massive datasets without the traditional latency penalties. Its ability to bridge the gap between complex semantic modeling and optimized search infrastructure makes it a versatile tool for modern information retrieval systems.