Meta

9 posts

engineering.fb.com

Filter by tag

meta

Adapting the Facebook Reels RecSys AI Model Based on User Feedback - Engineering at Meta (opens in new tab)

Meta has enhanced the Facebook Reels recommendation engine by shifting focus from traditional engagement signals, like watch time and likes, to direct user feedback. By implementing the User True Interest Survey (UTIS) model, the system now prioritizes content that aligns with genuine user preferences rather than just short-term interactions. This shift has resulted in significant improvements in recommendation relevance, high-quality content delivery, and long-term user retention. **Limitations of Engagement-Based Metrics** * Traditional signals like "likes" and "watch time" are often noisy and may not reflect a user’s actual long-term interests. * Models optimized solely for engagement tend to favor short-term value over the long-term utility of the product. * Internal research found that previous heuristic-based interest models only achieved 48.3% precision in identifying what users truly care about. * Effective interest matching requires understanding nuanced factors such as production style, mood, audio, and motivation, which implicit signals often miss. **The User True Interest Survey (UTIS) Model** * Meta collects direct feedback via randomized, single-question surveys asking users to rate video interest on a 1–5 scale. * The raw survey data is binarized to denoise responses and weighted to correct for sampling and nonresponse bias. * The UTIS model functions as a lightweight "alignment model layer" built on top of the main multi-task ranking system. * The architecture uses existing model predictions as input features, supplemented by engineered features that capture content attributes and user behavior. **Integration into the Ranking Funnel** * **Late Stage Ranking (LSR):** The UTIS score is used as an additional input feature in the final value formula, allowing the system to boost high-interest videos and demote low-interest ones. * **Early Stage Ranking (Retrieval):** The model aggregates survey data to reconstruct user interest profiles, helping the system source more relevant candidates during the initial retrieval phase. * **Knowledge Distillation:** Large sequence-based retrieval models are aligned using UTIS predictions as labels through distillation objectives. **Performance and Impact** * The deployment of UTIS has led to a measurable increase in the delivery of niche, high-quality content. * Generic, popularity-based recommendations that often lack depth have been reduced. * Meta observed robust improvements across core metrics, including higher follow rates, more shares, and increased user retention. * The system now offers better interpretability, allowing engineers to understand which specific factors contribute to a user’s sense of "interest match." To continue improving the Reels ecosystem, Meta is focusing on doubling down on personalization by tackling challenges related to sparse data and sampling bias while exploring more advanced AI architectures to further diversify recommendations.

meta

CSS at Scale With StyleX - Engineering at Meta (opens in new tab)

Scaling CSS within massive codebases presents unique challenges that traditional styling methods often struggle to solve effectively. Meta’s StyleX addresses these issues by offering a system that combines the intuitive ergonomics of CSS-in-JS with the runtime performance of static CSS. By prioritizing atomic styling and definition deduplication, StyleX minimizes bundle sizes and has become the primary styling standard across Meta's entire suite of applications. ### Performance-Driven Styling Architecture * Combines a CSS-in-JS developer experience with a compiler that outputs static CSS to ensure high performance and zero runtime overhead. * Utilizes atomic styling to break down CSS into small, reusable classes, which prevents style sheets from growing linearly with the size of the codebase. * Automatically deduplicates style definitions during the build process, significantly reducing the final bundle size delivered to the client. * Exposes a simple, consistent API that allows developers to manage complex styles and themes while maintaining type safety. ### Standardization and Industry Adoption * Serves as the foundational styling system for Meta’s most prominent platforms, including Facebook, Instagram, WhatsApp, Messenger, and Threads. * Gained significant industry traction beyond Meta, with large-scale organizations such as Figma and Snowflake adopting it for their own web applications. * Acts as an open-source force multiplier, allowing Meta engineers and the broader community to collaborate on solving CSS-at-scale problems. * Provides a mature ecosystem that bridges the gap between the flexibility of JavaScript-based styling and the efficiency of traditional CSS. For engineering teams managing large-scale web applications where bundle size and styling maintainability are critical, StyleX offers a battle-tested framework. Developers can leverage this tool to achieve the performance of static CSS without losing the expressive power of modern JavaScript tooling.

meta

Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption - Engineering at Meta (opens in new tab)

The 2025 Typed Python Survey highlights that type hinting has transitioned from an optional feature to a core development standard, with 86% of respondents reporting frequent usage. While mid-career developers show the highest enthusiasm for typing, the ecosystem faces ongoing friction from tooling fragmentation and the complexity of advanced type logic. Overall, the community is pushing for a more robust system that mirrors the expressive power of TypeScript while maintaining Python’s hallmark flexibility. ## Respondent Demographics and Adoption Trends * The survey analyzed responses from 1,241 developers, the majority of whom are highly experienced, with nearly half reporting over a decade of Python expertise. * Adoption is highest among developers with 5–10 years of experience (93%), whereas junior developers (83%) and those with over 10 years of experience (80%) show slightly lower usage rates. * The lower adoption among seniors is attributed to the management of legacy codebases and long-standing habits formed before type hints were introduced to the language. ## Primary Drivers for Typing Adoption * **Incremental Integration:** Developers value the "gradual typing" approach, which allows them to add types to existing projects at their own pace without breaking the codebase. * **Improved Tooling and IDE Support:** Typing significantly enhances developer experience by enabling more accurate autocomplete, jump-to-definition, and inline documentation in IDEs. * **Bug Prevention and Readability:** Type hints act as living documentation that helps catch subtle bugs during refactoring and makes complex codebases easier for teams to reason about. * **Library Compatibility:** Features like Protocols and Generics are highly appreciated, particularly for their synergy with modern libraries like Pydantic and FastAPI that utilize type annotations at runtime. ## Technical Pain Points and Ecosystem Friction * **Third-Party Integration:** A major hurdle is the inconsistent quality or total absence of type stubs in massive libraries like NumPy, Pandas, and Django. * **Tooling Fragmentation:** Developers expressed frustration over inconsistencies between major type checkers like Mypy and Pyright, as well as the slow performance of Mypy in large projects. * **Conceptual Complexity:** Advanced features such as variance (co/contravariance), decorators, and complex nested Generics remain difficult for many developers to implement correctly. * **Runtime Limitations:** Because Python does not enforce types at the interpreter level, some developers find it difficult to justify the verbosity of typing when it offers no native runtime guarantees. ## Most Requested Type System Enhancements * **TypeScript Parity:** There is a strong demand for features found in TypeScript, specifically Intersection types (using the `&` operator), Mapped types, and Conditional types. * **Utility Types:** Developers are looking for built-in utilities like `Pick`, `Omit`, and `keyof` to handle dictionary shapes more effectively. * **Improved Structural Typing:** While `TypedDict` exists, respondents want more flexible, anonymous structural typing to handle complex data structures without excessive boilerplate. * **Performance and Enforcement:** There is a recurring request for an official, high-performance built-in type checker and optional runtime enforcement to bridge the gap between static analysis and execution. As the Python type system continues to mature, developers should prioritize incremental adoption in shared libraries and internal APIs to maximize the benefits of static analysis. While waiting for more advanced features like intersection types, focusing on tooling consistency—such as aligning team standards around a specific type checker—can mitigate much of the friction identified in the 2025 survey.

meta

DrP: Meta's Root Cause Analysis Platform at Scale - Engineering at Meta (opens in new tab)

DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire. ## Architecture and Core Components * **Expressive SDK:** Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms. * **Built-in Analysis Tools:** The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas. * **Scalable Backend:** A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously. * **Workflow Integration:** DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention. ## Authoring and Verification Workflow * **Template Bootstrapping:** Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner. * **Analyzer Chaining:** The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services. * **Automated Backtesting:** Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance. * **Decision Tree Logic:** Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves. ## Execution and Post-Processing * **Trigger-based Analysis:** When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging. * **Automated Mitigation:** A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues. * **DrP Insights:** This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes. * **Alert Annotation:** Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder. ## Practical Conclusion Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.

meta

How We Built Meta Ray-Ban Display: From Zero to Polish - Engineering at Meta (opens in new tab)

Meta's development of the Ray-Ban Display AI glasses focuses on bridging the gap between sophisticated hardware engineering and intuitive user interfaces. By pairing the glasses with a neural wristband, the team addresses the fundamental challenge of creating a high-performance wearable that remains comfortable and socially acceptable for daily use. The project underscores the necessity of iterative refinement and cross-disciplinary expertise to transition from a technical prototype to a polished consumer product. ### Hardware Engineering and Physics * The design process draws parallels between hardware architecture and particle physics, emphasizing the high-precision requirements of miniaturizing components. * Engineers must manage the strict physical constraints of the Ray-Ban form factor while integrating advanced AI processing and thermal management. * The development culture prioritizes the celebration of incremental technical wins to maintain momentum during the long cycle from "zero to polish." ### Display Technology and UI Evolution * The glasses utilize a unique display system designed to provide visual overlays without obstructing the wearer’s natural field of vision. * The team is developing emerging UI patterns specifically for head-mounted displays, moving away from traditional touch-screen paradigms toward more contextual interactions. * Refining the user experience involves balancing the information density of the display with the need for a non-intrusive, "heads-up" interface. ### The Role of Neural Interfaces * The Ray-Ban Display is packaged with the Meta Neural Band, an electromyography (EMG) wristband that translates motor nerve signals into digital commands. * This wrist-based input mechanism provides a discrete and low-friction way to control the glasses' interface without the need for voice commands or physical buttons. * Integrating EMG technology represents a shift toward human-computer interfaces that are intended to feel like an extension of the user's own body. To successfully build the next generation of wearables, engineering teams should look toward multi-modal input systems—combining visual displays with neural interfaces—to solve the ergonomic and social challenges of hands-free computing.

meta

How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks - Engineering at Meta (opens in new tab)

Meta utilizes secure-by-default frameworks to wrap potentially unsafe operating system and third-party functions, ensuring security is integrated into the development process without sacrificing developer velocity. By leveraging generative AI and automation, the company scales the adoption of these frameworks across its massive codebase, effectively mitigating risks such as Android intent hijacking. This approach balances high-level security enforcement with the practical need for friction-free developer experiences. ## Design Principles for Secure-by-Default Frameworks To ensure high adoption and long-term viability, Meta follows specific architectural guidelines when building security wrappers: * **API Mirroring:** Secure framework APIs are designed to closely resemble the existing native APIs they replace (e.g., mirroring the Android Context API). This reduces the cognitive burden on developers and simplifies the use of automated tools for code conversion. * **Reliance on Public Interfaces:** Frameworks are built exclusively on public and stable APIs. Avoiding private or undocumented OS interfaces prevents maintenance "fire drills" and ensures the frameworks remain functional across various OS updates. * **Modularity and Reach:** Rather than creating a single monolithic tool, Meta develops small, modular libraries that target specific security issues while remaining usable across all apps and platform versions. * **Friction Reduction:** Frameworks must avoid introducing excessive complexity or noticeable performance overhead in terms of CPU and RAM, as high friction often leads developers to bypass security measures entirely. ## SecureLinkLauncher: Preventing Android Intent Hijacking SecureLinkLauncher (SLL) is a primary example of a secure-by-default framework designed to stop sensitive data from leaking via the Android intent system. * **Wrapped Execution:** SLL wraps native Android methods such as `startActivity()` and `startActivityForResult()`. Instead of calling `context.startActivity(intent)`, developers use `SecureLinkLauncher.launchInternalActivity(intent, context)`. * **Scope Verification:** The framework enforces scope verification before delegating to the native API. This ensures that intents are directed to intended "family" apps rather than being intercepted by malicious third-party applications. * **Mitigating Implicit Intents:** SLL addresses the risks of untargeted intents, which can be received by any app with a matching intent-filter. By enforcing a developer-specified scope, SLL ensures that data like `SECRET_INFO` is only accessible to authorized packages. ## Scaling Adoption through AI and Automation The transition from legacy, insecure patterns to secure frameworks is managed through a combination of automated tooling and artificial intelligence. * **Automated Migration:** Generative AI identifies insecure usage patterns across Meta’s vast codebase and suggests—or automatically applies—the appropriate secure framework replacements. * **Continuous Monitoring:** Automation tools continuously scan the codebase to ensure compliance with secure-by-default standards, preventing the reintroduction of vulnerable code. * **Scaling Consistency:** By reducing the manual effort required for refactoring, AI enables consistent security enforcement across different teams and applications without slowing down the shipping cycle. For organizations managing large-scale mobile codebases, the recommended approach is to build thin, developer-friendly wrappers around risky platform APIs and utilize automated refactoring tools to drive adoption. This ensures that security becomes an invisible, default component of the development lifecycle rather than a manual checklist.

meta

Zoomer: Powering AI Performance at Meta's Scale Through Intelligent Debugging and Optimization - Engineering at Meta (opens in new tab)

Zoomer is Meta’s centralized, automated platform designed to solve performance bottlenecks and GPU underutilization across its massive AI training and inference infrastructure. By integrating deep analytics with scalable data collection, the tool has become the internal standard for optimizing workloads ranging from Llama 3 training to large-scale ads recommendation engines. Ultimately, Zoomer enables significant energy savings and hardware efficiency gains, allowing Meta to accelerate model iteration and increase throughput across its global fleet of GPUs. ### The Three-Layered Architecture * **Infrastructure and Platform Layer:** This foundation utilizes Meta’s Manifold blob storage for trace data and employs fault-tolerant processing pipelines to manage massive trace files across thousands of hosts. * **Analytics and Insights Engine:** This layer performs deep analysis using specialized tools such as Kineto for GPU traces, NVIDIA DCGM for hardware metrics, and StrobeLight for CPU profiling. It automatically detects performance anti-patterns and provides actionable optimization recommendations. * **Visualization and User Interface Layer:** The presentation layer transforms complex data into interactive timelines and heat maps. It integrates with Perfetto for kernel-level inspection and provides drill-down dashboards that highlight outliers across distributed GPU deployments. ### Automated Profiling and Data Capture * **Trigger Mechanisms:** To ensure data accuracy, Zoomer automatically triggers profiling for training workloads during stable states (typically around iteration 550) to avoid startup noise, while inference workloads use on-demand or benchmark-integrated triggers. * **Comprehensive Metrics:** The platform simultaneously collects GPU SM utilization, Tensor Core usage, memory bandwidth, and power consumption via DCGM. * **System-Level Telemetry:** Beyond the GPU, Zoomer captures host-level data including CPU utilization, storage access patterns, and network I/O through dyno telemetry. * **Distributed Communication:** For large-scale training, the tool analyzes NCCL collective operations and inter-node communication patterns to identify stragglers and network bottlenecks. ### Inference and Training Optimization * **Inference Performance:** Zoomer tracks request/response latency, GPU memory allocation patterns, and Thrift request-level profiling to identify bottlenecks in serving user requests at scale. * **Workflow Acceleration:** By correlating application-level annotations—such as forward/backward passes and optimizer steps—with hardware performance, developers can pinpoint exactly which part of a model's execution is inefficient. * **Operational Impact:** These insights have led to significant improvements in Queries Per Second (QPS) for recommendation models and reduced training times for generative AI features by eliminating resource waste. For organizations managing large-scale AI clusters, the Zoomer model suggests that the key to efficiency is moving away from manual, reactive debugging toward an "always-on" automated profiling system. Correlating high-level software phases with low-level hardware telemetry is essential for maximizing the return on investment for expensive GPU resources and maintaining rapid iteration cycles.

meta

Key Transparency Comes to Messenger - Engineering at Meta (opens in new tab)

Messenger has enhanced the security of its end-to-end encrypted chats by launching key transparency, a system that provides an automated, verifiable record of public encryption keys. By moving beyond manual key comparisons, this feature ensures that users can verify their contacts' identities without technical friction, even when those contacts use multiple devices. This implementation allows Messenger to provide a higher level of assurance that no third party, including Meta, has tampered with or swapped the keys used to secure a conversation. ## The Role of Key Transparency in Encrypted Messaging * Provides a verifiable and auditable record of public keys, ensuring that messages are always encrypted with the correct keys for the intended recipient. * Prevents "man-in-the-middle" attacks by a compromised server by making any unauthorized key changes visible to the system. * Simplifies the user experience by automating the verification process, which previously required users to manually compare long strings of characters across every device their contact owned. ## Architecture and Third-Party Auditing * Built upon the open-source Auditable Key Directory (AKD) library, which was previously used to implement similar security properties for WhatsApp. * Partners with Cloudflare to act as a third-party auditor, maintaining a public Key Transparency Dashboard that allows anyone to verify the integrity of the directory. * Leverages an "epoch" system where the directory is updated and published frequently to ensure that the global log of keys remains current and immutable. ## Scaling for Global Messenger Traffic * Manages a massive database that has already grown to billions of entries, reflecting the high volume of users and the fact that Messenger indexes keys for every individual device a user logs into. * Operates at a high frequency, publishing a new epoch approximately every two minutes, with each update containing hundreds of thousands of new key entries. * Optimized the algorithmic efficiency of the AKD library to ensure that cryptographic proof sizes remain small and manageable, even as the number of updates for a single key grows over time. ## Infrastructure Resilience and Recovery * Improved the system's ability to handle temporary outages and long delays in key sequencing, drawing on two years of operational data from the WhatsApp implementation. * Replaced older proof methods that grew linearly with the height of the transparency tree with more efficient operations to maintain high availability and real-time verification speeds. * Established a robust recovery process to ensure that the transparency log remains consistent even after infrastructure disruptions. By automating the verification of encryption keys through a transparent, audited directory, Messenger has made sophisticated cryptographic security accessible to billions of users. This rollout represents a significant shift in how trust is managed in digital communications, replacing manual user checks with a seamless, background-level guarantee of privacy.

meta

Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation - Engineering at Meta (opens in new tab)

Meta has released Ax 1.0, an open-source platform designed to automate and optimize complex, resource-intensive experimentation through machine learning. By utilizing Bayesian optimization, the platform helps researchers navigate vast configuration spaces to improve AI models, infrastructure, and hardware design efficiently. The release aims to bridge the gap between sophisticated mathematical theory and the practical requirements of production-scale engineering. ## Real-World Experimentation and Utility * Ax is used extensively at Meta for diverse tasks, including tuning hyperparameter configurations, discovering optimal data mixtures for Generative AI, and optimizing compiler flags. * The platform is built to handle the logistical "overhead" of experimentation, such as managing experiment states, automating orchestration, and providing diagnostic tools. * It supports multi-objective optimization, allowing users to balance competing metrics and enforce "guardrail" constraints rather than just maximizing a single value. * Applications extend beyond software to physical engineering, such as optimizing design parameters for AR/VR hardware. ## System Insight and Analysis * Beyond finding optimal points, Ax serves as a diagnostic tool to help researchers understand the underlying behavior of their systems. * It includes built-in visualizations for Pareto frontiers, which illustrate the trade-offs between different metrics. * Sensitivity analysis tools identify which specific input parameters have the greatest impact on the final results. * The platform provides automated plots and tables to track optimization progress and visualize the effect of parameters across the entire input space. ## Technical Methodology and Architecture * Ax utilizes Bayesian optimization, an iterative approach that balances "exploration" (sampling new areas) with "exploitation" (refining known good areas). * The platform relies on **BoTorch** for its underlying Bayesian components and typically employs **Gaussian processes (GP)** as surrogate models. * GPs are preferred because they can make accurate predictions and quantify uncertainty even when provided with very few data points. * The system uses an **Expected Improvement (EI)** acquisition function to calculate the potential value of new configurations compared to the current best-known result. * This surrogate-based approach is designed to scale to high-dimensional settings involving hundreds of tunable parameters where traditional search methods are too costly. To begin implementing these methods, developers can install the platform via `pip install ax-platform`. Ax 1.0 provides a robust framework for moving cutting-edge optimization research directly into production environments.