Meta / automation

2 posts

meta

DrP: Meta's Root Cause Analysis Platform at Scale (opens in new tab)

DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire. ## Architecture and Core Components * **Expressive SDK:** Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms. * **Built-in Analysis Tools:** The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas. * **Scalable Backend:** A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously. * **Workflow Integration:** DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention. ## Authoring and Verification Workflow * **Template Bootstrapping:** Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner. * **Analyzer Chaining:** The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services. * **Automated Backtesting:** Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance. * **Decision Tree Logic:** Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves. ## Execution and Post-Processing * **Trigger-based Analysis:** When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging. * **Automated Mitigation:** A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues. * **DrP Insights:** This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes. * **Alert Annotation:** Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder. ## Practical Conclusion Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.

meta

How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks (opens in new tab)

Meta utilizes secure-by-default frameworks to wrap potentially unsafe operating system and third-party functions, ensuring security is integrated into the development process without sacrificing developer velocity. By leveraging generative AI and automation, the company scales the adoption of these frameworks across its massive codebase, effectively mitigating risks such as Android intent hijacking. This approach balances high-level security enforcement with the practical need for friction-free developer experiences. ## Design Principles for Secure-by-Default Frameworks To ensure high adoption and long-term viability, Meta follows specific architectural guidelines when building security wrappers: * **API Mirroring:** Secure framework APIs are designed to closely resemble the existing native APIs they replace (e.g., mirroring the Android Context API). This reduces the cognitive burden on developers and simplifies the use of automated tools for code conversion. * **Reliance on Public Interfaces:** Frameworks are built exclusively on public and stable APIs. Avoiding private or undocumented OS interfaces prevents maintenance "fire drills" and ensures the frameworks remain functional across various OS updates. * **Modularity and Reach:** Rather than creating a single monolithic tool, Meta develops small, modular libraries that target specific security issues while remaining usable across all apps and platform versions. * **Friction Reduction:** Frameworks must avoid introducing excessive complexity or noticeable performance overhead in terms of CPU and RAM, as high friction often leads developers to bypass security measures entirely. ## SecureLinkLauncher: Preventing Android Intent Hijacking SecureLinkLauncher (SLL) is a primary example of a secure-by-default framework designed to stop sensitive data from leaking via the Android intent system. * **Wrapped Execution:** SLL wraps native Android methods such as `startActivity()` and `startActivityForResult()`. Instead of calling `context.startActivity(intent)`, developers use `SecureLinkLauncher.launchInternalActivity(intent, context)`. * **Scope Verification:** The framework enforces scope verification before delegating to the native API. This ensures that intents are directed to intended "family" apps rather than being intercepted by malicious third-party applications. * **Mitigating Implicit Intents:** SLL addresses the risks of untargeted intents, which can be received by any app with a matching intent-filter. By enforcing a developer-specified scope, SLL ensures that data like `SECRET_INFO` is only accessible to authorized packages. ## Scaling Adoption through AI and Automation The transition from legacy, insecure patterns to secure frameworks is managed through a combination of automated tooling and artificial intelligence. * **Automated Migration:** Generative AI identifies insecure usage patterns across Meta’s vast codebase and suggests—or automatically applies—the appropriate secure framework replacements. * **Continuous Monitoring:** Automation tools continuously scan the codebase to ensure compliance with secure-by-default standards, preventing the reintroduction of vulnerable code. * **Scaling Consistency:** By reducing the manual effort required for refactoring, AI enables consistent security enforcement across different teams and applications without slowing down the shipping cycle. For organizations managing large-scale mobile codebases, the recommended approach is to build thin, developer-friendly wrappers around risky platform APIs and utilize automated refactoring tools to drive adoption. This ensures that security becomes an invisible, default component of the development lifecycle rather than a manual checklist.