automation

6 posts

meta

DrP: Meta's Root Cause Analysis Platform at Scale - Engineering at Meta (opens in new tab)

DrP is Meta’s programmatic root cause analysis (RCA) platform designed to automate incident investigations and reduce the burden of manual on-call tasks. By codifying investigation playbooks into executable "analyzers," the platform significantly lowers the mean time to resolve (MTTR) by 20% to 80% for over 300 teams. This systematic approach replaces outdated manual scripts with a scalable backend that executes 50,000 automated analyses daily, providing immediate context when alerts fire. ## Architecture and Core Components * **Expressive SDK:** Provides a framework for engineers to codify investigation workflows into "analyzers," utilizing a rich library of helper functions and machine learning algorithms. * **Built-in Analysis Tools:** The platform includes native support for anomaly detection, event isolation, time-series correlation, and dimension analysis to identify specific problem areas. * **Scalable Backend:** A multi-tenant execution environment manages a worker pool that handles thousands of requests securely and asynchronously. * **Workflow Integration:** DrP is integrated directly into Meta’s internal alerting and incident management systems, allowing for automatic triggering without human intervention. ## Authoring and Verification Workflow * **Template Bootstrapping:** Engineers use the SDK to generate boilerplate code that captures required input parameters and context in a type-safe manner. * **Analyzer Chaining:** The system allows for seamless dependency analysis by passing context between different analyzers, enabling investigations to span multiple interconnected services. * **Automated Backtesting:** Before deployment, analyzers undergo automated backtesting integrated into the code review process to ensure accuracy and performance. * **Decision Tree Logic:** Investigation steps are modeled as decision trees within the code, allowing the analyzer to follow different paths based on the data it retrieves. ## Execution and Post-Processing * **Trigger-based Analysis:** When an alert is activated, the backend automatically queues the relevant analyzer, ensuring findings are available as soon as an engineer begins triaging. * **Automated Mitigation:** A post-processing system can take direct action based on investigation results, such as creating tasks or submitting pull requests to resolve identified issues. * **DrP Insights:** This system periodically reviews historical analysis outputs to identify and rank the top causes of alerts, helping teams prioritize long-term reliability fixes. * **Alert Annotation:** Results are presented in both human-readable text and machine-readable formats, directly annotating the incident logs for the on-call responder. ## Practical Conclusion Organizations managing large-scale distributed systems should transition from static markdown playbooks to executable investigation code. By implementing a programmatic RCA framework like DrP, teams can scale their troubleshooting expertise and significantly reduce "on-call fatigue" by automating the repetitive triage steps that typically consume the first hour of an incident.

meta

How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks - Engineering at Meta (opens in new tab)

Meta utilizes secure-by-default frameworks to wrap potentially unsafe operating system and third-party functions, ensuring security is integrated into the development process without sacrificing developer velocity. By leveraging generative AI and automation, the company scales the adoption of these frameworks across its massive codebase, effectively mitigating risks such as Android intent hijacking. This approach balances high-level security enforcement with the practical need for friction-free developer experiences. ## Design Principles for Secure-by-Default Frameworks To ensure high adoption and long-term viability, Meta follows specific architectural guidelines when building security wrappers: * **API Mirroring:** Secure framework APIs are designed to closely resemble the existing native APIs they replace (e.g., mirroring the Android Context API). This reduces the cognitive burden on developers and simplifies the use of automated tools for code conversion. * **Reliance on Public Interfaces:** Frameworks are built exclusively on public and stable APIs. Avoiding private or undocumented OS interfaces prevents maintenance "fire drills" and ensures the frameworks remain functional across various OS updates. * **Modularity and Reach:** Rather than creating a single monolithic tool, Meta develops small, modular libraries that target specific security issues while remaining usable across all apps and platform versions. * **Friction Reduction:** Frameworks must avoid introducing excessive complexity or noticeable performance overhead in terms of CPU and RAM, as high friction often leads developers to bypass security measures entirely. ## SecureLinkLauncher: Preventing Android Intent Hijacking SecureLinkLauncher (SLL) is a primary example of a secure-by-default framework designed to stop sensitive data from leaking via the Android intent system. * **Wrapped Execution:** SLL wraps native Android methods such as `startActivity()` and `startActivityForResult()`. Instead of calling `context.startActivity(intent)`, developers use `SecureLinkLauncher.launchInternalActivity(intent, context)`. * **Scope Verification:** The framework enforces scope verification before delegating to the native API. This ensures that intents are directed to intended "family" apps rather than being intercepted by malicious third-party applications. * **Mitigating Implicit Intents:** SLL addresses the risks of untargeted intents, which can be received by any app with a matching intent-filter. By enforcing a developer-specified scope, SLL ensures that data like `SECRET_INFO` is only accessible to authorized packages. ## Scaling Adoption through AI and Automation The transition from legacy, insecure patterns to secure frameworks is managed through a combination of automated tooling and artificial intelligence. * **Automated Migration:** Generative AI identifies insecure usage patterns across Meta’s vast codebase and suggests—or automatically applies—the appropriate secure framework replacements. * **Continuous Monitoring:** Automation tools continuously scan the codebase to ensure compliance with secure-by-default standards, preventing the reintroduction of vulnerable code. * **Scaling Consistency:** By reducing the manual effort required for refactoring, AI enables consistent security enforcement across different teams and applications without slowing down the shipping cycle. For organizations managing large-scale mobile codebases, the recommended approach is to build thin, developer-friendly wrappers around risky platform APIs and utilize automated refactoring tools to drive adoption. This ensures that security becomes an invisible, default component of the development lifecycle rather than a manual checklist.

woowahan

We Refactor Culture Just Like (opens in new tab)

The Commerce Web Frontend Development team at Woowa Brothers recently underwent a significant organizational "refactoring" to manage the increasing complexity of their expanding commerce platform. By moving away from rigid, siloed roles and adopting a flexible "boundary-less" part system, the team successfully synchronized disparate services like B Mart and Baemin Store. This cultural shift demonstrates that treating organizational structure with the same iterative mindset as code can eliminate operational bottlenecks and foster a more resilient engineering environment. ### Transitioning to Boundary-less Parts * The team abandoned traditional division methods—such as project-based, funnel-based, or service-vs-backoffice splits—because they created resource imbalances and restricted developers' understanding of the overall service flow. * Traditional project-based splits often led to specific teams being overwhelmed during peak periods while others remained underutilized, creating significant delivery bottlenecks. * To solve these inefficiencies, the team introduced "boundary-less parts," where developers are not strictly tied to a single domain but are encouraged to work across the entire commerce ecosystem. * This structure allows the organization to remain agile, moving resources fluidly to address high-priority business needs without being hindered by departmental "walls." ### From R&R to Responsibility and Expandability (R&E) * The team replaced the traditional R&R (Role & Responsibility) model with "R&E" (Responsibility & Expandability), focusing on the core principle of "owning" a problem until it is fully resolved. * This shift encourages developers to expand their expertise beyond their immediate tasks, fostering a culture where helping colleagues and understanding neighboring domains is the standard. * Work is distributed through a strategic sync between team and part leaders, but team members maintain the flexibility to jump into different domains as project requirements evolve. * Regular "part shuffling" is utilized to ensure that domain knowledge is distributed across the entire 20-person frontend team, preventing the formation of information silos. ### Impact on Technical Integration and Team Resilience * The flexible structure was instrumental in the "ONE COMMERCE" initiative, which required integrating the technical stacks and user experiences of B Mart and Baemin Store. * Because developers had broad domain context, they were able to identify redundant logic across different services and abstract them into shared, common modules, ensuring architectural consistency. * The organization significantly improved its "Bus Factor"—the number of people who can leave before a project stalls—by ensuring multiple engineers understand the context of any given system. * Developers evolved into "domain-wide engineers" who understand the full lifecycle of a transaction, from the customer-facing UI to the backend administrative and logistics data flows. To prevent today's organizational solutions from becoming tomorrow's cultural legacy debt, engineering teams should proactively refactor their workflows. Moving from rigid role definitions to a model based on shared responsibility and cross-domain mobility is essential for maintaining velocity and technical excellence in large-scale platform environments.

toss

Improving Work Efficiency: Revisiting (opens in new tab)

The Toss Research Platform team argues that operational efficiency is best achieved by decomposing complex workflows into granular, atomic actions rather than attempting massive systemic overhauls. By systematically questioning the necessity of each step and prioritizing improvements based on stakeholder impact rather than personal workload, teams can eliminate significant waste through incremental automation. This approach demonstrates that even minor reductions in repetitive manual tasks can lead to substantial gains in team-wide productivity as an organization scales. ### Granular Action Mapping * Break down workflows into specific physical or digital actions—such as clicks, data entries, and channel switches—rather than high-level phases. * Document the "Who, Where, What, and Why" for every individual step to identify exactly where friction occurs. * Include exception cases and edge scenarios in the process map to uncover hidden gaps in the current operating model. ### Questioning Necessity and Identifying Automation Targets * Apply a critical filter to every mapped action by asking, "Why is this necessary?" to eliminate redundant tasks like manual cross-platform notifications. * Distinguish between essential human-centric tasks and mechanical actions, such as calendar entry creation, that are ripe for automation. * Address "micro-inefficiencies" that appear insignificant in isolation but aggregate into major resource drains when repeated multiple times daily across a large team. ### Stakeholder-Centric Prioritization * Shift the criteria for optimization from personal convenience to the impact on the broader organization. * Rank improvements based on three specific metrics: the number of people affected, the downstream influence on other workflows, and the total cumulative time consumed. * Recognize that automating a "small" task for an operator can unlock significant time and clarity for dozens of participants and observers. ### Incremental Implementation and Risk Mitigation * Avoid the "all-or-nothing" automation trap by deploying partial solutions that address solvable segments of a process immediately. * Utilize designated test periods for process changes to monitor for risks, such as team members missing interviews due to altered notification schedules. * Gather continuous feedback from stakeholders during small-scale experiments, allowing for iterative adjustments or quick reversals before a full rollout. To scale operations effectively, start by breaking your current workload into its smallest possible components and identifying the most frequent manual repetitions. True efficiency often comes from these small, validated adjustments and consistent feedback loops rather than waiting to build a perfect, fully automated end-to-end system.

naver

Project Automation with AI: Faster and Sm (opens in new tab)

This session from NAVER Engineering Day 2025 explores how developers can transition AI from a simple assistant into a functional project collaborator through local automation. By leveraging local Large Language Models (LLMs) and the Model Context Protocol (MCP), development teams can automate high-friction tasks such as build failure diagnostics and crash log analysis. The presentation demonstrates that integrating these tools directly into the development pipeline significantly reduces the manual overhead required for routine troubleshooting and reporting. ### Integrating LLMs with Local Environments * Utilizing **Ollama** allows teams to run LLMs locally, ensuring data privacy and reducing latency compared to cloud-based alternatives. * The **mcp-agent** (Model Context Protocol) serves as the critical bridge, connecting the LLM to local file systems, tools, and project-specific data. * This infrastructure enables the AI to act as an "agent" that can autonomously navigate the codebase rather than just processing static text prompts. ### Build Failure and Crash Monitoring Automation * When a build fails, the AI agent automatically parses the logs to identify the root cause, providing a concise summary instead of requiring a developer to sift through thousands of lines of terminal output. * For crash monitoring, the system goes beyond simple summarization by analyzing stack traces and identifying the specific developer or team responsible for the affected code segment. * By automating the initial diagnostic phase, the time between an error occurring and a developer beginning the fix is dramatically shortened. ### Intelligent Reporting via Slack * The system integrates with **Slack** to deliver automated, context-aware reports that categorize issues by severity and impact. * These reports include actionable insights, such as suggested fixes or links to relevant documentation, directly within the communication channel used by the team. * This ensures that project stakeholders remain informed of the system's health without requiring manual status updates from engineers. ### Considerations for LLM and MCP Implementation * While powerful, the combination of LLMs and MCP agents is not a "silver bullet"; it requires careful prompt engineering and boundary setting to prevent hallucination in technical diagnostics. * Effective automation depends on the quality of the local context provided to the agent; the more structured the logs and metadata, the more accurate the AI's conclusions. * Organizations should evaluate the balance between the computational cost of running local models and the productivity gains achieved through automation. To successfully implement AI-driven automation, developers should start by targeting specific, repetitive bottlenecks—such as triaging build errors—before expanding the agent's scope to more complex architectural tasks. Focusing on the integration between Ollama and mcp-agent provides a secure, extensible foundation for building a truly "smart" development workflow.

kakao

[AI_TOP_1 (opens in new tab)

The AI TOP 100 contest was designed to shift the focus from evaluating AI model performance to measuring human proficiency in solving real-world problems through AI collaboration. By prioritizing the "problem-solving process" over mere final output, the organizers sought to identify individuals who can define clear goals and navigate the technical limitations of current AI tools. The conclusion of this initiative suggests that true AI literacy is defined by the ability to maintain a "human-in-the-loop" workflow where human intuition guides AI execution and verification. ### Core Philosophy of Human-AI Collaboration * **Human-in-the-Loop:** The contest emphasizes a cycle of human analysis, AI problem-solving, and human verification. This ensures that the human remains the "pilot" who directs the AI engine and takes responsibility for the quality of the result. * **Strategic Intervention:** Participants were encouraged to provide AI with structural context it might struggle to perceive (like complex table relationships) and to perform data pre-processing to improve AI accuracy. * **Task Delegation:** For complex iterative tasks, such as generating images for a montage, solvers were expected to build automated pipelines using AI agents to handle repetitive feedback loops while focusing human effort on higher-level strategy. ### Designing Against "One-Shot" Solutions * **Low Barrier, High Ceiling:** Problems were designed to be intuitive enough for anyone to understand but complex enough to prevent "one-shot" solutions (the "click-and-solve" trap). * **Targeting Technical Weaknesses:** Organizers intentionally embedded technical hurdles that current LLMs struggle with, forcing participants to demonstrate how they bridge the gap between AI limitations and a correct answer. * **The Difficulty Ladder:** To account for varying domain expertise (e.g., OCR experience), problems utilized a multi-part structure. This included "Easy" starting questions to build momentum and "Medium" hint questions that guided participants toward solving the more difficult "Killer" components. ### The 4-Pattern Problem Framework * **P1 - Insight (Analysis & Definition):** Identifying meaningful opportunities or problems within complex, unstructured data. * **P2 - Action (Implementation & Automation):** Developing functional code or workflows to execute a defined solution. * **P3 - Persuasion (Strategy & Creativity):** Generating logical and creative content to communicate technical solutions to non-technical stakeholders. * **P4 - Decision (Optimization):** Making optimal choices and simulations to maximize goals under specific constraints. ### Quality Assurance and Score Calibration * **4-Stage Pipeline:** Problems moved from Ideation to Drafting (testing for one-shot immunity), then to Candidate (analyzing abuse vulnerabilities), and finally to a Final selection based on difficulty balance. * **Cross-Model Validation:** Internal and alpha testers solved problems using various models including Claude, GPT, and Gemini to ensure that no single tool could bypass the intended human-led process. * **Effort-Based Scoring:** Instead of uniform points, scores were calibrated based on the "effort cost" and human competency required to solve them. This resulted in varying total points per problem to better reflect the true difficulty of the task. In the era of rapidly evolving AI, the ability to "use" a tool is becoming less valuable than the ability to "collaborate" with it. This shift requires a move toward building automated pipelines and utilizing a "difficulty ladder" approach to tackle complex, multi-stage problems that AI cannot yet solve in a single iteration.