docker

6 posts

datadog

2023-03-08 incident: A deep dive into the platform-level impact | Datadog (opens in new tab)

The March 2023 Datadog outage was triggered by a simultaneous, global failure across multiple cloud providers and regions, caused by an unexpected interaction between a systemd security patch and Ubuntu 22.04’s default networking behavior. While Datadog typically employs rigorous, staged rollouts for infrastructure changes, the automated nature of OS-level security updates bypassed these controls. The incident highlights the hidden risks in system-level defaults and the potential for "unattended upgrades" to create synchronized failures across supposedly isolated environments. ## The systemd-networkd Routing Change * In December 2020, systemd version 248 introduced a change where `systemd-networkd` flushes all IP routing rules it does not recognize upon startup. * Version 249 introduced the `ManageForeignRoutingPolicyRules` setting, which defaults to "yes," confirming this management behavior for any rules not explicitly defined in systemd configuration files. * These changes were backported to earlier versions (v247 and v248) but were notably absent from v245, the version used in Ubuntu 20.04. ## Dormant Risks in the Ubuntu 22.04 Migration * Datadog began migrating its fleet from Ubuntu 20.04 to 22.04 in late 2022, eventually reaching 90% coverage across its infrastructure. * Ubuntu 22.04 utilizes systemd v249, meaning the majority of the fleet was susceptible to the routing rule flushing behavior. * The risk remained dormant during the initial rollout because `systemd-networkd` typically only starts during the initial boot sequence when no complex routing rules have been established yet. ## The Trigger: Unattended Upgrades and the CVE Patch * On March 7, 2023, a security patch for a systemd CVE was released to the Ubuntu security repositories. * Datadog’s fleet used the Ubuntu default configuration for `unattended-upgrades`, which automatically installs security-labeled patches once a day, typically between 06:00 and 07:00 UTC. * The installation of the patch forced a restart of the `systemd-networkd` service on active, running nodes. * Upon restarting, the service identified existing IP routing rules (crucial for container networking) as "foreign" and deleted them, effectively severing network connectivity for the nodes. ## Failure of Regional Isolation * Because the security patch was released globally and the automated upgrade window was synchronized across regions, the failure occurred nearly simultaneously worldwide. * This automation bypassed Datadog’s standard practice of "baking" changes in staging and experimental clusters for weeks before proceeding to production. * Nodes on the older Ubuntu 20.04 (systemd v245) were unaffected by the patch, as that version of systemd does not flush IP rules upon a service restart. To mitigate similar risks, infrastructure teams should consider explicitly disabling the management of foreign routing rules in systemd-networkd configuration when using third-party networking plugins. Furthermore, while automated security patching is a best practice, organizations must balance the speed of patching with the need for controlled, staged rollouts to prevent global configuration drift or synchronized failures.

datadog

Using the Dirty Pipe vulnerability to break out from containers | Datadog (opens in new tab)

The Dirty Pipe vulnerability (CVE-2022-0847) is a critical Linux kernel flaw that allows unprivileged processes to write data to any file they can read, effectively bypassing standard write permissions. This primitive is particularly dangerous in containerized environments like Kubernetes, where it can be leveraged to overwrite the host’s container runtime binary. By exploiting how the kernel manages page caches, an attacker can achieve a full container breakout and gain administrative privileges on the underlying host. ## Container Runtimes and the OCI Specification * Kubernetes utilizes the Container Runtime Interface (CRI) to manage containers via high-level runtimes like containerd or CRI-O. * These high-level runtimes rely on low-level Open Container Interface (OCI) runtimes, most commonly runC, to handle the heavy lifting of namespaces and control groups. * Isolation is achieved by runC setting up a restricted environment before executing the user-supplied entrypoint via the `execve` system call. ## Evolution of runC Vulnerabilities * A historical vulnerability, CVE-2019-5736, previously allowed escapes by overwriting the host’s runC binary through the `/proc/self/exe` file descriptor. * To mitigate this, runC was updated to either clone the binary before execution or mount the host's runC binary as read-only inside the container. * While the read-only mount improved performance through kernel cache page sharing, it created a target for the Dirty Pipe vulnerability, which specifically targets the kernel page cache. ## The Dirty Pipe Exploitation Primitive * Dirty Pipe allows an attacker to overwrite any file they can read, including read-only files, by manipulating the kernel's internal pipe-buffer structures. * The exploit targets the page cache, meaning the overwrite is non-persistent and resides only in memory; the original file on disk remains unchanged. * In a container escape scenario, the attacker waits for a runC process to start (triggered by actions like `kubectl exec`) and targets the file descriptor at `/proc/<runC-pid>/exe`. ## Proof-of-Concept Escape Walkthrough * The attack begins with a standard, unprivileged pod running a malicious script that monitors the system for new runC processes. * Once a `kubectl exec` command is issued by an administrator, the script identifies the runC PID and applies the Dirty Pipe exploit to the associated executable. * The exploit overwrites the runC binary in the kernel page cache with a malicious ELF binary. * Because the host kernel is executing this hijacked binary with root privileges to manage the container, the attacker’s malicious code (e.g., a reverse shell or administrative command) runs with full host-level authority. To protect against this attack vector, it is essential to patch the Linux kernel to a version that includes the fix for CVE-2022-0847 and ensure that container nodes are running updated distributions.

gitlab

Getting started with GitLab Duo Agentic Chat (opens in new tab)

GitLab Duo Agentic Chat marks a shift from traditional Q&A chatbots to autonomous AI collaboration partners integrated directly into the software development lifecycle. By leveraging specialized agents and context-aware large language models, the platform enables developers to automate complex tasks like code refactoring, security remediation, and issue triaging. This system serves as a centralized interface across both the GitLab Web UI and IDEs to streamline workflows from initial planning to production deployment. ## Capabilities of Agentic AI * **Autonomous Actions:** The system can move beyond simple chat by creating files, modifying existing code, and opening merge requests on behalf of the user. * **Deep Context Integration:** Agents have access to the full GitLab ecosystem, including issues, epics, Git commits, CI/CD pipelines, and security scans. * **Extensibility:** Through the Model Context Protocol (MCP), the chat can integrate with external services to expand its functional scope. * **Information Retrieval:** Users can query project architecture or use GitLab Query Language (GLQL) to pull specific project analytics and insights. ## Model and Agent Customization * **Flexible Model Selection:** Users and administrators can choose from different LLMs based on task requirements, with configuration available at both the group and individual user levels. * **Specialized Agents:** The platform features dedicated agents for specific roles, such as the **Planner Agent** for product management and the **Security Analyst Agent** for vulnerability management. * **Contextual Switching:** In IDEs, users can switch between agents via a dropdown menu, while the Web UI allows for agent selection when starting new chat sessions. ## Specialized Workflow Use Cases * **Project Planning:** The Planner Agent can break down epics into smaller tasks, list high-priority bugs, and generate technical requirements for new features. * **Security Remediation:** Security-focused agents can explain vulnerabilities in simple terms, identify false positives in scans, and suggest specific code fixes for SQL injection or XSS risks. * **Troubleshooting and Debugging:** The system can analyze CI/CD pipeline logs to identify why a build failed and suggest optimizations for job performance. * **Legacy Modernization:** Specific prompts can guide the AI to refactor code to follow SOLID principles or create migration plans for modernizing legacy languages like COBOL to Java or Python. ## Access and Integration * **Interface Options:** The chat is accessible via a collapsible sidebar in the Web UI and through dedicated plugins in popular IDEs. * **Future Development:** While currently limited to UI and IDE interfaces, a GitLab Duo CLI is in development to bring agentic capabilities to the terminal. To get the most out of GitLab Duo Agentic Chat, it is recommended to transition between specialized agents as you move through different project phases. Using the Security Analyst for code reviews and the Planner for backlog grooming ensures that the underlying models are optimized for the specific metadata and constraints of those tasks.