incident-response | Techlist.io

figma May 8, 2024

Craft and Beauty: The ROI of Marrying Form and Function | Figma Blog (opens in new tab)

How Linear made the most of a DDoS Maker Stories Profiles & interviews Engineering Case study Security

incident-response security network-security sre+3

figma Jun 14, 2022

Postmortem: Service disruptions on June 6 & 7 2022 | Figma Blog (opens in new tab)

Postmortem: Service disruption on April 29th, 2020 The root cause of our recent service outage and next steps Inside Figma Engineering

incident-response infrastructure postgresql backend-engineering+3

discord

ROOST Announces “Coop” and “Osprey”: Free, Open-Source Trust and Safety Infrastructure for the AI Era (opens in new tab)

ROOST, a non-profit dedicated to digital safety, has launched two open-source tools, Coop and Osprey, to provide enterprise-grade content moderation and threat investigation capabilities to organizations of all sizes. By open-sourcing technology previously developed by industry leaders like Discord and Cove, ROOST aims to democratize access to the infrastructure required to detect, triage, and respond to online harms. This initiative shifts Trust and Safety from a proprietary competitive advantage to a shared public resource, enabling platforms to prioritize user protection without the burden of expensive enterprise software. ### Content Review and Compliance with Coop Built on technology acquired from Cove and utilized by platforms like Notion, Coop focuses on the human-in-the-loop aspect of content moderation. * The platform provides robust tools for content review, allowing teams to route specific cases to subject-matter experts for deeper analysis. * It includes built-in integration with the National Center for Missing & Exploited Children’s (NCMEC) API, automating the mandatory reporting process for child sexual abuse material (CSAM). * The interface is designed to surface relevant context and metadata, ensuring moderators can make informed decisions and take immediate action against policy violations. ### Incident Response and Investigation with Osprey Osprey is a lightweight investigation tool originally developed by Discord to manage large-scale safety incidents and platform-wide threats. * It serves as a foundation for incident response, helping safety teams understand platform trends and investigate coordinated threats like phishing or harassment campaigns. * The tool is designed to be user-friendly and accessible for grassroots communities while remaining powerful enough for established platforms. * Early adopters, including the decentralized social network Bluesky, are implementing Osprey to demonstrate that effective safety infrastructure can be scalable and resource-efficient. ### A Collaborative Model for Safety Infrastructure The launch of these tools represents a strategic shift toward a collaborative "public-interest" model for digital defense. * ROOST acquired the intellectual property of Cove and received the donation of Osprey from Discord to ensure these tools remain available as a public good. * The initiative is backed by philanthropic funding and legal support from Perkins Coie, removing the financial barriers that often prevent smaller platforms from implementing high-level safety measures. * Major industry players like Notion and Bluesky are championing the move, signaling an industry-wide push to share safety innovations rather than silo them. Platforms and developers should prepare to integrate these tools into their safety stacks as they become publicly available in the coming months. By adopting open-source infrastructure for routine tasks like NCMEC reporting and incident triage, organizations can focus their internal resources on platform-specific innovations while maintaining a high standard of digital safety.

incident-response open-source api-integration trust-and-safety+3

datadog

2023-03-08 incident: A deep dive into our incident response | Datadog (opens in new tab)

Datadog’s first global outage on March 8, 2023, served as a rigorous stress test for their established incident response framework and "you build it, you own it" philosophy. While the outage was triggered by a systemic failure during a routine systemd upgrade, the company's commitment to blameless culture and decentralized engineering autonomy allowed hundreds of responders to coordinate a complex recovery across multiple regions. Ultimately, the event validated their investment in out-of-band monitoring and rigorous, bi-annual incident training as essential components for managing high-scale system disasters. ## Incident Response Structure and Philosophy * Datadog employs a decentralized "you build it, you own it" model where individual engineering teams are responsible for the 24/7 health and monitoring of the services they build. * For high-severity incidents, a specialized rotation is paged, consisting of an Incident Commander to lead the response, a communications lead, and a customer liaison to manage external messaging. * The organization prioritizes "people over process," empowering engineers to use their judgment to find creative solutions rather than following rigid, pre-written playbooks that may not apply to unprecedented failures. * A blameless culture is strictly maintained across all levels of the company, ensuring that post-incident investigations focus on systemic improvements rather than assigning fault to individuals. ## Multi-Layered Monitoring Strategy * Standard telemetry provides internal visibility, but Datadog also maintains "out-of-band" monitoring that operates completely outside its own infrastructure. * This out-of-band system interacts with Datadog APIs exactly like a customer would, ensuring that engineers are alerted even if the internal monitoring platform itself becomes unavailable. * Communication is streamlined through a dedicated Slack incident app that automatically generates coordination channels, providing situational awareness to any engineer who joins the effort. ## Anatomy of the March 8 Outage * The outage began at 06:00 UTC, triggered by a systemd upgrade that caused widespread Kubernetes failures and prevented pods from restarting correctly. * The global nature of the outage was diagnosed within 32 minutes of the initial monitoring alerts, leading to the activation of executive on-calls and the customer support management team. * Responders identified "unattended upgrades" as the incident trigger approximately five and a half hours after the initial failure. * Recovery was executed in stages: compute capacity was restored first in the EU1 region, followed by the US1 region, with full infrastructure restoration completed by 19:00 UTC. Organizations should treat incident response as a perishable skill that requires constant practice through a low threshold for declaring incidents and regular training. By combining out-of-band monitoring with a culture that empowers individual engineers to act autonomously during a crisis, teams can more effectively navigate the "not if, but when" reality of large-scale system failures.

incident-response k8s automation datadog+4