Techlist.io - Korean Tech Blog Curator

figma Jul 12, 2023

How Thumbtack structures their design system | Figma Blog (opens in new tab)

How Pinterest’s design systems team measures adoption Maker Stories Design systems Case study

product-design design-systems ui-ux frontend-development+2

figma Jul 11, 2023

Explore the value of play through a new zine | Figma Blog (opens in new tab)

Paula Scher’s 10 rules for play Maker Stories Thought leadership Tips & inspiration Brainstorming Config

figma product-design design-systems design-thinking+3

datadog Jun 30, 2023

How we migrated our acceptance tests to use Synthetic Monitoring | Datadog (opens in new tab)

Datadog’s Frontend Developer Experience team migrated their massive codebase from a fragile, custom Puppeteer-based acceptance testing framework to Datadog Synthetic Monitoring to address persistent flakiness and high maintenance overhead. By leveraging a record-and-play approach and integrating it into their CI/CD pipelines via the `datadog-ci` tool, they successfully reduced developer friction and improved testing reliability for over 300 engineers. This transition demonstrates how replacing manual browser scripting with specialized monitoring tools can significantly streamline high-scale frontend workflows. ### Limitations of Puppeteer-Based Testing * Custom runners built on Puppeteer suffered from inherent flakiness because they relied on a complex chain of virtual graphic engines, browser manipulation, and network stability that frequently failed unexpectedly. * Writing tests was unintuitive, requiring engineers to manually script interaction details—such as verifying if a button is present and enabled before clicking—which became exponentially more complex for custom elements like dropdowns. * The testing infrastructure was slow and expensive, with CI jobs taking up to 35 minutes of machine time per commit to cover the application's 565 tests and 100,000 lines of test code. * Maintenance was a constant burden; every product update required a corresponding manual update to the scripts, making the process as labor-intensive as writing new features. ### Adopting Synthetic Monitoring and Tooling * The team moved to Synthetic Monitoring, which allows engineers to record browser interactions directly rather than writing code, significantly lowering the barrier to entry for creating tests. * To integrate these tests into the development lifecycle, the team developed `datadog-ci`, a CLI tool designed to trigger tests and poll result statuses directly from the CI environment. * The new system uses a specific file format (`.synthetics.json`) to identify tests within the codebase, allowing for configuration overrides and human-readable output in the build logs. * This transition turned an internal need into a product improvement, as the `datadog-ci` tool was generalized to help all Datadog users execute commands from within their CI/CD scripts. ### Strategies for High-Scale Migration and Adoption * The team utilized comprehensive documentation and internal "frontend gatherings" to educate 300 engineers on how to record tests and why the new system required less maintenance. * To build developer trust, the team initially implemented the new tests as non-blocking CI jobs, surfacing failures as PR comments rather than breaking builds. * Migration was treated as a distributed effort, with 565 individual tests tracked via Jira and assigned to their respective product teams to ensure ownership and a steady pace. * By progressively sunsetting the old platform as tests were migrated, the team managed a year-long transition without disrupting the daily output of 160 authors pushing 90 new PRs every day. To successfully migrate large-scale testing infrastructures, organizations should prioritize developer trust by introducing new tools through non-blocking pipelines and providing comprehensive documentation. Transitioning from manual browser scripting to automated recording tools not only reduces technical debt but also empowers engineers to maintain high-quality codebases without the burden of managing complex testing infrastructure.

datadog frontend-development node.js puppeteer+3

datadog Jun 30, 2023

How we migrated our acceptance tests to use Synthetic Monitoring (opens in new tab)

Yoann Moinet Cecilia Watt The Frontend Developer Experience team strives to improve the lives of 300 frontend engineers at Datadog. We cover build systems, tests, deployments, code health, internal tools, and more—we’re here to remove any friction and pain points from our engine…

datadog ci-cd node.js puppeteer+3

figma Jun 20, 2023

AI: The Next Chapter in Design | Figma Blog (opens in new tab)

Introducing AI to FigJam Inside Figma AI Product updates Productivity Diagramming Collaboration Meetings Profiles & interviews FigJam News

figma ai gen-ai figjam+2

figma Jun 20, 2023

Introducing Figma’s New Dev Mode | Figma Blog (opens in new tab)

Making Figma better for developers with Dev Mode Inside Figma Product updates Config Engineering Collaboration Dev Mode News How can a design tool work better for developers? It’s a question we’ve been asking ourselves and our community. Today, we’re excited to introduce Dev Mod…

figma design-systems dev-mode github+4

figma Jun 20, 2023

Figma & Chromebook expand to K12 and Japan | Figma Blog (opens in new tab)

Figma and Chromebook: Empowering the next generation of designers Inside Figma Career & education Design News

figma design design-education education-technology+2

figma Jun 20, 2023

Config 2023 in Review: The Complete Recap | Figma Blog (opens in new tab)

Config 2023: Reimagining where teams design and build together Inside Figma Product updates Config Design Engineering Collaboration News We’re launching Dev Mode, variables, advanced prototyping, and a series of quality of life updates to help you go from design to build. Today…

figma design-systems prototyping design-to-code+4

figma Jun 19, 2023

Why Roles Are Not Rules | Figma Blog (opens in new tab)

Welcome to the WIP Insights Design Product management Thought leadership Design

figma Jun 19, 2023

Introducing Shortcut letter from the editor | Figma Blog (opens in new tab)

How Figma’s multiplayer technology works A peek into the homegrown solution we built as the first design tool with live collaborative editing. Inside Figma Engineering Behind the scenes Infrastructure

infrastructure distributed-systems real-time-collaboration collaborative-editing+2

figma Jun 18, 2023

What’s Happening at Config 2023? | Figma Blog (opens in new tab)

Four years in, here’s what Config tells us about the state of design Insights Report Config Events Thought leadership Design systems Collaboration Research Social impact AI

figma ai product-design design-systems+1

datadog Jun 16, 2023

2023-03-08 incident: A deep dive into the platform-level recovery (opens in new tab)

Laurent Bernaille On March 8, 2023, Datadog experienced an outage that affected all services across multiple regions. In a previous post we described how we faced the unexpected. We left off with the realization that we had lost 60 percent of our compute capacity. Armed with thi…

k8s observability cloud-computing cloud-infrastructure+4

datadog Jun 16, 2023

2023-03-08 incident: A deep dive into the platform-level recovery | Datadog (opens in new tab)

Following a massive system-wide outage in March 2023, Datadog successfully restored its EU1 region by identifying that a simple node reboot could resolve network connectivity issues caused by a faulty system patch. While the team managed to restore 100 percent of compute capacity within hours, the recovery effort was subsequently hindered by cloud provider infrastructure limits and IP address exhaustion. This post-mortem highlights the complexities of scaling hierarchical Kubernetes environments under extreme pressure and the importance of accounting for "black swan" capacity requirements. ## Hierarchical Kubernetes Recovery Datadog utilizes a strict hierarchy of Kubernetes clusters to manage its infrastructure, which necessitated a granular, three-tiered recovery approach. Because the outage affected network connectivity via `systemd-networkd`, the team had to restore components in a specific order to regain control of the environment. * **Parent Control Planes:** Engineers first rebooted the virtual machines hosting the parent clusters, which manage the control planes for all other clusters. * **Child Control Planes:** Once parent clusters were stable, the team restored the control planes for application clusters, which run as pods within the parent infrastructure. * **Application Worker Nodes:** Thousands of worker nodes across dozens of clusters were restarted progressively to avoid overwhelming the control planes, reaching full capacity by 12:05 UTC. ## Scaling Bottlenecks and Cloud Quotas Once the infrastructure was online, the team attempted to scale out rapidly to process a massive backlog of buffered data. This surge in demand triggered previously unencountered limitations within the Google Cloud environment. * **VPC Peering Limits:** At 14:18 UTC, the platform hit a documented but overlooked limit of 15,500 VM instances within a single network peering group, blocking all further scaling. * **Provider Intervention:** Datadog worked directly with Google Cloud support to manually raise the peering group limit, which allowed scaling to resume after a nearly four-hour delay. ## IP Address and Subnet Capacity Even after cloud-level instance quotas were lifted, specific high-traffic clusters processing logs and traces hit a secondary bottleneck related to internal networking. * **Subnet Exhaustion:** These clusters attempted to scale to more than twice their normal size, quickly exhausting all available IP addresses in their assigned subnets. * **Capacity Planning Gaps:** While Datadog typically targets a 66% maximum IP usage to allow for a 50% scale-out, the extreme demands of the recovery backlog exceeded these safety margins. * **Impact on Backlog:** For six hours, the lack of available IPs forced these clusters to process data significantly slower than the rest of the recovered infrastructure. ## Recovery Summary The EU1 recovery demonstrates that even when hardware is functional, software-defined limits can create cascading delays. Organizations should not only monitor their own resource usage but also maintain visibility into cloud provider quotas and ensure that subnet allocations account for extreme recovery scenarios where workloads may need to double or triple in size momentarily.

database-design k8s datadog cloud-computing+4

datadog Jun 1, 2023

2023-03-08 incident: A deep dive into our incident response | Datadog (opens in new tab)

Datadog’s first global outage on March 8, 2023, served as a rigorous stress test for their established incident response framework and "you build it, you own it" philosophy. While the outage was triggered by a systemic failure during a routine systemd upgrade, the company's commitment to blameless culture and decentralized engineering autonomy allowed hundreds of responders to coordinate a complex recovery across multiple regions. Ultimately, the event validated their investment in out-of-band monitoring and rigorous, bi-annual incident training as essential components for managing high-scale system disasters. ## Incident Response Structure and Philosophy * Datadog employs a decentralized "you build it, you own it" model where individual engineering teams are responsible for the 24/7 health and monitoring of the services they build. * For high-severity incidents, a specialized rotation is paged, consisting of an Incident Commander to lead the response, a communications lead, and a customer liaison to manage external messaging. * The organization prioritizes "people over process," empowering engineers to use their judgment to find creative solutions rather than following rigid, pre-written playbooks that may not apply to unprecedented failures. * A blameless culture is strictly maintained across all levels of the company, ensuring that post-incident investigations focus on systemic improvements rather than assigning fault to individuals. ## Multi-Layered Monitoring Strategy * Standard telemetry provides internal visibility, but Datadog also maintains "out-of-band" monitoring that operates completely outside its own infrastructure. * This out-of-band system interacts with Datadog APIs exactly like a customer would, ensuring that engineers are alerted even if the internal monitoring platform itself becomes unavailable. * Communication is streamlined through a dedicated Slack incident app that automatically generates coordination channels, providing situational awareness to any engineer who joins the effort. ## Anatomy of the March 8 Outage * The outage began at 06:00 UTC, triggered by a systemd upgrade that caused widespread Kubernetes failures and prevented pods from restarting correctly. * The global nature of the outage was diagnosed within 32 minutes of the initial monitoring alerts, leading to the activation of executive on-calls and the customer support management team. * Responders identified "unattended upgrades" as the incident trigger approximately five and a half hours after the initial failure. * Recovery was executed in stages: compute capacity was restored first in the EU1 region, followed by the US1 region, with full infrastructure restoration completed by 19:00 UTC. Organizations should treat incident response as a perishable skill that requires constant practice through a low threshold for declaring incidents and regular training. By combining out-of-band monitoring with a culture that empowers individual engineers to act autonomously during a crisis, teams can more effectively navigate the "not if, but when" reality of large-scale system failures.

k8s datadog automation cloud-infrastructure+4

datadog Jun 1, 2023

2023-03-08 incident: A deep dive into our incident response (opens in new tab)

Laura de Vesine In March, Datadog experienced a global outage. It was the first of its kind and called for a massive response that involved several hundred engineers working in shifts over the course of the outage, in addition to many concurrent video calls, chats, workstreams,…

k8s datadog automation monitoring+4