Effective habits of remote workers | Datadog (opens in new tab)

Cody Lee argues that remote workers in office-centric companies must intentionally project visibility and over-communicate to remain effective. By treating digital interactions with the same gravity as physical ones, remote employees can bridge the gap with their office-based colleagues and maintain strong professional relationships. The guide concludes that long-term success depends on the intentionality of one's "digital voice" and the strategic use of periodic in-person visits. ## Prioritizing Communication to Overcome Distance * Over-communicate progress and blockers frequently to ensure teammates are aware of your status without needing to ask. * Ask "dumb" questions openly in collaboration tools to stay engaged and leverage the collective knowledge of the team. * Maintain high availability by keeping an updated calendar with visible blocks for lunch and focus time, mimicking the accessibility of an office setting. * Prioritize face-to-face video communication over asynchronous messaging when active discussions are taking place. ## Embodying Digital Presence * Always keep the camera on during virtual meetings to convey body language and build a higher level of trust. * Maintain "eye contact" by looking at the camera rather than secondary screens and speak up during meetings to remain an active participant. * Be prepared to interrupt politely in hybrid meetings where office-based participants may dominate the conversation. * Avoid the "blur" effect on backgrounds in favor of a clean, unblurred workspace or a virtual background that reflects your personality. ## Proactive Relationship Building * Schedule recurring virtual coffee chats with colleagues across the organization to build rapport beyond immediate work tasks. * Participate in non-work-related messaging threads and volunteer for team-building initiatives to become more than just a name on a screen. * When visiting the office—ideally once per quarter—prioritize casual interactions, such as morning coffees and team dinners, over routine administrative tasks. * Clear your schedule of non-critical remote meetings during in-person visits to leave room for spontaneous "hallway track" conversations. ## Balancing Visibility and Communication Fatigue * Establish a consistent communication rhythm that provides meaningful updates without becoming overwhelming or irritating to colleagues. * Tailor your outreach methods by consulting teammates’ "How to work with me" documents to respect their individual communication preferences. * Focus on clarity and brevity in written text, as your writing effectively becomes your "voice" in a remote environment. Succeeding as a remote worker requires a shift from passive participation to active, intentional engagement. By treating digital communication as a primary tool for presence and prioritizing periodic in-person connection, you can ensure your contributions remain visible and your professional relationships remain strong.

How we migrated our acceptance tests to use Synthetic Monitoring | Datadog (opens in new tab)

Datadog’s Frontend Developer Experience team migrated their massive codebase from a fragile, custom Puppeteer-based acceptance testing framework to Datadog Synthetic Monitoring to address persistent flakiness and high maintenance overhead. By leveraging a record-and-play approach and integrating it into their CI/CD pipelines via the `datadog-ci` tool, they successfully reduced developer friction and improved testing reliability for over 300 engineers. This transition demonstrates how replacing manual browser scripting with specialized monitoring tools can significantly streamline high-scale frontend workflows. ### Limitations of Puppeteer-Based Testing * Custom runners built on Puppeteer suffered from inherent flakiness because they relied on a complex chain of virtual graphic engines, browser manipulation, and network stability that frequently failed unexpectedly. * Writing tests was unintuitive, requiring engineers to manually script interaction details—such as verifying if a button is present and enabled before clicking—which became exponentially more complex for custom elements like dropdowns. * The testing infrastructure was slow and expensive, with CI jobs taking up to 35 minutes of machine time per commit to cover the application's 565 tests and 100,000 lines of test code. * Maintenance was a constant burden; every product update required a corresponding manual update to the scripts, making the process as labor-intensive as writing new features. ### Adopting Synthetic Monitoring and Tooling * The team moved to Synthetic Monitoring, which allows engineers to record browser interactions directly rather than writing code, significantly lowering the barrier to entry for creating tests. * To integrate these tests into the development lifecycle, the team developed `datadog-ci`, a CLI tool designed to trigger tests and poll result statuses directly from the CI environment. * The new system uses a specific file format (`.synthetics.json`) to identify tests within the codebase, allowing for configuration overrides and human-readable output in the build logs. * This transition turned an internal need into a product improvement, as the `datadog-ci` tool was generalized to help all Datadog users execute commands from within their CI/CD scripts. ### Strategies for High-Scale Migration and Adoption * The team utilized comprehensive documentation and internal "frontend gatherings" to educate 300 engineers on how to record tests and why the new system required less maintenance. * To build developer trust, the team initially implemented the new tests as non-blocking CI jobs, surfacing failures as PR comments rather than breaking builds. * Migration was treated as a distributed effort, with 565 individual tests tracked via Jira and assigned to their respective product teams to ensure ownership and a steady pace. * By progressively sunsetting the old platform as tests were migrated, the team managed a year-long transition without disrupting the daily output of 160 authors pushing 90 new PRs every day. To successfully migrate large-scale testing infrastructures, organizations should prioritize developer trust by introducing new tools through non-blocking pipelines and providing comprehensive documentation. Transitioning from manual browser scripting to automated recording tools not only reduces technical debt but also empowers engineers to maintain high-quality codebases without the burden of managing complex testing infrastructure.

Performance improvements in the Datadog Agent metrics pipeline | Datadog (opens in new tab)

Datadog engineers recently optimized the Datadog Agent's metric processing pipeline to achieve higher throughput and lower CPU overhead. By identifying that metric context generation—the process of creating unique keys for metrics—was a primary bottleneck, they implemented a series of algorithmic changes and Go runtime optimizations. These improvements allow the Agent to process significantly more metrics using the same computational resources. ### Identifying Bottlenecks via CPU Profiling * Developers utilized Go’s native profiling tools to capture CPU usage during high-volume metric ingestion via DogStatsD. * Flamegraph analysis revealed that the `addSample` and `trackContext` functions were the most CPU-intensive components of the pipeline. * The profiling data specifically pointed to tag sorting and deduplication as the underlying operations consuming the most processing time. ### The Challenges of Metric Context Generation * The Agent must generate a unique hash (context) for every metric received to address it within a hash table in RAM. * To ensure the same metric always generates the same key, the original algorithm required sorting all tags and ensuring their uniqueness. * The computational cost of sorting lists repeatedly for every incoming message created a performance ceiling for the entire metrics pipeline. ### Specialization and Runtime Optimization * **Algorithmic Specialization:** The team implemented specialized sorting logic that adjusts based on the number of tags, optimizing the "hot path" for the most common metric structures. * **Hashing Efficiency:** Micro-benchmarks identified Murmur3 as the most efficient hash implementation for balancing speed and collision resistance in this use case. * **Leveraging Go Runtime:** The team transitioned from 128-bit hashes to 64-bit metric contexts. This change allowed the Agent to utilize Go's internal `mapassign_fast64` and `mapaccess2_fast64` functions, which provide optimized map operations for 64-bit keys. ### Redesigning for Performance * The original design followed a rigid "hash metric name -> sort tags -> deduplicate tags -> iterative hash" workflow. * Recognizing that sorting was the primary architectural bottleneck, the team moved toward a new design intended to minimize or eliminate the overhead of traditional list sorting during context generation. To achieve similar performance gains in high-throughput Go applications, developers should profile their applications under realistic load and look for opportunities to leverage runtime-specific optimizations, such as using 64-bit map keys to trigger specialized compiler paths.

How Datadog uses Datadog to gain visibility into the Datadog user experience | Datadog (opens in new tab)

Datadog leverages its own monitoring tools to bridge the gap between qualitative user interviews and quantitative performance data. By "dogfooding" features like Real User Monitoring (RUM) and Logs, the product design team makes evidence-based UI/UX adjustments while gaining firsthand empathy for the user experience. This approach allows them to identify exactly how users interact with specific components and where current designs fail to meet user expectations. **Optimizing Font Consistency via CSS API Tracking** * To ensure visual precision in information-dense views like the Log Explorer, the team needed to transition from a generic system font stack to a standardized monospace font. * Designers used the Web API’s `Document.font` interface and the CSS Font Loading API via Datadog RUM to collect data on which specific fonts were actually being rendered on users' machines. * By analyzing a dashboard of these results, the team selected Roboto Mono as the standard, ensuring the new font’s optical size matched what the plurality of users were already seeing to avoid breaking embedded tables. **Simplifying Components through Interaction Logging** * The `DraggablePane` component, used for resizing adjacent panels, was suffering from UI clutter due to physical buttons for minimizing and maximizing content. * The team implemented custom loggers within Datadog Logs to track how frequently users clicked these specific controls versus interacting with the draggable handle. * The data revealed that the buttons were almost never used; consequently, the team removed them and replaced the functionality with a double-click event, significantly streamlining the interface. **Refining Syntax Support through Error Analysis** * When introducing the `DateRangePicker` for custom time frames, the team needed to expand the component's logic to support natural language strings. * By aggregating "invalid inputs" in Datadog Logs, the team could see the exact strings users were typing—such as "last 2 weeks"—that the system failed to parse. * Analyzing these common patterns allowed the team to update the parsing logic for high-demand keywords, which resulted in the component’s error rate dropping from 10 percent to approximately 5 percent. Leveraging internal monitoring tools allows design teams to move beyond guesswork and create highly functional interfaces. For organizations managing complex technical products, tracking specific component failures and interaction frequencies is an essential strategy for prioritizing the design roadmap and improving user retention.

2023-03-08 incident: A deep dive into our incident response | Datadog (opens in new tab)

Datadog’s first global outage on March 8, 2023, served as a rigorous stress test for their established incident response framework and "you build it, you own it" philosophy. While the outage was triggered by a systemic failure during a routine systemd upgrade, the company's commitment to blameless culture and decentralized engineering autonomy allowed hundreds of responders to coordinate a complex recovery across multiple regions. Ultimately, the event validated their investment in out-of-band monitoring and rigorous, bi-annual incident training as essential components for managing high-scale system disasters. ## Incident Response Structure and Philosophy * Datadog employs a decentralized "you build it, you own it" model where individual engineering teams are responsible for the 24/7 health and monitoring of the services they build. * For high-severity incidents, a specialized rotation is paged, consisting of an Incident Commander to lead the response, a communications lead, and a customer liaison to manage external messaging. * The organization prioritizes "people over process," empowering engineers to use their judgment to find creative solutions rather than following rigid, pre-written playbooks that may not apply to unprecedented failures. * A blameless culture is strictly maintained across all levels of the company, ensuring that post-incident investigations focus on systemic improvements rather than assigning fault to individuals. ## Multi-Layered Monitoring Strategy * Standard telemetry provides internal visibility, but Datadog also maintains "out-of-band" monitoring that operates completely outside its own infrastructure. * This out-of-band system interacts with Datadog APIs exactly like a customer would, ensuring that engineers are alerted even if the internal monitoring platform itself becomes unavailable. * Communication is streamlined through a dedicated Slack incident app that automatically generates coordination channels, providing situational awareness to any engineer who joins the effort. ## Anatomy of the March 8 Outage * The outage began at 06:00 UTC, triggered by a systemd upgrade that caused widespread Kubernetes failures and prevented pods from restarting correctly. * The global nature of the outage was diagnosed within 32 minutes of the initial monitoring alerts, leading to the activation of executive on-calls and the customer support management team. * Responders identified "unattended upgrades" as the incident trigger approximately five and a half hours after the initial failure. * Recovery was executed in stages: compute capacity was restored first in the EU1 region, followed by the US1 region, with full infrastructure restoration completed by 19:00 UTC. Organizations should treat incident response as a perishable skill that requires constant practice through a low threshold for declaring incidents and regular training. By combining out-of-band monitoring with a culture that empowers individual engineers to act autonomously during a crisis, teams can more effectively navigate the "not if, but when" reality of large-scale system failures.

How we optimized our Akka application using Datadog’s Continuous Profiler | Datadog (opens in new tab)

Datadog engineers discovered a significant 20–30% CPU overhead in their Akka-based Java applications caused by inefficient thread management within the `ForkJoinPool`. Through continuous profiling, the team found that irregular task flows were forcing the runtime to waste cycles constantly parking and unparking threads. By migrating bursty actors to a dispatcher with a more stable workload, they achieved a major performance gain, illustrating how high-level framework abstractions can mask low-level resource bottlenecks. ### Identifying the Performance Bottleneck * While running A/B tests on a new log-parsing algorithm, the team noticed that expected CPU reductions did not materialize; in some cases, performance actually degraded. * Flame graphs revealed that the application was spending a disproportionate amount of CPU time inside the `ForkJoinPool.scan()` and `Unsafe.park()` methods. * A summary table of CPU usage by thread showed that the "work" pool was only using 1% of the CPU, while the default Akka dispatcher was the primary consumer of resources. * The investigation narrowed the cause down to the `LatencyReportActor`, which handled latency metrics for log events. ### Analyzing the Root Cause of Thread Fluctuations * The `ForkJoinPool` manages worker threads dynamically, calling `Unsafe.park()` to suspend idle threads and `Unsafe.unpark()` to resume them when tasks increase. * The `LatencyReportActor` exhibited an irregular task flow, processing several hundred events in milliseconds and then remaining idle until the next second. * Because the default dispatcher was configured to use a thread pool equal to the number of processor cores (32), the system was waking up 32 threads every second for a tiny burst of work. * This constant cycle of waking and suspending threads created massive CPU overhead through expensive native calls to the operating system's thread scheduler. ### Implementing a Configuration-Based Fix * The solution involved moving the `LatencyReportActor` from the default Akka dispatcher to the main "work" dispatcher. * Because the "work" dispatcher already maintained a consistent flow of log processing tasks, the threads remained active and did not trigger the frequent park/unpark logic. * A single-line configuration change was used to route the actor to the stable dispatcher. * Following the change, the default dispatcher’s thread pool shrank from 32 to 2 threads, and overall service CPU usage dropped by an average of 30%. To maintain optimal performance in applications using `ForkJoinPool` or Akka, developers should monitor the `ForkJoinPool.scan()` method; if it accounts for more than 10–15% of CPU usage, the thread pool is likely unstable. Recommendations for remediation include limiting the number of actor instances, capping the maximum threads in a pool, and utilizing task queues to buffer short spikes. The ultimate goal is to ensure a stable count of active threads and avoid the performance tax of frequent thread state transitions.

How we use Vale to improve our documentation editing process | Datadog (opens in new tab)

To manage a high volume of technical content across dozens of products, Datadog’s documentation team has automated its editorial process using the open-source linting tool Vale. By integrating these checks directly into their CI/CD pipeline via GitHub Actions, the team ensures prose consistency and clarity while significantly reducing the manual burden on technical writers. This "shift-left" approach empowers both internal and external contributors to identify and fix style issues independently before a formal human review begins. ### Scaling Documentation Workflows * The Datadog documentation team operates at a 200:1 developer-to-writer ratio, managing over 1,400 contributors and 35 distinct products. * In 2023 alone, the team merged over 20,000 pull requests covering 650 integrations, 400 security rules, and 65 API endpoints. * On-call writers review an average of 40 pull requests per day, necessitating automation to handle triaging and style enforcement efficiently. ### Automated Prose Review with Vale * Vale is implemented as a command-line tool and a GitHub Action that scans Markdown and HTML files for style violations. * When a contributor opens a pull request, the linter provides automated comments in the "Files Changed" tab, flagging long sentences, wordy phrasing, or legacy formatting habits. * This automation reduces the "mental toll" on writers by filtering out repetitive errors before they reach the human review stage. ### Codifying Style Guides into Rules * The team transitioned from static editorial guidelines stored in Confluence and wikis to a codified repository called `datadog-vale`. * Style rules are defined using Vale’s YAML specification, allowing the team to update global standards in a single location that is immediately active in the CI pipeline. * Custom regular expressions are used to exclude specific content from validation, such as Hugo shortcodes or technical snippets that do not follow standard prose rules. ### Implementation of Specific Linting Rules * **Jargon and Filler Words:** A `words.yml` file flags "cruft" such as "easily" or "simply" to maintain a professional, objective tone. * **Oxford Comma Enforcement:** The `oxfordcomma.yml` rule uses regex to identify lists missing a serial comma and provides a suggestion to the author. * **Latin Abbreviations:** The `abbreviations.yml` rule identifies terms like "e.g." or "i.e." and suggests plain English alternatives like "for example" or "that is." * **Timelessness:** Rules flag words like "currently" or "now" to ensure documentation remains relevant without frequent updates. By open-sourcing their Vale configurations, Datadog provides a framework for other organizations to automate their style guides and foster a more efficient, collaborative documentation culture. Teams looking to improve prose quality should consider adopting a similar "docs-as-code" approach to shift editorial effort toward the beginning of the contribution lifecycle.