performance-monitoring

2 posts

toss

Working as a QA in a (opens in new tab)

Toss Place implements a dual-role QA structure where managers are embedded directly within product Silos from the initial planning stages to final deployment. This shift moves QA from a final-stage bottleneck to a proactive partner that enhances delivery speed and stability through deep historical context and early risk mitigation. Consequently, the organization has transitioned to a culture where quality is viewed as a shared team responsibility rather than a siloed functional task. ### Integrating QA into Product Silos * QA managers belong to both a central functional team and specific product units (Silos) to ensure they are involved in the entire product lifecycle. * Participation begins at the OKR design phase, allowing QA to align testing strategies with specific product intentions and business goals. * Early involvement enables accurate risk assessment and scope estimation, preventing the "shallow testing" that often occurs when QA only sees the final product. ### Optimizing Spec Reviews and Sanity Testing * The team introduced a structured flow consisting of Spec Reviews followed by Q&A sessions to reduce repetitive discussions and information gaps. * All specification changes are centralized in shared design tools (such as Deus) or messenger threads to ensure transparency across all roles. * "Sanity Test" criteria were established where developers and QA agree on "Happy Case" validations and minimum spec requirements before development begins, ensuring everyone starts from the same baseline. ### Collaborative Live Monitoring * Post-release checklists were developed to involve the entire Silo in live monitoring, overcoming the limitations of having a single QA manager per unit. * This collaborative approach encourages non-technical roles to interact with the live product, reinforcing the culture that quality is a collective team responsibility. ### Streamlining Issue Tracking and Communication * The team implemented a "Send to Notion" workflow to instantly capture messenger-based feedback and ideas into a structured, prioritized backlog. * To reduce communication fragmentation, they transitioned from Jira to integrated Messenger Lists and Canvases, which allowed for centralized discussions and faster issue resolution. * Backlogs are prioritized based on user experience impact and release urgency, ensuring that critical bugs are addressed while minor improvements are tracked for future cycles. The success of these initiatives demonstrates that QA effectiveness is driven by integration and autonomy rather than rigid adherence to specific tools. To achieve both high velocity and high quality, organizations should empower QA professionals to act as product peers who can flexibly adapt their processes to the unique needs and data-driven goals of their specific product teams.

datadog

.NET Continuous Profiler: Under the hood | Datadog (opens in new tab)

Datadog’s Continuous Profiler timeline view addresses the limitations of traditional aggregate profiling by providing a temporal context for resource consumption. It allows developers to visualize how CPU usage, memory allocation, and thread activity evolve over time, making it easier to pinpoint transient performance regressions that are often masked by averages. By correlating execution patterns with specific time windows, teams can move beyond static flame graphs to understand the root causes of latency spikes and resource contention in live environments. ### Moving Beyond Aggregate Profiling * Traditional flame graphs aggregate data over a period, which can hide short-lived performance issues or intermittent stalls that do not significantly impact the overall average. * The timeline view introduces a chronological dimension, mapping stack traces to specific timestamps to show exactly when resource-intensive operations occurred. * This temporal granularity is essential for identifying "noisy neighbors" or periodic background tasks, such as scheduled jobs or cache invalidations, that disrupt request processing. ### Visualizing Thread Activity and Runtime Contention * The tool visualizes individual thread states, distinguishing between active CPU execution, waiting on locks, and I/O operations. * Developers can identify "Stop-the-World" garbage collection events or thread starvation by observing gaps in execution or excessive synchronization overhead within the timeline. * Specific metrics, including lock wait time and file/socket I/O, are overlaid on the timeline to provide a comprehensive view of how code interacts with the underlying runtime and hardware. ### Correlating Profiles with Distributed Traces * Integration between profiling and tracing allows users to pivot from a slow span in a distributed trace directly to the corresponding timeline view of the execution thread. * This correlation helps explain "unaccounted for" time in traces—such as time spent waiting for a CPU core or being blocked by a mutex—that traditional tracing cannot capture. * Filtering capabilities allow teams to isolate performance regressions by service, version, or environment, facilitating faster root-cause analysis during post-mortems. To optimize production performance effectively, teams should incorporate timeline analysis into their standard debugging workflow for latency spikes rather than relying solely on aggregate metrics. By combining chronological thread analysis with distributed tracing, developers can resolve complex concurrency issues and "tail latency" problems that aggregate profiling often overlooks.