discord

Measuring Product Impact Without A/B Testing: How Discord Used the Synthetic Control Method for Voice Messages (opens in new tab)

When Discord launched Voice Messages in 2023, the engineering and data teams faced a significant hurdle in measuring the feature's impact through traditional A/B testing. Because the feature is inherently social—requiring both a sender and a receiver—standard user-level randomization would fail to capture the true causal effect due to heavy network interference. The team had to navigate the limitations of their testing infrastructure, ultimately seeking a balance between imperfect user-level tests and geographically biased alternatives.

The Conflict Between Social Features and SUTVA

  • Traditional A/B testing relies on the Stable Unit Treatment Value Assumption (SUTVA), which posits that the behavior of one user is independent of the treatment assignment of others.
  • Voice Messages break this assumption because the feature’s value is realized through interactions; if a sender is in the treatment group but the receiver is in control, the experimental boundaries blur.
  • Network effects occur when treatment behavior in one group influences the control group, potentially skewing metrics and leading to an inaccurate understanding of the feature's success.

Infrastructure Constraints and Randomization Strategies

  • The ideal solution for social platforms is cluster randomization, which assigns entire networks or communities to a single experimental arm to contain interactions.
  • Discord’s internal testing platform did not support cluster randomization at the time of the Voice Message launch, forcing the team to consider less-than-ideal methodologies.
  • User-level randomization was deemed "bad" for this specific use case because it could not account for the interconnected nature of Discord’s user base.

The Trade-offs of Geo-Testing

  • One proposed alternative was randomizing by country, based on the assumption that most social networks are language or country-specific.
  • By treating an entire geographic region while keeping another as a control, the team hoped to mitigate cross-group network interference.
  • However, geo-testing introduces significant bias, as it conflates the treatment effect with existing cultural, economic, and behavioral differences between countries.

To accurately measure the impact of features built on social connectivity, organizations must account for network interference that violates standard statistical assumptions. When cluster randomization infrastructure is unavailable, data teams must carefully weigh the bias introduced by geographic testing against the interference inherent in user-level randomization.