google

Making group conversations more accessible with sound localization (opens in new tab)

Google Research has introduced SpeechCompass, a system designed to improve mobile captioning for group conversations by integrating multi-microphone sound localization. By shifting away from complex voice-recognition models toward geometric signal processing, the system provides real-time speaker diarization and directional guidance through a color-coded visual interface. This approach significantly reduces the cognitive load for users who previously had to manually associate a wall of scrolling text with different speakers in a room.

Limitations of Standard Mobile Transcription

  • Traditional automatic speech recognition (ASR) apps concatenate all speech into a single block of text, making it difficult to distinguish between different participants in a group setting.
  • Existing high-end solutions often require audio-visual separation, which needs a clear line of sight from a camera, or speaker embedding, which requires pre-registering unique voiceprints.
  • These current methods can be computationally expensive and often fail in spontaneous, mobile environments where privacy and setup speed are priorities.

Hardware and Signal Localization

  • The system was prototyped in two forms: a specialized phone case featuring four microphones connected to an STM32 microcontroller and a software-only implementation for standard dual-microphone smartphones.
  • While dual-microphone setups are limited to 180-degree localization due to "front-back confusion," the four-microphone array enables full 360-degree sound tracking.
  • The system utilizes Time-Difference of Arrival (TDOA) and Generalized Cross Correlation with Phase Transform (GCC-PHAT) to estimate the angle of arrival for sound waves.
  • To handle indoor reverberations and noise, the team applied statistical methods like kernel density estimation to improve the precision of the localizer.

Advantages of Waveform-Based Diarization

  • Low Latency and Compute: By avoiding heavy machine learning models and weights, the algorithm can run on low-power microcontrollers with minimal memory requirements.
  • Privacy Preservation: Unlike speaker embedding techniques, SpeechCompass does not identify unique voiceprints or require video, instead relying purely on the physical location of the sound source.
  • Language Independence: Because the system analyzes the differences between audio waveforms rather than the speech content itself, it is entirely language-agnostic and can localize non-speech sounds.
  • Dynamic Reconfiguration: The system adjusts instantly to the movement of the device, allowing users to reposition their phones without recalibrating the diarization logic.

User Interface and Accessibility

  • The prototype Android application augments standard speech-to-text with directional data received via USB from the microphone array.
  • Transcripts are visually separated by color and accompanied by directional arrows, allowing users to quickly identify where a speaker is located in the physical space.
  • This visual feedback loop transforms a traditional transcript into a spatial map of the conversation, making group interactions more accessible for individuals who are deaf or hard of hearing.