microphone-array

1 posts

google

Making group conversations more accessible with sound localization (opens in new tab)

Google Research has introduced SpeechCompass, a system designed to improve mobile captioning for group conversations by integrating multi-microphone sound localization. By shifting away from complex voice-recognition models toward geometric signal processing, the system provides real-time speaker diarization and directional guidance through a color-coded visual interface. This approach significantly reduces the cognitive load for users who previously had to manually associate a wall of scrolling text with different speakers in a room. ## Limitations of Standard Mobile Transcription * Traditional automatic speech recognition (ASR) apps concatenate all speech into a single block of text, making it difficult to distinguish between different participants in a group setting. * Existing high-end solutions often require audio-visual separation, which needs a clear line of sight from a camera, or speaker embedding, which requires pre-registering unique voiceprints. * These current methods can be computationally expensive and often fail in spontaneous, mobile environments where privacy and setup speed are priorities. ## Hardware and Signal Localization * The system was prototyped in two forms: a specialized phone case featuring four microphones connected to an STM32 microcontroller and a software-only implementation for standard dual-microphone smartphones. * While dual-microphone setups are limited to 180-degree localization due to "front-back confusion," the four-microphone array enables full 360-degree sound tracking. * The system utilizes Time-Difference of Arrival (TDOA) and Generalized Cross Correlation with Phase Transform (GCC-PHAT) to estimate the angle of arrival for sound waves. * To handle indoor reverberations and noise, the team applied statistical methods like kernel density estimation to improve the precision of the localizer. ## Advantages of Waveform-Based Diarization * **Low Latency and Compute:** By avoiding heavy machine learning models and weights, the algorithm can run on low-power microcontrollers with minimal memory requirements. * **Privacy Preservation:** Unlike speaker embedding techniques, SpeechCompass does not identify unique voiceprints or require video, instead relying purely on the physical location of the sound source. * **Language Independence:** Because the system analyzes the differences between audio waveforms rather than the speech content itself, it is entirely language-agnostic and can localize non-speech sounds. * **Dynamic Reconfiguration:** The system adjusts instantly to the movement of the device, allowing users to reposition their phones without recalibrating the diarization logic. ## User Interface and Accessibility * The prototype Android application augments standard speech-to-text with directional data received via USB from the microphone array. * Transcripts are visually separated by color and accompanied by directional arrows, allowing users to quickly identify where a speaker is located in the physical space. * This visual feedback loop transforms a traditional transcript into a spatial map of the conversation, making group interactions more accessible for individuals who are deaf or hard of hearing.