speech-to-text

4 posts

google

Next generation medical image interpretation with MedGemma 1.5 and medical speech to text with MedASR (opens in new tab)

Google Research has introduced MedGemma 1.5 4B and MedASR, expanding its suite of open medical AI models to support more complex clinical workflows. These updates significantly enhance the interpretation of high-dimensional imaging and medical speech-to-text, providing a compute-efficient foundation for healthcare developers to build upon. By maintaining an open-access model available on Hugging Face and Vertex AI, Google aims to accelerate the integration of multimodal AI into real-world medical applications. ### Multimodal Advancements in MedGemma 1.5 The latest update to the MedGemma 4B model focuses on high-dimensional and longitudinal data, moving beyond simple 2D image interpretation. * **3D Medical Imaging:** The model now supports volumetric representations from CT scans and MRIs, as well as whole-slide histopathology imaging. * **Longitudinal Review:** New capabilities allow for the review of chest X-ray time series, helping clinicians track disease progression over time. * **Anatomical Localization:** Developers can use the model to identify and localize specific anatomical features within chest X-rays. * **Document Understanding:** Enhanced support for extracting structured data from complex medical lab reports and documents. * **Edge Capability:** The 4B parameter size is specifically designed to be small enough to run offline while remaining accurate enough for core medical reasoning tasks. ### Medical Speech-to-Text with MedASR MedASR is a specialized automated speech recognition (ASR) model designed to bridge the gap between clinical dialogue and digital documentation. * **Clinical Dictation:** The model is specifically fine-tuned for medical terminology and the unique nuances of clinical dictation. * **Integrated Reasoning:** MedASR is designed to pair seamlessly with MedGemma, allowing transcribed text to be immediately processed for advanced medical reasoning or summarization. * **Accessibility:** Like other HAI-DEF models, it is free for research and commercial use and hosted on both Hugging Face and Google Cloud’s Vertex AI. ### Performance Benchmarks and Community Impact Google is incentivizing innovation through improved performance metrics and community-driven challenges. * **Accuracy Gains:** Internal benchmarks show MedGemma 1.5 improved disease-related CT classification by 3% and MRI classification by 14% compared to the previous version. * **MedGemma Impact Challenge:** A Kaggle-hosted hackathon with $100,000 in prizes has been launched to encourage developers to find creative applications for these multimodal tools. * **Model Collection:** The update complements existing tools like the MedSigLIP image encoder and the larger MedGemma 27B model, which remains the preferred choice for complex, text-heavy medical applications. Developers and researchers are encouraged to utilize MedGemma 1.5 for tasks requiring efficient, offline multimodal processing, while leveraging MedASR to automate clinical documentation. By participating in the MedGemma Impact Challenge, the community can help define the next generation of AI-assisted medical diagnostics and workflows.

line

Into the Passionate Energy of the (opens in new tab)

The PD1 AI Hackathon 2025 served as a strategic initiative by LY Corporation to embed innovative artificial intelligence directly into the LINE messaging ecosystem. Over 60 developers collaborated during an intensive 48-hour session to transition AI from a theoretical concept into practical features for messaging, content, and internal development workflows. The event successfully produced several high-utility prototypes that demonstrate how AI can enhance user safety, creative expression, and technical productivity. ## Transforming Voice Communication through NextVoIP * The "NextVoIP" project utilized Speech-to-Text (STT) technology to convert 1:1 and group call audio into real-time data for AI analysis. * The system was designed to provide life security features by detecting potential emergency situations or accidents through conversation monitoring. * AI acted as a communication assistant by suggesting relevant content and conversation topics to help maintain a seamless flow during calls. * Features were implemented to allow callers to enjoy shared digital content together, enriched by AI-driven recommendations. ## Creative Expression with MELODY LINE * This project focused on the intersection of technology and art by converting chat conversations into unique musical compositions. * The system analyzed the context and emotional sentiment of messages to automatically generate melodies that matched the atmosphere of the chat. * The implementation showcased the potential for generative AI to provide a multi-sensory experience within a standard messaging interface. ## AI-Driven QA and Test Automation * The grand prize-winning project, "IPD," addressed the bottleneck of repetitive manual testing by automating the entire Quality Assurance lifecycle. * AI was utilized to automatically generate and manage complex test cases, significantly reducing the manual effort required for mobile app validation. * The system included automated test execution and a diagnostic feature that identifies the root cause of failures when a test results in an error. * The project was specifically lauded for its immediate "production-ready" status, offering a direct path to improving development speed and software reliability. The results of this hackathon suggest that the most immediate value for AI in large-scale messaging platforms lies in two areas: enhancing user experience through contextual awareness and streamlining internal engineering via automated QA. Organizations should look toward integrating AI-driven testing tools to reduce technical debt while exploring real-time audio and text analysis to provide proactive security and engagement features for users.

google

Making group conversations more accessible with sound localization (opens in new tab)

Google Research has introduced SpeechCompass, a system designed to improve mobile captioning for group conversations by integrating multi-microphone sound localization. By shifting away from complex voice-recognition models toward geometric signal processing, the system provides real-time speaker diarization and directional guidance through a color-coded visual interface. This approach significantly reduces the cognitive load for users who previously had to manually associate a wall of scrolling text with different speakers in a room. ## Limitations of Standard Mobile Transcription * Traditional automatic speech recognition (ASR) apps concatenate all speech into a single block of text, making it difficult to distinguish between different participants in a group setting. * Existing high-end solutions often require audio-visual separation, which needs a clear line of sight from a camera, or speaker embedding, which requires pre-registering unique voiceprints. * These current methods can be computationally expensive and often fail in spontaneous, mobile environments where privacy and setup speed are priorities. ## Hardware and Signal Localization * The system was prototyped in two forms: a specialized phone case featuring four microphones connected to an STM32 microcontroller and a software-only implementation for standard dual-microphone smartphones. * While dual-microphone setups are limited to 180-degree localization due to "front-back confusion," the four-microphone array enables full 360-degree sound tracking. * The system utilizes Time-Difference of Arrival (TDOA) and Generalized Cross Correlation with Phase Transform (GCC-PHAT) to estimate the angle of arrival for sound waves. * To handle indoor reverberations and noise, the team applied statistical methods like kernel density estimation to improve the precision of the localizer. ## Advantages of Waveform-Based Diarization * **Low Latency and Compute:** By avoiding heavy machine learning models and weights, the algorithm can run on low-power microcontrollers with minimal memory requirements. * **Privacy Preservation:** Unlike speaker embedding techniques, SpeechCompass does not identify unique voiceprints or require video, instead relying purely on the physical location of the sound source. * **Language Independence:** Because the system analyzes the differences between audio waveforms rather than the speech content itself, it is entirely language-agnostic and can localize non-speech sounds. * **Dynamic Reconfiguration:** The system adjusts instantly to the movement of the device, allowing users to reposition their phones without recalibrating the diarization logic. ## User Interface and Accessibility * The prototype Android application augments standard speech-to-text with directional data received via USB from the microphone array. * Transcripts are visually separated by color and accompanied by directional arrows, allowing users to quickly identify where a speaker is located in the physical space. * This visual feedback loop transforms a traditional transcript into a spatial map of the conversation, making group interactions more accessible for individuals who are deaf or hard of hearing.

google

Deciphering language processing in the human brain through LLM representations (opens in new tab)

Recent research by Google Research and collaborating universities indicates that Large Language Models (LLMs) process natural language through internal representations that closely mirror neural activity in the human brain. By comparing intracranial recordings from spontaneous conversations with the internal embeddings of the Whisper speech-to-text model, the study found a high degree of linear alignment between artificial and biological language processing. These findings suggest that the statistical structures learned by LLMs via next-word prediction provide a viable computational framework for understanding how humans comprehend and produce speech. ## Mapping LLM Embeddings to Brain Activity * Researchers utilized intracranial electrodes to record neural signals during real-world, free-flowing conversations. * The study compared neural activity against two distinct types of embeddings from the Transformer-based Whisper model: "speech embeddings" from the model’s encoder and "language embeddings" from the decoder. * A linear transformation was used to predict brain signals based on these embeddings, revealing that LLMs and the human brain share similar multidimensional spaces for coding linguistic information. * The alignment suggests that human language processing may rely more on statistical structures and contextual embeddings rather than traditional symbolic rules or syntactic parts of speech. ## Neural Sequences in Speech Comprehension * When a subject listens to speech, the brain follows a specific chronological sequence that aligns with model representations. * Initially, speech embeddings predict cortical activity in the superior temporal gyrus (STG), which is responsible for processing auditory speech sounds. * A few hundred milliseconds later, language embeddings predict activity in Broca’s area (located in the inferior frontal gyrus), marking the transition from sound perception to decoding meaning. ## Reversed Dynamics in Speech Production * During speech production, the neural sequence is reversed, beginning approximately 500 milliseconds before a word is articulated. * Processing starts in Broca’s area, where language embeddings predict activity as the brain plans the semantic content of the utterance. * This is followed by activity in the motor cortex (MC), aligned with speech embeddings, as the brain prepares the physical articulatory movements. * Finally, after articulation, speech embeddings predict activity back in the STG, suggesting the brain is monitoring the sound of the speaker's own voice. This research validates the use of LLMs as powerful predictive tools for neuroscience, offering a new lens through which to study the temporal and spatial dynamics of human communication. By bridging the gap between artificial intelligence and cognitive biology, researchers can better model how the brain integrates sound and meaning in real-time.