Google Research / transformer

7 posts

google

Titans + MIRAS: Helping AI have long-term memory (opens in new tab)

Google Research has introduced Titans, a new architecture, and MIRAS, a theoretical framework, designed to overcome the computational limitations of Transformers while maintaining high-fidelity long-term memory. These innovations utilize "test-time memorization," allowing models to update their core parameters in real-time as they process data without requiring offline retraining. By combining the speed of linear recurrent neural networks (RNNs) with the accuracy of attention mechanisms, the system enables AI to handle massive contexts such as genomic analysis or full-document understanding. ## Titans and Neural Long-Term Memory * Unlike traditional RNNs that compress context into fixed-size vectors or matrices, Titans uses a multi-layer perceptron (MLP) as a dedicated long-term memory module. * This deep neural memory provides significantly higher expressive power, allowing the model to synthesize and understand entire narratives rather than just storing passive snapshots. * The architecture separates memory into two distinct modules: an attention mechanism for precise short-term context and the MLP for summarizing long-term information. ## The Gradient-Based Surprise Metric * Titans employs a "surprise metric" to decide which information is important enough to store, mirroring the human brain's tendency to remember unexpected events. * The model calculates an internal error signal (gradient); a high gradient indicates that the new input is anomalous or context-breaking, signaling it should be prioritized for long-term storage. * The system incorporates "Momentum" to track the flow of context over time, ensuring that subsequent relevant information is captured even if individual tokens are not surprising. * To manage memory capacity during extremely long sequences, an adaptive weight decay mechanism acts as a forgetting gate to discard information that is no longer useful. ## MIRAS: A Unified Framework for Sequence Modeling * MIRAS provides a theoretical blueprint that views all major sequence models—including Transformers and linear RNNs—as different forms of associative memory modules. * The framework defines sequence models through four key design choices: memory architecture (e.g., MLP vs. vector), attentional bias, and the internal learning objectives used to combine new and old data. * This approach shifts AI modeling toward real-time adaptation, where the model actively learns and incorporates specific new details into its core knowledge as data streams in. These advancements suggest a shift away from static context windows toward dynamic systems capable of lifelong learning. For developers working with large-scale data, the Titans architecture provides a practical tool for scaling performance, while the MIRAS framework offers a roadmap for designing next-generation models that adapt instantly to new information.

google

Real-time speech-to-speech translation (opens in new tab)

Google DeepMind and Google Core ML have developed an innovative end-to-end speech-to-speech translation (S2ST) model that enables real-time, voice-preserved communication with only a two-second delay. By replacing traditional cascaded pipelines with a streaming architecture trained on time-synchronized data, the system overcomes long-standing issues of high latency and accumulated errors. This advancement represents a significant shift toward natural, fluid cross-language dialogue that retains the original speaker's personality. ## Limitations of Cascaded S2ST Traditional real-time translation systems typically rely on a cascaded chain of three distinct AI models: Automatic Speech Recognition (ASR), Automatic Speech Translation (AST), and Text-to-Speech (TTS). This approach suffers from several critical drawbacks: * **High Latency:** Processing through three separate stages results in a 4–5 second delay, forcing users into unnatural, turn-based interactions. * **Error Propagation:** Inaccuracies in the initial transcription or translation phase accumulate, often leading to garbled or incorrect final audio output. * **Loss of Identity:** General-purpose TTS engines generate generic voices, stripping the communication of the original speaker’s unique vocal characteristics. ## Time-Synced Data Acquisition Pipeline To train an end-to-end model capable of low-latency output, researchers created a scalable pipeline that transforms raw audio into a specialized time-synchronized dataset. * **Alignment Multi-mapping:** The process uses forced alignment algorithms to map source audio to source text, source text to translated text, and finally, translated text to generated speech. * **Voice Preservation:** A custom TTS engine generates the target language audio while intentionally preserving the vocal characteristics of the original speaker. * **Strict Validation:** Automated filters discard any segments where alignments fail or where the translated audio cannot meet specific real-time delay requirements. * **Data Augmentation:** The training set is further refined using techniques such as sample rate reduction, denoising, and reverberation to ensure the model performs well in real-world environments. ## End-to-End Streaming Architecture The model’s architecture is designed for continuous audio streams, leveraging the AudioLM framework and fundamental transformer blocks to make real-time decisions. * **Streaming Encoder:** This component summarizes source audio data by focusing on the preceding 10-second window of input. * **Streaming Decoder:** This module predicts translated audio autoregressively, utilizing compressed encoder states and previous predictions to maintain flow. * **RVQ Audio Tokens:** The system represents audio as a 2D set of Residual Vector Quantization (RVQ) tokens, where the X-axis represents time and the Y-axis represents audio quality/fidelity. * **SpectroStream Integration:** By using SpectroStream codec technology, the model manages hierarchical audio representations, allowing it to prioritize the sequential output of audio segments for immediate playback. This technology effectively bridges the gap between high-quality translation and real-time responsiveness. For developers and researchers in the field, the transition from modular cascaded systems to end-to-end streaming architectures—supported by rigorous time-aligned datasets—is the recommended path for achieving truly seamless human-to-human cross-language communication.

google

Time series foundation models can be few-shot learners (opens in new tab)

Researchers at Google have introduced TimesFM-ICF, a foundation model that enables time-series forecasting to transition from zero-shot to few-shot learning via in-context fine-tuning. By utilizing continued pre-training and specialized separator tokens, the model learns to adapt to a handful of related examples at inference time without requiring the complex supervised fine-tuning typically needed for task-specific optimization. This approach effectively matches or exceeds the performance of specialized models while maintaining the flexibility of a general-purpose foundation model. ### Overcoming the Limitations of Zero-Shot Models * Traditional forecasting often requires building separate, specialized models for every unique task, which is resource-intensive and slow. * While zero-shot models like the original TimesFM provide immediate forecasts without task-specific training, they cannot incorporate relevant context, such as data from nearby sensors or similar historical patterns. * The In-Context Fine-tuning (ICF) approach allows the model to "learn" from a few examples provided at the time of prediction, similar to how Large Language Models (LLMs) use few-shot prompting. ### Architecture and the Common Separator Token * TimesFM-ICF utilizes a patched decoder architecture that tokenizes 32 contiguous timepoints into a single input token. * To prevent the model from conflating different data streams—such as separate store locations or distinct time periods—researchers introduced a "common separator token" as a digital boundary between examples. * The model processes these tokens through a transformer stack using causal self-attention (CSA), ensuring it learns from historical context without accidentally "peeking" into the future. * A shared multilayer perceptron (MLP) translates the processed output tokens back into a forecast spanning 128 timepoints. ### Performance Benchmarking and Results * The model was evaluated on 23 unseen datasets, using the Mean Absolute Scaled Error (MASE) metric to aggregate performance across diverse time-series tasks. * TimesFM-ICF demonstrated a significant performance boost over the original zero-shot TimesFM and other state-of-the-art foundation models like Moirai and Lag-Llama. * Test results showed that providing just a few in-context examples allowed the model to match the accuracy of supervised fine-tuning, which normally requires much more computational overhead and data curation. TimesFM-ICF represents a practical shift for businesses managing diverse data streams, offering a way to achieve high-accuracy forecasts by simply providing a few relevant historical examples. For those looking to optimize inventory or energy demands, this method provides the precision of a custom-tuned model with the deployment speed of a pre-trained foundation model.

google

Highly accurate genome polishing with DeepPolisher: Enhancing the foundation of genomic research (opens in new tab)

DeepPolisher is a deep learning-based genome assembly tool designed to correct base-level errors with high precision, significantly enhancing the accuracy of genomic research. By leveraging a Transformer architecture to analyze sequencing data, the tool reduces total assembly errors by 50% and insertion or deletion (indel) errors by 70%. This advancement is critical for creating near-perfect reference genomes, such as the Human Pangenome Reference, which are essential for identifying disease-causing variants and understanding human evolution. ## Limitations of Current Sequencing Technologies * Genome assembly relies on reading nucleotides (A, T, G, and C), but the microscopic scale of these base pairs makes accurate, large-scale sequencing difficult. * Short-read sequencing methods provide high signal strength but are limited to a few hundred nucleotides because identical DNA clusters eventually desynchronize, blending signals together. * Long-read technologies can sequence tens of thousands of nucleotides but initially suffered from high error rates (~10%); while tools like DeepConsensus have reduced this to 0.1%, further refinement is necessary for high-fidelity reference genomes. * Even a 0.1% error rate results in millions of inaccuracies across the 3-billion-nucleotide human genome, which can cause researchers to miss critical genetic markers or misidentify proteins. ## DeepPolisher Architecture and Training * DeepPolisher is an open-source pipeline adapted from the DeepConsensus model, utilizing a Transformer-based neural network. * The model was trained using a human cell line from the Personal Genomes Project that is estimated to be 99.99999% accurate, providing a "ground truth" for identifying and correcting errors. * The system takes sequenced bases, their associated quality scores, and the orientation of the DNA strands to learn complex error patterns that traditional methods might miss. * By combining sequence reads from multiple DNA molecules of the same individual, the tool iteratively "polishes" the assembly to reach the accuracy required for reference-grade data. ## Impact on Genomic Accuracy and Gene Discovery * The tool’s ability to reduce indel errors by 70% is particularly significant, as these specific errors often interfere with the identification of protein-coding genes. * DeepPolisher has already been integrated into major research efforts, including the enhancement of the Human Pangenome Reference, providing a more robust foundation for clinical diagnostics. * Improved assembly accuracy allows for better mapping of regions where the genome is highly repetitive, which were previously difficult to sequence and assemble confidently. For researchers and bioinformaticians, DeepPolisher represents a vital step in moving from "draft" genomes to high-fidelity references. Adopting this tool in assembly pipelines can drastically improve the reliability of variant calling and gene annotation, especially in complex clinical and evolutionary studies.

google

LSM-2: Learning from incomplete wearable sensor data (opens in new tab)

LSM-2 introduces a paradigm shift in processing wearable sensor data by treating naturally occurring data gaps as inherent features rather than errors to be corrected. By utilizing the Adaptive and Inherited Masking (AIM) framework, the model learns directly from fragmented, real-world data streams without the need for biased imputation or data-discarding filters. This approach allows LSM-2 to achieve state-of-the-art performance in health-related classification and regression tasks, maintaining robustness even when sensors fail or data is highly interrupted. ## The Challenge of Pervasive Missingness * Real-world wearable data is almost never continuous; factors such as device charging, motion artifacts, and battery-saving modes create frequent "missingness." * Traditional self-supervised learning models require complete data, forcing researchers to use imputation—which can introduce artificial bias—or aggressive filtering that discards over 90% of potentially useful samples. * In a dataset of 1.6 million day-long windows, research found that not a single sample had 0% missingness, highlighting the impracticality of training only on complete datasets. ## Adaptive and Inherited Masking (AIM) * AIM extends the Masked Autoencoder (MAE) framework by treating "inherited" masks (naturally occurring gaps) and "artificial" masks (training objectives) as equivalent. * The framework utilizes a dual masking strategy: it employs token dropout on a fixed ratio of tokens to ensure computational efficiency during encoding. * To handle the unpredictable and variable nature of real-world gaps, AIM uses attention masking within the transformer blocks for any remaining masked tokens. * During evaluation and fine-tuning, the model relies solely on attention masking to navigate naturally occurring gaps, allowing for accurate physiological modeling without filling in missing values. ## Scale and Training Architecture * LSM-2 was trained on a massive dataset comprising 40 million hours of de-identified wearable data from more than 60,000 participants using Fitbit and Google Pixel devices. * The model learns to understand underlying physiological structures by reconstructing masked segments across multimodal inputs, including heart signals, sleep patterns, and activity levels. * Because it is trained on fragmented data, the resulting foundation model is significantly more resilient to sensor dropouts in downstream tasks like hypertension prediction or stress monitoring. LSM-2 demonstrates that foundation models for health should be built to embrace the messiness of real-world environments. By integrating missingness directly into the self-supervised learning objective, developers can bypass the computational and statistical overhead of imputation while building more reliable diagnostic and monitoring tools.

google

Graph foundation models for relational data (opens in new tab)

Google researchers have introduced Graph Foundation Models (GFMs) as a solution to the limitations of traditional tabular machine learning, which often ignores the rich connectivity of relational databases. By representing tables as interconnected graphs where rows are nodes and foreign keys are edges, this approach enables a single model to generalize across entirely different schemas and feature sets. This shift allows for transferable graph representations that can perform inference on unseen tasks without the costly need for domain-specific retraining. ### Transforming Relational Schemas into Graphs The core methodology involves a scalable data preparation step that converts standard relational database structures into a single heterogeneous graph. This process preserves the underlying logic of the data while making it compatible with graph-based learning: * **Node Mapping:** Each unique table is treated as a node type, and every individual row within that table is converted into a specific node. * **Edge Creation:** Foreign key relationships are transformed into typed edges that connect nodes across different tables. * **Feature Integration:** Standard columns containing numerical or categorical data are converted into node features, while temporal data can be preserved as features on either nodes or edges. ### Overcoming the Generalization Gap A primary hurdle in developing GFMs is the lack of a universal tokenization method, unlike the word pieces used in language models or patches used in vision models. Traditional Graph Neural Networks (GNNs) are typically locked to the specific graph they were trained on, but GFMs solve this through several technical innovations: * **Schema Agnosticism:** The model avoids hard-coded embedding tables for specific node types, allowing it to interpret database schemas it has never encountered during training. * **Feature Interaction Learning:** Instead of training on "absolute" features (like specific price distributions), the model captures how different features interact with one another across diverse tasks. * **Generalizable Encoders:** The architecture uses transferable methods to derive fixed-size representations for nodes, whether they contain three continuous float features or dozens of categorical values. ### Scaling and Real-World Application To handle the requirements of enterprise-level data, the GFM framework is built to operate on a massive scale using Google’s specialized infrastructure: * **Massive Throughput:** The system utilizes JAX and TPU infrastructure to process graphs containing billions of nodes and edges. * **Internal Validation:** The model has been tested on complex internal Google tasks, such as spam detection in advertisements, which requires analyzing dozens of interconnected relational tables simultaneously. * **Performance Benefits:** By considering the connections between rows—a factor traditional tabular baselines like decision trees often ignore—the GFM provides superior downstream performance in high-stakes prediction services. Transitioning from domain-specific models to Graph Foundation Models allows organizations to leverage relational data more holistically. By focusing on the connectivity of data rather than just isolated table features, GFMs provide a path toward a single, generalist model capable of handling diverse enterprise tasks.

google

Deciphering language processing in the human brain through LLM representations (opens in new tab)

Recent research by Google Research and collaborating universities indicates that Large Language Models (LLMs) process natural language through internal representations that closely mirror neural activity in the human brain. By comparing intracranial recordings from spontaneous conversations with the internal embeddings of the Whisper speech-to-text model, the study found a high degree of linear alignment between artificial and biological language processing. These findings suggest that the statistical structures learned by LLMs via next-word prediction provide a viable computational framework for understanding how humans comprehend and produce speech. ## Mapping LLM Embeddings to Brain Activity * Researchers utilized intracranial electrodes to record neural signals during real-world, free-flowing conversations. * The study compared neural activity against two distinct types of embeddings from the Transformer-based Whisper model: "speech embeddings" from the model’s encoder and "language embeddings" from the decoder. * A linear transformation was used to predict brain signals based on these embeddings, revealing that LLMs and the human brain share similar multidimensional spaces for coding linguistic information. * The alignment suggests that human language processing may rely more on statistical structures and contextual embeddings rather than traditional symbolic rules or syntactic parts of speech. ## Neural Sequences in Speech Comprehension * When a subject listens to speech, the brain follows a specific chronological sequence that aligns with model representations. * Initially, speech embeddings predict cortical activity in the superior temporal gyrus (STG), which is responsible for processing auditory speech sounds. * A few hundred milliseconds later, language embeddings predict activity in Broca’s area (located in the inferior frontal gyrus), marking the transition from sound perception to decoding meaning. ## Reversed Dynamics in Speech Production * During speech production, the neural sequence is reversed, beginning approximately 500 milliseconds before a word is articulated. * Processing starts in Broca’s area, where language embeddings predict activity as the brain plans the semantic content of the utterance. * This is followed by activity in the motor cortex (MC), aligned with speech embeddings, as the brain prepares the physical articulatory movements. * Finally, after articulation, speech embeddings predict activity back in the STG, suggesting the brain is monitoring the sound of the speaker's own voice. This research validates the use of LLMs as powerful predictive tools for neuroscience, offering a new lens through which to study the temporal and spatial dynamics of human communication. By bridging the gap between artificial intelligence and cognitive biology, researchers can better model how the brain integrates sound and meaning in real-time.