genomics

4 posts

google

Accelerating the magic cycle of research breakthroughs and real-world applications (opens in new tab)

Google Research is accelerating a "magic cycle" where breakthrough scientific discoveries and real-world applications continuously reinforce one another through advanced AI models and open platforms. By leveraging agentic tools and large-scale foundations, the company is transforming complex data into actionable insights across geospatial analysis, genomics, and quantum computing. This iterative process aims to solve critical global challenges while simultaneously uncovering new frontiers for future innovation. ### Earth AI and Geospatial Reasoning * Google has integrated various geospatial models—including those for flood forecasting, wildfire tracking, and air quality—into a unified Earth AI program. * The newly introduced Geospatial Reasoning Agent uses Large Language Models (LLMs) to allow non-experts to ask complex questions and receive plain-language answers derived from diverse datasets. * Riverine flood models have been significantly expanded, now providing forecasts for over 2 billion people across 150 countries. * New Remote Sensing and Population Dynamics Foundations have been released to help researchers understand nuanced correlations in planetary data and supply chain management. ### DeepSomatic and Genomic Research * Building on ten years of genomics work, DeepSomatic is an AI tool designed to identify somatic mutations (genetic variants in tumors) to assist in cancer research. * The tool follows the development of previous foundational models like DeepVariant and DeepConsensus, which helped map human and non-human genomes. * These advancements aim to move the medical field closer to precision medicine by providing health practitioners with higher-resolution data on genetic variations. ### The Magic Cycle of Research and Development * Google highlights "Quantum Echoes" as a key breakthrough in quantum computing, contributing to the broader goal of solving fundamental scientific problems through high-scale computation. * The acceleration of discovery is largely attributed to "agentic tools" that assist scientists in navigating massive datasets and uncovering new research opportunities. * The company emphasizes a collaborative approach, making foundation models available to trusted testers and partners like the WHO and various international research institutes. To maximize the impact of these breakthroughs, organizations should look toward integrating multimodal AI agents that can bridge the gap between specialized scientific data and practical decision-making. By utilizing open platforms and foundation models, the broader scientific community can translate high-level research into scalable solutions for climate resilience, healthcare, and global policy.

google

Using AI to identify genetic variants in tumors with DeepSomatic (opens in new tab)

DeepSomatic is an AI-powered tool developed by Google Research to identify cancer-related mutations by analyzing a tumor's genetic sequence with higher accuracy than current methods. By leveraging convolutional neural networks (CNNs), the model distinguishes between inherited genetic traits and acquired somatic variants that drive cancer progression. This flexible tool supports multiple sequencing platforms and sample types, offering a critical resource for clinicians and researchers aiming to personalize cancer treatment through precision medicine. ## Challenges in Somatic Variant Detection * Somatic variants are genetic mutations acquired after birth through environmental exposure or DNA replication errors, making them distinct from the germline variants found in every cell of a person's body. * Detecting these mutations is technically difficult because tumor samples are often heterogeneous, containing a diverse set of variants at varying frequencies. * Sequencing technologies often introduce small errors that can be difficult to distinguish from actual somatic mutations, especially when the mutation is only present in a small fraction of the sampled cells. ## CNN-Based Variant Calling Architecture * DeepSomatic employs a method pioneered by DeepVariant, which involves transforming raw genetic sequencing data into a set of multi-channel images. * These images represent various data points, including alignment along the chromosome, the quality of the sequence output, and other technical variables. * The convolutional neural network processes these images to differentiate between three categories: the human reference genome, non-cancerous germline variants, and the somatic mutations driving tumor growth. * By analyzing tumor and non-cancerous cells side-by-side, the model effectively filters out sequencing artifacts that might otherwise be misidentified as mutations. ## System Versatility and Application * The model is designed to function in multiple modes, including "tumor-normal" (comparing a biopsy to a healthy sample) and "tumor-only" mode, which is vital for blood cancers like leukemia where isolating healthy cells is difficult. * DeepSomatic is platform-agnostic, meaning it can process data from all major sequencing technologies and adapt to different types of sample processing. * The tool has demonstrated the ability to generalize its learning to various cancer types, even those not specifically included in its initial training sets. ## Open-Source Contributions to Precision Medicine * Google has made the DeepSomatic tool and the CASTLE dataset—a high-quality training and evaluation set—openly available to the global research community. * This initiative is part of a broader effort to use AI for early detection and advanced research in various cancers, including breast, lung, and gynecological cancers. * The release aims to accelerate the development of personalized treatment plans by providing a more reliable way to identify the specific genetic drivers of an individual's disease. By providing a more accurate and adaptable method for variant calling, DeepSomatic helps researchers pinpoint the specific drivers of a patient's cancer. This tool represents a significant advancement in deep learning for genomics, potentially shortening the path from biopsy to targeted therapeutic intervention.

google

Highly accurate genome polishing with DeepPolisher: Enhancing the foundation of genomic research (opens in new tab)

DeepPolisher is a deep learning-based genome assembly tool designed to correct base-level errors with high precision, significantly enhancing the accuracy of genomic research. By leveraging a Transformer architecture to analyze sequencing data, the tool reduces total assembly errors by 50% and insertion or deletion (indel) errors by 70%. This advancement is critical for creating near-perfect reference genomes, such as the Human Pangenome Reference, which are essential for identifying disease-causing variants and understanding human evolution. ## Limitations of Current Sequencing Technologies * Genome assembly relies on reading nucleotides (A, T, G, and C), but the microscopic scale of these base pairs makes accurate, large-scale sequencing difficult. * Short-read sequencing methods provide high signal strength but are limited to a few hundred nucleotides because identical DNA clusters eventually desynchronize, blending signals together. * Long-read technologies can sequence tens of thousands of nucleotides but initially suffered from high error rates (~10%); while tools like DeepConsensus have reduced this to 0.1%, further refinement is necessary for high-fidelity reference genomes. * Even a 0.1% error rate results in millions of inaccuracies across the 3-billion-nucleotide human genome, which can cause researchers to miss critical genetic markers or misidentify proteins. ## DeepPolisher Architecture and Training * DeepPolisher is an open-source pipeline adapted from the DeepConsensus model, utilizing a Transformer-based neural network. * The model was trained using a human cell line from the Personal Genomes Project that is estimated to be 99.99999% accurate, providing a "ground truth" for identifying and correcting errors. * The system takes sequenced bases, their associated quality scores, and the orientation of the DNA strands to learn complex error patterns that traditional methods might miss. * By combining sequence reads from multiple DNA molecules of the same individual, the tool iteratively "polishes" the assembly to reach the accuracy required for reference-grade data. ## Impact on Genomic Accuracy and Gene Discovery * The tool’s ability to reduce indel errors by 70% is particularly significant, as these specific errors often interfere with the identification of protein-coding genes. * DeepPolisher has already been integrated into major research efforts, including the enhancement of the Human Pangenome Reference, providing a more robust foundation for clinical diagnostics. * Improved assembly accuracy allows for better mapping of regions where the genome is highly repetitive, which were previously difficult to sequence and assemble confidently. For researchers and bioinformaticians, DeepPolisher represents a vital step in moving from "draft" genomes to high-fidelity references. Adopting this tool in assembly pipelines can drastically improve the reliability of variant calling and gene annotation, especially in complex clinical and evolutionary studies.

google

Unlocking rich genetic insights through multimodal AI with M-REGLE (opens in new tab)

Google Research has introduced M-REGLE, a multimodal AI framework designed to analyze diverse health data streams simultaneously to uncover the genetic underpinnings of complex diseases. By jointly modeling complementary signals—such as electrocardiograms (ECG) and photoplethysmograms (PPG)—the method captures shared biological information and reduces noise more effectively than unimodal approaches. This integrated analysis significantly enhances the discovery of genetic associations and improves the prediction of cardiovascular conditions like atrial fibrillation. ## Technical Architecture and Workflow M-REGLE utilizes a multi-step process to transform raw physiological waveforms into actionable genetic insights: * **Multimodal Integration:** Instead of processing data types in isolation, the model combines multiple inputs, such as the 12 leads of an ECG or paired ECG and PPG data, to capture overlapping signals. * **Latent Representation Learning:** The system employs a convolutional variational autoencoder (CVAE) to compress these high-dimensional waveforms into a low-dimensional "signature" or latent factors. * **Statistical Refinement:** Principal component analysis (PCA) is applied to the CVAE-generated signatures to ensure the learned factors are independent and uncorrelated. * **Genetic Mapping:** These independent factors are analyzed via genome-wide association studies (GWAS) to identify significant correlations between physiological signatures and specific genetic variations. ## Improved Data Reconstruction and Genetic Sensitivity The transition from unimodal (U-REGLE) to multimodal modeling has led to substantial gains in both data accuracy and biological discovery: * **Error Reduction:** M-REGLE achieved a 72.5% reduction in reconstruction error for 12-lead ECGs compared to analyzing each lead separately, indicating a much higher fidelity in capturing essential waveform characteristics. * **Increased Discovery Power:** In a study involving over 40,000 participants from the UK Biobank, the multimodal approach identified 3,251 significant genetic loci associated with 12-lead ECGs, a notable increase over the 2,215 loci found by unimodal methods. * **Novel Findings:** The model identified specific genetic links, such as the *RBM20* locus, which were previously missed by standard clinical measurements but are known to be critical for heart muscle function. ## Interpretability and Disease Prediction Beyond identifying associations, M-REGLE offers generative capabilities that help clinicians understand the relationship between latent data and physical health: * **Waveform Synthesis:** By altering specific coordinates within the learned embeddings, researchers can observe how individual latent factors correspond to physical changes in a patient's ECG T-wave or PPG peaks. * **Clinical Utility:** The model identified specific embeddings (positions 4, 6, and 10) that distinguish patients with atrial fibrillation (AFib) from those without. * **Predictive Performance:** M-REGLE’s embeddings outperformed traditional clinical polygenic risk scores (PRS) in predicting AFib, demonstrating the value of incorporating raw waveform data into risk assessments. ## Practical Applications Researchers and clinicians can leverage M-REGLE to extract richer insights from existing biobank data and wearable device outputs. By integrating multiple modalities into a single analytical pipeline, the framework provides a more comprehensive view of organ system health, facilitating the identification of therapeutic targets and more accurate disease screening protocols.