dna-sequencing

1 posts

google

Highly accurate genome polishing with DeepPolisher: Enhancing the foundation of genomic research (opens in new tab)

DeepPolisher is a deep learning-based genome assembly tool designed to correct base-level errors with high precision, significantly enhancing the accuracy of genomic research. By leveraging a Transformer architecture to analyze sequencing data, the tool reduces total assembly errors by 50% and insertion or deletion (indel) errors by 70%. This advancement is critical for creating near-perfect reference genomes, such as the Human Pangenome Reference, which are essential for identifying disease-causing variants and understanding human evolution. ## Limitations of Current Sequencing Technologies * Genome assembly relies on reading nucleotides (A, T, G, and C), but the microscopic scale of these base pairs makes accurate, large-scale sequencing difficult. * Short-read sequencing methods provide high signal strength but are limited to a few hundred nucleotides because identical DNA clusters eventually desynchronize, blending signals together. * Long-read technologies can sequence tens of thousands of nucleotides but initially suffered from high error rates (~10%); while tools like DeepConsensus have reduced this to 0.1%, further refinement is necessary for high-fidelity reference genomes. * Even a 0.1% error rate results in millions of inaccuracies across the 3-billion-nucleotide human genome, which can cause researchers to miss critical genetic markers or misidentify proteins. ## DeepPolisher Architecture and Training * DeepPolisher is an open-source pipeline adapted from the DeepConsensus model, utilizing a Transformer-based neural network. * The model was trained using a human cell line from the Personal Genomes Project that is estimated to be 99.99999% accurate, providing a "ground truth" for identifying and correcting errors. * The system takes sequenced bases, their associated quality scores, and the orientation of the DNA strands to learn complex error patterns that traditional methods might miss. * By combining sequence reads from multiple DNA molecules of the same individual, the tool iteratively "polishes" the assembly to reach the accuracy required for reference-grade data. ## Impact on Genomic Accuracy and Gene Discovery * The tool’s ability to reduce indel errors by 70% is particularly significant, as these specific errors often interfere with the identification of protein-coding genes. * DeepPolisher has already been integrated into major research efforts, including the enhancement of the Human Pangenome Reference, providing a more robust foundation for clinical diagnostics. * Improved assembly accuracy allows for better mapping of regions where the genome is highly repetitive, which were previously difficult to sequence and assemble confidently. For researchers and bioinformaticians, DeepPolisher represents a vital step in moving from "draft" genomes to high-fidelity references. Adopting this tool in assembly pipelines can drastically improve the reliability of variant calling and gene annotation, especially in complex clinical and evolutionary studies.