Sumedha B S, Bangalore University
Long-read sequencing (LRS), also called third-generation sequencing, offers a number of advantages over short-read sequencing such as Illumina’s NovaSeq, NextSeq, HiSeq and MiSeq instruments. Long-read sequencing technologies could permit the assembly of genomes, which is capable of revolutionizing genomics. It has the potential to reveal the full spectrum of human genetic variation, which would help in resolving some of the missing heritability also, leading to the discovery of new pathogenesis and mechanisms of diseases.
Currently, long-read sequencing technologies have reached a level of precision enabling application to variant detection in tens to thousands of samples. Advances in sequencing and bioinformatics have made it possible to achieve population-scale long-read sequencing. Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the two major competitors working towards innovation in this field. Statistical approaches in genetics help to identify variants that correlate with a certain phenotype like a disease. Population-scale sequencing is defined here as the sequencing of more than 5 genomes, but in the case of limited genomic diversity, a lower number of genomes are sufficient.
Previous population studies, including genome-wide association studies (GWAS), have not been able to exhaustively characterize the genetic factors underlying human traits and diseases. However, a significant proportion of hidden variants could be discovered with long-read sequencing. Recent LRS projects involving Icelandic and Chinese populations have identified hidden variants related to anaemia, height, cholesterol levels. LRS is beneficial for improving the range, continuity, accuracy of variant phasing, and assessment of small variants. This has been used to find disease-associated alleles.
The largest human-focused population-scale long-read sequencing study examined the genomic diversity of 3,622 Icelandic genomes. As part of the Human Pangenome project, LRS of a global cohort diversity is being carried out. Aside from human studies, long-read sequencing has been applied on a population scale to discover structural variation associated with phenotypes in fruit flies, songbirds and crops.
Structural variants (SVs) are genomic alterations that are 50bp or larger, it includes deletions, duplications, insertions, inversions and translocations.
Project strategy needed to be considered: The total number of sequenced individuals or chromosomes should be as high as possible.
Different approaches are:
A full coverage approach– It is the most expensive of the three approaches. Highest level of resolution is obtained with this strategy. All the samples receive similar coverage and are equally well studied. It aims to sequence each sample of population with medium to high coverage. The advantage of this approach is the simplicity of study design, comprehensiveness, easy detection of rare variations and relatively straightforward workflow.
A mixed coverage approach– Here, a subset of samples, representative of the subgroups in the subpopulations are sequenced at high and the rest at low coverage. This approach is less expensive than the full coverage approach, and it achieves high detection sensitivity. It is suitable for studies with a limited budget or a high number of individuals. But there will be a bias towards common alleles with this approach, as many rare alleles may be missed.
A mixed sequencing approach– This involves LRS of just a few samples- 10-20% and short-read sequencing of the remaining. The basis of this approach is similar to selecting individuals for high coverage in mixed coverage strategy.
Other approaches developed:
Sequencing logistics. It involves efficiently operating long-read sequencers, from logistics to sample preparation, loading optimizations and run monitoring. ONT and PacBio have different advantages. It also has its own challenges in almost every step due to the different designs of flow cells and sequencing instruments. An adequate amount of high molecular weight DNA and highly pure input DNA are required.
Analytical considerations the main challenge in population-level studies is a scalable and streamlined analysis. Two main strategies for downstream analysis: aligning reads from individual samples to a single reference genome and comparing de novo assemblies. These methods are significantly different in the computational and coverage requirements. That depends on the complexity and size of the genome.
Read alignment-based analysis. This is often the most common method of choice for population-scale studies. This enables the comparison of all samples with a reference genome and is the reason why more than half of population studies use this. These methods are less computationally demanding.
Population-scale de novo assemblies. These approaches are very sensitive and used to reconstruct diverse regions of the genomes. This can also lead to a collapse of highly similar segmental duplications. Algorithms that leverage single-nucleotide variants (SNVs) that differentiate multiple copies of repeats are used. The main challenge faced is the correct representation of the ploidy.
Graph genome methods. These allow the study of variants that are undetected by the current hi-tech short-read SV discovery methods. Tools, such as GraphTyper2100, Paragraph101 and tools from the vg package45,96, have been developed to graph genome structures.
Variant validation and genotyping. In this,any variants showing polymorphic genotypes are excluded. This approach neglects that some types of SV have higher mutation rates which are responsible for possible repeated mutations. It delivers the first step towards a more reliable SV genotyping. This method has recently been used for the corvids crows and jackdaws successfully.
The development of different approaches will have a profound impact on improved variant representation and complexity of the underlying biology. However, this would require a shift from a linear to a more complex form of the reference genome. PacBio and ONT are currently leading in the development of LRS for multiple applications. Other companies (such as, Base4, Quantapore, Omniome) are developing novel long-read approaches, whose accountability needs to be assessed in the coming years. This field is very rapidly developing the area of genomics and established tools quickly become outdated and are replaced by new ones.
Also read: The curious case of Covid-19 Re-infection
References:
- De Coster, W., Weissensteiner, M.H. & Sedlazeck, F.J. Towards population-scale long-read sequencing. Nat Rev Genet (2021). https://doi.org/10.1038/s41576-021-00367-3
- The Corrosion Prediction from the Corrosion Product Performance
- Nitrogen Resilience in Waterlogged Soybean plants
- Cell Senescence in Type II Diabetes: Therapeutic Potential
- Transgene-Free Canker-Resistant Citrus sinensis with Cas12/RNP
- AI Literacy in Early Childhood Education: Challenges and Opportunities
One thought on “Population-scale long-read sequencing and its approaches”