Oct 2024

Navigating the Limitations of Whole Exome Sequencing in Complex Genomic Regions

Author: Rocío Acuña Hidalgo, MD, PhD

‍

Advances in genomic sequencing have transformed precision medicine, enabling the identification of the genetic basis of diseases. However, accurately interpreting genomic variants requires distinguishing true variants from sequencing artifacts.

‍

Whole Genome Sequencing (WGS), which covers both coding and non-coding regions of the genome, generates more sequencing variants compared to Whole Exome Sequencing (WES), which targets only exons. Additionally, WGS detects a broader range of genetic variation, including structural variants and regulatory elements, with higher data quality due to more uniform coverage. Studies have shown that, compared to WES, WGS has a lower rate of false positives including both small variants as well as copy number variants (CNVs).[ref] However, WGS's higher costs, more complex analysis and greater storage needs are important factors to consider. While WGS offers comprehensive insights and better quality, its cost and complexity must be weighed when choosing between the two methods.^1-3

‍

This article explores a case where discrepancies in variant calls were found between Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) data. By comparing the data generated with both approaches, we highlight potential pitfalls in variant interpretation and offer insights into best practices for geneticists in clinical and research laboratories working with sequencing technologies. We also note how new bioinformatics analysis and interpretation technologies, such as Nostos Genomics’ AION, reduces the cost gap between WES and WGS.

‍

Let’s do a deep dive into one interesting example in a homopolymer region!

‍

Case study: a candidate pathogenic variant in MSH2

‍

In a recent analysis, a candidate pathogenic splice variant in the MSH2 gene was identified in Whole Exome Sequencing (WES) of an individual with familial cancer. MSH2 is involved in mismatch repair of DNA replication errors and mutations in this gene are linked to hereditary cancer syndromes, such as hereditary nonpolyposis colon cancer (Lynch syndrome). A large proportion of known disease-causing variants in MSH2 disrupt splicing, making this variant an interesting candidate for further study. However, the same patient sample was also sequenced using Whole Genome Sequencing (WGS), where this variant was absent from the VCF file.

‍

‍

Whole Exome Sequencing (WES) analysis on AION (CE-IVD)

A 5bp intronic deletion was found near the donor splice site of exon 5 of MSH2.

VCF file: chr2-47641558-GTAAAA-G (GRCh37)

HGVSc: NM_000251:c.942+2_942+6del

‍

Whole Genome Sequencing (WGS) analysis on AION (CE-IVD)

The variant at chr2-47641558 was not found. However, a variant at an adjacent position was identified, showing a 2bp deletion in the same intronic region.

VCF file: chr2-47641559-TAA-T (GRCh37)

HGVSc: NM_000251:c.942+28_942+29del

‍

Understanding the discrepancy

‍

Upon inspecting the sequencing reads in IGV, both variants were found within a homopolymer region —a 27 bp-long stretch of adenines (A)— in an intronic region of the MSH2 gene. Homopolymers are notoriously difficult to sequence accurately due to challenges in determining the original number of repeating nucleotides, leading to polymerase slippage and potential errors in variant calling. As a result, homopolymers are known to be enriched for false positive indels that can be erroneously identified as false positive loss-of-function mutations.

‍

Whole Exome Sequencing data: screenshot from WES data in IGV

‍

All reads covering the full length of the homopolymer in the WES data were reverse reads, likely due to the exome library preparation process, coupled with the genomic region's location at the 3' end of the captured fragment. These reverse reads, however, degraded in quality as they approached the splice site, making it difficult for the algorithm to accurately call individual nucleotides in this region.

‍

Whole Genome Sequencing data: screenshot from WGS data in IGV

‍

In contrast, WGS showed balanced coverage with both forward and reverse reads starting at different points across the region, resulting in higher quality and sequencing read depth across the entire homopolymer, ensuring more accurate read mapping and variant calling.

‍

Both datasets were processed using the same secondary analysis pipeline, yet the variant calls differed significantly. This discrepancy underscores the need for careful review of sequencing data during variant interpretation, particularly in difficult-to-sequence regions that can lead to inaccurate variant calls impacting clinical decisions.

‍

Impact of Sequencing Technology on Variant Calls

‍

In gnomAD v4, allele frequencies for variants in this region vary between the cohort of individuals sequenced with exome sequencing versus those sequenced with genome sequencing. Comparing coverage plots from gnomAD v4 (lifted to GRCh38 coordinates: chr2:47,414,369-47,414,468) for both WES and WGS reveals that this genomic region is challenging to sequence, often resulting in sequencing artifacts that may be incorrectly called and potentially misinterpreted as disease-causing variants.

‍

Figure legend: Screenshot from gnomad v4 (URL: https://gnomad.broadinstitute.org/region/2-47414400-47414437?dataset=gnomad_r4) showing the dip in per base median coverage in exomes (blue) versus genomes (green) at the exon/intron boundary. Several variants in this region are classified in ClinVar, with 53 of them labeled as benign (blue), 33 as variants of unknown significance (yellow) and 32 as pathogenic (red). Notably, there are SNVs in this region leading to loss-of-function mutations due to splice site alterations such as NM_000251.3(MSH2):c.942+2T>A (ClinVar: https://www.ncbi.nlm.nih.gov/clinvar/variation/428474/).

‍

Annotation Discrepancies and HGVS Guidelines

‍

Variant interpretation becomes more complex when discrepancies in variant annotation arise as a consequence of discrepancies in variant calling. For instance, in the WES data, the variant in MSH2 was annotated as c.942+2_942+6del (closer to the exon/intron boundary), while in the WGS data, it was annotated as c.942+28_942+29del (further to the exon/intron boundary).

‍

This difference stems from the Human Genome Variation Society (HGVS) guidelines, which recommend describing a variant according to its most 3’ position in respect to the reference sequence. In the WES data, due to sequencing artifacts, the deletion called includes a thymine (T) before the homopolymer, anchoring the deletion to the 5’ end. However, in WGS, the sequencing reads cover the entirety of the homopolymer, allowing the deletion to be mapped and annotated further downstream. While confusing, the discordance in annotation of both variants is correct; in the VCF file, the variant is left-aligned and parsimonious at the genomic DNA level, while HGVS dictates following the 3' rule for standardized annotation.

‍

Clinical Interpretation and Reporting

‍

For clinicians and researchers, the paramount question is whether to report this variant and how to interpret its clinical significance. Several considerations inform this decision:

‍

Variant Frequency: The variant identified in the WGS data (a deletion within the homopolymer) is observed in the general population with an allele frequency of approximately 10%, as evidenced by databases like gnomAD.
Classification as Benign: According to the American College of Medical Genetics and Genomics (ACMG) guidelines, such a variant meets the criteria for being classified as benign (BA1), particularly due to its high frequency in the population and lack of association with disease.‍
Potential Artifacts in WES Data: The variant call in the WES data appears to be an artifact resulting from technical limitations in sequencing homopolymers and biases introduced during library preparation.

‍

Recommendations for Analyzing Genetic Variants

‍

Given the complexities highlighted for these specific variants, clinical scientists and researchers should consider the following:

‍

Careful Examination of Sequencing Read Data: Utilize tools like IGV to visually inspect read alignments and assess the quality of sequencing reads covering the variant.
Cross-Validation with Orthogonal Techniques When encountering variants in challenging regions, validate findings using an orthogonal method.
Awareness of Technical Limitations: Recognize the inherent limitations of sequencing technologies, particularly in homopolymer regions and interpret variant calls within this context.
‍Consistent Annotation Practices: Use variant annotation pipelines that adhere strictly to HGVS guidelines to ensure consistent variant annotation, facilitating clearer communication among clinicians and researchers.

‍

Reassessing the Cost-Effectiveness of WES vs. WGS

‍

One of the most striking aspects of this case is how a discrepancy in variant calls can prompt a reassessment of the cost-effectiveness of WES and WGS. Traditionally, WGS is seen as more expensive than WES, which has made WES the preferred choice in many clinical laboratories. However, in cases like this—where WGS provides more reliable data—labs may need to reconsider the overall value of WGS despite its higher cost.

‍

The discrepancies between WES and WGS in this case are partly due to differences in coverage. WGS, with its uniform coverage across the genome, is better suited to detect variants in complex regions, such as homopolymers and exon boundaries, where WES coverage may falter. Therefore, something to consider is whether the sensitivity and clinical utility of WGS justify its cost in certain scenarios, particularly for patients with complex or unclear genetic conditions.

‍

This is an important clinical consideration, especially when state-of-the-art specialized AI platforms, such as Nostos Genomics’ AION (CE-IVD), are used for interpretation and data storage. The interpretation effort of WGS and WES is similar with these tools, despite the larger number of potential variants to analyze with WGS, due to the impact of different molecular predictors and the clinical context. Additionally, data storage costs are higher for WGS due to larger data sizes. However, these additional data storage costs are minimized with infrastructure designed and developed for this purpose in these tools.

‍

Thanks to improvements in sequencing technologies and the development of powerful interpretation and data management tools, WGS is becoming more accessible, particularly for diagnosing rare diseases and advancing precision medicine. The ability to detect more disease-causing variants with fewer artifacts significantly improves diagnostic yield.⁴ As WGS costs continue to decline, it’s important to weigh the remaining cost difference against its substantial impact on patient care. Is the difference in cost still a barrier when WGS offers greater diagnostic power and potential for more tailored and impactful medical interventions?

‍

Conclusion

‍

Ultimately, this case underscores the challenges faced when interpreting genomic variants in homopolymer regions. Discrepancies between WES and WGS data can arise due to technical limitations, coverage differences, and biases introduced during library preparation. Notably, these differences also arise in complex variants such as CNV/SVs.

‍

For genomic researchers, understanding these factors is crucial for accurate variant interpretation and clinical decision-making. By adopting meticulous validation practices and remaining cognizant of sequencing constraints, researchers can enhance the reliability of genomic analyses and contribute to more effective patient care.

‍

The cost effectiveness of these approaches may have to be revisited in light of observations such as the ones here explained. More so, considering the effectiveness of AI prioritization tools such as AION (CE-IVD).

‍

‍

References

Belkadi, A., et al. (2015). Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proceedings of the National Academy of Sciences, 112(17), 5473–5478. https://doi.org/10.1073/pnas.1418631112
Lelieveld, S. H., et al. (2015). Comparison of Exome and genome sequencing technologies for the complete capture of Protein‐Coding regions. Human Mutation, 36(8), 815–822. https://doi.org/10.1002/humu.22813
Meienberg, J., Bruggmann, R., Oexle, K., & Matyas, G. (2016). Clinical sequencing: is WGS the better WES? Human Genetics, 135(3), 359–362. https://doi.org/10.1007/s00439-015-1631-9
Van Der Sanden, B. P. G. H., et al (2022). The performance of genome sequencing as a first-tier test for neurodevelopmental disorders. European Journal of Human Genetics, 31(1), 81–88. https://doi.org/10.1038/s41431-022-01185-9
Karczewski, K. J., et al. (2023). Advanced variant classification framework reduces the false positive rate of predicted loss-of-function variants in population sequencing data. Nature Communications, 14, 1234.
Human Genome Variation Society (HGVS). (2020). Nomenclature Guidelines.

‍

Note: This article is based on a detailed analysis of variant interpretation challenges in genomic data and aims to provide insights into best practices for researchers in the field.

‍