Author: Rocío Acuña Hidalgo, MD, PhD
Advances in genomic sequencing have transformed precision medicine, enabling the identification of the genetic basis of diseases. However, accurately interpreting genomic variants requires distinguishing true variants from sequencing artifacts.
Whole Genome Sequencing (WGS), which covers both coding and non-coding regions of the genome, generates more sequencing variants compared to Whole Exome Sequencing (WES), which targets only exons. Additionally, WGS detects a broader range of genetic variation, including structural variants and regulatory elements, with higher data quality due to more uniform coverage. Studies have shown that, compared to WES, WGS has a lower rate of false positives including both small variants as well as copy number variants (CNVs).[ref] However, WGS's higher costs, more complex analysis and greater storage needs are important factors to consider. While WGS offers comprehensive insights and better quality, its cost and complexity must be weighed when choosing between the two methods.1-3
This article explores a case where discrepancies in variant calls were found between Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) data. By comparing the data generated with both approaches, we highlight potential pitfalls in variant interpretation and offer insights into best practices for geneticists in clinical and research laboratories working with sequencing technologies. We also note how new bioinformatics analysis and interpretation technologies, such as Nostos Genomics’ AION, reduces the cost gap between WES and WGS.
Let’s do a deep dive into one interesting example in a homopolymer region!
In a recent analysis, a candidate pathogenic splice variant in the MSH2 gene was identified in Whole Exome Sequencing (WES) of an individual with familial cancer. MSH2 is involved in mismatch repair of DNA replication errors and mutations in this gene are linked to hereditary cancer syndromes, such as hereditary nonpolyposis colon cancer (Lynch syndrome). A large proportion of known disease-causing variants in MSH2 disrupt splicing, making this variant an interesting candidate for further study. However, the same patient sample was also sequenced using Whole Genome Sequencing (WGS), where this variant was absent from the VCF file.
Whole Exome Sequencing (WES) analysis on AION (CE-IVD)
A 5bp intronic deletion was found near the donor splice site of exon 5 of MSH2.
VCF file: chr2-47641558-GTAAAA-G (GRCh37)
HGVSc: NM_000251:c.942+2_942+6del
Whole Genome Sequencing (WGS) analysis on AION (CE-IVD)
The variant at chr2-47641558 was not found. However, a variant at an adjacent position was identified, showing a 2bp deletion in the same intronic region.
VCF file: chr2-47641559-TAA-T (GRCh37)
HGVSc: NM_000251:c.942+28_942+29del
Upon inspecting the sequencing reads in IGV, both variants were found within a homopolymer region —a 27 bp-long stretch of adenines (A)— in an intronic region of the MSH2 gene. Homopolymers are notoriously difficult to sequence accurately due to challenges in determining the original number of repeating nucleotides, leading to polymerase slippage and potential errors in variant calling. As a result, homopolymers are known to be enriched for false positive indels that can be erroneously identified as false positive loss-of-function mutations.
Whole Exome Sequencing data: screenshot from WES data in IGV
All reads covering the full length of the homopolymer in the WES data were reverse reads, likely due to the exome library preparation process, coupled with the genomic region's location at the 3' end of the captured fragment. These reverse reads, however, degraded in quality as they approached the splice site, making it difficult for the algorithm to accurately call individual nucleotides in this region.
Whole Genome Sequencing data: screenshot from WGS data in IGV
In contrast, WGS showed balanced coverage with both forward and reverse reads starting at different points across the region, resulting in higher quality and sequencing read depth across the entire homopolymer, ensuring more accurate read mapping and variant calling.
Both datasets were processed using the same secondary analysis pipeline, yet the variant calls differed significantly. This discrepancy underscores the need for careful review of sequencing data during variant interpretation, particularly in difficult-to-sequence regions that can lead to inaccurate variant calls impacting clinical decisions.
In gnomAD v4, allele frequencies for variants in this region vary between the cohort of individuals sequenced with exome sequencing versus those sequenced with genome sequencing. Comparing coverage plots from gnomAD v4 (lifted to GRCh38 coordinates: chr2:47,414,369-47,414,468) for both WES and WGS reveals that this genomic region is challenging to sequence, often resulting in sequencing artifacts that may be incorrectly called and potentially misinterpreted as disease-causing variants.
Variant interpretation becomes more complex when discrepancies in variant annotation arise as a consequence of discrepancies in variant calling. For instance, in the WES data, the variant in MSH2 was annotated as c.942+2_942+6del (closer to the exon/intron boundary), while in the WGS data, it was annotated as c.942+28_942+29del (further to the exon/intron boundary).
This difference stems from the Human Genome Variation Society (HGVS) guidelines, which recommend describing a variant according to its most 3’ position in respect to the reference sequence. In the WES data, due to sequencing artifacts, the deletion called includes a thymine (T) before the homopolymer, anchoring the deletion to the 5’ end. However, in WGS, the sequencing reads cover the entirety of the homopolymer, allowing the deletion to be mapped and annotated further downstream. While confusing, the discordance in annotation of both variants is correct; in the VCF file, the variant is left-aligned and parsimonious at the genomic DNA level, while HGVS dictates following the 3' rule for standardized annotation.
For clinicians and researchers, the paramount question is whether to report this variant and how to interpret its clinical significance. Several considerations inform this decision:
Given the complexities highlighted for these specific variants, clinical scientists and researchers should consider the following:
One of the most striking aspects of this case is how a discrepancy in variant calls can prompt a reassessment of the cost-effectiveness of WES and WGS. Traditionally, WGS is seen as more expensive than WES, which has made WES the preferred choice in many clinical laboratories. However, in cases like this—where WGS provides more reliable data—labs may need to reconsider the overall value of WGS despite its higher cost.
The discrepancies between WES and WGS in this case are partly due to differences in coverage. WGS, with its uniform coverage across the genome, is better suited to detect variants in complex regions, such as homopolymers and exon boundaries, where WES coverage may falter. Therefore, something to consider is whether the sensitivity and clinical utility of WGS justify its cost in certain scenarios, particularly for patients with complex or unclear genetic conditions.
This is an important clinical consideration, especially when state-of-the-art specialized AI platforms, such as Nostos Genomics’ AION (CE-IVD), are used for interpretation and data storage. The interpretation effort of WGS and WES is similar with these tools, despite the larger number of potential variants to analyze with WGS, due to the impact of different molecular predictors and the clinical context. Additionally, data storage costs are higher for WGS due to larger data sizes. However, these additional data storage costs are minimized with infrastructure designed and developed for this purpose in these tools.
Thanks to improvements in sequencing technologies and the development of powerful interpretation and data management tools, WGS is becoming more accessible, particularly for diagnosing rare diseases and advancing precision medicine. The ability to detect more disease-causing variants with fewer artifacts significantly improves diagnostic yield.4 As WGS costs continue to decline, it’s important to weigh the remaining cost difference against its substantial impact on patient care. Is the difference in cost still a barrier when WGS offers greater diagnostic power and potential for more tailored and impactful medical interventions?
Ultimately, this case underscores the challenges faced when interpreting genomic variants in homopolymer regions. Discrepancies between WES and WGS data can arise due to technical limitations, coverage differences, and biases introduced during library preparation. Notably, these differences also arise in complex variants such as CNV/SVs.
For genomic researchers, understanding these factors is crucial for accurate variant interpretation and clinical decision-making. By adopting meticulous validation practices and remaining cognizant of sequencing constraints, researchers can enhance the reliability of genomic analyses and contribute to more effective patient care.
The cost effectiveness of these approaches may have to be revisited in light of observations such as the ones here explained. More so, considering the effectiveness of AI prioritization tools such as AION (CE-IVD).
References
Note: This article is based on a detailed analysis of variant interpretation challenges in genomic data and aims to provide insights into best practices for researchers in the field.
Contact us!
*Nostos Genomics regularly produces webinars, white papers, and other types of content that you may find valuable.
You can unsubscribe at any time. For more information view our Privacy Policy.