Bioinformatic investigation of discordant sequence data for SARS-CoV-2: insights for robust genomic analysis during pandemic surveillance

Microb Genom. 2023 Nov;9(11):001146. doi: 10.1099/mgen.0.001146.


The COVID-19 pandemic has necessitated the rapid development and implementation of whole-genome sequencing (WGS) and bioinformatic methods for managing the pandemic. However, variability in methods and capabilities between laboratories has posed challenges in ensuring data accuracy. A national working group comprising 18 laboratory scientists and bioinformaticians from Australia and New Zealand was formed to improve data concordance across public health laboratories (PHLs). One effort, presented in this study, sought to understand the impact of the methodology on consensus genome concordance and interpretation. SARS-CoV-2 WGS proficiency testing programme (PTP) data were retrospectively obtained from the 2021 Royal College of Pathologists of Australasia Quality Assurance Programmes (RCPAQAP), which included 11 participating Australian laboratories. The submitted consensus genomes and reads from eight contrived specimens were investigated, focusing on discordant sequence data and findings were presented to the working group to inform best practices. Despite using a variety of laboratory and bioinformatic methods for SARS-CoV-2 WGS, participants largely produced concordant genomes. Two participants returned five discordant sites in a high-Cτ replicate, which could be resolved with reasonable bioinformatic quality thresholds. We noted ten discrepancies in genome assessment that arose from nucleotide heterogeneity at three different sites in three cell-culture-derived control specimens. While these sites were ultimately accurate after considering the participants' bioinformatic parameters, it presented an interesting challenge for developing standards to account for intrahost single nucleotide variation (iSNV). Observed differences had little to no impact on key surveillance metrics, lineage assignment and phylogenetic clustering, while genome coverage <90 % affected both. We recommend PHLs bioinformatically generate two consensus genomes with and without ambiguity thresholds for quality control and downstream analysis, respectively, and adhere to a minimum 90 % genome coverage threshold for inclusion in surveillance interpretations. We also suggest additional PTP assessment criteria, including primer efficiency, detection of iSNVs and minimum genome coverage of 90 %. This study underscores the importance of multidisciplinary national working groups in informing guidelines in real time for bioinformatic quality acceptance criteria. It demonstrates the potential for enhancing public health responses through improved data concordance and quality control in SARS-CoV-2 genomic analysis during pandemic surveillance.

Keywords: SARS-CoV-2; bioinformatics; public health genomics; quality assurance.

MeSH terms

  • Australia / epidemiology
  • COVID-19* / epidemiology
  • Computational Biology
  • Genomics
  • Humans
  • Nucleotides
  • Pandemics
  • Phylogeny
  • Retrospective Studies
  • SARS-CoV-2* / genetics


  • Nucleotides