Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 26;22(1):389.
doi: 10.1186/s12864-021-07702-2.

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads

Affiliations
Free PMC article

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads

Seth Commichaux et al. BMC Genomics. .
Free PMC article

Abstract

Background: Whole genome sequencing of cultured pathogens is the state of the art public health response for the bioinformatic source tracking of illness outbreaks. Quasimetagenomics can substantially reduce the amount of culturing needed before a high quality genome can be recovered. Highly accurate short read data is analyzed for single nucleotide polymorphisms and multi-locus sequence types to differentiate strains but cannot span many genomic repeats, resulting in highly fragmented assemblies. Long reads can span repeats, resulting in much more contiguous assemblies, but have lower accuracy than short reads.

Results: We evaluated the accuracy of Listeria monocytogenes assemblies from enrichments (quasimetagenomes) of naturally-contaminated ice cream using long read (Oxford Nanopore) and short read (Illumina) sequencing data. Accuracy of ten assembly approaches, over a range of sequencing depths, was evaluated by comparing sequence similarity of genes in assemblies to a complete reference genome. Long read assemblies reconstructed a circularized genome as well as a 71 kbp plasmid after 24 h of enrichment; however, high error rates prevented high fidelity gene assembly, even at 150X depth of coverage. Short read assemblies accurately reconstructed the core genes after 28 h of enrichment but produced highly fragmented genomes. Hybrid approaches demonstrated promising results but had biases based upon the initial assembly strategy. Short read assemblies scaffolded with long reads accurately assembled the core genes after just 24 h of enrichment, but were highly fragmented. Long read assemblies polished with short reads reconstructed a circularized genome and plasmid and assembled all the genes after 24 h enrichment but with less fidelity for the core genes than the short read assemblies.

Conclusion: The integration of long and short read sequencing of quasimetagenomes expedited the reconstruction of a high quality pathogen genome compared to either platform alone. A new and more complete level of information about genome structure, gene order and mobile elements can be added to the public health response by incorporating long read analyses with the standard short read WGS outbreak response.

Keywords: Assembly; Listeria; Metagenomics; Nanopore; Quasimetagenomics; Source tracking.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The effective time required to sequence and analyze the quasimetagenomic samples. The blue circles marked as 24H, 28H, 32H, 36H, and 40H denote the five enrichment time points where the quasimetagenomic samples were collected and sequenced with the Illumina MiSeq (short read) and the Oxford Nanopore GridIon (long read). Diamonds represent the 30 batches (B1 to B30) of 4000 GridIon reads, each generated 45 min apart. For our analysis, reads from each batch were merged with previously obtained batches to form cumulative batches (Ci). The time taken to assemble the reads is shown with boxes labeled ‘A’. C18 at 24H marks the earliest time point where a complete Listeria monocytogenes genome was reconstructed (with metaFlye). The green circle corresponds to the time required to culture and sequence a pure colony isolate of Listeria monocytogenes i.e. 144 h. Note: bioinformatic analysis can be performed in “real-time” on the GridIon batches as they are output whereas an Illumina MiSeq sequencing run must finish before the bioinformatics can begin. However, for our analysis we partitioned the reads from each MiSeq run into 30 batches—each composed of an equal number of sequenced bases as the GridIon batches
Fig. 2
Fig. 2
Taxonomic classification of cumulative batch 30 from each enrichment time point. For clarity, only the short read MegaHit and long read metaFlye assemblies were plotted (short read assembly results mirrored short read hybrid assemblies and long read assemblies mirrored long read hybrid assemblies). a The total bp of contigs per species (must have a minimum of 5000 bp) classified by Kraken. b Species in sample, excluding L. monocytogenes, R. mucilaginosa and unclassified sequences highlights how the short read assemblies capture more species than the long read assemblies
Fig. 3
Fig. 3
The NG50 versus the total number of base pairs sequenced per cumulative batch for the assembled L. monocytogenes contigs at each of the enrichment time points for each assembly approach. (Abbreviations: SR = short read, LR = long read, HY = hybrid)
Fig. 4
Fig. 4
The quality of assembled contigs annotated as L. monocytogenes, with respect to the reference genome, using Quast for cumulative batch 30 at each of the enrichment time points. The number of mismatches, insertion/deletion (indels), and misassemblies per 100 kbp for each assembly approach. (Abbreviations: SR = short read, LR = long read, HY = hybrid)
Fig. 5
Fig. 5
Core gene BLAST distances. BLAST distance between the core genes of the reference genome and the assemblies versus the total number of base pairs sequenced per cumulative batch. (Abbreviations: SR = short read, LR = long read, HY = hybrid)
Fig. 6
Fig. 6
Complete gene set BLAST distances. BLAST distance between the complete gene set of the reference genome and the assemblies versus the total number of base pairs sequenced per cumulative batch. (Abbreviations: SR = short read, LR = long read, HY = hybrid)
Fig. 7
Fig. 7
Consistency of assembly approaches between successive cumulative batches. Median successive cumulative batch difference in BLAST distances, across enrichment time points, for the A) the core genes and B) the complete gene

Similar articles

Cited by

References

    1. Allard MW, Strain E, Melka D, Bunning K, Musser SM, Brown EW, Timme R. Practical value of food pathogen traceability through building a whole-genome sequencing network and database. J Clin Microbiol. 2016;54(8):1975–1983. doi: 10.1128/JCM.00081-16. - DOI - PMC - PubMed
    1. Swaminathan B, Barrett TJ, Hunter SB, Tauxe RV, the CDC PulseNet Task Force. PulseNet: The Molecular Subtyping Network for Foodborne Bacterial Disease Surveillance, United States. Emerging Infectious Diseases. 2001. pp. 382–389. 10.3201/eid0703.017303 - PMC - PubMed
    1. Centers for Disease Control and Prevention (CDC) Establishment of a national surveillance program for antimicrobial resistance in Salmonella. MMWR Morb Mortal Wkly Rep. 1996;45:110–111. - PubMed
    1. Tollefson L. FDA reveals plans for antimicrobial susceptibility monitoring. J Am Vet Med Assoc. 1996;208(4):459–460. - PubMed
    1. Davis S, Pettengill JB, Luo Y, Payne J, Shpuntoff A, Rand H, et al. CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data. PeerJ Comput Sci. 2015:e20. 10.7717/peerj-cs.20.

LinkOut - more resources