Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 19;18(1):100.
doi: 10.1186/s12864-017-3504-1.

Parasite infection of public databases: a data mining approach to identify apicomplexan contaminations in animal genome and transcriptome assemblies

Affiliations

Parasite infection of public databases: a data mining approach to identify apicomplexan contaminations in animal genome and transcriptome assemblies

Janus Borner et al. BMC Genomics. .

Abstract

Background: Contaminations from various exogenous sources are a common problem in next-generation sequencing. Another possible source of contaminating DNA are endogenous parasites. On the one hand, undiscovered contaminations of animal sequence assemblies may lead to erroneous interpretation of data; on the other hand, when identified, parasite-derived sequences may provide a valuable source of information.

Results: Here we show that sequences deriving from apicomplexan parasites can be found in many animal genome and transcriptome projects, which in most cases derived from an infection of the sequenced host specimen. The apicomplexan sequences were extracted from the sequence assemblies using a newly developed bioinformatic pipeline (ContamFinder) and tentatively assigned to distinct taxa employing phylogenetic methods. We analysed 920 assemblies and found 20,907 contigs of apicomplexan origin in 51 of the datasets. The contaminating species were identified as members of the apicomplexan taxa Gregarinasina, Coccidia, Piroplasmida, and Haemosporida. For example, in the platypus genome assembly, we found a high number of contigs derived from a piroplasmid parasite (presumably Theileria ornithorhynchi). For most of the infecting parasite species, no molecular data had been available previously, and some of the datasets contain sequences representing large amounts of the parasite's gene repertoire.

Conclusion: Our study suggests that parasite-derived contaminations represent a valuable source of information that can help to discover and identify new parasites, and provide information on previously unknown host-parasite interactions. We, therefore, argue that uncurated assembly data should routinely be made available in addition to the final assemblies.

Keywords: Apicomplexa; Coccidia; Contamination; Database analysis; Gregarinasina; Haemosporida; Malaria; Parasites; Phylogeny; Piroplasmida.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Schematic overview of the ContamFinder pipeline. a All contigs from an assembly were searched against apicomplexan proteomes from the Eukaryotic Pathogen Database (EuPathDB [19, 20]). Sequences without significant hit were discarded. b Amino acid sequences were predicted using the best hitting apicomplexan protein. Low complexity regions and repeats in the sequence were masked. c The predicted amino acid sequences were searched against the EuPathDB and UniProt database. Sequences with the best hit outside of Apicomplexa were discarded. d Unprocessed contigs corresponding to the hits from the previous step were searched against the EuPathDB and UniProt databases. Sequences that had their best hit outside of Apicomplexa were discarded. Contigs and sequence regions that were kept and used in the next step are shown in green; sequences that were discarded are denoted in red. Parasite-derived proteins in the search database are shown in blue, others in yellow
Fig. 2
Fig. 2
Venn diagrams showing shared and unique hits from analyses using different search strategies on the assemblies of Capra hircus (a) and Odocoileus virginianus (b)
Fig. 3
Fig. 3
Maximum likelihood tree based on a RAxML analysis of dataset 1 (1,420 genes, 67 taxa). The tree was rooted with Chromerida
Fig. 4
Fig. 4
Majority-rule consensus tree based on a PhyloBayes analysis of dataset 2 (301 genes, 49 taxa). Bootstrap support values from a RAxML analysis were mapped onto the tree topology. Bayesian posterior probabilities < 1.00 and bootstrap support values < 100% are given at the nodes, respectively; n.s.: split was not supported in the ML analysis; splits that have 1.00 posterior probability and 100% bootstrap support are denoted by a dark circle. The tree was rooted with Chromerida

Similar articles

Cited by

References

    1. Naccache SN, Greninger AL, Lee D, Coffey LL, Phan T, Rein-Weston A, Aronsohn A, Hackett JJ, Delwart EL, Chiu CY. The perils of pathogen discovery: origin of a novel parvovirus-like hybrid genome traced to nucleic acid extraction spin columns. J Virol. 2013;87:11966–11977. doi: 10.1128/JVI.02323-13. - DOI - PMC - PubMed
    1. Laurence M, Hatzis C, Brash DE. Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes. PLoS One. 2014;9:e97876. doi: 10.1371/journal.pone.0097876. - DOI - PMC - PubMed
    1. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12:87. doi: 10.1186/s12915-014-0087-z. - DOI - PMC - PubMed
    1. Merchant S, Wood DE, Salzberg SL. Unexpected cross-species contamination in genome sequencing projects. PeerJ. 2014;2:e675. doi: 10.7717/peerj.675. - DOI - PMC - PubMed
    1. Tao Z, Sui X, Jun C, Culleton R, Fang Q, Xia H, Gao Q. Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and in silico correction of 26 parasite sequences. Parasit Vectors. 2015;8:318. doi: 10.1186/s13071-015-0927-x. - DOI - PMC - PubMed

Publication types