Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 9 (11), e112963
eCollection

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Affiliations

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Bruce J Walker et al. PLoS One.

Abstract

Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3-5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Simplified overview of the Pilon workflow for assembly improvement and variant detection.
The left column depicts the conceptual steps of the Pilon process, and the center and right columns describe what Pilon does at each step while in assembly improvement and variant detection modes, respectively. During the first step (top row), Pilon scans the read alignments for evidence where the sequencing data disagree with the input genome and makes corrections to small errors and detects small variants. During the second step (second row), Pilon looks for coverage and alignment discrepancies to identify potential mis-assemblies and larger variants. Finally (bottom row), Pilon uses reads and mate pairs which are anchored to the flanks of discrepant regions and gaps in the input genome to reassemble the area, attempting to fill in the true sequence including large insertions. The resulting output is an improved assembly and/or a VCF file of variants.
Figure 2
Figure 2. Example Pilon generated genome browser tracks.
This region was flagged by Pilon as containing a possible local mis-assembly, but Pilon was unable to determine a fix due to a tandem repeat sequence. The tracks shown here include: Pilon Features track indicating the extent of the region flagged by Pilon as containing a potential mis-assembly, Valid Coverage track indicating the sequence coverage of valid read pair alignments excluding the clipped portions of the alignments, Clipped Alignments track indicating the number of reads soft-clipped at each location, Pct Bad Alignments track indicating the percentage of the total reads aligned to each location which are not part of Valid Coverage. These tracks are created with the ‘—tracks' command-line option. Together, these tracks reveal the true bounds of the mis-assembly, and indicate that there are likely missing copies of the tandem repeat in the draft assembly. In this case, manual analysis revealed the draft assembly was missing two of three full copies of a 57-base tandem repeat.
Figure 3
Figure 3. Comparative view of a transposase-rich region of the M. tuberculosis F11 genome (coordinates 1,991,000 to 2,006,300) obtained from the draft (A) and Pilon-improved (B) assemblies.
In the draft assembly, three regions containing transposases (shown in blue) remained unassembled resulting in gaps. In the Pilon-improved assembly, all three sets of transposases were successfully assembled. The Pilon-improved assembly also contained a hypothetical gene, TBFG_11790 (shown in red), missing from the draft assembly. Though TBFG_11790 was not fully closed in the Pilon-improved version, closer inspection revealed that there was a 42 bp overlap in assembled sequence at this site. By default, Pilon will not close gaps unless there is at least 95 bp overlapping sequence to minimize spurious joins.
Figure 4
Figure 4. Venn diagram of the overlap in false negative (A) and false positive (B) calls by the three variant detection tools, Pilon, GATK UnifiedGenotyper and SAMtools.
False negative calls are the number of unique events from the curation set that was missed by each tool. Overlaps in the Venn diagram show the number of variants that were missed by multiple tools. False positive calls are the number of predictions from M. tuberculosis F11 that were not supported by the curation set. Overlaps indicate predictions that were shared among tools.

Similar articles

See all similar articles

Cited by 919 PubMed Central articles

See all "Cited by" articles

References

    1. Chewapreecha C, Harris SR, Croucher NJ, Turner C, Marttinen P, et al. (2014) Dense genomic sampling identifies highways of pneumococcal recombination. Nat Genet 46: 305–309 Available: http://www.ncbi.nlm.nih.gov/pubmed/24509479 Accessed 21 March 2014.. - PMC - PubMed
    1. Comas I, Coscolla M, Luo T, Borrell S, Holt KE, et al. (2013) Out-of-Africa migration and Neolithic coexpansion of Mycobacterium tuberculosis with modern humans. Nat Genet 45: 1176–1182 Available: http://www.ncbi.nlm.nih.gov/pubmed/23995134 Accessed 19 March 2014.. - PMC - PubMed
    1. Croucher NJ, Finkelstein J a, Pelton SI, Mitchell PK, Lee GM, et al. (2013) Population genomics of post-vaccine changes in pneumococcal epidemiology. Nat Genet 45: 656–663 Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3725542&tool=pmcentrez&rendertype=abstract Accessed 21 March 2014.. - PMC - PubMed
    1. Grad YH, Kirkcaldy RD, Trees D, Dordel J, Harris SR, et al. (2014) Genomic epidemiology of Neisseria gonorrhoeae with reduced susceptibility to cefixime in the USA: a retrospective observational study. Lancet Infect Dis 14: 220–226 Available: http://www.ncbi.nlm.nih.gov/pubmed/24462211 Accessed 21 March 2014.. - PMC - PubMed
    1. Ronen R, Boucher C, Chitsaz H, Pevzner P (2012) SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28: i188–96 Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3371851&tool=pmcentrez&rendertype=abstract Accessed 20 January 2014.. - PMC - PubMed

Publication types

Feedback