Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Nov 29;15(1):1039.
doi: 10.1186/1471-2164-15-1039.

Identifying Structural Variation in Haploid Microbial Genomes From Short-Read Resequencing Data Using Breseq

Affiliations
Free PMC article

Identifying Structural Variation in Haploid Microbial Genomes From Short-Read Resequencing Data Using Breseq

Jeffrey E Barrick et al. BMC Genomics. .
Free PMC article

Abstract

Background: Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events.

Results: We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for ~25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold).

Conclusions: Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.

Figures

Figure 1
Figure 1
Overview of the steps used by breseq to identify and annotate mutations in a haploid microbial genome from short-read resequencing data.
Figure 2
Figure 2
Junction candidate creation from split-read alignments that overlap. a) If two alignments of a read to the reference genome overlap, the overlapping bases at the center of the read could potentially be assigned to two separate locations in the reference sequence. b) If the read alignments in (a) had the imperfect alignments pictured here, the coordinates of each match and their overlap would be corrected as pictured by removing overlap until the remainder is a perfect match with no indels or mismatched bases. c) This type of junction candidate can be fully described by the reference coordinates defining each side of the junction breakpoint, the directions in the reference sequence each junction side continues to match from those breakpoint positions, and the number of overlapping bases in the read alignments.
Figure 3
Figure 3
Junction candidate creation from split-read alignments that do not overlap. a) If two alignments of a read to the reference genome do not meet or overlap in the middle of the read, then there are unique “read-only” bases present between the two matches to the reference sequence that do not match either side. b) This type of junction candidate can be fully described by the reference coordinates on each side of the junction breakpoint, the directions in the reference sequence each junction side continues to match from those positions, and the identity of the read-only bases inserted at the junction breakpoint.
Figure 4
Figure 4
Example of assigning coverage evenness scores to candidate junctions. Reads that align to a candidate new junction sequence may start at many different positions relative to the breakpoint. Reads that do not unambiguously support the new junction (gray arrows) because they do not extend across the breakpoint and any overlap or read-only bases (yellow highlighting) are not counted toward the evenness score. Although the two examples have the same number of reads that support the new junction because they align across the breakpoint and match the junction better than the reference genome (black arrows), the example in (a) is well-supported because these reads start in many different registers with respect to the breakpoint as would be expected for a normal reference genome location, whereas the example in (b) has reads beginning at a small number of biased positions with respect to the junction. This coverage evenness score is used to calculate a skew p-value to accept or reject a candidate junction, after also accounting for differences in the maximum number of read start positions that can support each candidate junction. In cases of tandem duplications much shorter than the read length, reads must also extend several “continuation” bases past any unique-only or overlap sequence to count as supporting a junction, as illustrated in Figure 5.
Figure 5
Figure 5
Case where additional read continuation across a breakpoint is required to support a junction candidate. In certain cases a read alignment must extend further across a junction breakpoint than just through the alignment overlap or read-only sequences to support the junction versus aligning equally well to the original reference sequence. One such case, where there is a deletion of four bases in a short tandem repeat region is shown. In this example, read alignments to the junction candidate sequence must extend across the four overlapping junction bases and the three bases shown on their left side to support the junction.
Figure 6
Figure 6
Missing coverage evidence. a) The censored fit of read depth at sites with unique-only coverage across the reference genome to a negative binomial distribution is shown for one of the E. coli samples from the mutation accumulation evolution experiment. The threshold for extending putative deleted regions of the genome is determined by taking the coverage value that produces a left-tail probability from the fit distribution as described in the text (arrow). b) A missing coverage evidence item is shown for the same E. coli sample to illustrate how its boundaries are determined by extending outward from a seed region with zero coverage of uniquely aligned reads through regions with multiply-mapped reads that match genomic repeat sequences until the coverage of uniquely aligned reads exceeds the calculated propagation threshold. Note that the left and right boundaries both correspond to a range of positions because they fall within repeat regions. In some cases, this type of ambiguity in the extent of the deletion can be resolved by examining new junction evidence matching the endpoints.
Figure 7
Figure 7
Predicting structural variation from new junction and missing coverage evidence. a) Types of structural variation for which breseq can predict precise mutational events from new junction sequences (JC) and missing read coverage (MC) evidence are shown in the context of the reference and mutant genomes. For JC evidence, the matched sequence on each side is shown as a solid arrow with a dashed line connecting the two sides. Orange JC arrows indicate that this side of a new sequence junction maps equally well to multiple locations in the reference genome (i.e., the location is ambiguous). Details for the procedure used in each case are described in the text. b) Mobile element insertions may require additional fields to describe the precise sequence change caused by insertion of a new copy. These may include a target site duplication and deleted or inserted bases on the margins of the new element copy, as shown.
Figure 8
Figure 8
Performance of structural variant prediction on simulated Illumina data sets. Data sets with different read lengths and coverage depths were generated according to an Illumina error model from simulated E. coli reference sequences with many examples of a single type of mutation causing structural variation randomly introduced. Results in terms of the sensitivity (or recall) for recovering true-positives (top panels) and the precision, equal to the number of true-positive predictions over the total number of predictions (bottom panels), are graphed as a function of junction skew scores accepted for making predictions. Results are shown for simulated genomes containing only a) deletions with breakpoints in non-repetitive reference genome sequences, b) new insertions of bacterial transposable sequences (IS elements), and c) deletions with one boundary ending on a repetitive IS element. The default junction skew score cutoff used by breseq is 3.0.
Figure 9
Figure 9
Performance of structural variant prediction on simulated 454 data sets. Read data sets with different average read lengths and coverage depths were generated according to a 454 error model with a 10% standard deviation in read length from simulated E. coli reference sequences with many examples of a single type of mutation causing structural variation randomly introduced. Results in terms of the sensitivity (or recall) for recovering true-positives (top panels) and the precision, equal to the number of true-positive predictions over the total number of predictions (bottom panels), are graphed as a function of junction skew scores accepted for making predictions. Results are shown for simulated genomes containing only a) deletions with breakpoints in non-repetitive reference genome sequences, b) new insertions of bacterial transposable sequences (IS elements), and c) deletions with one boundary ending on a repetitive IS element. The default junction skew score cutoff used by breseq is 3.0.
Figure 10
Figure 10
Reanalysis of evolved E. coli samples for structural variation. a) Summary of mutations predicted by breseq in 21 clones sequenced after 6,000 generations of growth in a mutation accumulation experiment [19]. These samples were previously analyzed for single-base substitutions and small indels. The line extending across the bars separates single-base substitutions and indels from mutations affecting more bases that were classified as structural variants. Full details for all mutations predicted in the ancestor of this experiment and each evolved lineage are provided in Additional file 2. b) Overall representation of the different types of structural variants predicted from combinations of new junction (JC) and missing coverage (MC) evidence across all 21 genomes. One structural variant was predicted from spurious read alignment (RA) evidence, as described in the text.

Similar articles

See all similar articles

Cited by 57 articles

See all "Cited by" articles

References

    1. Barrick JE, Lenski RE. Genome dynamics during experimental evolution. Nat Rev Genet. 2013;14:827–839. doi: 10.1038/nrg3564. - DOI - PMC - PubMed
    1. Greaves M, Maley CC. Clonal evolution in cancer. Nature. 2012;481:306–313. doi: 10.1038/nature10762. - DOI - PMC - PubMed
    1. Lieberman TD, Michel J-B, Aingaran M, Potter-Bynoe G, Roux D, Davis MR, Skurnik D, Leiby N, Lipuma JJ, Goldberg JB, McAdam AJ, Priebe GP, Kishony R. Parallel bacterial evolution within multiple patients identifies candidate pathogenicity genes. Nat Genet. 2011;43:1275–1280. doi: 10.1038/ng.997. - DOI - PMC - PubMed
    1. Damkiær S, Yang L, Molin S, Jelsbak L. Evolutionary remodeling of global regulatory networks during long-term bacterial adaptation to human hosts. Proc Natl Acad Sci U S A. 2013;110:7766–7771. doi: 10.1073/pnas.1221466110. - DOI - PMC - PubMed
    1. Blount ZD, Barrick JE, Davidson CJ, Lenski RE. Genomic analysis of a key innovation in an experimental Escherichia coli population. Nature. 2012;489:513–518. doi: 10.1038/nature11514. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

Feedback