Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May;22(5):387-401.
doi: 10.1089/cmb.2014.0146. Epub 2015 Jan 7.

Building a Pan-Genome Reference for a Population

Affiliations
Free PMC article

Building a Pan-Genome Reference for a Population

Ngan Nguyen et al. J Comput Biol. .
Free PMC article

Abstract

A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalize the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes' ordering and orientation, creating a pan-genome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pan-genome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test. In addition, we test the use of a pan-genome for describing variations and as a reference for read mapping.

Keywords: algorithms; computational molecular biology; genomics; molecular evolution; sequence analysis.

Figures

<b>FIG. 1.</b>
FIG. 1.
An illustration of a pan-genome reference on a sequence graph. (A) A bidirected graph representing the four ways two blocks can be connected. The arrowheads on the edges indicate their endpoints: the sides of the vertices. (B) An example pan-genome reference on a sequence graph. There are two sequences, indicated by the color of the edges. The red sequence, represented by the thread A, B, C, D, F, G, and the black sequence, represented by the thread A, −F, −E, −D, −B, G. The red thread visits the edges {−A, B}, {−B, C}, {−C, D}, {−D, F}, and {−F, G}, and the black thread visits the edges {−A, −F}, {F, −E}, {E, −D}, {D, −B}, and {B, G}. Neither thread includes all the blocks. A pan-genome reference, indicated by the dotted edges, is A, −F, −E, −D, −C, −B, G. The dotted edges and the edges {−B, D} and {−D, F} are the edges consistent with the given pan-genome reference.
<b>FIG. 2.</b>
FIG. 2.
(Top) An illustration of why it is not always sufficient to consider only abutting adjacencies. (A) There are five blocks, A, B, C, d, and e, reprising their roles from the example given in the introduction. The input contains n copies of the sequence A, d, B, e, C, and n copies of the sequence A,e, B, −d, C. (B) The bidirected graph representation of this problem, with the number of adjacencies supporting each edge labeled, the abutting adjacencies shown as solid lines, and the nonabutting adjacencies shown as dotted lines. If only solutions that start with A and end with C are of interest, there are four maximal solutions, shown in (C, D, E, F). Solutions (C) and (D) each have 4n abutting adjacencies and 10n nonabutting adjacencies. Solutions (E) and (F) each also have 4n abutting adjacencies but only 6n nonabutting ones. For θ <1 the (C) and (D) solutions are optimal. As θ approaches 1, the weight of nonabutting adjacencies approaches 0 and all four solutions become equally weighted, despite (E) and (F) having B in the reverse orientation. (Bottom) An illustration of why θ should be greater than 0. (G) There are m + 2, blocks, the input contains n − 1 copies of the sequence Am, B, C and 1 copy of the sequence formula image. (H) The bidirected graph representation of the problem, where the sequence of formula image blocks has been reduced to just a single vertex for convenience. The two maximal solutions are shown in (I, J), corresponding to the two distinct input sequences. If m > n and θ is 0 then the solution with B in the reverse orientation (I) is optimal, despite this orientation being observed only once. By increasing θ the alternative solution with B in the forward orientation becomes optimal.
<b>FIG. 3.</b>
FIG. 3.
(A) A bidirected graph with three vertices A, B, and C. (B) A subgraph of (A) containing no M, 0-cycles or odd M, N-cycles. (C) A side bicoloring of (B). (D) A digraph for (C).
<b>FIG. 4.</b>
FIG. 4.
(A) The bidirected graph from Figure 1B rewritten to show the nets as colored side subgraphs. (B) The cactus graph representation of the blocks and nets in (A), with the white net containing the highest level chains. The edges represent the blocks; the vertices represent the nets. The arrowheads on the edges indicate endpoints that are links.
<b>FIG. 5.</b>
FIG. 5.
(Top) Simulation results using arbitrary inversion and translocation operations. Each plot shows the total number of operations (a mixture of 50% inversions and 50% translocations) versus either the DCJ distance (top two plots) or symmetric difference distance (bottom two plots). The left plots give the average distance from the leaf genomes and the right plots give the distance from the original “true” median genome. Series shown include the original median genome (left plots only), the inferred median genome from the AsMedian program (Xu, 2009) using three leaves, and the inferred median genomes using our combined reference algorithms, using, separately, 3,5, and 10 leaf genomes as input. Simulations used 10 replicates for each fixed number of edits, points give median result, lines show max and min quartiles. (Bottom) Simulation results using short inversion and translocation operations, laid out as in the top panel.
<b>FIG. 6.</b>
FIG. 6.
Prototype UCSC pangenome reference browser screenshots. (Top) Indels. (Middle) A segregating inversion. (Bottom) An apparently fixed tandem duplication. For reasons of space some samples are omitted from the screenshots. The human reference genome is PGF, the chimpanzee genome is panTro3, and details of the other samples are in the Supplementary Material.
<b>FIG. 7.</b>
FIG. 7.
A comparison of indel and SNV rates between P. Ref. (red dots) and PGF (blue dots). (A–C) The number of insertions/deletions/SNVs per site (position) of each sample, as predicted by the Cactus MSA with respect to PGF and P. Ref.

Similar articles

See all similar articles

Cited by 14 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback