Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Aug;18(8):1362-8.
doi: 10.1101/gr.078477.108. Epub 2008 May 23.

DupMasker: A Tool for Annotating Primate Segmental Duplications

Affiliations
Free PMC article

DupMasker: A Tool for Annotating Primate Segmental Duplications

Zhaoshi Jiang et al. Genome Res. .
Free PMC article

Abstract

Segmental duplications (SDs) play an important role in genome rearrangement, evolution, and the copy-number variation (CNV) of primate genomes. Such sequences are difficult to detect, a priori, because they share no defining sequence features that distinguish them from unique portions of the genome. Current sequence annotation of segmental duplications requires computationally intensive, genome-wide self-comparisons that cannot be easily implemented on new data sets. Based on the successful implementation of RepeatMasker, we developed a new genome annotation tool, DupMasker. The program uses a library of nonredundant consensus sequences of human segmental duplications, wherein a majority of the ancestral origins have been determined based on comparisons to mammalian outgroup genomes. Using DupMasker, new human and nonhuman primate (NHP) sequences may be readily queried to provide details on the origin and degree of sequence identity of each duplicon. This program can be applied to delineate the order and orientation of duplicons within complex duplication blocks and used to characterize structural variation differences between sequenced human haplotypes. We predict this tool will be valuable in the annotation of large-insert sequence clones, allowing putative unique and duplicated regions of the genomes to be annotated prior to whole genome assembly comparisons.

Figures

Figure 1.
Figure 1.
DupMasker defines the substructure of human segmental duplication blocks. Human segmental duplications are organized into complex duplication blocks where individual duplicons originate from different regions of the genome. We assessed the ability of DupMasker to accurately define these ancestral duplicons in this context by comparing results from two regions studied previously in detail (Horvath et al. 2000; Jiang et al. 2007). We schematically display (PARASIGHT: http://eichlerlab.gs.washington.edu/jeff/parasight/index.html) duplicons detected using the A-Bruijn graph approach (top) versus DupMasker (bottom) for (A), an ∼600-kbp region on chromosome 2p11 and (B), an ∼700-kbp region on chromosome 5q13.2. The different duplicons are illustrated as color-coded blocks; different colors correspond to different cytogenetic band locations of the ancestral loci. We found 33/36 nonredundant duplicons blocks are consistent between these two results. The three mismatched blocks are relatively small in length (length <1.5 kb, highlighted in red).
Figure 2.
Figure 2.
The size and sequence identity distribution of “novel” duplications. (A) The length distribution of DupMasker duplications not detected by WGAC (termed “novel” SDs) reveals that the majority (99% by number of intervals, 91% by base pair) of these intervals are small fragments (size <1 kb). (B) We found 52.3% (21.9 Mb) of these small intervals are common repeats due to imprecision of boundary definition within repeat-rich regions. We performed a modified WGAC analysis using a relaxed threshold (require nonrepeat alignment ≥100 bp and sequence identity ≥75%) on these “novel” SDs. The analysis revealed alignments for 31% (13.1/41.96 Mb) of these “novel” SDs. Among the 13.1-Mb alignments, 97.7% (12.8/13.1 Mb) represent either small (size <1 kbp) or relatively ancient duplications (sequence identity <90%).
Figure 3.
Figure 3.
Duplication architecture flanking genomic disorders. This figure shows the duplication architecture defined by DupMasker for one of the most unstable regions of the human genome (15q11–15q13). (A) Blue lines delineate intrachromosomal duplications of high-sequence identity (size ≥10 kb and sequence identity ≥95%) within this region (WGAC) and identify four breakpoint regions associated with Prader-Willi/Angelman Syndrome and the 15q13.3 deletion syndrome. (B) The duplication substructures defined by DupMasker are depicted as color-coded boxes with different colors representing different cytogenetic band locations of duplicons. (C) ArrayCGH data from one patient with Prader-Willi syndrome (bottom) and two patients with chr15q13.3 deletion (Sharp et al. 2008) indicate the patients’ deletion breakpoints overlap with the duplicons defined by DupMasker. The locations of the breakpoint intervals are highlighted by red dashed lines.
Figure 4.
Figure 4.
Genomic comparisons by DupMasker. DupMasker facilitates the characterization of duplication-mediated genomic rearrangements. (A) Miropeats (Parsons 1995) comparison between human reference genome (build35, top) against a fosmid clone (bottom) from a Japanese individual (ABC9) identifies a ∼40-kbp deletion. DupMasker on this region identified a pair of tandem duplications (dark green) flanking the internal duplicon (light green), which was likely deleted by NAHR in this Japanese individual. The deletion removes part of the intron of the LATS1 gene. (B) A similar comparison between sequences from a chimpanzee BAC clone (AC097264.4) and its orthologous locus on human chromosome 17 predicts a large (∼80 kbp) chimpanzee-specific insertion. DupMasker analysis suggests that the insertion is the result of a duplicative transposition event composed of segmental duplications that originated from human–chimpanzee ancestral chromosome 16.
Figure 5.
Figure 5.
Assigning lineage-specific and shared duplications in primates. We applied DupMasker (standard default settings) to the macaque genome (RheMac2) and readily identified shared and lineage-specific duplications by comparing the results with duplication maps of the Rhesus Macaque Genome (Gibbs et al. 2007). We found that 84% (121.0/143.3 Mb) of duplications in the human genome are human-lineage specific. There are 22.3 Mb of duplications shared between human and macaque, and 24.3% (24.3/46.4 Mb) of duplications defined in the macaque genome that are macaque-lineage specific.

Similar articles

See all similar articles

Cited by 19 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback