Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Nov 16;43(20):e133.
doi: 10.1093/nar/gkv671. Epub 2015 Jul 10.

Utilizing Mapping Targets of Sequences Underrepresented in the Reference Assembly to Reduce False Positive Alignments

Affiliations
Free PMC article

Utilizing Mapping Targets of Sequences Underrepresented in the Reference Assembly to Reduce False Positive Alignments

Karen H Miga et al. Nucleic Acids Res. .
Free PMC article

Abstract

The human reference assembly remains incomplete due to the underrepresentation of repeat-rich sequences that are found within centromeric regions and acrocentric short arms. Although these sequences are marginally represented in the assembly, they are often fully represented in whole-genome short-read datasets and contribute to inappropriate alignments and high read-depth signals that localize to a small number of assembled homologous regions. As a consequence, these regions often provide artifactual peak calls that confound hypothesis testing and large-scale genomic studies. To address this problem, we have constructed mapping targets that represent roughly 8% of the human genome generally omitted from the human reference assembly. By integrating these data into standard mapping and peak-calling pipelines we demonstrate a 10-fold reduction in signals in regions common to the blacklisted region and identify a comprehensive set of regions that exhibit mapping sensitivity with the presence of the repeat-rich targets.

Figures

Figure 1.
Figure 1.
Large, multi-megabase sized regions of the human genome remain incomplete due to highly repetitive regions of the human genome, mapping to centromere/heterochromatin assigned gaps, and including sequences that remain missing from subtelomeric regions in the acrocentric short arms. As shown in (A), these regions are marked in the genome by gaps or space holders to indicate regions that are enriched for long arrays of tandemly repeated DNA. Often the edges of these gaps provide some representation of the sequences across the entirety of the array (shown as red if included in the assembly and shaded red if inferred to be present in the gap region). Sequence reads from the entire region are expected to be present in high-throughput, whole-genome datasets. When mapping to a partial reference, these reads find their best alignments on the regions represented in the assembly. As a result, a large number of reads (representing the multi-megabase arrays) align with high read depth, resulting in false positive sites in the genome. To account for these mapping errors we have designed mapping targets, collectively called a ‘sponge database’ with the various distribution of DNA families shown in (B) for the collection of 1.5 million remaining unassembled reads from the HuRef genome.
Figure 2.
Figure 2.
Reduction in artifact read alignments was observed in the presence of the sponge mapping targets when surveyed across blacklisted regions in figure (A) for four previously characterized datasets providing lists of annotated sites in hg19. When evaluating read mapping results with and without the sponge across low-coverage whole genomic datasets from two individual (HuRef, Western European and GM19239, Yoruba), we observe a 10-fold decrease. In panel (B) we observe a similar 10-fold or greater reduction in peaks called within blacklisted regions (shown here for the Anshul hg19 blacklisted data), including nine additional ENCODE functional datasets. Further, as one increases the abundance of the sponge database from 1x to 8x, we observe little improvement. Results for CTCF mapping in regions hg19 chr1:121,179,675–121,374,269 are shown with or without the sponge database in panel (C). MACS peak calls are indicated in red, and locations of CTCF binding are shown in the track highlighted in light browns. In the presence of sponge, mapping targets read alignment depth is decreased in regions that span a previously characterized blacklisted regions (shown in green) and labeled as a false positive. Alignments are reduced in regions that are not indicated as a blacklisted region, which appear to be novel (shown in orange), offering new sites of false positive alignments. Regions, as indicated in blue, that benefit from multiple lines of biological support still provide peak calls in the presence of the sponge mapping targets.
Figure 3.
Figure 3.
Sites of enrichment that benefit from multiple lines of biological evidence are not lost in the presence of the sponge mapping targets, as shown monitor changes in read depth for (A) CTCF with increasing abundance of the sponge database (1x–8x coverage), (B) long RNA datasets that overlap with characterized RefSeq gene locations and (C) within promoter regions, defined here as 1 kb upstream of a RefSeq gene and evaluated using DNase, GABP and H3K4me3 datasets.

Similar articles

See all similar articles

Cited by 13 articles

See all "Cited by" articles

References

    1. Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed
    1. Koehler R., Issac H., Cloonan N., Grimmond S.M. The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics. 2011;27:272–274. - PMC - PubMed
    1. Derrien T., Estelle J., Marco Sola S., Knowles D.G., Raineri E., Guigo R., Ribeca P. Fast computation and applications of genome mappability. PLoS One. 2012;7:e30377. - PMC - PubMed
    1. Lee H., Schatz M.C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics. 2012;28:2097–2105. - PMC - PubMed
    1. Eichler E.E., Clark R.A., She X. An assessment of the sequence gaps: unfinished business in a finished human genome. Nat. Rev. Genet. 2004;5:345–354. - PubMed

Publication types

Feedback