Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul;42(12):e99.
doi: 10.1093/nar/gku356. Epub 2014 May 6.

Realistic Artificial DNA Sequences as Negative Controls for Computational Genomics

Affiliations
Free PMC article

Realistic Artificial DNA Sequences as Negative Controls for Computational Genomics

Juan Caballero et al. Nucleic Acids Res. .
Free PMC article

Abstract

A common practice in computational genomic analysis is to use a set of 'background' sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such 'background' sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by 'shuffling' real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/.

Figures

Figure 1.
Figure 1.
Total number of known and predicted coding genes in Ensembl 64 for human and other species. The gene size is determined by the total length in bases of each CDS.
Figure 2.
Figure 2.
Overview of the algorithms. Our development consists of two stages: (1) training (left side): the fraction of the genome remaining after masking of functional and repetitive regions is analyzed for k-mer and GC content. Repeats are also modeled, separated into interspersed (IR) and simple sequence repeats (SR); (2) generation (right side): a base sequence is generated using the k-mer and GC profiles. Artificially evolved repeats are then inserted into the base sequence to create a new artificial sequence.
Figure 3.
Figure 3.
Multi-species PCA. We compared the GC-binned tetramer composition of orthologous sequences in human (hs), chimpanzee (pt), cow (bt), dog (cf), elephant (la), guinea pig (cp), marmoset (cj), horse (ec), mouse (mm), orangutan (pa), panda (am), pig (ss), rabbit (os), rat (rn) and rhesus (rm) for intergenic (blue) and intronic (red) regions.
Figure 4.
Figure 4.
GESTALT comparison of artificial, real and shuffled sequences. (a) Artificial sequence, (b) Intergenic region chr4:104640972–104740972, (c) Intergenic region chr4:104640972–104740972 after dimer permutation. The elements in RepeatMasker are: LINEs in green, SINEs in red/pink, LTR, DNA transposable elements and others in brown.
Figure 5.
Figure 5.
Principal component analysis of composition and complexity measures. We compare 100 sequences with 100 kb each for artificial sequences (blue), selected intergenic regions (green) and their respective dimer permutation of intergenic sequences (red).
Figure 6.
Figure 6.
Repeat identification benchmarks. (a) False positives expected in dimer-permuted sequences and synthetic sequences without repeats, (b) Alu tests, (c) MIR tests. For (b) and (c), each point represents the average sensitivity (Sn) and specificity (Sp). Colors represent the program (PC = PClouds, RM = RepeatMasker) and word size used for PClouds (8, 10, 12, 14, 16). The transparency of the color denotes the amount of repetitive sequence in each set (10%, 20%, … 90%) and the size of the point the average FPR. Each sequence set includes 10 artificial sequences of 100 kb each generated with k-mer size of 8 and window size of 1000 with a G+C content of 40–60%.
Figure 7.
Figure 7.
Coding and transcribed size distribution of predicted genes. Predictions in artificial (blue), intergenic (green) and dimer permuted intergenic sequences (red) are compared with known genes (gray) for (a) Augustus, (b) Genscan, (c) Twinscan and (d) FEAST.

Similar articles

See all similar articles

Cited by 9 articles

See all "Cited by" articles

References

    1. Do J.H., Choi D. Computational approaches to gene prediction. J. Microbiol. 2006;44:137–144. - PubMed
    1. Brent M.R. How does eukaryotic gene prediction work? Nat. Biotechnol. 2007;25:883–885. - PubMed
    1. The Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
    1. Mardis E.R. Anticipating the $1,000 genome. Genome Biol. 2006;7:112. - PMC - PubMed
    1. Mardis E.R. The impact of the next-generation sequencing technologies on genetics. Trends Genet. 2008;3:133–141. - PubMed

Publication types

Substances

Feedback