Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Apr 5;7(1):5.
doi: 10.1186/1748-7188-7-5.

A normalization strategy for comparing tag count data

Affiliations

A normalization strategy for comparing tag count data

Koji Kadota et al. Algorithms Mol Biol. .

Abstract

Background: High-throughput sequencing, such as ribonucleic acid sequencing (RNA-seq) and chromatin immunoprecipitation sequencing (ChIP-seq) analyses, enables various features of organisms to be compared through tag counts. Recent studies have demonstrated that the normalization step for RNA-seq data is critical for a more accurate subsequent analysis of differential gene expression. Development of a more robust normalization method is desirable for identifying the true difference in tag count data.

Results: We describe a strategy for normalizing tag count data, focusing on RNA-seq. The key concept is to remove data assigned as potential differentially expressed genes (DEGs) before calculating the normalization factor. Several R packages for identifying DEGs are currently available, and each package uses its own normalization method and gene ranking algorithm. We compared a total of eight package combinations: four R packages (edgeR, DESeq, baySeq, and NBPSeq) with their default normalization settings and with our normalization strategy. Many synthetic datasets under various scenarios were evaluated on the basis of the area under the curve (AUC) as a measure for both sensitivity and specificity. We found that packages using our strategy in the data normalization step overall performed well. This result was also observed for a real experimental dataset.

Conclusion: Our results showed that the elimination of potential DEGs is essential for more accurate normalization of RNA-seq data. The concept of this normalization strategy can widely be applied to other types of tag count data and to microarray data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Outline of TbT normalization strategy. Left panel: M-A plot for negative binomially distributed simulation data from Ref. [21], after scaling for RPM mapped reads in each sample. Magenta and black dots indicate DEGs (20% of all genes; PDEG = 20%) and non-DEGs (80%), respectively. 90% of all DEGs is four-fold higher in Sample A than B (PA = 90%). Each dot represents a gene. Right panel: same plot but colored differently. TbT estimates 16.8% of PDEG using this data. Gray dots indicate genes estimated as non-DEGs by step 2 in TbT. Note that the median log-ratio for true non-DEGs when data normalization is performed using the TbT normalization factors (0.045) is closer to zero than that using the TMM normalization factors (0.170).
Figure 2
Figure 2
Distributions of AUC values for two edgeR-related combinations. Simulation results for 100 trials under PA = (a) 50%, (b) 70%, and (c) 90%, with PDEG = 20%. Left panel: box plots for AUC values. Right panel: scatter plots for AUC values. When the performances between the two combinations are completely the same, all the points should be on the black (y = x) line. Point below (or above) the black line indicates that the AUC value from the edgeR/TbT combination is higher (or lower) than that from the edgeR/default combination.
Figure 3
Figure 3
Results of iterative TbT approach. (a) Procedure for iterative TbT approach until the third iteration, and simulation results under PA = (b) 50%, (c) 70%, and (d) 90%, with PDEG = 20%. Left panel: accuracies of DEG identifications when step 2 in our DEG elimination strategy is performed using the following normalization factors: TMM (Default), TbT (First), TbT1 (Second), and TbT2 (Third). Right panel: AUC values when the following normalization factors are combined with the edgeR package: TbT (Default), TbT1 (First), TbT2 (Second), and TbT3 (Third).
Figure 4
Figure 4
Results for real data. (a) Number of tasRNA-associated sRNAs (i.e., provisional true discoveries) for given numbers of top-ranked sRNAs obtained from individual combinations. Combinations of individual R packages with TbT and default normalization methods are indicated by dashed and solid lines, respectively. For easy comparison with the previous study, results of DEGseq with the same parameter settings as in the previous study are also shown (solid yellow line). (b) Full ROC plots. Plots on left side (roughly the [0.00, 0.05] region on the x-axis) are essentially the same as those shown in Figure 4a. The R-code for producing Figure 4 is available in Additional file 5.

Similar articles

Cited by

References

    1. Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB. Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol. 2007;144(1):32–42. doi: 10.1104/pp.107.096677. - DOI - PMC - PubMed
    1. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24(3):133–141. doi: 10.1016/j.tig.2007.12.007. - DOI - PubMed
    1. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270(5235):467–470. doi: 10.1126/science.270.5235.467. - DOI - PubMed
    1. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996;14(13):1675–1680. doi: 10.1038/nbt1296-1675. - DOI - PubMed
    1. Asmann YW, Klee EW, Thompson EA, Perez EA, Middha S, Oberg AL, Therneau TM, Smith DI, Poland GA, Wieben ED, Kocher JP. 3' tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics. 2009;10:531. doi: 10.1186/1471-2164-10-531. - DOI - PMC - PubMed

LinkOut - more resources