Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May 18;46(9):e54.
doi: 10.1093/nar/gky102.

Improving the Value of Public RNA-seq Expression Data by Phenotype Prediction

Affiliations
Free PMC article

Improving the Value of Public RNA-seq Expression Data by Phenotype Prediction

Shannon E Ellis et al. Nucleic Acids Res. .
Free PMC article

Abstract

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

Figures

Figure 1.
Figure 1.
Missing phenotype information. (A) Phenotype information is critical to answer questions about biology using expression data. (B) This critical information is missing for many samples within the SRA (red boxes). Note that sample phenotype information begins with the 6,620th row, as this is the first row in the dataset for which sex and tissue are available for the same sample. (C) Missingness is limited within the GTEx data. Expression data from samples with accompanying phenotype information are used to build the predictors. ERs = expressed regions
Figure 2.
Figure 2.
General approach to phenotype prediction. To predict phenotype information, the training data are first randomly divided and the predictor is built. Accuracy is first tested in the training data. Upon achieving sufficient accuracy (≥85%), the predictor is tested in the remaining half of the training data set. Phenotypes can then be predicted across all samples in recount2.
Figure 3.
Figure 3.
Prediction accuracy. Predictors for critical phenotype information were built from expression data available in recount2 for (A) sex, (B) tissue, (C) sequencing strategy and (D) sample source. Samples for which reported phenotype information is available were used to determine prediction accuracy. GTEx data are in purple, TCGA in pink, and SRA in teal.
Figure 4.
Figure 4.
Predicted sex across the SRA Plots summarize predicted sex across the SRA showing (A) the distribution of predicted sex across SRA samples, and (B) the distribution of project type, broken down by the predicted sex of samples in each project.
Figure 5.
Figure 5.
Differential gene expression analysis. (A) Number of genes reported significant in Kim et al. (23) and the analyses carried out here using their data obtained from recount2. (B, C) Concordance at top (CAT) plots (29) comparing DGEA. The number of genes concordant between analyses are plotted, where perfect agreement between analyses’ results would fall along 45-degree line (gray). DGEA where no covariates were included for analysis (x-axis) were compared to (B) DGEA with sex included as a covariate and (C) DGEA with both sex and tissue included as covariates. NC = normal colonic tissue; PC = primary colorectal cancer; MC = metastatic cancer (liver).

Similar articles

See all similar articles

Cited by 8 articles

See all "Cited by" articles

References

    1. Lister R., O’Malley R.C., Tonti-Filippini J., Gregory B.D., Berry C.C., Millar A.H., Ecker J.R. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008; 133:523–536. - PMC - PubMed
    1. Nagalakshmi U., Wang Z., Waern K., Shou C., Raha D., Gerstein M., Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008; 320:1344–1349. - PMC - PubMed
    1. Mortazavi A., Williams B.A., McCue K., Schaeffer L., Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008; 5:621–628. - PubMed
    1. Leinonen R., Sugawara H., Shumway M. The sequence read archive. Nucleic Acids Res. 2011; 39:D19–D21. - PMC - PubMed
    1. Eswaran J., Horvath A., Godbole S., Reddy S.D., Mudvari P., Ohshiro K., Cyanam D., Nair S., Fuqua S.A.W., Polyak K. et al. RNA sequencing of cancer reveals novel splicing alterations. Scientific Rep. 2013; 3:1689 - PMC - PubMed

Publication types

Feedback