Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015;16 Suppl 5(Suppl 5):S10.
doi: 10.1186/1471-2105-16-S5-S10. Epub 2015 Mar 18.

Using Epigenomics Data to Predict Gene Expression in Lung Cancer

Free PMC article

Using Epigenomics Data to Predict Gene Expression in Lung Cancer

Jeffery Li et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking.

Methods: A new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set.

Results: A best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

Conclusions: By considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.

Figures

Figure 1
Figure 1
Segments associated with protein coding genes. Features considered to predict differential gene expression are depicted on a segment-by-segment basis. Segments are determined based on the annotations of Illumina Infinium Human Methylation 450K Beadchip Array, with augmentations on segments located in gene bodies. From 5' to 3' end of the protein coding genes, listed are transcription starting sites (TSS) upstream up to 1500 bp (TSS 1500) and 200 bp (TSS 200), first exon which may include 5' UTR, first intron, exon body, last intron, and last exon which may include 3' UTR. A full transcript region is determined as the UTRs and coding region together.
Figure 2
Figure 2
Performance comparison of models with various feature selection and classification methods. The Areas Under the Curve (AUC) of ROC are used as the metric to compare the performance of models with different combinations of feature selection (CFS, Gain Ratios and ReliefF) and classification (Gaussian SVM, Linear SVM, Logistic regression, Naïve Bayes and Random Forest), on the training data with 10 fold cross-validation. The model with ReliefF based feature selection and Random Forest classification is selected as the best model.
Figure 3
Figure 3
Top fifteen features from the best model. (a) The clustering results on the absolute values of Pearson's correlation coefficients from 67 selected features by the best model. The names of different type of features are labeled by different colors. Note: the length of a segment is listed out separately. (b) List of top fifteen features selected by ReliefF feature selection and sorted by their correlation to the classification of differential gene expression.
Figure 4
Figure 4
Evaluation of features generated from various data types. (a-b) Effects of feature set drop-off on ROC curves from the 10-fold cross-validation training set (a) and testing set (b). (c) Effects of feature set drop-off on other four metrics: AUC, Accuracy, F-measure and MCC, in the training set and testing set.
Figure 5
Figure 5
Evaluation of methylation features by segment. (a-b) Effects of segment-based methylation feature set sequential drop-off on ROC curves from the 10-fold cross-validation training set (a) and testing set (b). (c) Effects of segment-based methylation feature set sequential drop-off on other four metrics: AUC, Accuracy, F-measure and MCC, in the training set and testing set.

Similar articles

See all similar articles

Cited by 14 articles

See all "Cited by" articles

References

    1. Portela A, Esteller M. Epigenetic modifications and human disease. Nature biotechnology. 2010;28(10):1057–1068. doi: 10.1038/nbt.1685. - DOI - PubMed
    1. Bock C, Lengauer T. Computational epigenetics. Bioinformatics. 2008;24(1):1–10. doi: 10.1093/bioinformatics/btm546. - DOI - PubMed
    1. Laird PW. Principles and challenges of genomewide DNA methylation analysis. Nature reviews Genetics. 2010;11(3):191–203. doi: 10.1038/nrg2732. - DOI - PubMed
    1. Lim SJ, Tan TW, Tong JC. Computational Epigenetics: the new scientific paradigm. Bioinformation. 2010;4(7):331–337. doi: 10.6026/97320630004331. - DOI - PMC - PubMed
    1. Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. Journal of molecular biology. 1987;196(2):261–282. doi: 10.1016/0022-2836(87)90689-9. - DOI - PubMed

Publication types

Feedback