Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 19;15:199.
doi: 10.1186/1471-2105-15-199.

Non-specific Filtering of Beta-Distributed Data

Affiliations
Free PMC article

Non-specific Filtering of Beta-Distributed Data

Xinhui Wang et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias.

Results: We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets.

Conclusions: We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

Figures

Figure 1
Figure 1
Smoothed scatter plots showing six filter statistics vs. the mean DNA methylation (Beta) value (22198 features, 26 colon cancer samples). A. SD-b: standard deviation of Beta values; B. SD-m: standard deviation of M-values; C. 1/Precision: inverse of precision parameter; D. BQ-GOF: Beta Quantile Goodness-Of-Fit; E. TM-GOF: Transformed Moment Goodness-Of-Fit; F. TQ-GOF: Transformed Quantile Goodness-Of-Fit. Red line in each figure indicates the median statistic values.
Figure 2
Figure 2
ROC curves for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). For each data set the sensitivity and specificity of selecting informative features using the top ranked list (12000 features) are averaged over 100 replications. Figure A-C show ROC curves for 7 listed filtering methods: SD-b, Precision, SD-m, BQ-GOF, TM-GOF, TQ-GOF, and BR (best rank) under different sample ratio scenarios: A. Sample size ratio 9:1 (non-CIMP/CIMP); B. Sample size ratio 1:1; C. Sample size ratio 1:9. The bottom three panels D-F are partial ROC curves obtained from the panels A-C by restricting the axis ranges to the region relevant to the diagonal line. The solid black diagonal line in Figure D-F indicates the estimated sensitivity and specificity levels for a list of 100 genes.
Figure 3
Figure 3
Misclassification rates of RPMM cluster analysis using top filtered features for 7 filtering methods (2 groups, 200 informative features out of 2000, 200 samples, 100 simulated data sets). Average 100 simulations of misclassification rates from a cluster analysis performed using RPMM, for the top 100, 200, or 400 features of seven different filtering methods under different sample size ratios. A*. Sample size ratio 9:1 (non-CIMP/CIMP); B*. Sample size ratio 1:9; C* &D**. Sample size ratio 1:1. For A*-C*: informative features have effective size smaller than 1; For D**: informative features have effective size smaller than 0.5.
Figure 4
Figure 4
Heatmaps of RPMM cluster analysis using top 1000 filtered features by A) TM-GOF or B) SD-b methods using 26 colon cancer samples (data set #1). Rows represent features and columns represent samples; yellow represents high DNA methylation and blue represents low. The color bars at the top of the columns indicate sample tissue types (row 1) and clusters (row 2). In row 1 dark and light green indicate CIMP and non-CIMP tumors, respectively. In row 2 red, yellow, blue and green bars indicate the sample clusters found after two divisions of clustering using RPMM. In Figure A, the red and yellow clusters are identified at the second division, and no subdivision of the blue cluster is found. In Figure B, the red and yellow clusters separate in the second division, as do the blue and green clusters.
Figure 5
Figure 5
Heatmaps of RPMM cluster analysis using top 1000 filtered features by TM-GOF (A,C) or SD-b (B,D) methods using 95 kidney cancer-non-cancer samples (data set #4). Rows represent features and columns represent samples; yellow represents high DNA methylation and blue represents low. The color bars at the top of the columns indicate sample tissue types (row 1) and clusters (row 2). In row 1 dark and light green indicate cancer and non-cancer samples, respectively. In row 2 red, yellow, blue and green bars indicate the sample clusters found after two divisions of clustering using RPMM. Figure A &B show heatmaps of all 95 kidney samples using top 1000 features filtered by TM-GOF or by SD-b method, respectively. Figures C &D show heatmaps of 50 kidney tumors using top 1000 features filtered by TM-GOF or by SD-b method, respectively. In C, the blue and green bar clusters are found at the second separation.

Similar articles

See all similar articles

Cited by 6 articles

See all "Cited by" articles

References

    1. Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci U S A. 2010;107(21):9546–9551. - PMC - PubMed
    1. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–2517. - PubMed
    1. Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS, Trivett MK, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung JA, Chiew YE, Haviv I, Gertig D, DeFazio A, Bowtell DD. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to. Clin Cancer Res. 2008;14(16):5198–5208. - PubMed
    1. Kim EY, Kim SY, Ashlock D, Nam D. MULTI-K: accurate classification of microarray subtypes using ensemble k-means. BMC Bioinformatics. 2009;10:260. - PMC - PubMed
    1. Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics. 2008;9:365. - PMC - PubMed

Publication types

Associated data

LinkOut - more resources

Feedback