Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Dec 19:10:39.
doi: 10.1186/s13040-017-0159-z. eCollection 2017.

Sparse generalized linear model with L0 approximation for feature selection and prediction with big omics data

Affiliations

Sparse generalized linear model with L0 approximation for feature selection and prediction with big omics data

Zhenqiu Liu et al. BioData Min. .

Abstract

Background: Feature selection and prediction are the most important tasks for big data mining. The common strategies for feature selection in big data mining are L1, SCAD and MC+. However, none of the existing algorithms optimizes L0, which penalizes the number of nonzero features directly.

Results: In this paper, we develop a novel sparse generalized linear model (GLM) with L0 approximation for feature selection and prediction with big omics data. The proposed approach approximate the L0 optimization directly. Even though the original L0 problem is non-convex, the problem is approximated by sequential convex optimizations with the proposed algorithm. The proposed method is easy to implement with only several lines of code. Novel adaptive ridge algorithms (L0ADRIDGE) for L0 penalized GLM with ultra high dimensional big data are developed. The proposed approach outperforms the other cutting edge regularization methods including SCAD and MC+ in simulations. When it is applied to integrated analysis of mRNA, microRNA, and methylation data from TCGA ovarian cancer, multilevel gene signatures associated with suboptimal debulking are identified simultaneously. The biological significance and potential clinical importance of those genes are further explored.

Conclusions: The developed Software L0ADRIDGE in MATLAB is available at https://github.com/liuzqx/L0adridge.

Keywords: Big data mining; Classification; GLM; L0 penalty; Multi-omics data; Sparse modeling; Suboptimal debulking.

PubMed Disclaimer

Conflict of interest statement

Not Applicable.Not Applicable.The authors declare that they have no competing interests.Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Gene signatures associated with suboptimal debulking, where nodes in red: mRNA signatures; nodes in green: microRNA signatures; nodes in pink: methylation signatures, and edges in red: positive partial correlation; edges in blue: negative partial correlation
Fig. 2
Fig. 2
Predictive AUCs for integrated data, mRNA expression only, microRNA expression only, and methylation only

Similar articles

Cited by

References

    1. Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc B. 1996;58:267–88.
    1. Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–29. doi: 10.1198/016214506000000735. - DOI
    1. Lin D, Foster D, Ungar L. A risk ratio comparison of l0 and l1 penalized regressions. Tech. rep.,University of Pennsylvania; 2010.
    1. Kakade S, Shamir O, Sridharan K, Tewari A. Learning exponential families in high dimensions: strong convexity and sparsity. JMLR. 2013;9:381–8.
    1. Bahmani S, Raj B, Boufounos P. Greedy sparsity-constrained optimization. J Mach Learn Res. 2013;14(3):807–41.