Multilevel heterogeneous omics data integration with kernel fusion

Haitao Yang; Hongyan Cao; Tao He; Tong Wang; Yuehua Cui

doi:10.1093/bib/bby115

Multilevel heterogeneous omics data integration with kernel fusion

Brief Bioinform. 2020 Jan 17;21(1):156-170. doi: 10.1093/bib/bby115.

Authors

Haitao Yang¹, Hongyan Cao², Tao He³, Tong Wang², Yuehua Cui^{2

4}

Affiliations

¹ Department of Epidemiology and Health Statistics, School of Public Health, and Hebei Province Key Laboratory of Environment and Human Health, Hebei Medical University, Shijiazhuang, PR China.
² Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China.
³ Department of Mathematics, San Francisco State University, San Francisco, CA, USA.
⁴ Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA.

PMID: 30496340
DOI: 10.1093/bib/bby115

Abstract

High-throughput omics data are generated almost with no limit nowadays. It becomes increasingly important to integrate different omics data types to disentangle the molecular machinery of complex diseases with the hope for better disease prevention and treatment. Since the relationship among different omics data features are typically unknown, a supervised learning model assuming a particular distribution with a specific structure will not serve the purpose to capture the underlying complex relationship between multiple features and a disease phenotype. In this work, we briefly reviewed methods for kernel fusion (KF) based on support vector machine and kernel partial least squares (KPLS) algorithms. We then proposed a fused KPLS (fKPLS) model for disease classification and prediction with multilevel omics data. The fused kernel can deal with effect heterogeneity in which different omic data types may have different effect contribution to the trait of interest, with the purpose to improve the prediction performance. We proposed to optimize the kernel parameters and kernel weights with the genetic algorithm (GA). The proposed GA-fKPLS model can substantially improve disease classification performance by integrating multiple omics data types, demonstrated via extensive simulations and real data analysis. With properly defined fitness functions during GA optimization, the proposed KF method can be extended to other kernel-based analyses such as in kernel association analysis with common or rare variants.

Keywords: data fusion; genetic algorithm; kernel partial least squares; nonlinear classification; omics data integration.