Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Jan 22;25(2):103798.
doi: 10.1016/j.isci.2022.103798. eCollection 2022 Feb 18.

Machine learning for multi-omics data integration in cancer

Affiliations
Free PMC article
Review

Machine learning for multi-omics data integration in cancer

Zhaoxiang Cai et al. iScience. .
Free PMC article

Abstract

Multi-omics data analysis is an important aspect of cancer molecular biology studies and has led to ground-breaking discoveries. Many efforts have been made to develop machine learning methods that automatically integrate omics data. Here, we review machine learning tools categorized as either general-purpose or task-specific, covering both supervised and unsupervised learning for integrative analysis of multi-omics data. We benchmark the performance of five machine learning approaches using data from the Cancer Cell Line Encyclopedia, reporting accuracy on cancer type classification and mean absolute error on drug response prediction, and evaluating runtime efficiency. This review provides recommendations to researchers regarding suitable machine learning method selection for their specific applications. It should also promote the development of novel machine learning methodologies for data integration, which will be essential for drug discovery, clinical trial design, and personalized treatments.

Keywords: machine learning; omics; systems biology.

PubMed Disclaimer

Conflict of interest statement

JL has received grant funding from 10.13039/100004325AstraZeneca for research unrelated to the current work.

Figures

None
Graphical abstract
Figure 1
Figure 1
Growth of publications in omics Line charts showing the number of articles published in each year from 1995 to 2020 in PubMed, colored by different omics. The y axis is plotted in log scale. Search terms used are “genomics,” “epigenomics,” “transcriptomics,” “proteomics,” and “multi-omics”.
Figure 2
Figure 2
Illustration of early, middle, and late integration for merging data matrices generated by different omics In early integration, features from different data matrices are concatenated. Middle integration uses machine learning models to consolidate data without concatenating features or merging results. In late integration, each omics layer is analyzed independently, and results are combined at the end.
Figure 3
Figure 3
Unique contribution of this review First, we describe a balance of both biological and technical content covering topics from genomics to proteomics and from machine learning to multi-omics integration tools. Second, we propose a new classification that categorizes the reviewed tools into two categories, namely general-purpose and task-specific, and then review these tools for four types of applications in biomedical sciences. Third, we provide an independent benchmarking analysis to compare integration methods for cancer type classification and drug response prediction.
Figure 4
Figure 4
Details of the benchmarking analysis (A) The process of determining the scope of the benchmarking analysis. (B) An overview of the steps included in the benchmarking analysis.
Figure 5
Figure 5
Benchmarking of machine learning-based integration tools using the CCLE multi-omics data (A) Accuracy of each method for cancer type prediction, showing standard errors of the mean derived from 100 runs of five-fold cross-validation, totalling 500 experiments (∗ signifies p value < 0.05 and ∗∗∗ signifies p value < 0.001 by an unpaired two-tailed Student’s t test). (B) MAE comparison for drug response prediction across 1,448 drugs, error bars representing standard errors of the mean (∗∗∗ signifies p value < 0.001 and n.s. stands for not significant by an unpaired two-tailed Student’s t test). (C) Runtime comparison. PCA is omitted as the runtime was negligible compared with the five multi-omics integration methods. (D) A summary of the benchmarking study, derived from the results of cancer type prediction, drug response prediction (MAE between the measured AUC and predicted AUC), runtime comparison, and the number of citations since publication. The number of citations for PCA was set to the maximum for better visualization and because of its widespread use. The inverse of the runtime and drug response prediction MAE values are plotted so that higher values indicate better performance in all dimensions, and all values are plotted in the range of 0 to 1 in the radar plot.

Similar articles

Cited by

References

    1. Aizerman M.A. Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control. 1964;25:821–837.
    1. Alcala N., Leblay N., Gabriel A.A.G., Mangiante L., Hervas D., Giffon T., Sertier A.S., Ferrari A., Derks J., Ghantous A., et al. Integrative and comparative genomic analyses identify clinically relevant pulmonary carcinoid groups and unveil the supra-carcinoids. Nat. Commun. 2019;10:3407. - PMC - PubMed
    1. Andersson R., Sandelin A. Determinants of enhancer and promoter activities of regulatory elements. Nat. Rev. Genet. 2020;21:71–87. - PubMed
    1. Andrew G., Arora R., Bilmes J., Livescu K. Deep canonical correlation analysis. Proc. 30th Int. Conf. Machine Learn. 2013;28:1247–1255.
    1. Argelaguet R., Velten B., Arnol D., Dietrich S., Zenz T., Marioni J.C., Buettner F., Huber W., Stegle O. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 2018;14:e8124. - PMC - PubMed

LinkOut - more resources