Analysis of gene expression profiles of lung cancer subtypes with machine learning algorithms

Biochim Biophys Acta Mol Basis Dis. 2020 Aug 1;1866(8):165822. doi: 10.1016/j.bbadis.2020.165822. Epub 2020 Apr 28.

Abstract

Lung cancer is one of the most common cancer types worldwide and causes more than one million deaths annually. Lung adenocarcinoma (AC) and lung squamous cell cancer (SCC) are two major lung cancer subtypes and have different characteristics in several aspects. Identifying their differentially expressed genes and different gene expression patterns can deepen our understanding of these two subtypes at the transcriptomic level. In this work, we used several machine learning algorithms to investigate the gene expression profiles of lung AC and lung SCC samples retrieved from Gene Expression Omnibus. First, the profiles were analyzed by using a powerful feature selection method, namely, Monte Carlo feature selection. A feature list, ranking all features according to their importance, and some informative features were obtained. Then, the feature list was used in the incremental feature selection method to extract optimal features, which can allow the support vector machine (SVM) to yield the best performance for classifying lung AC and lung SCC samples. Some top genes (CSTA, TP63, SERPINB13, CLCA2, BICD2, PERP, FAT2, BNC1, ATP11B, FAM83B, KRT5, PARD6G, PKP1) were extensively analyzed to prove that they can be differentially expressed genes between lung AC and lung SCC. Meanwhile, a rule learning procedure was applied on informative features to construct the classification rules. These rules provide a clear procedure of classification and show some different gene expression patterns between lung AC and lung SCC.

Keywords: Feature selection method; Gene expression profile; Lung adenocarcinoma; Lung squamous cell cancer; Nonsmall cell lung cancer; Rule learning algorithm.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adenocarcinoma of Lung / diagnosis
  • Adenocarcinoma of Lung / genetics*
  • Adenocarcinoma of Lung / metabolism
  • Adenocarcinoma of Lung / pathology
  • Adenosine Triphosphatases / genetics
  • Adenosine Triphosphatases / metabolism
  • Cadherins / genetics
  • Cadherins / metabolism
  • Carcinoma, Squamous Cell / diagnosis
  • Carcinoma, Squamous Cell / genetics*
  • Carcinoma, Squamous Cell / metabolism
  • Carcinoma, Squamous Cell / pathology
  • Computational Biology / methods*
  • Cystatin A / genetics
  • Cystatin A / metabolism
  • Datasets as Topic
  • Diagnosis, Differential
  • Gene Expression Profiling
  • Gene Expression Regulation, Neoplastic*
  • Humans
  • Lung Neoplasms / diagnosis
  • Lung Neoplasms / genetics*
  • Lung Neoplasms / metabolism
  • Lung Neoplasms / pathology
  • Machine Learning / statistics & numerical data*
  • Membrane Transport Proteins / genetics
  • Membrane Transport Proteins / metabolism
  • Monte Carlo Method
  • Serpins / genetics
  • Serpins / metabolism
  • Terminology as Topic
  • Transcription Factors / genetics
  • Transcription Factors / metabolism
  • Transcriptome
  • Tumor Suppressor Proteins / genetics
  • Tumor Suppressor Proteins / metabolism

Substances

  • Cadherins
  • Cystatin A
  • FAT2 protein, human
  • Membrane Transport Proteins
  • SERPINB13 protein, human
  • Serpins
  • TP63 protein, human
  • Transcription Factors
  • Tumor Suppressor Proteins
  • CSTA protein, human
  • ATP11B protein, human
  • Adenosine Triphosphatases