Identifying a small set of marker genes using minimum expected cost of misclassification

Artif Intell Med. 2012 May;55(1):51-9. doi: 10.1016/j.artmed.2012.01.004. Epub 2012 Mar 3.

Abstract

Objectives: This paper presents a model independent feature selection approach to identify a small subset of marker genes.

Methods and material: An evaluation measure, minimum expected cost of misclassification (MEMC), is used to estimate the discriminative power of a feature subset without building a model. The MECM measure is combined with sequential forward search for feature selection. This approach was applied to a breast cancer profiling problem, with the goal of identifying a small number of marker genes whose expression can be used to predict cancer molecular subtype (p53 gene status). Furthermore, the method was also applied to find a small set of single-nucleotide polymorphisms (SNPs) that can be used to predict molecular phenotype of a different type, namely alleles (genetic variants) of human leukocyte antigen genes that play an important roles in autoimmunity.

Results: Two marker genes were identified based on p53 status, which achieved a p-value of 7.53×10(-5) (vs. 6×10(-4) with 32 genes identified by previous research) in survival analysis. Six SNP loci were identified that achieved a leave-one-out cross-validation accuracy of 92.8% (vs. 90.6% and 89.5% with 18 SNPs selected using χ2 statistics and information gain, respectively).

Conclusion: The MECM-based feature selection approach is capable of identifying a smaller subset of market genes with comparable or even better performance than that obtained using conventional filter methods.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Artificial Intelligence*
  • Breast Neoplasms / classification*
  • Breast Neoplasms / genetics*
  • Computer Simulation
  • Female
  • Gene Expression Profiling / methods
  • Genetic Markers
  • HLA Antigens
  • Humans
  • Models, Statistical*
  • Pattern Recognition, Automated / methods
  • Phenotype
  • Polymorphism, Single Nucleotide / genetics*
  • Software

Substances

  • Genetic Markers
  • HLA Antigens