An Interpretable Deep Embedding Model for Few and Imbalanced Biomedical Data

IEEE J Biomed Health Inform. 2022 Nov 21;PP. doi: 10.1109/JBHI.2022.3223798. Online ahead of print.


In healthcare, training examples are usually hard to obtain (e.g., cases of a rare disease), or the cost of labelling data is high. With a large number of features ( p) be measured in a relatively small number of samples ( N), the "big p, small N" problem is an important subject in healthcare studies, especially on the genomic data. Another major challenge of effectively analyzing medical data is the skewed class distribution caused by the imbalance between different class labels. In addition, feature importance and interpretability play a crucial role in the success of solving medical problems. Therefore, in this paper, we present an interpretable deep embedding model (IDEM) to classify new data having seen only a few training examples with highly skewed class distribution. IDEM model consists of a feature attention layer to learn the informative features, a feature embedding layer to directly deal with both numerical and categorical features, a siamese network with contrastive loss to compare the similarity between learned embeddings of two input samples. Experiments on both synthetic data and real-world medical data demonstrate that our IDEM model has better generalization power than conventional approaches with few and imbalanced training medical samples, and it is able to identify which features contribute to the classifier in distinguishing case and control.