The analysis of whole genomes of pan-cancer data sets provides a challenge for researchers, and we contribute to the literature concerning the identification of robust subgroups with clear biological interpretation. Specifically, we tackle this unsupervised problem via a novel rank-based Bayesian clustering method. The advantages of our method are the integration and quantification of all uncertainties related to both the input data and the model, the probabilistic interpretation of final results to allow straightforward assessment of the stability of clusters leading to reliable conclusions, and the transparent biological interpretation of the identified clusters since each cluster is characterized by its top-ranked genomic features. We applied our method to RNA-seq data from cancer samples from 12 tumor types from the Cancer Genome Atlas. We identified a robust clustering that mostly reflects tissue of origin but also includes pan-cancer clusters. Importantly, we identified three pan-squamous clusters composed of a mix of lung squamous cell carcinoma, head and neck squamous carcinoma, and bladder cancer, with different biological functions over-represented in the top genes that characterize the three clusters. We also found two novel subtypes of kidney cancer that show different prognosis, and we reproduced known subtypes of breast cancer. Taken together, our method allows the identification of robust and biologically meaningful clusters of pan-cancer samples.
Keywords: Bayes Mallows model; cluster analysis; pan-cancer; robust statistics; subgroup analysis; transcriptomics.
© 2022 The Authors. Molecular Oncology published by John Wiley & Sons Ltd on behalf of Federation of European Biochemical Societies.