Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;9(5):e1003047.
doi: 10.1371/journal.pcbi.1003047. Epub 2013 May 9.

Improving Breast Cancer Survival Analysis Through Competition-Based Multidimensional Modeling

Free PMC article

Improving Breast Cancer Survival Analysis Through Competition-Based Multidimensional Modeling

Erhan Bilal et al. PLoS Comput Biol. .
Free PMC article


Breast cancer is the most common malignancy in women and is responsible for hundreds of thousands of deaths annually. As with most cancers, it is a heterogeneous disease and different breast cancer subtypes are treated differently. Understanding the difference in prognosis for breast cancer based on its molecular and phenotypic features is one avenue for improving treatment by matching the proper treatment with molecular subtypes of the disease. In this work, we employed a competition-based approach to modeling breast cancer prognosis using large datasets containing genomic and clinical information and an online real-time leaderboard program used to speed feedback to the modeling team and to encourage each modeler to work towards achieving a higher ranked submission. We find that machine learning methods combined with molecular features selected based on expert prior knowledge can improve survival predictions compared to current best-in-class methodologies and that ensemble models trained across multiple user submissions systematically outperform individual models within the ensemble. We also find that model scores are highly consistent across multiple independent evaluations. This study serves as the pilot phase of a much larger competition open to the whole research community, with the goal of understanding general strategies for model optimization using clinical and molecular profiling data and providing an objective, transparent system for assessing prognostic models.

Conflict of interest statement

The authors have declared that no competing interests exist.


Figure 1
Figure 1. Gene expression subclass analysis.
(A) Comparison of hierarchical clustering of METABRIC data (left panel) and Perou data (right panel). Hierarchical clustering on the gene expression data of the PAM50 genes in both datasets reveals a similar gene expression pattern that separates into several subclasses. Although several classes are apparent, they are consistent with sample assignment into basal-like, Her2-enriched and luminal subclasses in the Perou data. Similarly, in the METABRIC data the subclasses are consistent with the available clinical data for triple-negative, ER and PR status, and HER2 positive. (B) Kaplan-Meier plot for subclasses. The METABRIC test dataset was separated into 3 major subclasses according to clinical features. The subclasses were determined by the clinical features: triple negative (red); ER or PR positive status (blue); and HER2 positive with ER and PR negative status (green). The survival curve was estimated using a standard Kaplan-Meier curve, and shows the expected differences in overall survival between the subclasses. (C,D) Kaplan-Meier curve by grade and histology. The test dataset was separated by tumor grade (subplot C; grade 1 – red, grade 2 – green, grade 3- blue), or by histology (subplot D; Infilitrating Lobular – red, Infiltrating Ductal – yellow, Medullary –green, Mixed Histology – blue, or Mucinous - purple). The survival curves were estimated using a standard Kaplan-Meier curve, and show the expected differences in overall survival for the clinical features.
Figure 2
Figure 2. Distribution of concordance index scores of models submitted in the pilot competition.
(A) Models are categorized by the type of features they use. Boxes indicate the 25th (lower end), 50th (middle red line) and 75th (upper end) of the scores in each category, while the whiskers indicate the 10th and 90th percentiles of the scores. The scores for the baseline and best performer are highlighted. (B) Model performance by submission date. In the initial phase of the competition, slight improvements over the baseline model were achieved by applying machine learning approaches to only the clinical data (red circles), whereas initial attempts to incorporate molecular data significantly decreased performance (green, purple, and black circles). In the intermediate phase of the competition, models combining molecular and clinical data (green circles) predominated and achieved slightly improved performance over clinical only models. Towards the end of the competition, models combining clinical information with molecular features selected based on prior information (purple circles) predominated.
Figure 3
Figure 3. Model performance by feature set and learning algorithm.
(A) The concordance index is displayed for each model from the controlled experiment (Table S4). The methods and features sets are arranged according to the mean concordance index score. The ensemble method (cyan curve) infers survival predictions based on the average rank of samples from each of the four other learning algorithms, and the ensemble feature set uses the average rank of samples based on models trained using all of the other feature sets. Results for the METABRIC2 and MicMa datasets are show in Figure S1. (B) The concordance index of models from the controlled phase by type. The ensemble method again utilizes the average rank for models in each category.
Figure 4
Figure 4. Consistency of results in 2 additional datasets.
(A,C) Concordance index scores for all models evaluated in the controlled experiment. Scores from the original evaluation are compared against METABRIC2 (A) and MicMa (C). The 4 machine learning algorithms are displayed in different colors. (B,D) Individual plots for each machine learning algorithm.
Figure 5
Figure 5. Model evaluation pipeline schematic.
Green regions: Public areas, untrusted. Blue regions: Trusted areas where no competitor's code is to be run. Yellow region: Sandboxed area, where untrusted code is run on a trusted system. Red region: Permissions managed by Synapse.

Similar articles

See all similar articles

Cited by 34 articles

See all "Cited by" articles


    1. Cancer - NPCR - USCS - View Data Online (n.d.). Available:
    1. Perou CM, Sørlie T, Eisen MB, Van De Rijn M, Jeffrey SS, et al. (2000) Molecular portraits of human breast tumours. Nature 406: 747–752 Available: - PubMed
    1. Stephens PJ, Tarpey PS, Davies H, Van Loo P, Greenman C, et al. (2012) The landscape of cancer genes and mutational processes in breast cancer. Nature advance on 400–404 Available: - DOI - PMC - PubMed
    1. Kristensen VN, Vaske CJ, Ursini-Siegel J, Van Loo P, Nordgard SH, et al. (2012) Integrated molecular profiles of invasive breast tumors and ductal carcinoma in situ (DCIS) reveal differential vascular and interleukin signaling. Proceedings of the National Academy of Sciences of the United States of America 109: 2802–2807 Available: Accessed 11 March 2013. - PMC - PubMed
    1. Van De Vijver MJ, He YD, Van't Veer LJ, Dai H, Hart AAM, et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine 347: 1999–2009 Available: - PubMed

Publication types