The transcriptome of breast cancers have been extensively screened with microarrays and large sets of genes associated with clinical features have been established. The aim of this study was to validate original gene sets on a large cohort of raw breast cancer microarray data with known clinical follow-up. We recovered 20 publications and matched them to Affymetrix HGU133A annotations. Raw Affymetrix HGU133A microarray data were extracted from GEO and MAS5 normalized. For classifying patients using the selected gene sets, we applied prediction analysis of microarrays and constructed Kaplan-Meier plots. A new classification including all patients was generated using supervised principal components analysis. Seven studies including 1,470 patients were downloaded from GEO. Notably, we uncovered 641 microarrays representing 251 individual tumor specimens among them, which were repeatedly described under independent GEO identifiers. We excluded all redundant data and used the remaining 1,079 samples. Eight of the 20 gene sets were able to predict response at a significance of P < 0.05. The discrimination of good and poor prognosis groups exclusively relying on gene expression data resulted in high significance (P = 1.8E-12). A model including genes fitted by both gene expression and clinical covariates (lymph node status and grade) contains 44 genes and can predict response at P = 9.5E-7. The outcome provides a ranking of the gene lists regarding applicability on an independent dataset. We established a consensus predictor combining the available clinical and gene expression data. The database comprising expression profiles of 1,079 breast cancers can be used to classify individual patients.