Background: The use of somatic mutations for predicting clinical outcome is difficult because a mutation can indirectly influence the function of many genes, and also because clinical follow-up is sparse in the relatively young next generation sequencing (NGS) databanks. Here we approach this problem by linking sequence databanks to well annotated gene-chip datasets, using a multigene transcriptomic fingerprint as a link between gene mutations and gene expression in breast cancer patients.
Methods: The database consists of 763 NGS samples containing mutational status for 22,938 genes and RNA-seq data for 10,987 genes. The gene chip database contains 5,934 patients with 10,987 genes plus clinical characteristics. For the prediction, mutations present in a sample are first translated into a 'transcriptomic fingerprint' by running ROC analysis on mutation and RNA-seq data. Then correlation to survival is assessed by computing Cox regression for both up- and downregulated signatures.
Results: According to this approach, the top driver oncogenes having a mutation prevalence over 5 % included AKT1, TRANK1, TRAPPC10, RPGR, COL6A2, RAPGEF4, ATG2B, CNTRL, NAA38, OSBPL10, POTEF, SCLT1, SUN1, VWDE, MTUS2, and PIK3CA, and the top tumor suppressor genes included PHEX, TP53, GGA3, RGS22, PXDNL, ARFGEF1, BRCA2, CHD8, GCC2, and ARMC4. The system was validated by computing correlation between RNA-seq and microarray data (r(2) = 0.73, P < 1E-16). Cross-validation using 20 genes with a prevalence of approximately 5 % confirmed analysis reproducibility.
Conclusions: We established a pipeline enabling rapid clinical validation of a discovered mutation in a large breast cancer cohort. An online interface is available for evaluating any human gene mutation or combinations of maximum three such genes ( http://www.g-2-o.com ).