Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 2;534(7605):55-62.
doi: 10.1038/nature18003. Epub 2016 May 25.

Proteogenomics Connects Somatic Mutations to Signalling in Breast Cancer

Collaborators, Affiliations
Free PMC article

Proteogenomics Connects Somatic Mutations to Signalling in Breast Cancer

Philipp Mertins et al. Nature. .
Free PMC article


Somatic mutations have been extensively characterized in breast cancer, but the effects of these genetic alterations on the proteomic landscape remain poorly understood. Here we describe quantitative mass-spectrometry-based proteomic and phosphoproteomic analyses of 105 genomically annotated breast cancers, of which 77 provided high-quality data. Integrated analyses provided insights into the somatic cancer genome including the consequences of chromosomal loss, such as the 5q deletion characteristic of basal-like breast cancer. Interrogation of the 5q trans-effects against the Library of Integrated Network-based Cellular Signatures, connected loss of CETN3 and SKP1 to elevated expression of epidermal growth factor receptor (EGFR), and SKP1 loss also to increased SRC tyrosine kinase. Global proteomic data confirmed a stromal-enriched group of proteins in addition to basal and luminal clusters, and pathway analysis of the phosphoproteome identified a G-protein-coupled receptor cluster that was not readily identified at the mRNA level. In addition to ERBB2, other amplicon-associated highly phosphorylated kinases were identified, including CDK12, PAK1, PTK2, RIPK2 and TLK2. We demonstrate that proteogenomic analysis of breast cancer elucidates the functional consequences of somatic mutations, narrows candidate nominations for driver genes within large deletions and amplified regions, and identifies therapeutic targets.


Extended Data Figure 1
Extended Data Figure 1. Experimental and data analysis workflows and longitudinal data generation quality control
a, iTRAQ 4-plex global proteome and phosphoproteome analysis workflow. 105 TCGA breast tumors were analyzed in 35 iTRAQ 4-plex experiments (plus 1 replicate and 1 normal sample experiment), with three tumors of different subtypes compared to a fourth common internal reference sample in each experiment. The reference sample comprised 10 individual tumors of each of the 4 major breast cancer intrinsic subtypes and served as an internal standard for all proteins and phosphoproteins quantified in this study. Each iTRAQ MS/MS spectrum measures a peptide from 4 samples (3 individual patients and the reference sample mix of 40 patients). More than 400,000 distinct peptides were identified and quantified in ~14 million MS/MS spectra. Personalized tumor-specific protein databases were generated in the QUILTS software package using whole exome sequencing-derived variant calls and RNAseq-derived transcript information. All mass spectrometry data was analyzed using the Spectrum Mill software package. b, Overview of proteome and phosphoproteome datasets. The table provides a summary of the datasets used in specific analyses, including the filters applied to derive the proteins and phosphosites/phosphoproteins that constitute each dataset; the protein, phosphosite or phosphoprotein count; and the methods that employ the respective datasets. c, Distribution of sequence coverage of the identified proteins with tryptic peptides detected by MS/MS, whiskers show the 5–95 percentiles. d and e, Robust and accurate proteome/phosphoproteome platform. Longitudinal performance was tested by repeated proteome and phosphoproteome analysis of patient-derived xenograft tumors. Scatterplots, histograms and Pearson correlations comparing individual replicate measurements are shown.
Extended Data Figure 2
Extended Data Figure 2. Tumor sample quality control (I)
a, Remark diagram showing sample processing and partitioning. Initial quality review encompassed histopathological examination of H&E stained tissue slices. *For 3 samples no tumor cells were seen on histopathology (BH-A0E9, BH-A0C1, A2-A0SW). These samples were nevertheless included in the proteome analysis since other quality control standards were met (see below) and samples with 0% tumor cellularity on top or bottom sections were included in TCGA analyses. b, correlation of TCGA (top or bottom sections) and CPTAC histological assessment of neoplastic cellularity for samples (n = 105). The average and range of neoplastic cellularities were identical for CPTAC and TCGA histological assessments. Averages (standard deviations) for neoplastic cellularity were 76% (+/− 17) for CPTAC, 76% (+/− 15) for TCGA_Top, and 75% (+/− 18) for TCGA_Bottom histopathology slides (Supplementary Table 2). Note that in three CPTAC cases where no tumor cells were identified by histopathological assessment, numbers of protein-level somatic variants were similar to all other tumors. The identified mutated proteins were TP53_R273C, NOP58_Q23E, TAGLN2_G154R, TUBA1B_D116H, and MRPL48_I173K (Supplementary Table 5), indicating presence of tumor cells in these samples. c, Proteome iTRAQ tumor/internal reference ratio heatmap for all CPTAC samples (8,028 proteins without missing values) including passed and failed proteomic quality control (QC) samples. d, Global tumor/reference proteome ratio distributions for samples that passed and failed proteomic quality control analysis. e, Degradation-related gene sets were enriched in tumors that failed proteomic quality control analysis. f, Variant allele frequency (VAF) analysis of re-sequenced CPTAC tumors and comparison to original TCGA data. Overall VAFs for failed QC samples were lower compared to passed samples suggesting lower purity.
Extended Data Figure 3
Extended Data Figure 3. Tumor sample quality control (II)
a, There was high concordance (94.6%) between DNA variants reported by TCGA and CPTAC re-sequenced tumors. Most point mutations reported by TCGA could be identified across the 8 re-sequenced samples used in the study. b, A high overall correlation (mean=0.77) was observed for the CPTAC Variant Allele Fraction (VAF) (X-axis) and TCGA VAF (Y-axis) across the 8 samples used in the study. c, Agglomerative hierarchical clustering (Supplementary Methods Section 3.8) used to co-cluster protein and RNA tumor expression data after filtering to retain 4,291 proteins and genes with moderate to high protein-RNA correlation (Pearson correlation > 0.4) with results displayed as a circular dendrogram (fanplot). The proteome (.P) and RNA (.R) components of each sample are labeled using the same color. The outer ring shows proteome samples in light grey and RNA samples in dark grey. High concordance between RNA and protein expression is evident from the color adjacency in the inner ring and alternating color in the outer ring showing that RNA and protein components co-cluster for a large proportion of samples (62/80). d, Co-clustering of MS and RPPA tumor data. 126 RPPA readouts were mapped to gene names. These genes were intersected with the genes observed in the MS proteome, filtered to 48 proteins with moderate or higher RPPA-MS protein correlation, and analyzed for co-clustering as in c. 47 of 80 RPPA-MS protein pairs co-cluster. While this is a smaller proportion than for RNA-protein analysis, the number of genes used in the clustering is significantly smaller for RPPA (48 vs. 4,291 for RNA). e, ESTIMATE tumor purity comparison between mRNA, RNAseq, and proteome data. ANOVA is used to assess the difference in distribution (−log10(p-value)) of ESTIMATE, stromal, immune, and tumor purity scores across mRNA (microarray), RNA-seq and proteome data. The only significant p-value (=0.02) is for the Cluster 3 stromal score, and higher stromal scores for the proteome drive that difference. f, Ischemia score analysis. Comparison of ischemia scores of 77 CPTAC tumors, 3 normal samples, and patient-derived xenografts. CPTAC tumors had generally lower ischemia scores than PDX samples subjected to 30 minutes of cold ischemia. Median ischemia scores are less than 30 minutes for each subtype and no significant differences were observed across subtypes. Effects due to cold ischemia therefore appear to be negligible in this CPTAC sample collection.
Extended Data Figure 4
Extended Data Figure 4. Protein-to-Protein, -CNA, and -mRNA correlation analyses
a, Identification of UBE3A as an E3 ubiquitin ligase that negatively correlates to p53 on the protein level. Pearson correlation and Benjamini-Hochberg corrected p-value are shown. b, Analysis of counter-regulated genes with negative correlation of CNA-to-RNA as well as CNA-to-protein levels. Negative Pearson correlations are shown with Benjamini-Hochberg corrected p-values for CNA-to-protein correlations. Depicted genes have significant negative correlations at FDR<0.05 in the CNA-to-RNA and CNA-to-protein analyses. c, Global mRNA-to-protein correlation and gene set enrichment analysis.
Extended Data Figure 5
Extended Data Figure 5. Global CNA effects and comparison of CNA TRANS effects to knockdown signatures in the LINCS database
a, CNA landscape in the CPTAC tumor collection. The segment-based CNAs of 77 samples were downloaded from TCGA Firehose, including 18 Basal, 12 Her2, 23 Luminal A and 24 Luminal B subtypes. Copy number amplifications were marked in red and deletions in blue. The bottom color key represents the log2 transformed copy number value, with CNA=2 centered at 0. Specific CNA events are seen for chromosome 5q and 10p regions in basal-like tumors. b, Correlations of copy number alterations (x-axis) to phosphoprotein levels (y-axis) highlight new CNA cis and trans effects. Significant (FDR<0.05) positive (red) and negative (green) correlations between CNA and phosphoproteins are indicated. Histograms show the fraction [%] of significant CNA trans effects for each CNA gene. c, LINCS CMap analysis facilitates identification of novel functional candidates for CNA trans effects. Knockdown profiles were compared with CNA/protein trans effects for 502 genes. Genes with a connectivity score >|90| were considered connected and significant cis effects were annotated at an FDR<0.05. d, Basal-like tumor-specific CNAs are candidate regulatory events for EGFR and SRC expression levels. Oncogenic kinases with significant CNA/protein trans effects (left panel), that were regulated in LINCS shRNA experiments (right panel; 4 cell lines,) and directly measured as LINCS landmark genes, are shown alongside candidate regulatory genes CETN3 and SKP1. Clinical ER, PR, and HER2 annotation and PAM50 classification are shown in the header rows of each column.
Extended Data Figure 6
Extended Data Figure 6. Proteome cluster heatmap and stability analysis
a, K-means consensus clustering of proteome and phosphoproteome data identifies three subgroups: basal-enriched, luminal-enriched, and stromal-enriched. The heatmap represents all 1,521 proteins used for clustering (Dataset G8). b, Identification of optimal proteome clusters for QC-passed CPTAC breast cancer tumors. Proteome clusters were derived using consensus clustering based on 1000 resampled datasets, exploring the range of 2 to 6 k-means clusters. Visualization of consensus matrices from k-means consensus clustering for k=3, 4, 5 and 6 target clusters. Consensus clustering was performed on 1,521 proteins with no missing values and SD>1.5. c, Silhouette plots were generated to evaluate the coherence of the clustering. Silhouette plots for k=3 and k=4 clusters showing a cleaner separation of clusters for k=3. d, Based on both visual inspection of the consensus matrix and the delta plot assessing change in consensus cumulative distribution function (CDF) area, three robustly segregated groups were observed. Consensus cumulative distribution function (CDF) and delta area (change in CDF area) plots for 2–6 clusters.
Extended Data Figure 7
Extended Data Figure 7. Proteome cluster markers and enriched pathways
a, Markers (based on SAM analysis; FDR<0.01) discriminate between proteome clusters 1, 2 and 3 (compare to heatmap of proteins used to derive clusters depicted in Extended Data Fig. 6a). b, Applying a Fisher exact test-based enrichment analysis to the proteome, phosphoproteome and mRNA data, gene sets from MSigDB were identified that were unique for each proteome cluster. Heat map showing specific pathways comprising dominant biological themes that are significantly differential by enrichment analysis between basal-enriched and luminal-enriched tumors (Fisher Exact Test Benjamini-Hochberg corrected p-values are shown; enrichment test performed on marker sets identified using SAM analysis; see Methods; compare to Figure 3c). c, Heatmap showing a selection of gene sets significant in basal-enriched or luminal-enriched tumors exclusively by mRNA, protein or phosphoprotein expression. Cytokine signatures, for example, were strongly captured at the mRNA level, but were seen to only a limited degree at the global protein level, likely because of their typically low protein abundance. By contrast, the vast majority of significant gene sets annotated as "signaling" were enriched only at the phosphoprotein level. d, Global heat map representing all gene sets significantly enriched in at least one of the proteomic breast cancer subtypes. The stromal-enriched group was characterized by breast cancer normal-like, adipocyte differentiation, smooth muscle, toll-like receptor signaling and endothelin gene sets, supporting the clustering-based annotation of high stromal and/or adipose content in these tumors (see Supplemental Table 13).
Extended Data Figure 8
Extended Data Figure 8. Phosphoproteome pathway clustering, kinase-phosphosite multivariate regression, and protein co-expression networks
a, Phosphoproteome pathway clustering. Using phosphorylation state as a proxy for activity, deep phosphoproteome profiling allows development of a breast cancer molecular taxonomy based on signaling pathways. K-means consensus clustering was performed on pathways derived from single sample GSEA analysis of phosphopeptide data (908 pathways shown). Of four robustly segregated groups, subgroups 2 and 3 substantially recapitulated the stromal- and luminal-enriched proteomic subgroups, respectively. Subgroup 4 included a significant majority of tumors from the basal-enriched proteomic subgroup, but was admixed particularly with luminal-enriched samples. This subgroup was defined by high levels of cell cycle and checkpoint activity. All basal and a majority of non-basal samples in this subgroup had TP53 mutations. Subgroup 1 was a novel subgroup defined exclusively in the phosphopeptide / pathway activity domain, with no enrichment for either proteomic or PAM50 subtypes. It was defined by G-protein, G-protein coupled receptor, and inositol phosphate metabolism signatures, as well as ionotropic glutamate signaling. b, Analysis of the regulatory relationship between outlier kinases (see Supplementary Table 19) and phosphopeptides by regulatory multivariate regression analysis (see Methods) identified CDK1 as the most highly connected of the outlier Cyclin-Dependent Kinases, with highest centrality (based on node-degree; see Methods) among the outlier CDKs and seventh highest centrality among all the outlier kinases considered in the remMap analysis. Each line represents a phosphosite-kinase relationship. c–f, Analysis of differences in the co-expression patterns among genes/proteins across different subgroups. A Joint Random Forest (JRF) method was applied to simultaneously build gene co-expression and protein co-expression networks (Supplementary Table 17, and Methods). Modules in these networks revealed different interaction patterns between basal-enriched and luminal-enriched subgroups. c, Network module P1 of the protein co-expression network, defined chiefly in the proteome space. This module contained 12 genes connected by 39 edges, among which 34 were protein-specific and 5 were shared by both the protein and mRNA co-expression networks. Many edges were supported by published information and were contained in the STRING database. Edges in red are specific to the protein co-expression network; edges in green are shared by both protein and gene co-expression networks; edges indicated by double lines are contained in the STRING database with confidence score greater than 0.15. MMP9, one of the central proteins in this module, contributes to metastatic progression and is a potential target for anti-metastatic therapies for basal-like / triple negative breast cancer. d, Heatmaps of the absolute correlation across each pair of genes in module P1 (shown in Panel c), based on either protein or gene expression data for samples in the basal-enriched and luminal-enriched subgroups, respectively. The MMP9 protein was strongly co-expressed with the other members of the module only in the basal-enriched subgroup. Notably, this observation is dependent on protein data; the correlation at the mRNA level for this module was consistently low in both the basal-enriched and luminal enriched subgroups indicating that these events coherently occur at the proteomic level. e, Co-expression network based on proteomics data. The network contains 693 proteomic network-specific edges (grey) and 792 edges shared with the RNAseq network (green). For each module, the most enriched category and corresponding Benjamini-Hochberg adjusted p-value is reported. Pie charts adjacent to each module show the proportion of proteomics-specific edges (grey area) and edges shared between proteomics and RNAseq data (green area). f, RNAseq network.
Extended Data Figure 9
Extended Data Figure 9. Phosphoproteome signatures of PIK3CA (a,b) and TP53 (c,d) mutated tumors highlight activated key regulators and indicate frequency of activation
a and c, Phosphosites upregulated in mutated tumors (SAM FDR<0.05 across all tumors and independently also across luminal tumors; average phosphosite signal for all markers shown as bar graph). To avoid confounding by intrinsic subtype-specific distinctions, only markers that were significantly identified both in analyses covering all tumors and analyses restricted to luminal tumors were selected (FDR <0.05). Color bars in the margins indicate FDRs for grouped analysis of different mutation classes and indicate kinase substrates of known kinases in the respective pathways. Significantly regulated kinase phosphosites are annotated. The average phosphorylation signal of the marker phosphosites provides a read-out for PI3K and TP53 pathway activity in mutated tumors (histogram below heatmap). A 95% prediction confidence interval (indicated by dashed lines) across the average signal in non-mutated tumors was chosen in order to discriminate active from non-active tumors. The most strongly activated PIK3CA kinase domain mutant tumor differed from the other 9 kinase domain mutant tumors, as it contained an amino acid side chain charge neutral H1047L instead of the more common positively charged H1047R mutation. Among the 62 phosphosites identified that were significantly upregulated in PIK3CA mutated tumors, 13 phosphosites were found on phosphoproteins that are known substrates of well-annotated kinases in the PIK3CA pathway (panel a, right column). In the mutant TP53 analysis a total 20 phosphosites were found on phosphoproteins that are known substrates of well annotated kinases in the p53 pathway (panel c, right column). b and d, Upregulated phosphosite sets were derived from isogenic PIK3CA and TP53 mutant versus wild-type cell line pairs and tested for enrichment within mutant versus wild-type CPTAC tumors using single sample GSEA. Significantly enriched phosphosite sets are shown (p<0.05).
Extended Data Figure 10
Extended Data Figure 10. PIRCOS plots, kinase outliers and outliers in the ERBB2 pathway
a, Pircos (Proteogenomics CIRCOS) plots for 8q and 17q showing median CNA, RNA, protein, and phosphosite expression for 20 tumors with amplification in 8q based on RIPK2 CNA>1; 23 tumors with amplification in 8q based on PTK2 CNA>1; 15 tumors with amplification in 17q based on CDK12 CNA >1; and 10 tumors with amplification in 17q based on TLK2 CNA>1. Red indicates expression >1, blue < −1, and grey between −1 and 1. Genes with both copy number amplification (CNA>1) and increased phosphosite expression (p-site>1) are labeled. b, Phosphosite outliers in known ERBB2 signaling genes. To better understand the downstream effects of ERBB2 amplification, phosphosite outliers in known ERBB2 signaling genes (MSigDB pathway set, KEGG_ERBB_SIGNALING PATHWAY) were identified for the 15 samples that had ERBB2 phosphosite outlier status. Forty-one genes were identified as having a phosphosite outlier in at least one of the ERBB2 amplified samples. PAK4 and ARAF phosphosite outlier status were found in seven of the 15 ERBB2 kinase outlier samples; GSK3B outliers were found in 6 samples; and EIF4EBP1, MAP2K2, ABL1 and AKT1 outlier status was found in 5 of the 15 samples. c, Proteogenomic outlier expression analysis for TLK2 and RIPK2. Samples with outlier phosphosite (red), protein (yellow), RNA (green) and copy number (purple) expression are shown. Phosphosite squares indicate per-sample outlier phosphosites.
Figure 1
Figure 1. Proteogenomic analysis of human breast cancer. Direct effects of genomic alterations on protein level
Overlap of a, protein coding single amino acid variants (SAAVs) and b, RNA splice junctions not present in RefSeq v60 detected by DNA exome sequencing, RNA-seq, and LC-MS/MS. Proportions of novel variants are noted. c, Heatmap of mutations/CNA and their effects on RNA and protein expression of breast cancer-relevant genes across tumor and normal samples. ER, PR, HER2 and PAM50 status are annotated. Median iTRAQ protein abundance ratio and the most frequently detected and differential phosphosite ratio are shown for each gene. Pearson correlations between MS protein vs RNA-seq and MS protein vs RPPA are indicated.
Figure 2
Figure 2. Effects of copy number alterations (CNA) on mRNA, protein, and phosphoprotein abundance
a, Correlations of CNA (x-axes) to RNA and protein expression levels (y-axes) highlight new CNA cis and trans effects. Significant (FDR<0.05) positive (red) and negative (green) correlations between CNA and mRNAs or proteins are indicated. CNA cis effects appear as a red diagonal line, CNA trans effects as vertical stripes. Histograms show the fraction [%] of significant CNA trans effects for each CNA gene. b, Overlap of cis effects observed at RNA, protein, and phosphoprotein levels (FDR<0.05). c, Trans-effect regulatory candidates identified among those with significant protein cis-effects using LINCS CMap. Bars indicate total numbers of significant CNA/protein trans effects (gray; FDR<0.05) and overlap with regulated genes in LINCS knock-down profiles (red; 4 cell lines; moderated T-test FDR<0.1).
Figure 3
Figure 3. Proteomic and phosphoproteomic subtypes of breast cancer and subtype-specific pathway enrichment
a, Unsupervised clustering of RNA-seq and proteomics data restricted to PAM50 genes and subset of 35 detected proteins reveal high similarity to PAM50 (TCGA) sample annotation. b, K-means consensus clustering of proteome and phosphoproteome data identifies basal-enriched, luminal-enriched, and stromal-enriched subgroups. c, Gene set enrichment analysis highlights sets of pathways significantly differential between basal-enriched and luminal-enriched tumors (detailed in Extended Data Fig. 7b). d, K-means consensus clustering performed on pathways derived from single sample GSEA analysis of phosphopeptide data identifies four distinct clusters.
Figure 4
Figure 4. Example analyses of aberrantly regulated kinases in human breast cancer
a and b, PIRCOS (Proteogenomics CIRCOS) plots showing CNA, RNA, protein and phosphosite expression for 17 tumors with amplification in 17q (ERBB2 CNA>1) and 8 tumors with amplification in 11q (PAK1 CNA>1). Labeled genes have CNA>1 and phosphosite>1. c, Proteogenomic outlier expression analysis for ERBB2, CDK12, and PAK1. Samples with outlier phosphosite (red), protein (yellow), RNA (green) and copy number (purple) expression are shown. Phosphosite squares indicate per-sample outlier phosphosites. d, Outlier kinase events by PAM50 subtype (>35% of subtype samples contain a phosphosite outlier; <10% FDR using Benjamini-Hochberg adjusted p-values).

Similar articles

See all similar articles

Cited by 339 articles

See all "Cited by" articles


    1. Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. - PMC - PubMed
    1. Curtis C, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486:346–352. - PMC - PubMed
    1. van 't Veer LJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. - PubMed
    1. Chin K, et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer cell. 2006;10:529–541. - PubMed
    1. Ellis MJ, et al. Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium. Cancer discovery. 2013;3:1108–1112. - PMC - PubMed

Publication types

MeSH terms