Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 17 (9), 1362-77

Whole Proteome Analysis of Post-Translational Modifications: Applications of Mass-Spectrometry for Proteogenomic Annotation

Affiliations

Whole Proteome Analysis of Post-Translational Modifications: Applications of Mass-Spectrometry for Proteogenomic Annotation

Nitin Gupta et al. Genome Res.

Abstract

While bacterial genome annotations have significantly improved in recent years, techniques for bacterial proteome annotation (including post-translational chemical modifications, signal peptides, proteolytic events, etc.) are still in their infancy. At the same time, the number of sequenced bacterial genomes is rising sharply, far outpacing our ability to validate the predicted genes, let alone annotate bacterial proteomes. In this study, we use tandem mass spectrometry (MS/MS) to annotate the proteome of Shewanella oneidensis MR-1, an important microbe for bioremediation. In particular, we provide the first comprehensive map of post-translational modifications in a bacterial genome, including a large number of chemical modifications, signal peptide cleavages, and cleavages of N-terminal methionine residues. We also detect multiple genes that were missed or assigned incorrect start positions by gene prediction programs, and suggest corrections to improve the gene annotation. This study demonstrates that complementing every genome sequencing project by an MS/MS project would significantly improve both genome and proteome annotations for a reasonable cost.

Figures

Figure 1.
Figure 1.
The ribosomal protein L31 (SO_4120) is entirely covered by identified peptides (all peptides longer than six amino acids are shown). The protein sequence is shown at the top in red, and the identified peptides are shown below in blue. Tryptic peptides are shown in bold.
Figure 2.
Figure 2.
(A) Distribution of the number of identified peptides observed for TIGR genes. Protein counts are plotted on a logarithmic scale. (B) Distribution of the residue coverage of TIGR genes by identified peptides. A total of 102 genes had coverage of ≥90%. Genes were grouped into percentage bins of size 10 percentage points based on their coverage.
Figure 3.
Figure 3.
Correlations between coverage of individual proteins by MS-peptides and their biological features deduced from comparative genomics. (A) Conservation and essentiality. Conservation index for every protein was computed as a number of putative orthologs within a set of ∼100 diverse bacterial genomes (list provided in Supplemental Table S1B). A bar diagram on the left panel shows the corresponding values averaged for each coverage group. The right panel shows the fraction of E. coli orthologs (456, 419, 363, 322, and 738 in groups A–E, respectively) that were deemed essential in the published study (Baba et al. 2006) plotted for each coverage group. (B) Functional categories. The upper panel shows a distribution of proteins within each coverage group by main functional categories according to TIGR annotations (to avoid redundancy only one category was chosen for each protein). In the lower panel, a similar distribution reflects inclusion of proteins in a collection of categorized subsystems (pathways) in The SEED database (restricted to one subsystem per protein). (C) Examples of individual subsystems (pathways). A distribution of proteins between coverage groups is illustrated for eight subsystems selected from the six major functional categories (protein metabolism; carbohydrates; nucleosides and nucleotides; amino acids and derivatives; fatty acids and lipids; cofactors and vitamins).
Figure 4.
Figure 4.
(A) Alignment of identified peptides (colored blue) and the hypothetical TIGR protein SO_1175 (colored red). Numbers on the right show the number of times the peptide was identified in MS/MS spectra (spectral count). The start codon, at position 1,218,574 on the forward strand of the chromosome, is normally read as valine. These peptide identifications demonstrate that translation begins upstream of the annotated start site. (B) Alignment of identified peptides (blue) with the nucleotide sequence of N-terminal region of SO_1175, relative to proposed new start codon (single arrow) and to the original proposed start codon (double arrow). A Shine Dalgarno-like site (underlined) is found upstream only to the proposed new start site. (C) Multiple sequence alignment of SO_1175 of S. oneidensis (red) with orthologs in other Shewanella strains.
Figure 5.
Figure 5.
(A) Alignment of identified peptides (blue) with the intergenic region between TIGR proteins SO_2299 and SO_2300. The starred positions indicate the last codon of SO_2299 (left, chromosome position 2,412,384) and the first codon of SO_2300 (right, chromosome position 2,412,499). The arrow points to the newly postulated translational start site for SO_2300 (infC gene). (B) Nucleotide sequence of the chromosome region between SO_2299 and SO_2300. The first red segment is the C-terminal end of SO_2299, and the second red segment is the N-terminal region of SO_2300. The region covered by the three identified peptides is underlined, and the arrow indicates our suggested start position for SO_2300. (C) A TBLASTN comparison of the proposed new S. oneidensis MR-1 infC N terminus to genome sequences for S. baltica OS155, Shewanella sp. MR-4, S. putrefaciens strains CN32, S. loihica PV-4, Sodalis glossinidius (gi 84778498), and Photobacterium profundum (gi 46913734). The original start position is indicated by the arrow.
Figure 6.
Figure 6.
(A) The positioning of two identified peptides (boxed) relative to the three frame translation of the SO_0590 pseudogene and to homologous protein from Shewanella W3–18–1. The nucleotide sequence is shown on top (in red), and the three translated frames are shown below (in black), with stop codons translations shown as asterisk (*). The locus containing the extra A identified by re-evaluation of sequence trace TI|202865473 available in the NCBI trace archives is indicated by the arrow. (B) Alignment of identified peptides (blue) with the gene SO_4538, annotated as a degenerate M16 protease. Alignment is shown to homologous sequences in S. baltica OS155 and Shewanella sp. ANA-3.
Figure 7.
Figure 7.
Peptides from the N-terminal portion of conserved hypothetical protein SO_3842. The peptide breakage before the starred residue is produced when the signal peptide is cleaved and degraded. The other nontryptic peptides are properly contained in observed tryptic peptides and are most likely generated by post-digestion breakup. The N-terminal “ladder” observed for the tryptic peptide QMSIGTDTLLQIK is a likely result of aminopeptidase-driven trimming or in-source fragmentation.
Figure 8.
Figure 8.
Most nontryptic peptides are contained in an identified tryptic peptide. The plot gives the distribution of residue distances from a nontryptic endpoint to the endpoint in the containing tryptic peptide. The plot describes 2178 peptides with a nontryptic C terminus and 3508 peptides with a nontryptic N terminus. Surprisingly, elimination of two residues (as opposed to a single residue) from the C terminus is particularly common. Peaks at even positions (2, 4, and 6) from the C terminus may reflect peptidyl dipeptidase activity that digest two amino acids at a time; dipeptidases acting at the N terminus are not known in Shewanella.
Figure 9.
Figure 9.
Distribution of the N termini of all noncovered peptides, and of those which also have no upstream coverage. Two peaks are observed at two and ∼20 amino acids. These correspond to N-terminal methionine cleavage and to cleavage of signal peptides.
Figure 10.
Figure 10.
Fraction of peptides undergoing cleavage of N-terminal methionine, for a given second-position residue. Amino acids are arranged in increasing order by size of side chain. The in vitro data come from measurements of E. coli MAP enzyme efficiency (Hirel et al. 1989). The rates in vivo were estimated by counting the number of peptides. If X is the number of peptides that begin at residue 1 of a protein (indicating no cleavage) and Y is the number of peptides beginning at residue 2 (indicating a cleavage), the cleavage efficiency for that amino acid is defined as Y/(X + Y). Some amino acids are rarely used as the second residue of any of the TIGR genes (or GeneMark predictions). For instance, 308 protein sequences have serine at the second residue, while only nine have tryptophan. Because of this, our identifications contain relatively few N-terminal peptides for some amino acids. For the starred residues, 10 or fewer N-terminal peptides were observed with that residue at the second position.
Figure 11.
Figure 11.
(Top) Sequence logo for the amino acid sequence motif of all signal peptides identified by MS/MS analysis. Position −1 correspond to the last residue of the signal peptide. (Middle) Sequence logo for Gram-negative bacteria employed by PrediSi (Hiller et al. 2004). (Bottom) Sequence logo for Gram-negative bacteria employed by SignalP (Nielsen et al. 1997).
Figure 12.
Figure 12.
(A) Venn diagram of all signal peptide predictions on confirmed proteins. A total of 94 signal peptide cleavage sites are validated by mass spectrometry (23 of them missed by both SignalP and PrediSi). (B) Number of signal predictions by SignalP (89) and PrediSi (38) rejected due to the observation of peptides upstream of the signal cleavage site. Eight of these sites were predicted by both tools.
Figure 13.
Figure 13.
Sequence of phosphoribosylformylglycinamidine cyclo-ligase (SO_2760) in Shewanella oneidensis MR-1 according to TIGR annotation (red), observed nontryptic peptide (blue) and alignment to the orthologs in other Shewanella strains (green).

Similar articles

See all similar articles

Cited by 83 PubMed Central articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback