Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 9 (6), 1260-70

Template Proteogenomics: Sequencing Whole Proteins Using an Imperfect Database

Affiliations

Template Proteogenomics: Sequencing Whole Proteins Using an Imperfect Database

Natalie E Castellana et al. Mol Cell Proteomics.

Abstract

Database search algorithms are the primary workhorses for the identification of tandem mass spectra. However, these methods are limited to the identification of spectra for which peptides are present in the database, preventing the identification of peptides from mutated or alternatively spliced sequences. A variety of methods has been developed to search a spectrum against a sequence allowing for variations. Some tools determine the sequence of the homologous protein in the related species but do not report the peptide in the target organism. Other tools consider variations, including modifications and mutations, in reconstructing the target sequence. However, these tools will not work if the template (homologous peptide) is missing in the database, and they do not attempt to reconstruct the entire protein target sequence. De novo identification of peptide sequences is another possibility, because it does not require a protein database. However, the lack of database reduces the accuracy. We present a novel proteogenomic approach, GenoMS, that draws on the strengths of database and de novo peptide identification methods. Protein sequence templates (i.e. proteins or genomic sequences that are similar to the target protein) are identified using the database search tool InsPecT. The templates are then used to recruit, align, and de novo sequence regions of the target protein that have diverged from the database or are missing. We used GenoMS to reconstruct the full sequence of an antibody by using spectra acquired from multiple digests using different proteases. Antibodies are a prime example of proteins that confound standard database identification techniques. The mature antibody genes result from large-scale genome rearrangements with flexible fusion boundaries and somatic hypermutation. Using GenoMS we automatically reconstruct the complete sequences of two immunoglobulin chains with accuracy greater than 98% using a diverged protein database. Using the genome as the template, we achieve accuracy exceeding 97%.

Figures

Fig. 1.
Fig. 1.
An overview of the production of a mature immunoglobulin. Bottom, the mature immunoglobulin protein structure contains two identical light chains and two identical heavy chains. The germline heavy-chain and light-chain loci (top) contain many different gene segments. During heavy-chain gene rearrangement, in B-cell differentiation, one V, one D, and one J gene segment are combined. For light-chain gene formation, a V and a J gene segment are combined. The combined VDJ or VJ segments are joined by splice junction to a constant region.
Fig. 2.
Fig. 2.
The template proteogenomic method reconstructs a target protein sequence using tandem mass spectra and a template database in three steps: template-chain selection, anchor extension, and sequence construction. The template database specifies ordering and mutual exclusion constraints between templates. A set of templates is selected that obeys these constraints based on peptides identified on them. Anchors are peptides identified by searching spectra against the template database. Anchors are extended by aligning spectra that overlap the anchor. Finally, the sequence is reconstructed by merging the extended anchor sequences.
Fig. 3.
Fig. 3.
The partial alignment of a spectrum to an anchor sequence and consequent extension of the anchor sequence. Top, a PRM spectrum is shown with a partial alignment to the theoretical PRM spectrum of an anchor. The C terminus of the spectrum is not aligned. Bottom, the overhanging peaks enable the extension of the anchor sequence by two amino acids (QT).
Fig. 4.
Fig. 4.
The profile HMM used to align spectra and produce a consensus spectrum. A, the spectrum profile HMM derived from the anchor “VCAK” after aligning four spectra, all of which are aligned to the state I5. The peaks aligned to that state are shown and suggest a candidate Match state at mass 515.33 Da. B, the same HMM after we performed model surgery to add the new Match state.
Fig. 5.
Fig. 5.
A set of spectra is shown overlapping a region of the predicted sequence. A spectrum supports a mass interval in the predicted sequence if both adjacent PRMs to the interval are matched in the spectrum. The confidence of each mass interval is the fraction of overlapping spectra that support the interval (with pseudocounts). The PRMs of the overlapping spectra that are necessary to support the mass interval corresponding to “C” are circled.
Fig. 6.
Fig. 6.
The average accuracy of each position in the extension. The accuracy of the extension degrades for positions close to the end of the extension, whereas the number of predictions increases. Each data point is annotated with the total number of anchors extended to that position or further.
Fig. 7.
Fig. 7.
The accuracy of extension as a function of the position from the end of the extended sequence. A, the aBTLA heavy chain and light chains reconstructed from protein template databases. The gray rectangles are anchors; the arrows, annotated with sequence, are the extended and merged sequences. Text above the anchors indicates the GI number of the template used, and coordinates within or below the anchors indicate their position within the template. Red amino acids were incorrectly predicted. B, the aBTLA heavy chain identified using a genomic template database. The anchors were identified using templates from the locus reverse strand. Anchor ordering and genomic position is annotated with reference to the forward strand. The coordinates of each anchor on the chromosome are shown. Red portions of the anchors are incorrectly incorporated anchor sequence. C, the heavy chain sequence produced by using increasingly divergent templates. The reconstructions at 85, 75, and 65% similarity to the aBTLA heavy chain sequence are shown.
Fig. 8.
Fig. 8.
The annotation of the BSA gene using a genomic template database. Twelve exons for the gene are shown, with corresponding extensions. Each anchor is annotated with its genomic coordinates.

Similar articles

See all similar articles

Cited by 16 PubMed Central articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback