Background

Post Translational Modifications

PTMs play a critical role in the function of proteins. The nature and complexity of the molecules that are attached to a protein can greatly vary. Classes of molecules that are used to modify proteins include lipids, glycans, phosphates, other proteins, and many more.

Mass Spectrometry

Mass spectrometry (MS) has emerged as a powerful tool to screen complex biological samples for many types of PTMs. In most cases, proteins to be studied by MS are digested with trypsin, which cleaves peptide bonds on the C-terminal side of K and R residues. The resulting peptides may be sequenced using liquid chromatography-electrospray ionization-tandem mass spectrometry (LC-ESI-MS/MS), in which; (a) the peptide mix is separated by reversed-phase liquid chromatography (RPLC), (b) peptides eluting from the RPLC column are subjected to electrospray ionization (ESI), a process which ionizes the peptides and transfers them to the gas phase. Gas phase peptide ions are then introduced directly into the mass spectrometer. (c) An initial mass measurement (the survey or parent ion scan) is conducted on all peptides eluting at a given moment in time, and (d) individual peptides (usually the several most abundant species within a given parent ion scan) are isolated and subjected to collision induced dissociation (CID). Peptide ion selection and CID are repeated thousands of times in a typical MS analysis.

Database Search Software

During CID, various types of fragment ions differing in structure and charge state are generated. The b- and y- ion series represent fragmentation at consecutive peptide bonds (Fig. 1A, box i), and are used in typical proteomics applications to reconstruct the peptide sequence. Standard database searching algorithms compare acquired CID spectra to theoretical spectra generated in silico from a relevant protein database, and identify any resulting matches. Many types of PTMs do not undergo fragmentation in the CID process, and therefore alter one or more of the fragment ions of a modified peptide by a predictable indivisible mass. For example, a phosphate group conjugated to a tyrosine residue may be successfully identified by current peptide sequencing software, using the molecular mass of the phosphate group (~80Da) as a variable modification of Tyr1. However, this type of analysis is not generally possible with PTMs that also fragment during CID, because the additional PTM-derived fragment ions obscure the fragmentation pattern of the modified target peptide. For instance, tryptic cleavage of a SUMO-1 conjugate results in a target protein-derived peptide covalently linked to a 19aa C-terminal SUMO-1 fragment. Thus, when a SUMO-1 conjugated tryptic peptide is subjected to CID, both the target peptide and the conjugated C-terminal SUMO tryptic peptide are fragmented, resulting in multiple overlapping b- and y-fragment ion series. This composite spectrum is uninterpretable using standard peptide sequencing software. All human SUMO conjugates generate similar composite fragment ion patterns; tryptic digestion of SUMO-3 yields a C-terminal fragment of 16aa, while the conjugated C-terminal tails of SUMOs 2 and 4 are 32aa in length.

Ubiquitin Like Modifiers

SUMOylation has been demonstrated to play critical roles in transcription, cell cycle control, chromatin integrity and dynamics, and nuclear-cytoplasmic transport. Several large-scale proteomics projects have identified hundreds of putative SUMO target proteins, providing invaluable information concerning cellular processes that are likely to be regulated by SUMOylation. However, in order to understand how SUMO conjugation affects the function of a particular target protein, the number and location of the modification site(s) must be determined. Only very few SUMO modification sites have been identified using MS, and no automated method has been available to scan large MS datasets to identify SUMO modification sites. For example, Denison et al. identified 251 putative S. cerevisiae SUMO targets, but only six modification sites. Using a different approach, Wykoff and O'Shea identified 82 putative yeast SUMO conjugates, but only five modification sites. A consensus site for SUMO-1 conjugation was defined as ?KXE, where ? is a large hydrophobic residue (primarily I, L or V), K is the lysine residue to which SUMO is conjugated, and X any amino acid. While the majority of known SUMO conjugation sites conform to this consensus sequence, SUMOylation also clearly takes place on lysine residues located within non-consensus regions: e.g. the TEL, PML, and Smad4 proteins are SUMOylated at TKED, AKCP, and VKYC, respectively. Since no large-scale study of SUMOylation sites has been conducted, the proportion of modification sites conforming to the consensus sequence in vivo is unknown. To date, identification of the majority of SUMOylation sites has been accomplished using labor-intensive mutational analyses. Determination of modification sites in larger proteins, especially if the protein is multiply SUMOylated and/or SUMO conjugation occurs in non-consensus regions, can therefore become a burdensome task. In addition, identifications based solely on mutational analyses can yield spurious results due to disruption of protein secondary structure. Here, we present a novel approach to the identification of PTM sites using MS and a new software tool, SUMmOn. Notably, this approach is generally applicable to any type of modification that generates a diagnostic fragment ion series.

SUMmOn

A. At the core of SUMmOn is a routine to extract intensity readings for any user-specified b- and y-ion series (plus or minus a user-definable tolerance level) from every MS/MS scan in a given MS analysis. Depending on how the particular modification is conjugated to a target peptide, either its b- or y-ion series will be dependent on the mass of the modified peptide: e.g. if the modification is attached via its C-terminus, as in the case of SUMO, the y1+-ion series (dependent) must be recalculated for each MS/MS scan by subtracting the masses of the b1+-fragments (independent) from the singly charged precursor ion mass, and adding the mass of one hydrogen atom. Given the non-isotopic mass resolution of typical ion trap instruments, this step is repeated four times for each scan, assuming all possible charge states for the precursor ion, from +1 to +4. The bq+- and yq+-ion series (where 1 ² q ² precursor ion charge) are then calculated by adding q-1 hydrogen atoms, and dividing the resulting mass by q. Importantly, these ion series only account for those instances of fragmentation in which the target peptide remains intact; nevertheless (as demonstrated below) the unique SUMO fragmentation signature has proven to be remarkably effective for identifying SUMOylated peptides.

B. modification score. Next, each selected MS/MS scan is assigned a score. If we consider the matching of each m/z intensity peak as an independent Bernoulli event with a probability p (where p is obtained by dividing the average number of matches per scan for a given precursor charge state q by the average number of peaks per scan), then the probability p(k) of matching k peaks in a scan with n total peaks is regulated by a binomial distribution.

C. Following the scoring process, a list of potentially modified peptides is generated by comparing the measured masses of the precursor ions (minus the contribution of the modification) with theoretical fragment masses generated by an in silico tryptic digest of a relevant protein database. We first search the MS/MS data in each run against the human IPI database using SEQUEST, then filter and validate the SEQUEST output using Interact and Protein Prophet. A database containing only the remaining high confidence hits is then generated with a PERL script, and used to identify putative target peptides. Using this approach, even a relatively low mass accuracy instrument may be utilized to identify SUMO modification sites in a simple protein mixture, as the size of the database used to search for target peptides is relatively small.

D. target peptide score. Each potential target peptide sequence is assigned a score using the same algorithm described for the modification score. In this case, however, the probability of randomly matching one or more peaks is recalculated for each scan and target peptide candidate group. To this end, the sequences of all candidate target peptides (e.g. KEGEYIK) are reversed (KIYEGEK) and rotated (IYEGEKK, YEGEKKI, etc.) to generate a pool of "negative" target peptides. The number of peaks matched in these negative peptides is then divided by the total number of peaks in the scan to estimate the probability p of randomly matching a peak.

E. SUMmOn output. The results of the SUMmOn analysis are stored in an XML instance document. A dynamically generated XSLT style sheet is then used to create an HTML file that is formatted via CSS (cascading style sheet), and viewed in a web-browser. The introductory page displays the modification score distribution. The user may then view all scored MS/MS spectra, or choose to view particular MS/MS scans of interest by specifying any of several descriptors (e.g. scan number, charge state, precursor mass, etc.) in the SUMmOn control center window. If one or more putative target peptides is predicted by SUMmOn, it is indicated with the corresponding scan (see tutorial).