Proteomics 2009, 9, 3741–3751
3741
DOI 10.1002/pmic.200800701
RESEARCH ARTICLE
LymPHOS: Design of a phosphosite database of primary human T cells David Ovelleiro, Montserrat Carrascal, Vanessa Casas and Joaquin Abian CSIC/UAB Proteomics Laboratory, Instituto de Investigaciones Biome´dicas de Barcelona-Consejo Superior de Investigaciones Cientı´ficas, IDIBAPS, Barcelona Autonomous University, Bellaterra, Spain
Current proteomic technology is capable of producing huge amounts of analytical information, which is often difficult to manage in a comprehensive form. Curation, further annotation and public communication of proteomic data require the development of standard data formats and efficient, multimedia database structures. We have implemented a workflow for the annotation of a phosphopeptide database (LymPHOS) that includes tools for MS data filtering and phosphosite assignation, mass spectrum visualization, experimental description and accurate phosphorylation site assignation. Experimental annotations were fitted to current minimum information about a proteomics experiment guidelines. A new guideline for phosphoprotein sample preparation is also proposed. Currently, the database describes 342 phosphorylation sites mapping to more than 200 gene sequences, and it can be accessed through the net (http://www.lymphos.org).
Received: September 1, 2008 Revised: March 17, 2009 Accepted: April 20, 2009
Keywords: LymPHOS database / Phosphoproteome / T-lymphocyte
1
Introduction
Large-scale characterization of PTMs is often carried out by shotgun approaches using PTM-specific isolation and detection methods [1–3]. Current proteomic technology is capable of producing huge amounts of analytical information that is frequently difficult to handle in a comprehensive form. An LC-MS/MS study consisting of a few 90 min chromatograms can produce several thousand MS/MS spectra that then need to be evaluated. Consequently, curation, further annotation and public communication of proteomic data require the development of standard data formats and efficient, multimedia database structures. Moreover, re-evaluation of the quality and significance of the identification data in a database by an external user should be facilitated by an easy access to more specific Correspondence: Dr. Joaquin Abian, LP CSIC/UAB, Facultat de Medicina, Edifici M, Universitat Auto`noma de Barcelona, Campus UAB, 08193 Bellaterra, Spain E-mail:
[email protected] Fax: 134-93-581-49-13 Abbreviations: MIAPE, minimum information about a proteomics experiment; p-site, phosphorylation site; XDK, Xcalibur Development Kit
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
information, such as experimental conditions and individual spectrum raw data. This information is especially relevant for the analysis of phosphorylation sites (p-sites), where p-site assignation is often derived from the combination of MS2 and MS3 spectral data, and several analytical tools are often required to resolve ambiguous p-site assignations [4, 5]. Several open access, dedicated phosphoprotein databases have been built by a number of groups [6–12]. Most of these contain predicted as well as experimental data extracted from the literature and recently some of them [6, 11, 12] have included tools for on-line visualization of spectral data. Probably the largest repository of p-sites is phosphosite, which collects more than 50 000 experimental or predicted p-sites from near 180 cell lines and 16 distinct tissues from human, mouse, rat and other species (www.phosphosite.org). Phosphosite records can be searched in several ways (peptide, species, disease, etc.) and a viewer is provided to visualize the spectra. Another database of wide coverage, PhosphoELM, contains near 18 300 p-sites extracted from experimental data from the literature, from which about 13 000 are from mouse or human [7]. PhosphoELM is focused on the study of kinases and kinase These authors contributed equally to the work.
www.proteomics-journal.com
3742
D. Ovelleiro et al.
substrates and provides search methods for this purpose; however, it does not include mass spectral information, for which a search of the source literature is required. Besides these major databases, small reservoirs such as the rat Collecting Duct Phosphoprotein Database (http://dir.nhlbi.nih.gov/papers/lkem/cdpd/, about 750 p-sites) or PhosphAT [9] (4070 non-redundant phosphopeptides) hold data from only one specific cell, tissue or model organism being addressed in a specific area of research. Often, these dedicated databases develop to extend their scope as was the case of Phosida [8] or PhosphoPep [6]. Initially built to hold data from phosphoproteomics experiments in HeLa cells, Phosida is one of the most carefully curated databases available for phosphopeptides. At present, Phosida holds phosphopeptide data from B. subtilis, Mus musculus, Lactococcus lactis and E. coli, and includes p-sites, phosphoprotein information and some useful tools for p-site analysis. Recently, the PhosphoPep database, initially containing experimental phosphopeptide data on D. megalogaster, has also been extended to include data on humans, S. cerevisae and C. elegans. Although this database stores data from several laboratories, a common curation process procures identification and annotation homogeneity. Here, we describe the design of a phosphopeptide database (LymPHOS) focused on the phosphoproteome of human primary T-lymphocytes, a model for the study of signal transduction and altered immune response for which no phosphopeptide data are available in other databases. LymPHOS currently holds all the experimental data on p-site identification by LC-MS/MS carried in our laboratory [13] and its design attempts to fulfill the above-mentioned requirements for efficient data access and analysis. For its annotation, we implemented a workflow that includes tools for filtering MS data and p-site identification, mass spectra visualization, experimental description and accurate p-site assignation. All spectra supporting a p-site assignation were stored in the database and presented graphically with the corresponding experimental information and identification parameters and scores. Experimental annotations were fitted to current minimum information about a proteomics experiment (MIAPE) guidelines [14]. A new guideline for phosphoprotein sample preparation is also proposed.
2
Materials and methods
2.1 Sample preparation Human primary T cells were isolated from buffy coats obtained from the Blood Bank of the ‘‘Hospital Clinic’’ and the ‘‘Hospital Vall d’Hebron’’ (Barcelona, Spain) using a Ficoll-Paque (GE, Uppsala, Sweden) gradient centrifugation and following the standard procedures. Each purification was started from 100 mL buffy coat. Purified lymphocytes were lysed and proteins extracts were prepared as described [13]. Protein extracts were separated in a hand-poured 10% & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Proteomics 2009, 9, 3741–3751
acrylamide SDS-PAGE gel (MiniProtean, Bio-Rad), and stained with Coomassie Blue. The gel lane was then divided into 11 slices, which were in-gel digested with trypsin (Promega, Madison, WI, USA), using a Digest MSPro robot (Intavis, Koeln, Germany) and following standard procedures. Extracts were evaporated to dryness and redissolved in 200 mL 250 mM AcOH/30% acetonitrile.
2.2 Phosphopeptide enrichment Phosphopeptide enrichment was performed using sequential IMAC and TiO2 as previously described [13]. Briefly, 20 mL IMAC resin (Phos-Select iron affinity gel, Sigma, St. Louis, MO, USA) was added to each peptide extract and the mixtures were incubated for 90 min at 201C. Phosphopeptides were eluted with 0.5% NH4OH. The non-retained fraction was concentrated in the SpeedVac to 10 mL, diluted five times with 1 M glycolic acid, 5% TFA, and 80% acetonitrile and loaded into a TiO2 minicolumn. Phosphopeptides were eluted from the tip with 0.5% NH4OH followed by 1 mL 30% ACN.
2.3 LC-MSn analysis IMAC and TiO2 eluates were analyzed by LC-MSn using a LTQ linear ion trap equipped with a microESI ion source and controlled by Xcalibur 2.0 SR2 (ThermoFisher, San Jose, CA, USA). Each extract was concentrated to about 5 mL and diluted to 40 mL with 1% formic acid. Separation was carried out using a C18 preconcentration cartridge (Agilent Technologies, Barcelona, Spain) connected to a 10 cm long 150 mm id Vydac C18 column (Vydac, IL, USA). Separation was done at 1 mL/ min using a linear ACN gradient from 0 to 40% in 60 min (solvent A: 0.1% formic acid, solvent B: ACN 0.1% formic acid). The LTQ instrument was operated in positive ion mode with a spray voltage of 2 kV. The scan range of each full MS was m/z 400–2000. The spectrometric analysis was performed in an automatic dependent mode. Each acquisition cycle comprised a full scan (m/z 400–2000) followed by eight product spectra on the corresponding most abundant precursor ions in the full scan. When a signal derived from a neutral loss of 98, 49 or 32.7 (loss of H3PO4 for 11,12 and 13 charged ions, respectively) from the precursor ion was detected among the ten most intense signals in a given MS/MS spectrum, a subsequent MS3 scan was performed on that ion. Dynamic exclusion was set to one with a time window of 5 min, in order to minimize the redundant selection of precursor ions.
2.4 Spectrometric data interpretation and filtering LC/MS files in the Xcalibur raw format were analyzed using Bioworks v3.3, a ThermoFisher implementation of Sequest www.proteomics-journal.com
3743
Proteomics 2009, 9, 3741–3751
[14] algorithm. Search parameters were: peptide mass tolerance, 2 Da; fragment tolerance, 0.8 Da; digestion rule, ‘KR no P’ (trypsin), allowing up to two missed cleavages; static modification, carbamidomethylated cysteine (157 Da); dynamic modifications, methionine oxidation (116 Da), phosphorylation on Ser, Thr and Tyr (180 Da) and loss of water from Ser and Thr (b-elimination of phosphoric acid from the corresponding phospho-amino acid). A database search was performed by limiting the tentative charge of precursor ions to a maximum of 13. Each mass spectrum was searched using the composite target-reverse database strategy, which allows efficient data filtering with a selected ratio of false-positive hits [15, 16]. The target database was prepared from the Uniprot-Swissprot and Uniprot-Trembl [17] human databases (UniProt Knowledgebase Release 14.0, 53 550 and 19 045 protein sequences, respectively) by converting the original dat format to fasta and combining the two archives. Bioworks search results were exported to a Microsoft Excel format from which scan number, precursor mass and charge and search scores were extracted. In addition, the original raw files were processed using the Xcalibur Development Kit (XDK). The XDK is a suite of programmable COM objects provided with Xcalibur, which allow display and manipulation of data and access to Xcalibur files. Using the PerlWin32::OLE module to link to the XDK library, the information on the MS stage (MS2 or MS3) and the mass–intensity arrays of each spectrum were extracted and stored in an automated way. This information was merged automatically with that provided by Bioworks on the basis of the scan number of each spectrum available in the Bioworks Excel report. To efficiently reduce false-positive identifications, two scores were used in combination: the Xcorr (main Sequest score representing the cross-correlation between experimental and theoretical spectra) and an ‘‘in house’’ produced D-value (discriminant score used by Peptide Prophet [18]) calculated from the scores provided by Bioworks. For each data set, the cutoff values of Xcorr and D-value that gave a 1% false discovery rate were used [13]. Only tentative phosphorylated peptides were selected for further analysis. Peptides with lower scores than the cutoff values or those from decoy proteins were rejected. However, as previously described [13], in order to improve the efficiency of p-site location, MS2-MS3 spectra pairs pointing to the same peptide sequence were stored even when one of them was under the cutoff value.
2.5 Phosphorylation site assignment: Ascore After filtering, a set of high confidence phosphorylated peptides are obtained. However, when multiple p-sites are possible in a given peptide, assignation of the phosphorylation site is often ambiguous. If two or more phosphorylations take place in a given peptide with three or more & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
possible p-sites, the difficulty increases. To confirm the correct location of the p-site (or p-sites), the Ascore, a probability-based score described by Gygi’s group [4] was used. Ascore provides a probabilistic estimation of the confidence of phosphorylation site assignment relative to the second-best alternative. The different p-site alternatives are ranked from best to worst depending on how many theoretical fragments (series y and b) are found in the experimental spectrum. When the best choice is not sufficiently better than the second one, the phosphorylation point is poorly scored. The first step in the Ascore calculation is to determine all the possible phosphopeptides that would be generated by alternative phosphate locations in the Ser, Thr and Tyr amino acids in the molecule. The Ascore calculation procedure assigns a score to each of these forms in order to elucidate the correct pattern of p-sites in the peptide. As described [4], an Ascore419 means that the probability of a correct assignation is 99% for the selected site, while an Ascore in the range 15–19 means a probability of 490%. The number of possible phosphorylated sequences can be calculated with the binomial coefficient formula:
Ckn
n! n ¼ ¼ k k!ðn kÞ!
ð1Þ
To automatically obtain all the phosphorylation patterns of a given phosphopeptide, we parameterized the problem by assigning a numerical index to each amino acid. Then, using a Perl script, we obtained the sequences resulting from the combinations without repetition of n possible p-sites taken in groups of k predicted phosphorylated sites (nP and kP, respectively, in Fig. 1A). In the case of MS3 spectra, the fragmentation of the phosphate neutral loss ion from a p-Ser or p-Thr leads to the assignation of the residue involved as the corresponding dehydrated form (dehydroalanine or dehydrobutyric acid, respectively). Thus, in cases where MS3 spectra were assigned to sequences containing only unmodified Thr, Ser and Tyr residues together with dehydrated forms of Thr or Ser, combinations were calculated as described before for MS2 spectra. However, when a MS3 spectrum indicated the presence of phosphorylated amino acids, in addition to one or more dehydrated sites, another combination without repetition was performed for each combination of phosphorylated amino acids, using nW possible sites for water loss (total number of dehydrated and non-phosphorylated Ser and Thr residues) taken in groups of kW dehydrated sites, as derived from the assigned sequence (Fig. 1A). The number of possible combinations can be quite large: the observed peptide TPSPLVLEGtIEQSuPPLSPTTK, with one phosphorylation and one dehydrated site (t and u, respectively) in a sequence with eight possible sites to hold the phosphorylated or dehydrated forms, generates 56 www.proteomics-journal.com
3744
D. Ovelleiro et al.
Proteomics 2009, 9, 3741–3751
A Obtaining peptide candidates nW CnP kP × CkW
KLEKEEEEGISQEusEEEQ
C13 × C12 = 6
a1. Phosphorylation pattern proposed by search engine
Phos_1 -> + DH_1-> + DH_2-> Phos_2 -> + DH_1-> + DH_2-> Phos_3 -> + DH_1-> + DH_2->
KLEKEEEEGIsQESSEEEQ KLEKEEEEGIsQEuSEEEQ KLEKEEEEGIsQESuEEEQ KLEKEEEEGISQEsSEEEQ KLEKEEEEGIuQEsSEEEQ KLEKEEEEGISQEsuEEEQ KLEKEEEEGISQESsEEEQ KLEKEEEEGIuQESsEEEQ KLEKEEEEGISQEusEEEQ
a2. Six phosphopeptides alternatives
B Ascore calculation 120
KLEKEEEEGIsQEuSEEEQ KLEKEEEEGIuQESsEEEQ KLEKEEEEGIuQEsSEEEQ KLEKEEEEGISQEsuEEEQ KLEKEEEEGISQEusEEEQ KLEKEEEEGIsQESuEEEQ
score
100 80 60 40
8
1 2 3 4 5 6 7 8 9 10
level
b1. Peptide candidates b2. Theoretical-Experimental correlation
P(x) =
N
⎡ ⎤
∑ ⎢Nk ⎥ p
k =n
⎣ ⎦
k
b3. Levels comparison
− (1− p) N k
KLEKEEEEGISQEusEEEQ
Score = −10 × log(P (x))
Ascore = 21.1
Ascore = Score best − Score second
b5. Best candidate and Ascore
b4. Ascore calculation Figure 1. Precise phosphosite assignation. (A) Obtaining alternative peptide candidates (amino acid modifications are coded as s, t and y for phosphorylated Ser, Thr and Tyr, respectively and as u, v for dehydrated/dephosphorylated Ser and Thr, respectively). (a1) In the example, the assigned peptide contains three Serines, each in a different form (unmodified, phosphorylated and dehydrated). (a2) A list of all possible combinations is generated. The number of combinations obtained is given by the product of the combination of nP possible p-sites in the unmodified sequence taken in groups of kP predicted phosphorylated sites by the combination of nW water loss sites taken in groups of kW observed water losses. (B) Ascore calculation. The two best sequences are selected from the list and the Ascore for the first candidate is calculated. (b1) Peptide candidates. The theoretical spectrum for each peptide candidate is calculated (in the example, six in silico spectra are produced depicting the corresponding y and b ion series and considering charges 11 and 12). (b2) Theoretical –experimental correlation. For each candidate, coincidences with the experimental spectrum are scored using the cumulative binomial probability for the corresponding number of coincident ions. For this, the experimental spectrum is divided into sectors of 100 m/z units and coincidences between the theoretical ions and the most intense signals in each sector are determined. The evaluation is made at ten levels. In the first level only the most intense peak in each sector from the experimental spectrum is considered; the two most intense are considered in the second level and so on. (b3) Level comparisons. The two candidates with the highest scores are selected and the level with the greatest difference between them determined (eight level in the example). (b4) Score calculations. The scores are recalculated taking into account only p-site specific ions and using a theoretical probability given by the level of maximum difference determined in the previous step (p 5 0.08 for level eight and sectors of 100 m/z units). The Ascore is the difference between the scores of the best and second best candidates. (b5) The candidate with the highest Ascore is reported.
different patterns for Ascore evaluation. The rest of the Ascore calculations were performed following Gygi’s group (Fig. 1B).
2.6 Web application structure The LymPHOS web application consists of a relational database and a web interface that allows data submission, & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
querying and visualization. The database uses a MySQL (www.mysql.com) relational database management system. The web interface is based on server side scripts that allow uploading of the required data (peptide and protein related information and experimental, MIAPE compliant information). Scripts are written in Perl (www.perl.org) and PHP (www.php.net). Graphical visualization of spectra and fragmentation patterns are performed using the GD graphics library (www.boutell.com/gd). www.proteomics-journal.com
3745
Proteomics 2009, 9, 3741–3751
3
Results and discussion
The LymPHOS database was designed to provide efficient storage of the different types of information related to a shotgun experiment in phosphoproteomics and at the same time to offer a simple and intuitive interface for further curation, analysis and consulting of the data. All the information provided by the LymPHOS database has direct experimental support and is restricted to phosphate-bearing sequences. Experimental information supporting each p-site assignment is shown at the peptide level. All the available MS2 and MS3 spectra justifying an assignation are presented graphically together with the corresponding experimental information and identification parameters and scores. Thus, the information for each p-site assignation includes always, at least, one annotated spectrum image. In some cases, such as for peptides KAsGPPVSELITK, QAsIELPSMAVASTK, up to 52 different MS2 and MS3 spectra from different files and experiments
are provided side-by-side with the analytical and spectrometric data. The user can see and download each mass spectrum used to define a p-site, and can access the experimental conditions used to generate each spectrum. The full set of available data is stored in a MySQL relational database (Fig. 2) and includes experimental conditions, annotated mass spectra, peptide and protein sequences, proteomics search engine scores and p-site assignment. Some of these parameters were retrieved from the commercial search engine Bioworks, but others required the development of customized software in order to automatically extract the information from the original Thermo raw files. For database uploading, most of the information (peptide scores, peptides with p-sites assignments, mass spectra and protein sequences) was packed in a text file designated the phos-file. The web application that presents the LymPHOS database supports the minimum possible workload as the calculus operations and CPU intensive routines are executed
D A 1
data
1
data id
1
C spectra spectrum id mass array intensity array
phos-file name
1
B conditions condition id name description
1
raw file name scan MS stage parent mass delta mass charge original peptide validated peptide peptide probability Xcorr DeltaCn Sp RSp Ions Proteins count D value Ascore condition id
phosphosites phosphosite id protein id
1
phosphorylation position inside peptide phosphorylation position inside protein
F prot_spec protein id data id
E prots
1
protein id Accession Number SwissProt/TrEMBL ID sequence
Figure 2. Scheme of LymPHOS database structure. (A) The data table stores information related to each data entry: scan number, phos-file name, raw file name, scores and a condition identifier (condition id). (B) The conditions table reflects each set of MIAPE documents related to each data entry. For example, condition id 5 1 is associated with an ‘‘SDS-PAGE Titanium Resting Lymphocytes’’ experiment, with three associated MIAPE documents: Sample preparation, MS and MS informatics. (C) Each data entry is associated with a spectrum in the spectra table. (D) The phosphosites table associates each data entry with the phosphosites characterized in the peptide, the proteins mapped by this peptide and the position of the phosphate in the protein sequence. (E) The proteins table lists the full set of proteins mapped by the phosphopeptides in the LymPHOS database. (F) The prot_spec table relates each data entry (every peptide) with one or more proteins.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
3746
D. Ovelleiro et al.
Proteomics 2009, 9, 3741–3751
externally, prior to the uploading process. Moreover, a rapidaccess, compact sequence database including only the proteins identified was used. Overall, these conditions provide optimal performance with minimum hardware requirements. The workflow used in the LymPHOS database is summarized in Fig. 3. Experimental information is uploaded in a MIAPE [19] compliant structure. MIAPEs are a set of reporting guidelines for providing experimental information related to a proteomic analysis. MIAPEs are being actively developed by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) in order to establish common standards for comparison and evaluation of experimental data. Available MIAPEs are associated with the main proteomics tasks such as 2-DE or LC-MS/MS analysis. In the development of our work, we lacked a standard for sample preparation in phosphoproteomics and consequently we implemented a proposal form that included the most common tasks and parameters involved in this field. For this purpose, experimental metadata was compiled in the form of a structured list of fields. This information included experimental reagents and conditions for sample preparation, methods for phosphoprotein/phosphopeptide enrichment, as well as analytical parameters for chromatography
XCALIBUR
and MS analysis (see Supporting Information). This content was stored in the form of a text file, which was uploaded into the database server through a special web form.
3.1 Data filtering A major issue in large-scale shotgun MS analysis is the selection of correct identifications from a high number of spectra. Spectra included in LymPHOS were filtered using both the Xcorr and D-value. Although the D-value includes Xcorr in its calculation, we found that Xcorr/D-value distributions were more effective in rescuing assignations of low D-value but high Xcorr. These assignations frequently correspond to phosphopeptides with alternative, ambiguous assignations of similar Xcorr. In these cases, the parameter DCn (the difference between Xcorr from the best and second-best candidates), which is also considered in the D-value calculation, is low and hence produces a low score. These assignations are discarded when only the D-value parameter is used. We tested several procedures to establish the boundaries in the Xcorr/D-value space that separate true from false positives with a given FDR. The use of a linear boundary of the type Xcorr 5 aD-value1b was a conve-
BIOWORKS
RAW files Mass Spectrometry Informatics
Fasta DB
XDK
Mass Spectrometry
Excel report Sample preparation and handling Scan number
Retention time Mass/Intensity spectra MS state
Precursor mass
Non-redundant list of proteins containing each peptide
MIAPE info
Charge
LymPHOS web interface
Bioworks/Sequest scores
D value Peptide
Ascore
Peptide-related info
Protein info
LymPHOS DB
Phos file Figure 3. LymPHOS workflow overview. (A) Phos-file construction: Xcalibur raw files are analyzed using Bioworks, which searches for the best possible match for each spectrum in a local Uniprot database in Fasta format. An Excel report containing all peptide assignations and associated scores is generated. Xcalibur raw files are also used to extract mass spectra as well as complementary information not provided by Bioworks (MS2/MS3 state, retention time). Subsequently, the D-value and Ascore are calculated for each spectrum to evaluate sequence assignation and p-site location, respectively. Only the best spectra-peptide pairs are preserved. Finally, all the proteins mapped by each peptide are included and all the information is stored in a phos text file. (B) MIAPE info: All the information related to the proteomics experiment is arranged in a MIAPE-based structure, and stored in a text file that can be uploaded to the web application. The three relevant experimental steps are summarized in three MIAPE documents that describe sample preparation, MS and data analysis conditions, respectively. (C) Web interface: The LymPHOS application provides a webform that enables the uploading of phos-files and the associated MIAPE documents.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
3747
Proteomics 2009, 9, 3741–3751
patterns [20] and consequently assignations with poor scores. Surprisingly, from the set of MS2-MS3 pairs selected, 64% of the assignations from the MS2 spectra passed through the Xcorr/D-value filter while for MS3 spectra this value was only 46%. Of the 188 pairs of MS2-MS3 spectra, only 26 produced distinct p-site assignations. In most cases, these conflicting assignations were derived from MS2-MS3 pairs in which one spectrum produced a high Ascore while the other had a low Ascore, below the cutoff value of 19. In three cases we found two alternative assignations each supported by MS2 and MS3 spectra of high Ascore. Manual analysis of these spectra indicated that they were produced by the presence of both alternative phosphopeptide forms (Fig. 4). Ascore thus appears to have some limitations for the correct evaluation of these spectra. As peptide isomers that differ in the phosphate position are likely to elute closely in reversed phase chromatography, production of mixed spectra during a shotgun analysis of phosphopeptide extracts may be quite frequent. Thus, although we found Ascore to be a powerful tool for p-site analysis, more resolutive methods are still needed
nient method for this purpose. Values of a and b were calculated with an iterative procedure based on the generation of random values of these parameters in a given range. Values providing the highest number of true positive spectra for a given FDR were used iteratively to restrict the search range. Confident assignation of the p-site was performed on the basis of the Ascore values. After Ascore evaluation, about 10% of the initial p-sites assigned by Bioworks were relocated. In order to improve the efficiency of p-site location, MS2-MS3 spectra pairs pointing to the same peptide sequence were collected and used for Ascore calculation even when one of them was below the cutoff value in the Xcorr/D-value filtering step. Low score assignations in such pairs may be useful for supporting the location of the phosphate in the sequence derived from its MSn partner data. In this respect, the dominance of the phosphate neutral loss signal at m/z 5 (Mr1n98)/n (for n 5 1–3) in MS2 spectra is often considered to result in reduced sequence information, producing hard to interpret fragmentation 924.6 5
[M-H3PO4]2+
MS/MS 869.0 (+2)
4
IEDVGsDEEDDSGKDKK IEDVGSDEEDDsGKDKK
y142+ 794.9
y6
3
745.8
803.6
662.4 2
1150.4
y6 b4
y2
y9
644 .3
1071.2 1021.4
457.1
275.2
400
1671.4
b15-H3PO4 1573.5
1284.2
y10
1
b15
b11
852.5
y11
1265.6 y13-H3PO4 1391.5
b14 1556.4
b16 1799.5
600
1000
800
1200
1400
1600
1800
m/z 746.0
100
y142+
3
MS 869.0 820
y152+
Relative Abundance
80
803.5
[M-H2O]2+ 906.6
60
m/z
Figure 4. MS2 and MS3 spectra tentatively corresponding to a mixture of two positional phosphopeptide isomers. Ascore assigns the MS2 spectrum to IEDVGsDEEDDSGKDKK while MS3 is assigned to IEDVGSDEEDDsGKDKK, in both cases with a high score (54 and 28, respectively).
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
y11
40
1247.4 1391.5
y10 y
20
6 662.3 644.3
b3 y2
358.1
275.2
y6
1132.4 851.5
b5
y9
b11
y9
1021.4 1186.2 1003.4
514.3 400
600
800
1000
y13
1200
y11
b15 1573.5
1265.4
b14 1459.4 1400
y15
b16
1606.7 1701.6 1600
1800
3748
D. Ovelleiro et al.
Proteomics 2009, 9, 3741–3751
to deconvolute these special spectra. These tools are of importance in cases such as the study of multisite phosphorylation, a phenomenon of great relevance for the understanding of the role of phosphorylation in protein function, especially in the study of signal multiplexing during signal transduction [21, 22].
3.2 Peptide–protein association Shotgun proteomics has a strong peptide-centric nature. Digestion with an enzyme generates a mixture of peptides in which the possibility of associating a peptide with its original protein is lost if other proteins sharing this specific sequence exist. This has been described as the protein inference problem [23]. Tentative peptide assignation to a given protein in preference to other candidates is sometimes supported by the confidence given by the presence of other peptides in the sample that also point to that specific protein sequence. However, this premise is not applicable when PTM-selective purification is involved, such as in phosphopeptide concentration and detection. Peptide–protein association in the LymPHOS database takes into account this characteristic so that each peptide was associated with every possible protein form annotated in the combined UniProt SwissProt – TrEMBL Human protein databases. Unfortunately, the Bioworks reports already prepared as part of the LymPHOS workflow could not be used directly for this purpose, as Bioworks lists only one protein for each peptide
A Scan
MS
413 476 680 339 473 700 422
MS3 MS3 MS2 MS3 MS2 MS2 MS2
ions 0.77 0.33 0.67 0.89 0.39 0.57 0.70
B
n
Filename IMAC_4.RAW IMAC_8.RAW IMAC_8.RAW IMAC_5.RAW IMAC_5.RAW IMAC_6.RAW IMAC_6.RAW
count 0 1 0 1 1 0 1
BW Protein Q86VM9|ZCH18_HUM... Q86X27|RGPS2_HUM... Q09666|AHNK_HUMA... O95466|FMNL_HUMA... Q8IYB3|SRRM1_HUM... Q09666|AHNK_HUMA... Q13586|STIM1_HUM...
D
[M+H]
+
deltaMass z
1274.57 2180.12 1812.79 1039.57 1977.99 1812.79 1405.61
-1.92 -0.10 0.13 0.26 -0.88 0.21 -0.03
2 3 2 2 3 2 2
Set of proteins containing peptide
4.88 4.58 3.95 4.51 4.67 3.08 4.96
candidate (although the Bioworks viewer shows a number of proteins sharing the target sequence). Thus, a parallel database search was performed using a Perl script to provide an associative list correlating each phosphopeptide sequence with all the proteins containing that sequence. About 28% of the peptides in the database mapped to a single protein sequence, while another 24% were common to more than five proteins. On average, each sequence was associated with three proteins.
3.3 Phos-file documents All the information obtained up to this point (Bioworks scores, peptides and physical parameters, in house-calculated D-value and Ascore, spectra and protein names and sequences) was packed in a single phos-file. Phos-files are generated via a custom-made Perl application that unifies the different scripts mentioned in the preceding sections. This application performs, for each spectrum, all the described operations including Bioworks report parsing, extraction via XDK of the MS stage and spectrum, D-value calculation, peptide filtering at FDRo1% based on Xcorr and D-value, Ascore determination and p-site validation, and finally the search of the set of proteins mapped by the validated peptide. Although we developed this scheme for our convenience (currently, the raw format is the original one in all our MS instruments) it can be easily extended to more general cases. The only requirements for the generation of phos-file documents are
Spectra -> Mass
||Q86VM9|ZCH18_HUMAN ||Q86X27|RGPS2_HUMAN||B1AMV0|B1AMV0_HUMAN ||Q09666|AHNK_HUMAN ||O95466|FMNL_HUMAN||Q2T9F0|Q2T9F0_HUMAN ||A9Z1X7|A9Z1X7_HUMAN||Q8IYB3|SRRM1_HUMAN ||Q09666|AHNK_HUMAN ||Q8N382|Q8N382_HUMAN||Q13586|STIM1_HUMAN
log(Prob) XCorr δ Cn
BW Peptide
[email protected] R.S@AASREDLVGPEVGAS... K.ASLGS#LEGEAEAEASSP...
[email protected] K.VPKPEPIPEPKEPS#PEK... K.ASLGS#LEGEAEAEASSP... R.AEQS#LHDLQER.L
|628.710|777.079|123... |911.174|671.956|887... |857.853|1075.185|85... |511.168| 470.026|893... |627. 485|594.756|562... |857.913|858.791|889... |654. 295|655.157|545...
-7.20 0.00 -9.91 0.00 -12.19 -0.15 -7.02
2.98 2.90 4.25 2.21 3.23 3.26 2.08
Sp
RSp
0.38 1157.7 0.39 939.2 0.11 1030.1 0.37 1048.0 0.34 572.6 0.1 346.1 0.47 310.2
1 2 1 1 1 1 1
Spectra -> Intensity
Rt(s)
Ascore
|1037.921|540.006|38... |34.637|31.025|29.13... |19089.467|3474.507|... |3360.741|2127.249|1... |52417.047|8693.704|... |1450.787|311.580|12... |39184.613|4696.431|...
1862.3 2088.7 2705.2 1668.5 2090.2 2709.6 1883.4
66.7 0.0 42.3 1000.0 1000.0 17.4 1000.0
Final peptide AuDLEDEESAAR uAASREDLVGPEVGAS... ASLGsLEGEAEAEASSPK RDuELGPGVK VPKPEPIPEPKEPsPEK ASLGsLEGEAEAEASSPK AEQsLHDLQER
### PROTEIN SEQUENCES Q86VM9|ZCH18_HUMAN Q86X27|RGPS2_HUMAN B1AMV0|B1AMV0_HUMAN Q09666|AHNK_HUMAN O95466|FMNL_HUMAN Q2T9F0|Q2T9F0_ HUMAN A9Z1X7|A9Z1X7_HUMAN Q8IYB3|SRRM1_HUMAN
Zinc finger CCCH domain-containing protein 18... Ras-specific guanine nucleotide-releasing fac... Ral GEF with PH domain and SH3 binding motif ... Neuroblast differentiation-associated protein... Formin-like protein 1 (Leukocyte formin) (CLL... FMNL1 protein.... Uncharacterized protein SRRM1 (Serine/arginin... Serine/arginine repetitive matrix protein 1 (...
MDVAESPERDPHSPEDEEQPQGLSDDDILRDSGSDQDLDGAGVRASDLEDEESAARGPSQ. .. MDLMNGQASSVNIAATASEKSSSSESLSDKGSELKKSFDAVVFDVLKVTPEEYAGQITLM... STGSILENEQRSNLMNNILRIISDLQQSCEYDIPMLPHVQKYLNSVQYIEELQKFVEDDN ... MEKEETTRELLLPNWQGSGSHGLTIAQRDDGVFVQEVTQNSPAARTGVVKEGDQIVGATI... MGNAAGSAEQPAGPAAPPPKQPAPPKQPMPAAGELEERFNRALNCMNLPPDKVQLLSQYD... MGNAAGSAEQPAGPAAPPPKQHRFEKLMEYFRNEDSNIDFMVACMQFINIVVHSVENMNF... MDAGFFRGTSAEQDNRFSNKQKKLLKQLKFAECLEKKVDMSKVNLEVIKPWITKRVTEIL... MDAGFFRGTSAEQDNRFSNKQKKLLKQLKFAECLEKKVDMSKVNLEVIKPWITKRVTEIL...
Figure 5. Phos-file example. The phos-file comprises two parts separated by a text flag: (A) Each data entry is listed with all the collected information: parameters and scores derived from the search engine report, the list of proteins in the database used containing the peptide, the mass and intensity arrays, calculated scores (D-value and Ascore) and the validated peptide (peptide with p-site assignation revised using Ascore). (B) List of the proteins containing the previous list of peptides. Data for each protein are arranged in three columns: the Accession Number and SwissProt/TrEMBL ID, the SwissProt/TrEMBL description line and the full sequence.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
3749
Proteomics 2009, 9, 3741–3751
the availability of a plain text or XML file with the MS data and a spreadsheet document containing the corresponding peptide identification data. The phos-file is a text file divided into two consecutive blocks (Figs. 3 and 5). The first block contains all the information relative to each spectrum, arranged as a tab-delimited list. This includes all the peptide information extracted from Xcalibur and Bioworks (MS stage, D-value, Ascore, validated peptide sequence, and spectrum, in the form of a massintensity array) as well as the list of accession numbers of the proteins mapped by each peptide. The second block contains a non-redundant list of the proteins mapped by the full data set in three tab-delimited columns: the accession number and name, the description and the sequence of each protein. The phos-file format minimizes the work to be done when queries are submitted to the LymPHOS database, as
all calculations are already performed and the peptide–protein relationships are directly supplied to the database. Only useful, non-redundant information is stored in this file, thereby providing a very compact file format of about 11 kB per spectrum. Furthermore, data handling can be greatly simplified as several raw files obtained in the same experimental conditions can be processed together, thus creating a single phos-file. Phos-files can also be used as backup files of the information stored in the LymPHOS database. Phos-files are a very convenient and efficient reservoir for storing analytical information, although they also have some limitations. Thus, the use of pre-analyzed, static data to build the phos-files implies that the information on peptide and peptide–protein relationships is bound to the content of the Uniprot database at the moment of file creation. This
Accession Number Protein name
Protein view
Sequence Phosphosites
Phosphosite
List of peptides supporting the phosphosite
Expasy – Uniprot info and links
Phosphopeptide
Peptide view
List of proteins containing the phosphopeptide Spectral parameters, scores and experimental conditions
Mass spectrum with annotated fragmentation pattern
Figure 6. Peptide and protein views. Screenshot of the two main views in the LymPHOS web application.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
www.proteomics-journal.com
3750
D. Ovelleiro et al.
makes necessary periodical generation of new phos-files from the entire information contained in the LymPHOS database using the updated protein databases.
Proteomics 2009, 9, 3741–3751
hyperlinked to the corresponding ‘‘Peptide view’’. At the bottom of the page, some useful information parsed from the Expasy web page is available, including other experimentally described p-sites, annotated protein function and cellular location, and a link to the protein entry in the UniProt database.
3.4 Overview of LymPHOS LymPHOS is an open access database (www.lymphos.org) that can present the experimental p-site information in two modes. In the Peptide view mode, the information is accessible as a list of peptides with indication of the corresponding p-sites. This mode reflects the peptide-centric nature of shotgun proteomics, making experimental information directly accessible. Tentative proteins to which the peptide could belong appear as a list of references for the corresponding entry. In the protein view mode, protein sequences are depicted with an indication of the p-sites and the complete list of phosphopeptides supporting the p-site locations. These two views are connected, and it is possible to browse from a given protein to the information on any of its related peptides and vice versa. In addition, a detailed view of experimental conditions for each spectrum can be accessed.
3.5 Peptide and protein search and visualization From the ‘‘Search’’ page, peptide information can be accessed by querying specific sequences or selecting a peptide from the list of database sequences. In either case, the ‘‘Peptide view’’ page is displayed (see Fig. 6). This page shows a heading with the peptide sequence and p-site, followed by a list of the proteins mapped by the peptide. At the bottom of the page, all the mass spectra supporting the peptide assignation are displayed, with their corresponding experimental information and scores. The definition of each parameter is provided in a mouse scrolling sensitive environment. The definition of each parameter is provided in a mouse scrolling sensitive environment. Spectra can be downloaded in Mascot generic format (mgf) and other formats, such as mzML [24], will be implemented in the future. High confidence p-site assignations are highlighted using a color code to distinguish between sequences with an Ascore greater than 19 (99% of correct assignation) or with one only possible phosphorylation site (green) and those with an Ascore lower than 19 (orange). Proteins listed on this page are hyperlinked to the corresponding ‘‘Protein View’’. Proteins can be searched from the ‘‘Search’’ page by their Uniprot accession number or description line or selected from the list of database sequences. Protein information can be also obtained through the links available in the ‘‘Peptide view’’ page. At the top of the ‘‘Protein View’’ page, the SwissProt/TrEMBL Accession Number and ID, protein description and sequence are shown. All p-sites annotated in the LymPHOS database for the protein are indicated, together with a list of all phosphopeptide sequences supporting them. These sequences are & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
4
Concluding remarks
We have implemented a workflow for handling experimental, mass spectrometric and data analysis information derived from a shotgun phosphoproteome analysis as well as a relational, open access LymPHOS database, in order to store and consult this information. Although we specifically built these tools to help in the analysis of the human T-lymphocyte phosphoproteome, the workflow described could also be useful for the storage and management of proteomics experimental data from other studies related with PTM characterization. This work was supported by grants BIO2004-01788 and BIO2008-03365 from the Ministerio de Ciencia y Tecnologı´a. The LP-CSIC/UAB is a member of ProteoRed, funded by Genoma Spain, and follows the quality criteria set up by ProteoRed standards. The authors have declared no conflict of interest.
5
References
[1] Bodenmiller, B., Malmstrom, J., Gerrits, B., Campbell, D. et al., PhosphoPep – a phosphoproteome resource for systems biology research in Drosophila Kc167 cells. Mol. Syst. Biol. 2007, 3, 1–11. [2] Wilson-Grady, J. T., Villen, J., Gygi, S. P., Phosphoproteome analysis of fission yeast. J. Proteome Res. 2008, 7, 1088–1097. [3] Zahedi, R. P., Lewandrowski, U., Wiesner, J., Wortelkamp, S. et al., Phosphoproteome of resting human platelets. J. Proteome Res. 2008, 7, 526–534. [4] Beausoleil, S. A., Villen, J., Gerber, S. A., Rush, J., Gygi, S. P., A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24, 1285–1292. [5] Lu, B., Ruse, C., Xu, T., Park, S.-K., Yates, J., III, Automatic validation of phosphopeptide identifications from tandem mass spectra. Anal. Chem. 2007, 79, 1301–1310. [6] Bodenmiller, B., Campbell, D., Gerrits, B., Lam, H. et al., PhosphoPep – a database of protein phosphorylation sites in model organisms. Nat. Biotechnol. 2008, 26, 1339–1340. [7] Diella, F., Gould, C. M., Chica, C., Via, A., Gibson, T. J., Phospho.ELM: a database of phosphorylation sites – update 2008. Nucleic Acid Res. 2007, 36, D240–D244. [8] Gnad, F., Ren, S., Cox, J., Olsen, J. et al., PHOSIDA (phosphorylation site database): management, structural and
www.proteomics-journal.com
Proteomics 2009, 9, 3741–3751 evolutionary investigation, prediction of phosphosites. Genome Biol. 2007, 8, R250. [9] Heazlewood, J., Durek, P., Hummel, J., Selbig, J. et al., PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor. Nucleic Acid Res. 2008, 36, D1015–D1021.
3751 [16] Higdon, R., Hogan, J. M., Belle, G. v., Kolker, E., Randomized sequence databases for tandem mass spectrometry peptide and protein identification. OMICS A J. Integr. Biol. 2005, 9, 364–379. [17] Consortium, U., The Universal protein resource (UniProt). Nucleic Acids Res. 2008, 36, D190–D195.
[10] Hornbeck, P. V., Chabra, I., Kornhauser, J. M., Skrzypek, E., Zhang, B., PhosphoSite: a bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics 2004, 4, 1551–1561.
[18] Keller, A., Nesvizhskii, A. I., Kolker, E., Aebersold, R., Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383–5392.
[11] Hummel, J. N. M., Wienkoop, S., Schulze, W., Steinhauser, D. et al., ProMEX: a mass spectral reference database for proteins and protein phosphorylation sites. BMC Bioinformatics 2007, 23, 216.
[19] Taylor, C. F., Paton, N. W., Lilley, K. S., Binz, P.-A. et al., The minimum information about a proteomics experiment (MIAPE). Nat. Biotechnol. 2007, 25, 887–93.
[12] Prince, J. T., Carlson, M. W., Wang, R., Lu, P., Marcotte, E. M., The need for a public proteomics repository. Nat. Biotechnol. 2004, 22, 471–472. [13] Carrascal, M., Ovelleiro, D., Casas, V., Gay, M., Abian, J., Phosphorylation analysis of primary human T lymphocytes using sequential IMAC and titanium oxide enrichment. J. Proteome Res. 2008, 7, 5167–5176. [14] Eng, J. K., McCormack, A. L., Yates, J. R., An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. [15] Elias, J. E., Gygi, S. P., Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207–214.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
[20] DeGnore, J. P., Qin, J., Fragmentation of phosphopeptides in an ion trap mass spectrometer. J. Am. Soc. Mass Spectrom. 1998, 9, 1175–1188. [21] Cohen, P., The regulation of protein function by multisite phosphorylation – a 25 year update. TIBS 2000, 25, 596–601 [22] Yang, F., Stenoien, D. L., Strittmatter, E. F., Waan, J. et al., Phosphoproteome profiling of human skin fibroblast cells in response to low- and high-dose irradiation. J. Proteome Res. 2006, 5, 1252–1260. [23] Nesvizhskii, A. I., Aebersold, R., Interpretation of shotgun proteomic data. The protein inference problem. Mol. Cell. Proteomics 2005, 4, 1419–1440. [24] Deutsch, E., mzML: a single, unifying data format for mass spectrometer output. Proteomics 2008, 8, 2776–2777.
www.proteomics-journal.com