LymPHOS: Design of a phosphosite database of primary human T cells

June 14, 2017 | Autor: Vanessa Casas | Categoria: Proteomics, Biological Sciences, Humans, T lymphocytes, Phosphopeptides

Share Embed

Denunciar este link

Descrição do Produto

Proteomics 2009, 9, 3741–3751

3741

DOI 10.1002/pmic.200800701

RESEARCH ARTICLE

LymPHOS: Design of a phosphosite database of primary human T cells David Ovelleiro, Montserrat Carrascal, Vanessa Casas and Joaquin Abian CSIC/UAB Proteomics Laboratory, Instituto de Investigaciones Biome´dicas de Barcelona-Consejo Superior de Investigaciones Cientı´ficas, IDIBAPS, Barcelona Autonomous University, Bellaterra, Spain

Current proteomic technology is capable of producing huge amounts of analytical information, which is often difficult to manage in a comprehensive form. Curation, further annotation and public communication of proteomic data require the development of standard data formats and efficient, multimedia database structures. We have implemented a workflow for the annotation of a phosphopeptide database (LymPHOS) that includes tools for MS data filtering and phosphosite assignation, mass spectrum visualization, experimental description and accurate phosphorylation site assignation. Experimental annotations were fitted to current minimum information about a proteomics experiment guidelines. A new guideline for phosphoprotein sample preparation is also proposed. Currently, the database describes 342 phosphorylation sites mapping to more than 200 gene sequences, and it can be accessed through the net (http://www.lymphos.org).

Received: September 1, 2008 Revised: March 17, 2009 Accepted: April 20, 2009

Keywords: LymPHOS database / Phosphoproteome / T-lymphocyte

1

Introduction

Large-scale characterization of PTMs is often carried out by shotgun approaches using PTM-specific isolation and detection methods [1–3]. Current proteomic technology is capable of producing huge amounts of analytical information that is frequently difficult to handle in a comprehensive form. An LC-MS/MS study consisting of a few 90 min chromatograms can produce several thousand MS/MS spectra that then need to be evaluated. Consequently, curation, further annotation and public communication of proteomic data require the development of standard data formats and efficient, multimedia database structures. Moreover, re-evaluation of the quality and significance of the identification data in a database by an external user should be facilitated by an easy access to more specific Correspondence: Dr. Joaquin Abian, LP CSIC/UAB, Facultat de Medicina, Edifici M, Universitat Auto`noma de Barcelona, Campus UAB, 08193 Bellaterra, Spain E-mail: [email protected] Fax: 134-93-581-49-13 Abbreviations: MIAPE, minimum information about a proteomics experiment; p-site, phosphorylation site; XDK, Xcalibur Development Kit

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

information, such as experimental conditions and individual spectrum raw data. This information is especially relevant for the analysis of phosphorylation sites (p-sites), where p-site assignation is often derived from the combination of MS2 and MS3 spectral data, and several analytical tools are often required to resolve ambiguous p-site assignations [4, 5]. Several open access, dedicated phosphoprotein databases have been built by a number of groups [6–12]. Most of these contain predicted as well as experimental data extracted from the literature and recently some of them [6, 11, 12] have included tools for on-line visualization of spectral data. Probably the largest repository of p-sites is phosphosite, which collects more than 50 000 experimental or predicted p-sites from near 180 cell lines and 16 distinct tissues from human, mouse, rat and other species (www.phosphosite.org). Phosphosite records can be searched in several ways (peptide, species, disease, etc.) and a viewer is provided to visualize the spectra. Another database of wide coverage, PhosphoELM, contains near 18 300 p-sites extracted from experimental data from the literature, from which about 13 000 are from mouse or human [7]. PhosphoELM is focused on the study of kinases and kinase These authors contributed equally to the work.

www.proteomics-journal.com

3742

D. Ovelleiro et al.

substrates and provides search methods for this purpose; however, it does not include mass spectral information, for which a search of the source literature is required. Besides these major databases, small reservoirs such as the rat Collecting Duct Phosphoprotein Database (http://dir.nhlbi.nih.gov/papers/lkem/cdpd/, about 750 p-sites) or PhosphAT [9] (4070 non-redundant phosphopeptides) hold data from only one specific cell, tissue or model organism being addressed in a specific area of research. Often, these dedicated databases develop to extend their scope as was the case of Phosida [8] or PhosphoPep [6]. Initially built to hold data from phosphoproteomics experiments in HeLa cells, Phosida is one of the most carefully curated databases available for phosphopeptides. At present, Phosida holds phosphopeptide data from B. subtilis, Mus musculus, Lactococcus lactis and E. coli, and includes p-sites, phosphoprotein information and some useful tools for p-site analysis. Recently, the PhosphoPep database, initially containing experimental phosphopeptide data on D. megalogaster, has also been extended to include data on humans, S. cerevisae and C. elegans. Although this database stores data from several laboratories, a common curation process procures identification and annotation homogeneity. Here, we describe the design of a phosphopeptide database (LymPHOS) focused on the phosphoproteome of human primary T-lymphocytes, a model for the study of signal transduction and altered immune response for which no phosphopeptide data are available in other databases. LymPHOS currently holds all the experimental data on p-site identification by LC-MS/MS carried in our laboratory [13] and its design attempts to fulfill the above-mentioned requirements for efficient data access and analysis. For its annotation, we implemented a workflow that includes tools for filtering MS data and p-site identification, mass spectra visualization, experimental description and accurate p-site assignation. All spectra supporting a p-site assignation were stored in the database and presented graphically with the corresponding experimental information and identification parameters and scores. Experimental annotations were fitted to current minimum information about a proteomics experiment (MIAPE) guidelines [14]. A new guideline for phosphoprotein sample preparation is also proposed.

2

Materials and methods

2.1 Sample preparation Human primary T cells were isolated from buffy coats obtained from the Blood Bank of the ‘‘Hospital Clinic’’ and the ‘‘Hospital Vall d’Hebron’’ (Barcelona, Spain) using a Ficoll-Paque (GE, Uppsala, Sweden) gradient centrifugation and following the standard procedures. Each purification was started from 100 mL buffy coat. Purified lymphocytes were lysed and proteins extracts were prepared as described [13]. Protein extracts were separated in a hand-poured 10% & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2009, 9, 3741–3751

acrylamide SDS-PAGE gel (MiniProtean, Bio-Rad), and stained with Coomassie Blue. The gel lane was then divided into 11 slices, which were in-gel digested with trypsin (Promega, Madison, WI, USA), using a Digest MSPro robot (Intavis, Koeln, Germany) and following standard procedures. Extracts were evaporated to dryness and redissolved in 200 mL 250 mM AcOH/30% acetonitrile.

2.2 Phosphopeptide enrichment Phosphopeptide enrichment was performed using sequential IMAC and TiO2 as previously described [13]. Briefly, 20 mL IMAC resin (Phos-Select iron affinity gel, Sigma, St. Louis, MO, USA) was added to each peptide extract and the mixtures were incubated for 90 min at 201C. Phosphopeptides were eluted with 0.5% NH4OH. The non-retained fraction was concentrated in the SpeedVac to 10 mL, diluted five times with 1 M glycolic acid, 5% TFA, and 80% acetonitrile and loaded into a TiO2 minicolumn. Phosphopeptides were eluted from the tip with 0.5% NH4OH followed by 1 mL 30% ACN.

2.3 LC-MSn analysis IMAC and TiO2 eluates were analyzed by LC-MSn using a LTQ linear ion trap equipped with a microESI ion source and controlled by Xcalibur 2.0 SR2 (ThermoFisher, San Jose, CA, USA). Each extract was concentrated to about 5 mL and diluted to 40 mL with 1% formic acid. Separation was carried out using a C18 preconcentration cartridge (Agilent Technologies, Barcelona, Spain) connected to a 10 cm long 150 mm id Vydac C18 column (Vydac, IL, USA). Separation was done at 1 mL/ min using a linear ACN gradient from 0 to 40% in 60 min (solvent A: 0.1% formic acid, solvent B: ACN 0.1% formic acid). The LTQ instrument was operated in positive ion mode with a spray voltage of 2 kV. The scan range of each full MS was m/z 400–2000. The spectrometric analysis was performed in an automatic dependent mode. Each acquisition cycle comprised a full scan (m/z 400–2000) followed by eight product spectra on the corresponding most abundant precursor ions in the full scan. When a signal derived from a neutral loss of 98, 49 or 32.7 (loss of H3PO4 for 11,12 and 13 charged ions, respectively) from the precursor ion was detected among the ten most intense signals in a given MS/MS spectrum, a subsequent MS3 scan was performed on that ion. Dynamic exclusion was set to one with a time window of 5 min, in order to minimize the redundant selection of precursor ions.

2.4 Spectrometric data interpretation and filtering LC/MS files in the Xcalibur raw format were analyzed using Bioworks v3.3, a ThermoFisher implementation of Sequest www.proteomics-journal.com

3743

Proteomics 2009, 9, 3741–3751

[14] algorithm. Search parameters were: peptide mass tolerance, 2 Da; fragment tolerance, 0.8 Da; digestion rule, ‘KR no P’ (trypsin), allowing up to two missed cleavages; static modification, carbamidomethylated cysteine (157 Da); dynamic modifications, methionine oxidation (116 Da), phosphorylation on Ser, Thr and Tyr (180 Da) and loss of water from Ser and Thr (b-elimination of phosphoric acid from the corresponding phospho-amino acid). A database search was performed by limiting the tentative charge of precursor ions to a maximum of 13. Each mass spectrum was searched using the composite target-reverse database strategy, which allows efficient data filtering with a selected ratio of false-positive hits [15, 16]. The target database was prepared from the Uniprot-Swissprot and Uniprot-Trembl [17] human databases (UniProt Knowledgebase Release 14.0, 53 550 and 19 045 protein sequences, respectively) by converting the original dat format to fasta and combining the two archives. Bioworks search results were exported to a Microsoft Excel format from which scan number, precursor mass and charge and search scores were extracted. In addition, the original raw files were processed using the Xcalibur Development Kit (XDK). The XDK is a suite of programmable COM objects provided with Xcalibur, which allow display and manipulation of data and access to Xcalibur files. Using the PerlWin32::OLE module to link to the XDK library, the information on the MS stage (MS2 or MS3) and the mass–intensity arrays of each spectrum were extracted and stored in an automated way. This information was merged automatically with that provided by Bioworks on the basis of the scan number of each spectrum available in the Bioworks Excel report. To efficiently reduce false-positive identifications, two scores were used in combination: the Xcorr (main Sequest score representing the cross-correlation between experimental and theoretical spectra) and an ‘‘in house’’ produced D-value (discriminant score used by Peptide Prophet [18]) calculated from the scores provided by Bioworks. For each data set, the cutoff values of Xcorr and D-value that gave a 1% false discovery rate were used [13]. Only tentative phosphorylated peptides were selected for further analysis. Peptides with lower scores than the cutoff values or those from decoy proteins were rejected. However, as previously described [13], in order to improve the efficiency of p-site location, MS2-MS3 spectra pairs pointing to the same peptide sequence were stored even when one of them was under the cutoff value.

2.5 Phosphorylation site assignment: Ascore After filtering, a set of high confidence phosphorylated peptides are obtained. However, when multiple p-sites are possible in a given peptide, assignation of the phosphorylation site is often ambiguous. If two or more phosphorylations take place in a given peptide with three or more & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

possible p-sites, the difficulty increases. To confirm the correct location of the p-site (or p-sites), the Ascore, a probability-based score described by Gygi’s group [4] was used. Ascore provides a probabilistic estimation of the confidence of phosphorylation site assignment relative to the second-best alternative. The different p-site alternatives are ranked from best to worst depending on how many theoretical fragments (series y and b) are found in the experimental spectrum. When the best choice is not sufficiently better than the second one, the phosphorylation point is poorly scored. The first step in the Ascore calculation is to determine all the possible phosphopeptides that would be generated by alternative phosphate locations in the Ser, Thr and Tyr amino acids in the molecule. The Ascore calculation procedure assigns a score to each of these forms in order to elucidate the correct pattern of p-sites in the peptide. As described [4], an Ascore419 means that the probability of a correct assignation is 99% for the selected site, while an Ascore in the range 15–19 means a probability of 490%. The number of possible phosphorylated sequences can be calculated with the binomial coefficient formula:

Ckn

n! n ¼ ¼ k k!ðn kÞ!

ð1Þ

To automatically obtain all the phosphorylation patterns of a given phosphopeptide, we parameterized the problem by assigning a numerical index to each amino acid. Then, using a Perl script, we obtained the sequences resulting from the combinations without repetition of n possible p-sites taken in groups of k predicted phosphorylated sites (nP and kP, respectively, in Fig. 1A). In the case of MS3 spectra, the fragmentation of the phosphate neutral loss ion from a p-Ser or p-Thr leads to the assignation of the residue involved as the corresponding dehydrated form (dehydroalanine or dehydrobutyric acid, respectively). Thus, in cases where MS3 spectra were assigned to sequences containing only unmodified Thr, Ser and Tyr residues together with dehydrated forms of Thr or Ser, combinations were calculated as described before for MS2 spectra. However, when a MS3 spectrum indicated the presence of phosphorylated amino acids, in addition to one or more dehydrated sites, another combination without repetition was performed for each combination of phosphorylated amino acids, using nW possible sites for water loss (total number of dehydrated and non-phosphorylated Ser and Thr residues) taken in groups of kW dehydrated sites, as derived from the assigned sequence (Fig. 1A). The number of possible combinations can be quite large: the observed peptide TPSPLVLEGtIEQSuPPLSPTTK, with one phosphorylation and one dehydrated site (t and u, respectively) in a sequence with eight possible sites to hold the phosphorylated or dehydrated forms, generates 56 www.proteomics-journal.com

3744

D. Ovelleiro et al.

Proteomics 2009, 9, 3741–3751

A Obtaining peptide candidates nW CnP kP × CkW

KLEKEEEEGISQEusEEEQ

C13 × C12 = 6

a1. Phosphorylation pattern proposed by search engine

Phos_1 -> + DH_1-> + DH_2-> Phos_2 -> + DH_1-> + DH_2-> Phos_3 -> + DH_1-> + DH_2->

KLEKEEEEGIsQESSEEEQ KLEKEEEEGIsQEuSEEEQ KLEKEEEEGIsQESuEEEQ KLEKEEEEGISQEsSEEEQ KLEKEEEEGIuQEsSEEEQ KLEKEEEEGISQEsuEEEQ KLEKEEEEGISQESsEEEQ KLEKEEEEGIuQESsEEEQ KLEKEEEEGISQEusEEEQ

a2. Six phosphopeptides alternatives

B Ascore calculation 120

KLEKEEEEGIsQEuSEEEQ KLEKEEEEGIuQESsEEEQ KLEKEEEEGIuQEsSEEEQ KLEKEEEEGISQEsuEEEQ KLEKEEEEGISQEusEEEQ KLEKEEEEGIsQESuEEEQ

score

100 80 60 40

8

1 2 3 4 5 6 7 8 9 10

level

b1. Peptide candidates b2. Theoretical-Experimental correlation

P(x) =

N

⎡ ⎤

∑ ⎢Nk ⎥ p

k =n

⎣ ⎦

k

b3. Levels comparison

− (1− p) N k

KLEKEEEEGISQEusEEEQ

Score = −10 × log(P (x))

Ascore = 21.1

Ascore = Score best − Score second

b5. Best candidate and Ascore

b4. Ascore calculation Figure 1. Precise phosphosite assignation. (A) Obtaining alternative peptide candidates (amino acid modifications are coded as s, t and y for phosphorylated Ser, Thr and Tyr, respectively and as u, v for dehydrated/dephosphorylated Ser and Thr, respectively). (a1) In the example, the assigned peptide contains three Serines, each in a different form (unmodified, phosphorylated and dehydrated). (a2) A list of all possible combinations is generated. The number of combinations obtained is given by the product of the combination of nP possible p-sites in the unmodified sequence taken in groups of kP predicted phosphorylated sites by the combination of nW water loss sites taken in groups of kW observed water losses. (B) Ascore calculation. The two best sequences are selected from the list and the Ascore for the first candidate is calculated. (b1) Peptide candidates. The theoretical spectrum for each peptide candidate is calculated (in the example, six in silico spectra are produced depicting the corresponding y and b ion series and considering charges 11 and 12). (b2) Theoretical –experimental correlation. For each candidate, coincidences with the experimental spectrum are scored using the cumulative binomial probability for the corresponding number of coincident ions. For this, the experimental spectrum is divided into sectors of 100 m/z units and coincidences between the theoretical ions and the most intense signals in each sector are determined. The evaluation is made at ten levels. In the first level only the most intense peak in each sector from the experimental spectrum is considered; the two most intense are considered in the second level and so on. (b3) Level comparisons. The two candidates with the highest scores are selected and the level with the greatest difference between them determined (eight level in the example). (b4) Score calculations. The scores are recalculated taking into account only p-site specific ions and using a theoretical probability given by the level of maximum difference determined in the previous step (p 5 0.08 for level eight and sectors of 100 m/z units). The Ascore is the difference between the scores of the best and second best candidates. (b5) The candidate with the highest Ascore is reported.

different patterns for Ascore evaluation. The rest of the Ascore calculations were performed following Gygi’s group (Fig. 1B).

2.6 Web application structure The LymPHOS web application consists of a relational database and a web interface that allows data submission, & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

querying and visualization. The database uses a MySQL (www.mysql.com) relational database management system. The web interface is based on server side scripts that allow uploading of the required data (peptide and protein related information and experimental, MIAPE compliant information). Scripts are written in Perl (www.perl.org) and PHP (www.php.net). Graphical visualization of spectra and fragmentation patterns are performed using the GD graphics library (www.boutell.com/gd). www.proteomics-journal.com

3745

Proteomics 2009, 9, 3741–3751

3

Results and discussion

The LymPHOS database was designed to provide efficient storage of the different types of information related to a shotgun experiment in phosphoproteomics and at the same time to offer a simple and intuitive interface for further curation, analysis and consulting of the data. All the information provided by the LymPHOS database has direct experimental support and is restricted to phosphate-bearing sequences. Experimental information supporting each p-site assignment is shown at the peptide level. All the available MS2 and MS3 spectra justifying an assignation are presented graphically together with the corresponding experimental information and identification parameters and scores. Thus, the information for each p-site assignation includes always, at least, one annotated spectrum image. In some cases, such as for peptides KAsGPPVSELITK, QAsIELPSMAVASTK, up to 52 different MS2 and MS3 spectra from different files and experiments

are provided side-by-side with the analytical and spectrometric data. The user can see and download each mass spectrum used to define a p-site, and can access the experimental conditions used to generate each spectrum. The full set of available data is stored in a MySQL relational database (Fig. 2) and includes experimental conditions, annotated mass spectra, peptide and protein sequences, proteomics search engine scores and p-site assignment. Some of these parameters were retrieved from the commercial search engine Bioworks, but others required the development of customized software in order to automatically extract the information from the original Thermo raw files. For database uploading, most of the information (peptide scores, peptides with p-sites assignments, mass spectra and protein sequences) was packed in a text file designated the phos-file. The web application that presents the LymPHOS database supports the minimum possible workload as the calculus operations and CPU intensive routines are executed

D A 1

data

1

data id

1

C spectra spectrum id mass array intensity array

phos-file name

1

B conditions condition id name description

1

raw file name scan MS stage parent mass delta mass charge original peptide validated peptide peptide probability Xcorr DeltaCn Sp RSp Ions Proteins count D value Ascore condition id

phosphosites phosphosite id protein id

1

phosphorylation position inside peptide phosphorylation position inside protein

F prot_spec protein id data id

E prots

1

protein id Accession Number SwissProt/TrEMBL ID sequence

Figure 2. Scheme of LymPHOS database structure. (A) The data table stores information related to each data entry: scan number, phos-file name, raw file name, scores and a condition identifier (condition id). (B) The conditions table reflects each set of MIAPE documents related to each data entry. For example, condition id 5 1 is associated with an ‘‘SDS-PAGE Titanium Resting Lymphocytes’’ experiment, with three associated MIAPE documents: Sample preparation, MS and MS informatics. (C) Each data entry is associated with a spectrum in the spectra table. (D) The phosphosites table associates each data entry with the phosphosites characterized in the peptide, the proteins mapped by this peptide and the position of the phosphate in the protein sequence. (E) The proteins table lists the full set of proteins mapped by the phosphopeptides in the LymPHOS database. (F) The prot_spec table relates each data entry (every peptide) with one or more proteins.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

3746

D. Ovelleiro et al.

Proteomics 2009, 9, 3741–3751

externally, prior to the uploading process. Moreover, a rapidaccess, compact sequence database including only the proteins identified was used. Overall, these conditions provide optimal performance with minimum hardware requirements. The workflow used in the LymPHOS database is summarized in Fig. 3. Experimental information is uploaded in a MIAPE [19] compliant structure. MIAPEs are a set of reporting guidelines for providing experimental information related to a proteomic analysis. MIAPEs are being actively developed by the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) in order to establish common standards for comparison and evaluation of experimental data. Available MIAPEs are associated with the main proteomics tasks such as 2-DE or LC-MS/MS analysis. In the development of our work, we lacked a standard for sample preparation in phosphoproteomics and consequently we implemented a proposal form that included the most common tasks and parameters involved in this field. For this purpose, experimental metadata was compiled in the form of a structured list of fields. This information included experimental reagents and conditions for sample preparation, methods for phosphoprotein/phosphopeptide enrichment, as well as analytical parameters for chromatography

XCALIBUR

and MS analysis (see Supporting Information). This content was stored in the form of a text file, which was uploaded into the database server through a special web form.

3.1 Data filtering A major issue in large-scale shotgun MS analysis is the selection of correct identifications from a high number of spectra. Spectra included in LymPHOS were filtered using both the Xcorr and D-value. Although the D-value includes Xcorr in its calculation, we found that Xcorr/D-value distributions were more effective in rescuing assignations of low D-value but high Xcorr. These assignations frequently correspond to phosphopeptides with alternative, ambiguous assignations of similar Xcorr. In these cases, the parameter DCn (the difference between Xcorr from the best and second-best candidates), which is also considered in the D-value calculation, is low and hence produces a low score. These assignations are discarded when only the D-value parameter is used. We tested several procedures to establish the boundaries in the Xcorr/D-value space that separate true from false positives with a given FDR. The use of a linear boundary of the type Xcorr 5 aD-value1b was a conve-

BIOWORKS

RAW files Mass Spectrometry Informatics

Fasta DB

XDK

Mass Spectrometry

Excel report Sample preparation and handling Scan number

Retention time Mass/Intensity spectra MS state

Precursor mass

Non-redundant list of proteins containing each peptide

MIAPE info

Charge

LymPHOS web interface

Bioworks/Sequest scores

D value Peptide

Ascore

Peptide-related info

Protein info

LymPHOS DB

Phos file Figure 3. LymPHOS workflow overview. (A) Phos-file construction: Xcalibur raw files are analyzed using Bioworks, which searches for the best possible match for each spectrum in a local Uniprot database in Fasta format. An Excel report containing all peptide assignations and associated scores is generated. Xcalibur raw files are also used to extract mass spectra as well as complementary information not provided by Bioworks (MS2/MS3 state, retention time). Subsequently, the D-value and Ascore are calculated for each spectrum to evaluate sequence assignation and p-site location, respectively. Only the best spectra-peptide pairs are preserved. Finally, all the proteins mapped by each peptide are included and all the information is stored in a phos text file. (B) MIAPE info: All the information related to the proteomics experiment is arranged in a MIAPE-based structure, and stored in a text file that can be uploaded to the web application. The three relevant experimental steps are summarized in three MIAPE documents that describe sample preparation, MS and data analysis conditions, respectively. (C) Web interface: The LymPHOS application provides a webform that enables the uploading of phos-files and the associated MIAPE documents.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

3747

Proteomics 2009, 9, 3741–3751

patterns [20] and consequently assignations with poor scores. Surprisingly, from the set of MS2-MS3 pairs selected, 64% of the assignations from the MS2 spectra passed through the Xcorr/D-value filter while for MS3 spectra this value was only 46%. Of the 188 pairs of MS2-MS3 spectra, only 26 produced distinct p-site assignations. In most cases, these conflicting assignations were derived from MS2-MS3 pairs in which one spectrum produced a high Ascore while the other had a low Ascore, below the cutoff value of 19. In three cases we found two alternative assignations each supported by MS2 and MS3 spectra of high Ascore. Manual analysis of these spectra indicated that they were produced by the presence of both alternative phosphopeptide forms (Fig. 4). Ascore thus appears to have some limitations for the correct evaluation of these spectra. As peptide isomers that differ in the phosphate position are likely to elute closely in reversed phase chromatography, production of mixed spectra during a shotgun analysis of phosphopeptide extracts may be quite frequent. Thus, although we found Ascore to be a powerful tool for p-site analysis, more resolutive methods are still needed

nient method for this purpose. Values of a and b were calculated with an iterative procedure based on the generation of random values of these parameters in a given range. Values providing the highest number of true positive spectra for a given FDR were used iteratively to restrict the search range. Confident assignation of the p-site was performed on the basis of the Ascore values. After Ascore evaluation, about 10% of the initial p-sites assigned by Bioworks were relocated. In order to improve the efficiency of p-site location, MS2-MS3 spectra pairs pointing to the same peptide sequence were collected and used for Ascore calculation even when one of them was below the cutoff value in the Xcorr/D-value filtering step. Low score assignations in such pairs may be useful for supporting the location of the phosphate in the sequence derived from its MSn partner data. In this respect, the dominance of the phosphate neutral loss signal at m/z 5 (Mr1n98)/n (for n 5 1–3) in MS2 spectra is often considered to result in reduced sequence information, producing hard to interpret fragmentation 924.6 5

[M-H3PO4]2+

MS/MS 869.0 (+2)

4

IEDVGsDEEDDSGKDKK IEDVGSDEEDDsGKDKK

y142+ 794.9

y6

3

745.8

803.6

662.4 2

1150.4

y6 b4

y2

y9

644 .3

1071.2 1021.4

457.1

275.2

400

1671.4

b15-H3PO4 1573.5

1284.2

y10

1

b15

b11

852.5

y11

1265.6 y13-H3PO4 1391.5

b14 1556.4

b16 1799.5

600

1000

800

1200

1400

1600

1800

m/z 746.0

100

y142+

3

MS 869.0 820

y152+

Relative Abundance

80

803.5

[M-H2O]2+ 906.6

60

m/z

Figure 4. MS2 and MS3 spectra tentatively corresponding to a mixture of two positional phosphopeptide isomers. Ascore assigns the MS2 spectrum to IEDVGsDEEDDSGKDKK while MS3 is assigned to IEDVGSDEEDDsGKDKK, in both cases with a high score (54 and 28, respectively).

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

y11

40

1247.4 1391.5

y10 y

20

6 662.3 644.3

b3 y2

358.1

275.2

y6

1132.4 851.5

b5

y9

b11

y9

1021.4 1186.2 1003.4

514.3 400

600

800

1000

y13

1200

y11

b15 1573.5

1265.4

b14 1459.4 1400

y15

b16

1606.7 1701.6 1600

1800

3748

D. Ovelleiro et al.

Proteomics 2009, 9, 3741–3751

to deconvolute these special spectra. These tools are of importance in cases such as the study of multisite phosphorylation, a phenomenon of great relevance for the understanding of the role of phosphorylation in protein function, especially in the study of signal multiplexing during signal transduction [21, 22].

3.2 Peptide–protein association Shotgun proteomics has a strong peptide-centric nature. Digestion with an enzyme generates a mixture of peptides in which the possibility of associating a peptide with its original protein is lost if other proteins sharing this specific sequence exist. This has been described as the protein inference problem [23]. Tentative peptide assignation to a given protein in preference to other candidates is sometimes supported by the confidence given by the presence of other peptides in the sample that also point to that specific protein sequence. However, this premise is not applicable when PTM-selective purification is involved, such as in phosphopeptide concentration and detection. Peptide–protein association in the LymPHOS database takes into account this characteristic so that each peptide was associated with every possible protein form annotated in the combined UniProt SwissProt – TrEMBL Human protein databases. Unfortunately, the Bioworks reports already prepared as part of the LymPHOS workflow could not be used directly for this purpose, as Bioworks lists only one protein for each peptide

A Scan

MS

413 476 680 339 473 700 422

MS3 MS3 MS2 MS3 MS2 MS2 MS2

ions 0.77 0.33 0.67 0.89 0.39 0.57 0.70

B

n

Filename IMAC_4.RAW IMAC_8.RAW IMAC_8.RAW IMAC_5.RAW IMAC_5.RAW IMAC_6.RAW IMAC_6.RAW

count 0 1 0 1 1 0 1

BW Protein Q86VM9|ZCH18_HUM... Q86X27|RGPS2_HUM... Q09666|AHNK_HUMA... O95466|FMNL_HUMA... Q8IYB3|SRRM1_HUM... Q09666|AHNK_HUMA... Q13586|STIM1_HUM...

D

[M+H]

+

deltaMass z

1274.57 2180.12 1812.79 1039.57 1977.99 1812.79 1405.61

-1.92 -0.10 0.13 0.26 -0.88 0.21 -0.03

2 3 2 2 3 2 2

Set of proteins containing peptide

4.88 4.58 3.95 4.51 4.67 3.08 4.96

candidate (although the Bioworks viewer shows a number of proteins sharing the target sequence). Thus, a parallel database search was performed using a Perl script to provide an associative list correlating each phosphopeptide sequence with all the proteins containing that sequence. About 28% of the peptides in the database mapped to a single protein sequence, while another 24% were common to more than five proteins. On average, each sequence was associated with three proteins.

3.3 Phos-file documents All the information obtained up to this point (Bioworks scores, peptides and physical parameters, in house-calculated D-value and Ascore, spectra and protein names and sequences) was packed in a single phos-file. Phos-files are generated via a custom-made Perl application that unifies the different scripts mentioned in the preceding sections. This application performs, for each spectrum, all the described operations including Bioworks report parsing, extraction via XDK of the MS stage and spectrum, D-value calculation, peptide filtering at FDRo1% based on Xcorr and D-value, Ascore determination and p-site validation, and finally the search of the set of proteins mapped by the validated peptide. Although we developed this scheme for our convenience (currently, the raw format is the original one in all our MS instruments) it can be easily extended to more general cases. The only requirements for the generation of phos-file documents are

Spectra -> Mass

||Q86VM9|ZCH18_HUMAN ||Q86X27|RGPS2_HUMAN||B1AMV0|B1AMV0_HUMAN ||Q09666|AHNK_HUMAN ||O95466|FMNL_HUMAN||Q2T9F0|Q2T9F0_HUMAN ||A9Z1X7|A9Z1X7_HUMAN||Q8IYB3|SRRM1_HUMAN ||Q09666|AHNK_HUMAN ||Q8N382|Q8N382_HUMAN||Q13586|STIM1_HUMAN

log(Prob) XCorr δ Cn

BW Peptide [email protected] R.S@AASREDLVGPEVGAS... K.ASLGS#LEGEAEAEASSP... [email protected] K.VPKPEPIPEPKEPS#PEK... K.ASLGS#LEGEAEAEASSP... R.AEQS#LHDLQER.L

|628.710|777.079|123... |911.174|671.956|887... |857.853|1075.185|85... |511.168| 470.026|893... |627. 485|594.756|562... |857.913|858.791|889... |654. 295|655.157|545...

-7.20 0.00 -9.91 0.00 -12.19 -0.15 -7.02

2.98 2.90 4.25 2.21 3.23 3.26 2.08

Sp

RSp

0.38 1157.7 0.39 939.2 0.11 1030.1 0.37 1048.0 0.34 572.6 0.1 346.1 0.47 310.2

1 2 1 1 1 1 1

Spectra -> Intensity

Rt(s)

Ascore

|1037.921|540.006|38... |34.637|31.025|29.13... |19089.467|3474.507|... |3360.741|2127.249|1... |52417.047|8693.704|... |1450.787|311.580|12... |39184.613|4696.431|...

1862.3 2088.7 2705.2 1668.5 2090.2 2709.6 1883.4

66.7 0.0 42.3 1000.0 1000.0 17.4 1000.0

Final peptide AuDLEDEESAAR uAASREDLVGPEVGAS... ASLGsLEGEAEAEASSPK RDuELGPGVK VPKPEPIPEPKEPsPEK ASLGsLEGEAEAEASSPK AEQsLHDLQER

### PROTEIN SEQUENCES Q86VM9|ZCH18_HUMAN Q86X27|RGPS2_HUMAN B1AMV0|B1AMV0_HUMAN Q09666|AHNK_HUMAN O95466|FMNL_HUMAN Q2T9F0|Q2T9F0_ HUMAN A9Z1X7|A9Z1X7_HUMAN Q8IYB3|SRRM1_HUMAN

Zinc finger CCCH domain-containing protein 18... Ras-specific guanine nucleotide-releasing fac... Ral GEF with PH domain and SH3 binding motif ... Neuroblast differentiation-associated protein... Formin-like protein 1 (Leukocyte formin) (CLL... FMNL1 protein.... Uncharacterized protein SRRM1 (Serine/arginin... Serine/arginine repetitive matrix protein 1 (...

MDVAESPERDPHSPEDEEQPQGLSDDDILRDSGSDQDLDGAGVRASDLEDEESAARGPSQ. .. MDLMNGQASSVNIAATASEKSSSSESLSDKGSELKKSFDAVVFDVLKVTPEEYAGQITLM... STGSILENEQRSNLMNNILRIISDLQQSCEYDIPMLPHVQKYLNSVQYIEELQKFVEDDN ... MEKEETTRELLLPNWQGSGSHGLTIAQRDDGVFVQEVTQNSPAARTGVVKEGDQIVGATI... MGNAAGSAEQPAGPAAPPPKQPAPPKQPMPAAGELEERFNRALNCMNLPPDKVQLLSQYD... MGNAAGSAEQPAGPAAPPPKQHRFEKLMEYFRNEDSNIDFMVACMQFINIVVHSVENMNF... MDAGFFRGTSAEQDNRFSNKQKKLLKQLKFAECLEKKVDMSKVNLEVIKPWITKRVTEIL... MDAGFFRGTSAEQDNRFSNKQKKLLKQLKFAECLEKKVDMSKVNLEVIKPWITKRVTEIL...

Figure 5. Phos-file example. The phos-file comprises two parts separated by a text flag: (A) Each data entry is listed with all the collected information: parameters and scores derived from the search engine report, the list of proteins in the database used containing the peptide, the mass and intensity arrays, calculated scores (D-value and Ascore) and the validated peptide (peptide with p-site assignation revised using Ascore). (B) List of the proteins containing the previous list of peptides. Data for each protein are arranged in three columns: the Accession Number and SwissProt/TrEMBL ID, the SwissProt/TrEMBL description line and the full sequence.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

3749

Proteomics 2009, 9, 3741–3751

the availability of a plain text or XML file with the MS data and a spreadsheet document containing the corresponding peptide identification data. The phos-file is a text file divided into two consecutive blocks (Figs. 3 and 5). The first block contains all the information relative to each spectrum, arranged as a tab-delimited list. This includes all the peptide information extracted from Xcalibur and Bioworks (MS stage, D-value, Ascore, validated peptide sequence, and spectrum, in the form of a massintensity array) as well as the list of accession numbers of the proteins mapped by each peptide. The second block contains a non-redundant list of the proteins mapped by the full data set in three tab-delimited columns: the accession number and name, the description and the sequence of each protein. The phos-file format minimizes the work to be done when queries are submitted to the LymPHOS database, as

all calculations are already performed and the peptide–protein relationships are directly supplied to the database. Only useful, non-redundant information is stored in this file, thereby providing a very compact file format of about 11 kB per spectrum. Furthermore, data handling can be greatly simplified as several raw files obtained in the same experimental conditions can be processed together, thus creating a single phos-file. Phos-files can also be used as backup files of the information stored in the LymPHOS database. Phos-files are a very convenient and efficient reservoir for storing analytical information, although they also have some limitations. Thus, the use of pre-analyzed, static data to build the phos-files implies that the information on peptide and peptide–protein relationships is bound to the content of the Uniprot database at the moment of file creation. This

Accession Number Protein name

Protein view

Sequence Phosphosites

Phosphosite

List of peptides supporting the phosphosite

Expasy – Uniprot info and links

Phosphopeptide

Peptide view

List of proteins containing the phosphopeptide Spectral parameters, scores and experimental conditions

Mass spectrum with annotated fragmentation pattern

Figure 6. Peptide and protein views. Screenshot of the two main views in the LymPHOS web application.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.com

3750

D. Ovelleiro et al.

makes necessary periodical generation of new phos-files from the entire information contained in the LymPHOS database using the updated protein databases.

Proteomics 2009, 9, 3741–3751

hyperlinked to the corresponding ‘‘Peptide view’’. At the bottom of the page, some useful information parsed from the Expasy web page is available, including other experimentally described p-sites, annotated protein function and cellular location, and a link to the protein entry in the UniProt database.

3.4 Overview of LymPHOS LymPHOS is an open access database (www.lymphos.org) that can present the experimental p-site information in two modes. In the Peptide view mode, the information is accessible as a list of peptides with indication of the corresponding p-sites. This mode reflects the peptide-centric nature of shotgun proteomics, making experimental information directly accessible. Tentative proteins to which the peptide could belong appear as a list of references for the corresponding entry. In the protein view mode, protein sequences are depicted with an indication of the p-sites and the complete list of phosphopeptides supporting the p-site locations. These two views are connected, and it is possible to browse from a given protein to the information on any of its related peptides and vice versa. In addition, a detailed view of experimental conditions for each spectrum can be accessed.

3.5 Peptide and protein search and visualization From the ‘‘Search’’ page, peptide information can be accessed by querying specific sequences or selecting a peptide from the list of database sequences. In either case, the ‘‘Peptide view’’ page is displayed (see Fig. 6). This page shows a heading with the peptide sequence and p-site, followed by a list of the proteins mapped by the peptide. At the bottom of the page, all the mass spectra supporting the peptide assignation are displayed, with their corresponding experimental information and scores. The definition of each parameter is provided in a mouse scrolling sensitive environment. The definition of each parameter is provided in a mouse scrolling sensitive environment. Spectra can be downloaded in Mascot generic format (mgf) and other formats, such as mzML [24], will be implemented in the future. High confidence p-site assignations are highlighted using a color code to distinguish between sequences with an Ascore greater than 19 (99% of correct assignation) or with one only possible phosphorylation site (green) and those with an Ascore lower than 19 (orange). Proteins listed on this page are hyperlinked to the corresponding ‘‘Protein View’’. Proteins can be searched from the ‘‘Search’’ page by their Uniprot accession number or description line or selected from the list of database sequences. Protein information can be also obtained through the links available in the ‘‘Peptide view’’ page. At the top of the ‘‘Protein View’’ page, the SwissProt/TrEMBL Accession Number and ID, protein description and sequence are shown. All p-sites annotated in the LymPHOS database for the protein are indicated, together with a list of all phosphopeptide sequences supporting them. These sequences are & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

4

Concluding remarks

We have implemented a workflow for handling experimental, mass spectrometric and data analysis information derived from a shotgun phosphoproteome analysis as well as a relational, open access LymPHOS database, in order to store and consult this information. Although we specifically built these tools to help in the analysis of the human T-lymphocyte phosphoproteome, the workflow described could also be useful for the storage and management of proteomics experimental data from other studies related with PTM characterization. This work was supported by grants BIO2004-01788 and BIO2008-03365 from the Ministerio de Ciencia y Tecnologı´a. The LP-CSIC/UAB is a member of ProteoRed, funded by Genoma Spain, and follows the quality criteria set up by ProteoRed standards. The authors have declared no conflict of interest.

5

References

[1] Bodenmiller, B., Malmstrom, J., Gerrits, B., Campbell, D. et al., PhosphoPep – a phosphoproteome resource for systems biology research in Drosophila Kc167 cells. Mol. Syst. Biol. 2007, 3, 1–11. [2] Wilson-Grady, J. T., Villen, J., Gygi, S. P., Phosphoproteome analysis of fission yeast. J. Proteome Res. 2008, 7, 1088–1097. [3] Zahedi, R. P., Lewandrowski, U., Wiesner, J., Wortelkamp, S. et al., Phosphoproteome of resting human platelets. J. Proteome Res. 2008, 7, 526–534. [4] Beausoleil, S. A., Villen, J., Gerber, S. A., Rush, J., Gygi, S. P., A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24, 1285–1292. [5] Lu, B., Ruse, C., Xu, T., Park, S.-K., Yates, J., III, Automatic validation of phosphopeptide identifications from tandem mass spectra. Anal. Chem. 2007, 79, 1301–1310. [6] Bodenmiller, B., Campbell, D., Gerrits, B., Lam, H. et al., PhosphoPep – a database of protein phosphorylation sites in model organisms. Nat. Biotechnol. 2008, 26, 1339–1340. [7] Diella, F., Gould, C. M., Chica, C., Via, A., Gibson, T. J., Phospho.ELM: a database of phosphorylation sites – update 2008. Nucleic Acid Res. 2007, 36, D240–D244. [8] Gnad, F., Ren, S., Cox, J., Olsen, J. et al., PHOSIDA (phosphorylation site database): management, structural and

www.proteomics-journal.com

Proteomics 2009, 9, 3741–3751 evolutionary investigation, prediction of phosphosites. Genome Biol. 2007, 8, R250. [9] Heazlewood, J., Durek, P., Hummel, J., Selbig, J. et al., PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor. Nucleic Acid Res. 2008, 36, D1015–D1021.

3751 [16] Higdon, R., Hogan, J. M., Belle, G. v., Kolker, E., Randomized sequence databases for tandem mass spectrometry peptide and protein identification. OMICS A J. Integr. Biol. 2005, 9, 364–379. [17] Consortium, U., The Universal protein resource (UniProt). Nucleic Acids Res. 2008, 36, D190–D195.

[10] Hornbeck, P. V., Chabra, I., Kornhauser, J. M., Skrzypek, E., Zhang, B., PhosphoSite: a bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics 2004, 4, 1551–1561.

[18] Keller, A., Nesvizhskii, A. I., Kolker, E., Aebersold, R., Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383–5392.

[11] Hummel, J. N. M., Wienkoop, S., Schulze, W., Steinhauser, D. et al., ProMEX: a mass spectral reference database for proteins and protein phosphorylation sites. BMC Bioinformatics 2007, 23, 216.

[19] Taylor, C. F., Paton, N. W., Lilley, K. S., Binz, P.-A. et al., The minimum information about a proteomics experiment (MIAPE). Nat. Biotechnol. 2007, 25, 887–93.

[12] Prince, J. T., Carlson, M. W., Wang, R., Lu, P., Marcotte, E. M., The need for a public proteomics repository. Nat. Biotechnol. 2004, 22, 471–472. [13] Carrascal, M., Ovelleiro, D., Casas, V., Gay, M., Abian, J., Phosphorylation analysis of primary human T lymphocytes using sequential IMAC and titanium oxide enrichment. J. Proteome Res. 2008, 7, 5167–5176. [14] Eng, J. K., McCormack, A. L., Yates, J. R., An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. [15] Elias, J. E., Gygi, S. P., Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207–214.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

[20] DeGnore, J. P., Qin, J., Fragmentation of phosphopeptides in an ion trap mass spectrometer. J. Am. Soc. Mass Spectrom. 1998, 9, 1175–1188. [21] Cohen, P., The regulation of protein function by multisite phosphorylation – a 25 year update. TIBS 2000, 25, 596–601 [22] Yang, F., Stenoien, D. L., Strittmatter, E. F., Waan, J. et al., Phosphoproteome profiling of human skin fibroblast cells in response to low- and high-dose irradiation. J. Proteome Res. 2006, 5, 1252–1260. [23] Nesvizhskii, A. I., Aebersold, R., Interpretation of shotgun proteomic data. The protein inference problem. Mol. Cell. Proteomics 2005, 4, 1419–1440. [24] Deutsch, E., mzML: a single, unifying data format for mass spectrometer output. Proteomics 2008, 8, 2776–2777.

www.proteomics-journal.com

Lihat lebih banyak...

LymPHOS: Design of a phosphosite database of primary human T cells

Descrição do Produto

Comentários