Data standards for proteomics: mitochondrial two-dimensional polyacrylamide gel electrophoresis data as a model system

Share Embed


Descrição do Produto

Mitochondrion 3 (2004) 327–336 www.elsevier.com/locate/mito

Data standards for proteomics: mitochondrial two-dimensional polyacrylamide gel electrophoresis data as a model system Veerasamy Ravichandrana,*, Gregory B. Vasqueza, Sudhir Srivastavac, Mukesh Vermac, Emanuel Petricoind, Joshua Lubellb, Ram D. Sriramb, Peter E. Barkera, Gary L. Gillilanda a

Biotechnology Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD 20899, USA b Manufacturing Systems Integration Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD 20899, USA c Division of Cancer Prevention, National Cancer Institute, Rockville, MD 20852, USA d Division of Therapeutic Products, Office of Therapeutics Research and Review, Center for Biologics Evaluation and Research, FDA, Bethesda, MD 20892, USA Received 2 October 2003; received in revised form 20 January 2004; accepted 5 February 2004

Abstract Proteomics has emerged as a major discipline that led to a re-examination of the need for consensus and a nationally sanctioned set of proteomics technology standards. Such standards for databases and data reporting may be applied to twodimensional polyacrylamide gel electrophoresis (2D PAGE) technology as a pilot project for assessing global and national needs in proteomics, and the role of the National Institute of Standards and Technology (NIST) and other similar standards and measurement organizations. The experience of harmonizing the heterogeneous data included in the Protein Data Bank (PDB) provides a paradigm for technology in an area where significant heterogeneity in technical detail and data storage has evolved. Here we propose an approach toward standardizing mitochondrial 2D PAGE data in support of a globally relevant proteomics consensus. q 2004 Elsevier B.V. and Mitochondria Research Society. All rights reserved. Keywords: 2D gel electrophoresis; Data standards; Interoperability; Proteomics; Data uniformity

1. Introduction Advances in genome sequencing have created an immense opportunity to understand, describe, and * Corresponding author. Address: Center for Advanced Research in Biotechnology, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850, USA. Tel.: þ 1-301-738-6215; fax: þ 1-301738-6255. E-mail address: [email protected] (V. Ravichandran).

model whole living organisms (Service, 2001). With the completion of the Human Genome Project, the post genomic era has begun. Both industry and the US government are investing heavily in proteomics, one of the major focus areas of this new era. Investigators are racing to identify the complete set of proteins encoded by the human genome, determine their function and how they work together (Broder and Venter, 2000). These efforts are fuelling the growth of

1567-7249/$20.00 q 2004 Elsevier B.V. and Mitochondria Research Society. All rights reserved. doi:10.1016/j.mito.2004.02.006

328

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

the biotechnology industry and will lead to many new products, including more and superior drugs for improved healthcare (Fields, 2001). The understanding of cellular function and physiology requires insight into the complexity of the system-wide protein content of cells. Researchers want to understand the differences between healthy and diseased tissues and cells, and how the differences in protein expression levels can be correlated with disease. The tools that are being employed are two-dimensional gel electrophoresis (2D PAGE), mass spectrometry, and the associated bioinformatic tools. Two-dimensional gel electrophoresis, which combines isoelectric focusing (IEF) in the first dimension and SDS-PAGE in the second dimension, has been the most widely used method for protein separation (Pandey and Mann, 2000; Anderson and Anderson, 2002). Mass spectrometry techniques include Isotope-Coded Affinity Tags (ICAT) for differentially labelling normal and diseased proteins that can then be distinguished rapidly and with high-throughput (LeBihan et al., 2001). Bioinformatics involves the management and analysis of biological data using computing technologies. Both public and private sectors are gearing up to use these techniques on the molecular analysis of healthy and diseased cells, e.g. 2D PAGE separation of proteins. Many technical advances, including the automation of these methods have led to the accumulation of a considerable body of 2D PAGE data. Most of the electrophoresis data are published in the scientific journals, though for some studies only the analyzed results are published. There are limited Web resources providing 2D PAGE data. Comprehensive, structured information generated from these 2D PAGE data will aid in the better understanding of any particular protein of interest. Currently, the necessary information for a particular protein or family of proteins is difficult to obtain from heterogeneous resources due to the lack of data standards, leading to data compatibility problems. Consequently, automated information exchange between these resources is very limited. A new report from the National Institute of Standards and Technology (NIST) states that US industries could be saving $900 million annually, simply by using a suite of international standards that reduce interoperability problems encountered in the exchange of digital

product information (http://www.nist.gov/director/ prog-orc/report02-5.pdf). Considering the enormous volume of heterogeneous proteomics data already existing, a system of standardization will provide more useful data and enhance the ability to query across these valuable data resources.

2. Benefits of data standards Most of the 2D PAGE experimental details are naturally addressed and are context- dependent. One of the principal barriers for interaction with 2D PAGE data lies in the complex and varied vocabulary used by biologists to define the attributes of each molecule. Thus, to retrieve the relevant information for each protein of interest, it is necessary to read the context and extract the data from each individual study. If it is necessary to compare a given experimental result with previously published results the problem grows exponentially due to the context-based nature of the documents. Without a universally acceptable common protocol, it is very difficult to understand either the data or their representations from different sources. Defining data standards for each data item is thus the first step towards the generation of a uniform format for 2D data and the eventual achievement of 2D data interoperability. Data standards are also useful for efficient data annotation.

3. Informatics issues for 3D structural data Hundreds of complex and heterogeneous data items exist for each individual 3D macromolecular experiment. As of July 2003 there were 21,838 macromolecular 3D structures deposited in the Protein Data Bank (PDB: http://nist.rcsb.org/pdb/), the single worldwide repository for the processing and distribution of 3D biological macromolecular structure data. The PDB is managed by Rutgers, The State University of New Jersey; the San Diego Supercomputer Centre at the University of California, San Diego; and the National Institute of Standards and Technology (http://www.rcsb.org/pdb/). Predicting the data growth and interoperability problems, in 1991 the International Union of Crystallography (IUCr) appointed a working group to develop data

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

standards for macromolecular data. The Crystallographic Information File (CIF) was developed with a dictionary that defines the structure of CIF data files. A Dictionary Description Language (DDL) was developed to define the structure of CIF data files. The CIF data files, dictionaries and DDLs are expressed in a common syntax. Later, IUCr extended CIF to mmCIF (macromolecular Crystallographic Information File). In 1998 IUCr recommended 140 new definitions for adopting newly emerged NMR data. Now mmCIF is the standard representation for PDB (Westbrook and Bourne, 2000). mmCIF contains well defined syntax, and provides precise definitions and examples. It also defines data relationships, data type, range restrictions, allowed values, interdependencies, exclusivity, units, and methods. mmCIF has a data dictionary organized in table-like structures called ‘categories’ and is easily integrated with relational or object database management systems. By the early 1990s, the majority of journals required a PDB accession code and at least one funding agency (National Institute of General Medical Sciences) adopted the guidelines published by the International Union of Crystallography (IUCr) requiring data deposition for all structures. Through this community-based effort, the PDB now handles complex macromolecular data more efficiently.

329

5. A 2D repository: A queryable 2D database One of the main objectives of the 2D experimental data repository is to provide the community with detailed information about a given protein of interest, including qualitative and quantitative properties. This requires more comprehensive data for each of the data items and data groups. The current focus is on establishing standards for 2D experimental data annotation and data exchange/interoperability, subsequently facilitating the creation of a 2D experimental data repository. Eventually, this will be enhanced by adopting the Distributed Annotation System (DAS). DAS allows annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software (Stein, 2003). If the 2D repository has sufficient data in enough categories then it will be possible to query the data. For example, the user may select from within the following categories: the protein name, the sample source condition and method of preparation, the electrophoretic conditions, and the staining method. Once the relevant categories are selected, the user can then visualize the virtual 2D gel generated on the basis of the available 2D data, with the available related information being presented through internal and external links.

6. A model system 4. Array issues Recently, the microarray community responded to the high complexity and rapid growth of microarray data by creating consensus guidelines, Minimal Information About a Micro array Experiment (MIAME), to make this data much more useful and accessible (Brezma et al., 2003). The MIAME guidelines require that the minimum information about a published microarray-based experiment should include a description containing: the experimental design, the array design, the sample information, and the hybridization procedures and parameters. Furthermore, the hybridization images, their specifications and quantified data and the normalization controls must also be included in this description.

The National Institute of Standards and Technology (NIST) has an established record for the development and certification of validated and standardized databases. NIST is currently developing a program to meet some of the needs of the proteomics communities. This effort will initially focus on mitochondrial proteomics, since the community has relatively focused issues. The human mitochondrion is a tangible system of about 1000 proteins. In addition, they are centrally involved in a large number of human disorders. Aside from their bioenergetic function, mitochondria regulate cell death, modulate ionic homeostasis, oxidise carbohydrates and fatty acids, and participates in numerous other catabolic and anabolic pathways. Consequently, defects in mitochondria, the cell’s power plants, contribute to disorders ranging from heart degeneration to

330

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

Alzheimer’s disease. The need for NIST’s involvement in proteomics and the suitability of the mitochondrion as a model system was emphasized at the NIST workshop, ‘Systems Biology Approaches to Health Care: Mitochondrial Proteomics’, on September 17-18, 2002 (http://www.cstl.nist.gov/biotech/ mito/mitoproteomics.html). The need for 2D PAGE reference and data standards for mitochondrial proteomics was particularly emphasized.

7. Theoretical model As an initial attempt at standardization, implementation of a 2D reference based upon the theoretical calculation of isoelectric point (pI) and molecular weight for each mitochondrial protein (both mitochondrial and nuclear encoded) sequences was attempted. A customizable interface was developed to permit complex queries that include the name of the protein, tissue, mitochondrial compartment, chromosome number where relevant, molecular weight range, pI range, and keywords (Fig. 1). The query results, along with the protein name, pI and molecular weight are presented, with the protein name linked to the theoretical virtual 2D PAGE of that protein (Fig. 2). The virtual 2D PAGE shows the query protein’s mobility on a virtual two-dimensional gel, based upon the isoelectric point and molecular weight as calculated from the protein sequence (Fig. 3). Each 2D spot’s protein name is linked to detailed information about the protein (Fig. 4). The information presented on the detail page includes: SwissProt information, description, cellular location, key words, tissue,

cellular function, similarity with other proteins, gene name, synonyms, chromosomal location, and protein sequence information such as the amino acid length, theoretical pI, and theoretical molecular weight. External database links are also presented on the detail page, including: † Online Mendelian Inheritance in Man (OMIM; http://www.ncbi.nlm.nih.gov/omim) † RefSeq (http://www.ncbi.nlm.nih.gov/LocusLink/ refseq.html) † Locus link (http://www.ncbi.nlm.nih.gov/LocusLink) † Genome Database (GDB; http://gdbwww.gdb.org) † PubMed (http://www.ncbi.nlm.nih.gov/entrez) † Enzyme Commission Number (http://www. expasy.org/cgi-bin/get-enzyme-entry) † SwissProt 2D (http://us.expasy.org/ch2d). The protein sequence can be highlighted using a mouse-over option that provides annotation such as the mitochondrial localization signal, variant information, and other protein sequence details. The protein sequence of interest can also be used to search the journal references, the SwissProt site for related proteins, or against the Protein Data Bank sequence for the related 3D structures. All the available details for any given protein can be viewed or downloaded with an Extensible Markup Language (XML) file, along with the Document Type Definitions (DTD) for easy data interaction. Detailed documentation of each data element of the existing data is also presented (Fig. 6).

Fig. 1. A customizable interface.

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

331

Fig. 2. Query result.

8. Experimental model We plan to extend the 2D PAGE option of our theoretical model into an experimental model. In order to be able to search a 2D PAGE database with selectable query options for an individual protein, it is necessary to have comprehensive data for each of the individual categories. Despite having insufficient data to make a queryable interface, we have begun our pilot study with the mitochondrial

2D PAGE experimental data obtained from three research groups (Fountoulakis et al., 2002; Rabilloud et al., 1998; Celis et al., 1995), which has been sufficient to begin the construction of some static experimental models (Fig. 5). The protein 2D PAGE spots are assigned numbers, based on the experimental run conditions. The protein information corresponding to each spot number is linked to the detail page as explained above (see ‘Theoretical Model’).

332

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

Fig. 3. Theoretical 2D gel model.

9. What needs to be done? 1. Data deposition: One of the principal barriers to constructing a database repository lies in the complex and varied vocabulary used by researchers to define the data elements of a molecule The resource can be useful only if the information is described by a structured vocabulary along with well-defined relationship between the data items. The author can enter the minimal 2D PAGE data required through a web-based common depository. In parallel, an effort must be established to gather 2D PAGE data from the published literature and integrate these data into the repository. 2. Data annotation: Each data point and its value need to be examined and validated for its correctness and completeness by the experienced annotators, centrally, as well as through distributed annotation system Where necessary, a literature search must be made in order to ascertain values for missing, incomplete or inaccurate data. The annotation of data elements also requires all of the related data

records within a file to be consistent and properly integrated across each file group. 3. Data storage: The volume of mitochondrial proteomics data is growing at a nearly exponential rate and this poses a problem in terms of data management, scalability, and performance Building of the database structure is the first step towards the structured recording of electrophoresis data in a relational database. This consists of precisely defined data fields and precisely defined relationships between them represented by links between the tables. Proteomics data, 2D PAGE data, in particular, are complex to model and there are many different types of data presenting numerous relationships. Data models are the logical structures used to represent a collection of entries and their underlying one-to-one, one-to-many, and many-to-many relationships. The main motivation for creating mitochondrial proteomics data models is usually to be able to implement them within database management systems, usually as a relational database management system (e.g., ORACLE, SQL Server, SYBSASE, MySQL, etc.).

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

333

Fig. 4. Detailed information about a selected protein.

This database is being modeled to handle the heterogeneous data from various external data sources. Data analysis generates new data that also have to be modeled and integrated. 4. Data distribution: Users should be able to gather their data of interest through a query able web interface The individual or grouped data should be able to be downloaded in commonly used formats.

10. Required information for 2D PAGE Our background aim is to aid in the establishment of a public repository for experimental 2D PAGE data. The minimum information necessary from any 2D PAGE experiment is that associated with the experimental

details, in order to ensure firstly the reproducibility of the experiment, and secondly the interoperability of the results. The following data elements should be collected in association with their required data categories: source, experimental detail, sample preparation, gel running conditions, data analysis and author information. Each one of these major categories may have predefined data items, recommended by the data experts specialized in these areas. For example, the source information should contain the following data items: the source record, which specifies the biological and/or chemical source of each molecule in the entry. Sources should be described by both their common and scientific names. The cell line and strain should be given for immortalized cells when they help to uniquely

334

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

Fig. 5. Experimental 2D analysis of mitochondrial proteins.

identify the biological entity studied. Two types of sources will be grouped, the natural source and genetically modified sources. Data items in the genetically-modified category record details of the source from which the sample was obtained. Associated data for this category include: the gene modified in the source material for the experiment, the genetic variation (transgenic, knockout), the system used to express the recombinant protein, and the specific cell line used as the expression system (name, vendor, genotype, and phenotype). Data items in the natural source category will record details of the sample source. Associated data for this category will include: the common name of the organism and its scientific name, the source condition (normal, disease), any genetic variation, sex, age, organ, tissue, cell, organelle, secretion, and cell line information. The above are suggestions given here as illustration. In practice, data elements are derived through a consensus process with input from all likely user communities. As such, detailed data elements

and associations can be defined by the working group convened to arbitrate the standard data element definitions.

11. Materials and methods The mitochondrial proteins (based on the localization and function at mitochondrion, based on the literature search) data used here to generate the theoretical model were consolidated from the following sources: † SwissProt (http://www.ebi.ac.uk/swissprot/) † LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) † Protein Data Bank (http://nist.rcsb.org/pdb) † GenBank (http://www.ncbi.nlm.nih.gov) † Genome Database (http://gdbwww.gdb.org/) † Online Mendelian Inheritance in Man (http://www. ncbi.nlm.nih.gov/omim)

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

335

Protein information related to the mitochondrion was gathered using a program written in JAVA, and stored in an ORACLE relational database. New data are periodically annotated and updated into the database. For the experimental model, data related to mitochondrial 2D experimental results including the experimental conditions, sources, protein name, gene name, pI, molecular weight were obtained from researchers and loaded into ORACLE tables. A customizable, user friendly interface was developed in Hyper Text Markup Language (HTML) format to permit complex queries. The query results, along with the protein sequence and journal reference, are presented as an HTML page. The web-based virtual 2D images are developed dynamically using JAVA Applets, Servlets and Kavacharts (http://www.ve. com/index.html). The data documentation was developed using a tool from XMLSPY (http://www. xmlspy.com).

12. Disclaimer The contents of this article do not necessarily reflect the views or policies of National Institute of Standards and Technology (NIST). Any mention of commercial products within this article is for information only and does not imply recommendation or endorsement by NIST. The World Wide Web pages are provided as a public service by NIST. With the exception of material marked as copyright, information presented on these pages is considered public information and may be distributed or copied. Use of appropriate byline/photo/image credits is requested.

Fig. 6. A part of the data element documentation.

† Human Mitochondrial Genome Database (http:// www.genpat.uu.se/mtDB) † MITOMAP (http://www.mitomap.org) † Neuromuscular Disease Center (http://www.neuro. wustl.edu/neuromuscular/mitosyn.html) † Mendelian Inheritance and the Mitochondrion (http://srdata.nist.gov/mitdb/) † Human 2D PAGE Database (http://proteomics. cancer.dk/jecelis/human_data_select.html)

Acknowledgements This work was supported in part by an Exploratory Research Grant (VR) from the Chemical Sciences and Technology Laboratory (CSTL), National Institute of Standards and Technology (NIST). We are grateful to Sundari Ravi for her technical assistance. We thank Drs. Michael Fountoulakis (Genomics Technologies, F. Hoffmann-La Roche Ltd., Pharmaceutical Research, Basel, Switzerland) and Julio Celis (Department of Medical Biochemistry and Danish

336

V. Ravichandran et al. / Mitochondrion 3 (2004) 327–336

Centre for Human Genome Research, Denmark.) for providing us with the mitochondrial 2D data.

References Anderson, N.L., Anderson, N.G., 2002. The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell. Proteomics 1, 845–867. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., et al., 2003. ArrayExpress-a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 31, 68 –71. Broder, S., Venter, J.C., 2000. Whole genomes: the foundation of new biology and medicine. Curr. Opin. Biotechnol. 11, 581–585. Celis, J.E., Rasmussen, H.H., Gromov, P., Olsen, E., Madsen, P., Leffers, H., et al., 1995. The human keratinocyte twodimensional gel protein database (update 1995): mapping components of signal transduction pathways. Electrophoresis 16, 2177–2240.

Fields, S., 2001. Proteomics in genomeland. Science 291, 1221–1224. Fountoulakis, M., Berndt, P., Langen, H., Suter, L., 2002. The rat liver mitochondrial proteins. Electrophoresis 23, 311 –328. Le Bihan, T., Pinto, D., Figeys, D., 2001. Nanoflow gradient generator coupled with mu-LC-ESI-MS/MS for protein identification. Anal. Chem. 73, 1307–1315. Pandey, A., Mann, M., 2000. Proteomics to study genes and genomes. Nature 405, 837 –846. Rabilloud, T., Kieffer, S., Procaccio, V., Louwagie, M., Courchesne, P.L., Patterson, S.D., et al., 1998. Two-dimensional electrophoresis of human placental mitochondria and protein identification by mass spectrometry:toward a human mitochondrial proteome. Electrophoresis 19, 1006– 1014. Service, R.F., 2001. Materials Research Society meeting. Assembling the supersmall and ultrasensitive. Science 294, 2074–2077. Stein, L.D., 2003. Integrating biological databases. Nat. Rev. Genet. 4, 337–345. Westbrook, J.D., Bourne, P.E., 2000. STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics 16, 159–168.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.