PSI-2: structural genomics to cover protein domain family space

July 11, 2017 | Autor: Burkhard Rost | Categoria: Genomics, Structure, Computational Biology, Proteomics, Structural Genomics, Biological Sciences, Humans, Animals, Proteins, CHEMICAL SCIENCES, Protein Sequence Analysis, Protein Domains, Protein Conformation, Biological Sciences, Humans, Animals, Proteins, CHEMICAL SCIENCES, Protein Sequence Analysis, Protein Domains, Protein Conformation

Share Embed

Denunciar este link

Descrição do Produto

NIH Public Access Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

NIH-PA Author Manuscript

Published in final edited form as: Structure. 2009 June 10; 17(6): 869–881. doi:10.1016/j.str.2009.03.015.

PSI-2: Structural Genomics to Cover Protein Domain Family Space Benoît H. Dessailly1,*, Rajesh Nair2, Lukasz Jaroszewski3, J. Eduardo Fajardo4, Andrei Kouranov5, David Lee1, Andras Fiser4, Adam Godzik3, Burkhard Rost6, and Christine Orengo1 1Dept of Structural and Molecular Biology, University College of London (UCL), London WC1E 6BT, UK 21350

Piccard Dr, Center for Devices and Radiological Health, Food and Drug Administration,

USA 3The

Burnham Institute, La Jolla, CA 92037, USA

NIH-PA Author Manuscript

4Dept

of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA 5Dept

of Chemistry and Chemical Biology, Rutgers, The State University of NJ, Piscataway, NJ 08854, USA 6Dept

of Biochemistry and Molecular Biophysics, Center for Computational Biology and Bioinformatics (C2B2), and Northeast Structural Genomics Consortium (NESG), Columbia University, 1130 St. Nicholas Ave. NY 10032, USA

Summary

NIH-PA Author Manuscript

One major objective of structural genomics efforts, including the NIH-funded Protein Structure Initiative (PSI), has been to increase the structural coverage of protein sequence space. Here, we present the target selection strategy used during the second phase of PSI (PSI-2). This strategy, jointly devised by the bioinformatics groups associated with the PSI-2 large-scale production centres, targets representatives from large, structurally uncharacterised protein domain families, and from structurally uncharacterised subfamilies in very large and diverse families with incomplete structural coverage. These very large families are extremely diverse both structurally and functionally, and are highly over-represented in known proteomes. On the basis of several metrics, we then discuss to what extent PSI-2, during its first three years, has increased the structural coverage of genomes, and contributed structural and functional novelty. Together, the results presented here suggest that PSI-2 is successfully meeting its objectives and provides useful insights into structural and functional space.

Background The multiple international genomics and metagenomics initiatives are providing us with sequences of hundreds of genomes and millions of genes. Analysis of this windfall is greatly

© 2009 Elsevier Inc. All rights reserved. * Corresponding author: [email protected]. Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Dessailly et al.

Page 2

NIH-PA Author Manuscript

aided by the fact that these millions of genes can be grouped into a much smaller number of gene families that, being related by evolution, share similarities in function and threedimensional structure (Todd et al., 2001;Pegg et al., 2006;Reeves et al., 2006;Finn et al., 2008;Redfern et al., 2008). Gene family sizes seem to follow a power law with most families containing small numbers of members, and relatively few families being very large and diverse (Todd et al., 2001;Gerlt et al., 2001;Reeves et al., 2006;Marsden et al., 2007) (Figure 1).

NIH-PA Author Manuscript

Several explanations that rely on historical, functional or thermodynamic arguments have been proposed to rationalise the existence of these very large families (Goldstein, 2008). For example, it was suggested that certain types of functions in ancestral proteins made these more amenable to duplication and diversification (Ranea et al., 2006;Goldstein, 2008). Results from other analyses imply that particular structural folds are more likely to accommodate insertions and deletions, which in turn allow functional diversification during evolution (Reeves et al., 2006). Interestingly, whilst the total number of families keeps growing at an almost linear pace with new sequencing data, the number of such very large families remains essentially constant (Goldstein, 2008;Redfern et al., 2008). This phenomenon not only reflects the laws of statistics, but also seems to hint at the history of life on Earth since several of these families contain very ancient genes that are present in organisms from all domains of life, often in multiple paralogues (Aravind et al., 2002;Goldstein, 2008). During their long evolution, genes from these families have had ample chances to diversify, both in structure and function. For example, analysis of bacterial genomes has shown that some of these ancient families have linearly expanded with genome size and the occurrence of multiple paralogues has allowed diversification of functions increasing the functional repertoire of the organisms (Ranea et al., 2006). It is also worth noting that the largest domain families are often involved in essential functions (Shakhnovich et al., 2006), making them potentially interesting targets to understand better disease-related processes for instance.

NIH-PA Author Manuscript

Because of the observed modularity of proteins and the fact that many proteins, especially in eukaryotes, consist of multiple domains that can combine differently in other proteins, it is generally convenient to consider domains as the fundamental units of protein evolution (Ponting et al., 2002;Moore et al., 2008). Analyses of completed genomes have characterized the extent to which domains are duplicated and fused in different domain contexts. Whilst fewer than ten percent of the protein families in an organism are common to all kingdoms of life, over half the domain sequences in an organism are likely to belong to less than 200 families universal to all kingdoms of life (Lee et al., 2005;Ranea et al., 2006) appearing in diverse multi-domain contexts. A number of domain family resources, e.g. Pfam (Finn et al., 2008), CATH (Greene et al., 2007) and SCOP (Murzin et al., 1995), have emerged to capture evolutionary relationships between domains enabling studies on the evolution of different functional roles in diverse relatives. Various large-scale efforts, among them structural genomics, are directed at attaining some level of description of all the known domain families. Structural genomics initiatives that have been set up world-wide have undoubtedly started modifying the way protein three-dimensional structures are used to address issues in several disciplines, among which are enzymology (Gerlt, 2007), protein folding (Fersht, 2008) or protein function prediction (Watson et al., 2007;lali-Hassani et al., 2007). One major historical focus of many structural genomics efforts has been to increase structural coverage of known protein space, by selecting targets from novel, structurally uncharacterised protein families (Sali, 1998;Chandonia et al., 2006;Liu et al., 2007). More recently, it has been argued that structural coverage of protein space could only be completed by concomitantly selecting targets from very large and diverse superfamilies, which often display extreme

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 3

NIH-PA Author Manuscript

structural and functional diversity (Todd et al., 2001;Pegg et al., 2006;Reeves et al., 2006;Marsden et al., 2007). In addition, it can be expected that a more comprehensive sampling of structures from such very large superfamilies would help understanding better the determinants and the extent of their functional diversity (Reeves et al., 2006;Redfern et al., 2008). Accordingly, as part of the Protein Structure Initiative (PSI) funded by the NIH, structural genomics production centres have committed significant part of their resources to solve structures of proteins from such diverse families (Norvell et al., 2007). Structural data will not only help rationalising the mechanism for function divergence in these extremely large families but may also explain why some families are more highly recurrent in particular organisms or environmental contexts. Recently, structural genomics target selection strategy was extended to target protein families that were shown to be overrepresented in uncultured bacteria present in specific environments, such as the human distal gut, as identified by metagenomics studies.

NIH-PA Author Manuscript

In this article, we present the target selection strategy that is being followed by the four large-scale production centres of the NIH Protein Structure Initiative (i.e. JCSG (www.jcsg.org), MCSG (www.mcsg.anl.gov), NESG (www.nesg.org) and NYSGXRC (www.nysgrc.org)). This strategy has been aimed at two major objectives, namely (a) to provide structures from protein families representing significant proportions of the genome sequences, and (b) to study the structural basis of functional diversity in the most diverse and highly populated families. We specifically address the benefits of increasing our sampling of structure space in these very large families.

Historical considerations on PSI target selection strategy

NIH-PA Author Manuscript

The first phase of the Protein Structure Initiative (PSI-1), which started in September 2000 and ended in June 2005, did not specify the exact meaning of structural coverage of protein space and generally targeted ‘novel’ proteins showing no close relationship to any proteins of known structure. A simplistic general threshold of 30% sequence identity was widely adopted as a definition of novel targets (Vitkup et al., 2001). This threshold was selected based on evidence from CASP quality assessments (Moult, 2005), which suggested that 30% sequence identity was a reasonable cut-off for building homology models. The underlying idea was that each novel structure solved could in turn be exploited to provide approximate models of all close homologues (Sali, 1998). Centres participating in the PSI focused on targets from specific organisms, metabolic pathways or other medically relevant topics. For example, the JCSG focused on targets from Thermotoga maritima, NESG from human and other eukaryotes, MCSG focused on various pathogenic organisms, and NYSGXRC solved structures of proteins involved in metabolic pathways and cancer. These four large-scale centres continued to participate in PSI-2 since July 2005. PSI-1 succeeded in establishing a new model of structure determination whereby very large numbers of structures are solved by an automated high throughput experimental pipeline, providing unparalleled productivity and cost savings. The four large-scale centres now involved in PSI-2 have solved over 800 protein structures during the first phase of PSI, far more than conventional structural biology labs could have solved alone with a comparable amount of funding. Thereby, PSI-1 achieved one of its goals, namely to reduce significantly the cost of solving protein structures. Several reports in the literature have detailed the success of PSI-1 according to different criteria (Todd et al., 2005;Chandonia et al., 2006;Watson et al., 2007). All these publications clearly suggest that PSI-1 was successful in significantly increasing the proportion of novel distinct protein structures deposited in the PDB (Berman et al., 2000), as well as the proportion of novel structural superfamilies and novel fold groups. The analysis by Todd et al (2005) showed encouraging increases in

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 4

NIH-PA Author Manuscript

numbers of structures solved in these different categories over the first 3 years. More recent analyses by Chandonia and Brenner (2006) confirmed these observations over 5 years (see Table 1). However, it was clear from both analyses that there were still considerable levels of redundancy between the PSI large-scale centres and also between the centres and the general structure biology community, despite the adoption of a centralized target tracking system (TargetDB, http://targetdb.pdb.org/) for publicizing information on selected targets and progress with these targets. In many cases this was due to similar targets having advanced too far in different pipelines by the time conflicts surfaced. In some situations, targets were not stopped because they involved a relative from a different species or with a different ligand bound that could potentially provide useful biological insights. In order to reduce the redundancy in structures targeted and solved by the four PSI largescale centres, the target selection strategy from PSI-1 was reviewed at the start of PSI-2, and a new joint initiative was started involving four BioInformatics Groups (jointly referred to as the BIG4), each being associated with one of the four large-scale centres. The aim was to improve the productivity of PSI by reducing the overlap among centres as far as possible and to coordinate efforts of all the centres towards the main goal of PSI.

NIH-PA Author Manuscript

All four large-scale PSI centres split their efforts among three major target lists by spending about 70% of their efforts on a centralized list targeting structural novelty, 15% on community nominated targets and 15%, on bio-medically important targets. Here we present the strategies developed by the BIG4 in PSI-2 for assembling the centralized list targeting structural novelty, by focusing on uncharacterised domain families, as well as diverse relatives in very highly populated domain families of known structure which are predicted to be structurally and functionally dissimilar to previously determined structures. We also present our initial analyses of the structures deposited in the PDB by PSI-2 large-scale centres during the first three years of PSI-2, and examine the degree to which PSI-2 has been successful in increasing the proportion of distinct (less than 98% sequence identity to any structure pre-existing in the PDB – see Methods) and structurally novel structures solved since the beginning of the initiative. Even though it was not an explicit goal of PSI, we also assess the success of PSI-2 in contributing structural information for functional families and thereby the degree to which PSI-2 has illuminated both structure and function space, by identifying the functional categories within the Gene Ontology classification (The Gene Ontology Consortium, 2000) for which PSI-2 has solved the first structure.

Target Selection Strategy and Domain Families Targeted in PSI-2 NIH-PA Author Manuscript

A primary aim in PSI-2 has been to increase the proportion of domain families for which one or more structures have been characterized, by a coarse-grained sampling of sequence space. One major challenge for the target selection strategy was therefore to construct a list of domain families with no representative structure. Domain families with at least 10 relatives were targeted more specifically so as to maximize the potential impact of PSI-2 structures via homology modelling. Here, these domain families are referred to as structurally uncharacterised large families, and were also referred to internally as BIG families. Even though a family with 10 relatives may arguably not qualify as large, this threshold was chosen as a result of a compromise between selecting families having a significant size and not restricting the final list to too few families given the other constraints we had in the selection procedure (e.g. no features that might affect structure determination – see below). Some domain families are very large and very diverse both in terms of structure and function. The largest 200 CATH (Greene et al., 2007) families in Gene3D (Yeats et al., Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 5

NIH-PA Author Manuscript

2008) account for at least 50% of domain sequences in the genomes and yet, Figure 2 shows that, for all these very large CATH superfamilies but 2, less than 10% of the sequence subfamilies (so-called modelling subfamilies – see Methods) within them have structural representatives in the PDB (Berman et al., 2000). Previous analyses of some of these very large families reveal that some proteins can be up to 5-times larger than other members of the same family, sometimes to the point of actually adopting a different fold (Reeves et al., 2006). Such structural divergence is often clearly correlated with divergence in function (see Figure 3). Our structural sampling of most of these very large and diverse families is very incomplete. Here, these domain families are referred to as very large and diverse families with incomplete structural coverage, and were also referred to internally as MEGA families. Another aim of PSI-2 has therefore been to target additional relatives in these MEGA families, with the expectation that this will give us deeper insights into the nature of structural divergence within a family, and on how structural changes between related domains bring about changes in function. This, in turn, should trigger improvements in algorithms that attempt to predict functions from structures. Finally, such a fine-grained sampling of subfamilies with diverse families is required to fully characterize the structural repertoire in nature.

NIH-PA Author Manuscript

In recent years, metagenomics experiments have revealed the extent of previously uncovered parts of the protein universe, which are found in complex communities of uncultured microbes from various environments (e.g. ocean, soil, human skin or gastrointestinal tract). On that account, PSI-2 centres also started a pilot project in which the above-mentioned target selection strategies were applied to include domain families that are over-represented in one of the most studied environments, namely the human distal gut microbiome. Sequence information from metagenomics can illuminate important functional roles being carried out by the bacterial communities found in specific habitats (Riesenfeld et al., 2004). For example many bacterial proteins in the human gut are essential for breaking down complex food substrates and synthesizing vital nutrients such as vitamins. Understanding how these communities function and what populations are most beneficial to the human host is likely to be important for understanding and promoting human health and diagnosing conditions likely to lead to disease. In practice, as will be shown hereafter, these Gut Metagenome Families constitute a subset of the targeted large families (BIG and MEGA) mentioned in the above paragraphs. Targeting structurally uncharacterised large families for coarse-grained sampling of sequence space

NIH-PA Author Manuscript

Bioinformatics groups (BIG4) from all 4 PSI Centres collaborated in developing a consensus strategy for target selection. Defining domain families is a complex issue, and a number of curated domain family resources such as Pfam (Finn et al., 2008) and TIGRFAMs (Haft et al., 2003) are now publicly available, which can facilitate research in this field. In order to benefit from these existing domain family resources but also from more optimal strategies for target selection, we applied a mixed protocol to identify suitable sequence families for coarse-grained targets. A primary list of large structurally uncharacterised families was constructed using Pfam, which is one of the most comprehensive manually curated resources. Exclusion of families with less than 10 relatives (see Methods) or with features that might affect success in structure determination (e.g. trans-membrane regions etc.) resulted in a total of 1369 target Pfam families, corresponding to approximately 20% of sequences in Pfam families without structural representatives. However, several problematic features of Pfam were identified, which originate from the fact that the aims of PSI efforts and the rules guiding Pfam classifications are similar but not identical. For example, the sequences clustered into a Pfam family sometimes represent a multi-domain family rather than a single domain family. A reverse problem happens in Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 6

NIH-PA Author Manuscript

proteins that have been chopped into partial domains that are never found separately and may not constitute a proper domain and therefore cannot be solved experimentally. Since consensus approaches have historically been shown to be highly successful in bioinformatics, we attempted to solve these problems by using a collaborative approach involving several orthogonal methodologies for domain family definition, which would allow us to look for consensus families, i.e. families that were found by more than a single source. Therefore, the target list of Pfam families was supplemented by families identified using various automated protocols developed in the BIG4 groups described below: a.

The Rost group identified domain families using the CLUP method (Liu et al., 2004), which applies an iterative domain chopping and comparison protocol to merge related sequences into families.

b. In the Orengo group, the Gene3D database (Yeats et al., 2008) was used to identify NewFam domain families, which are clusters of domain sequences built from regions of genome sequences that cannot be assigned to CATH or Pfam domain families (Marsden et al., 2006). c.

The Godzik group used an iterative protocol for building families from a broad range of protein sequence databases.

NIH-PA Author Manuscript

d. The Fiser group analysed PFAM-B database (automatically generated domain clusters obtained from PRODOM (Bru et al., 2005)) for structurally uncharacterised sequence families. A combined target list of families identified by Pfam and by the BIG4 protocols was generated, and families found by more than one source were labelled (see Methods). Each centre used their own criteria to prioritise those families that they wished to target for structure determination, for example depending on their reagent genomes, and the families were then divided amongst the four large centres using a random pick procedure. This random pick assignment was iterated four times and, in total, 2357 families were distributed to the centres (see Table 2). Targeting subfamilies in very large and diverse families with incomplete structural coverage for fine-grained sampling of sequence space

NIH-PA Author Manuscript

The Gene3D database (Yeats et al., 2008) was exploited to identify the most highly populated domain families with known structures in the genomes. This resource comprises more than five million protein sequences, including sequences from 520 completed genomes and the UniProt (UniProt Consortium, 2008) and RefSeq (Pruitt et al., 2007) databases. Putative domains are identified by scanning sequences against Hidden Markov Models (HMMs) derived from the CATH and Pfam domain databases, using conservative thresholds that have been carefully benchmarked with structural data. As of August 2008, approximately 37% of residues in protein sequences from Gene3D can be assigned to families of known structure in CATH, with a further 48% that can be assigned to Pfam families. Furthermore, approximately 55% of protein sequences in Gene3D contain at least one domain that can be assigned to a family in CATH. Figure 4 shows that the largest 200 domain families contain more than 290,000 modelling subfamilies. PSI-2 is unlikely to solve this number of structures over the next few years and rational approaches are clearly needed to attempt to select representatives that are structurally and functionally diverse. Therefore, a large part of the first year of PSI-2 (June 2005–June 2006) was dedicated to design a robust target selection strategy and to develop the clustering and analysis tools needed to improve the rational selection of targets within the very large and diverse families selected. For example in the NESG and NYSGXRC research into improved methods of aligning sequences and deriving homology models led to Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 7

revised thresholds for clustering sequences into modelling subfamilies on the basis of predicted structural similarity.

NIH-PA Author Manuscript

Similarly, the MCSG consortium developed the GEMMA approach (Lee et al., submitted) which exploits HMM-HMM strategies to progressively merge subfamilies of functionally related domains to enable selection of functionally diverse representatives. For some superfamilies this approach can reduce the number of predicted functionally diverse subfamilies to target by more than ten-fold, making it more feasible to achieve structural coverage of these diverse subfamilies using this rational approach. Additional constraints that operate when selecting representatives are the reagent genomes available for cloning to the centre, which restrict the choice of homologues for structure determination. A measure of success of this target selection strategy will be the degree of structural and functional novelty observed in the structures that are deposited in the PDB by the four centres during this second phase of PSI. This is reviewed below for the first three years of PSI-2.

NIH-PA Author Manuscript

For each of the most highly populated families in CATH that have been allocated to the PSI-2 large-scale centres, Table 3 gives the number of relatives identified in Gene3D, the number of different functional terms from the Gene Ontology (GO) (The Gene Ontology Consortium, 2000), the number of different modelling subfamilies it contains (where sequences are clustered into modelling subfamilies using a 30% sequence identity threshold), and the percentage of modelling subfamilies for which there is a solved structure. Table 3 also shows to which of the large-scale centres each family was allocated, as well as the date of allocation. The largest four of these very large families, also called SUPERMEGA superfamilies, were not allocated to any individual centre but instead, each centre prioritised individual modelling subfamilies within these superfamilies, largely on the basis of features which suited their experimental pipelines (e.g. presence of homologues in the reagent genomes used by the centre) and functional assignments (e.g. biologically interesting GO terms for which no structures were currently known). Modelling subfamilies from these four largest families were then assigned to each centre using the draft pick protocol. It is worth noting the disproportionately larger size of the superfamily of P-loop containing nucleotide triphosphate hydrolases (CATH code 3.40.50.300), as compared with the other very large superfamilies. Targeting families that are over-represented in the gut microbiome

NIH-PA Author Manuscript

Two rounds of identification of protein families over-represented in the human gut microbiome were performed. For both rounds, protein sequences found in the gut microbiome were first grouped into homologous clusters (see Methods for further details). Comparing numbers of homologues from these clusters found in the gut microbiome and in other bacterial genomes allowed the identification of clusters that are significantly overrepresented in the gut. The largest and most over-represented clusters were considered as potential targets. A subset of 1092 clusters from the first round and 136 clusters from the second round (defined by HMMs) were then selected as targets and equally divided amongst the four centres using the draft pick protocol. Many of these Gut Metagenome Families constitute a subset of the targeted large families (BIG and MEGA) mentioned in the above paragraphs, however some represent novel families, specific to the human gut environment.

Analysis of the Coverage of Genome Sequences by PSI-2 Structures and their Structural and Functional Novelty It is possible to gauge the success of the structural genomics initiatives, in particular that of PSI-2, using a number of different measures. The total number of structures solved is an Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 8

NIH-PA Author Manuscript

obvious preliminary indicator, but it must be considered with caution since it is not necessarily indicative of the actual impact and leverage of PSI-2, or of its success at meeting its objectives (Liu et al., 2007). One major aim of PSI-2 is to determine “novel” protein structures and in that context, all newly solved structures do not have the same value. For example, alternative structures of a given protein with different ligands can be crucial for understanding better the mechanism of a particular protein, but do not help in terms of structural novelty. We consider two measures to evaluate the success of PSI-2 at determining novel protein structures. First, we measure the extent to which these structures are affecting the structural coverage of known proteomes. Ultimately, this issue relies on the definition of modelling subfamilies and how the newly solved structures can be used to provide valuable structural information on their relative sequences. Secondly, we directly measure the structural novelty of PSI-2 structures by comparing them with previously released structures using a normalised RMSD score.

NIH-PA Author Manuscript

Another means of assessing the success of the structural genomics initiatives and the potential value of this data to biologists is to consider the number of diverse functions which have been characterized experimentally and captured in public resources such as GO but for which there are no structural relatives. Solving representative structures for proteins possessing these functions will help in revealing the molecular mechanisms by which these proteins function and expand our understanding of functional space as well as structural space. For this reason, we also consider the number of functional groups that were previously uncharacterised structurally and for which PSI-2 has provided a first structural representative. Total Number of Structures solved

NIH-PA Author Manuscript

Analyses were performed on all structures deposited in the PDB (Berman et al., 2000) by the four PSI-2 large-scale production centres from July 1st 2005 to July 1st 2008. Some of these analyses were conducted in collaboration with the PSI Structural Genomics Knowledgebase established at Rutgers University (http://kb.psi-structuralgenomics.org/KB/) (Berman et al., 2009). A total of 1600 structures were solved by the 4 centres in the first three years of PSI-2 and they amounted to 1502 distinct chains (~94%). This compares with a ratio of 61% of distinct chains to PDB entries (9597/15629) for the entire PDB (excluding PSI structures) over the same period of time. Of the 1502 distinct structures solved, 460 (~30%) were from BIG families of which 355 (~24%) were from Pfam families. During the first three years of PSI-2, 288 Pfam families had their first structure solved by PSI-2 large-scale centres, which is about 38% of all Pfam families (total of 748) for which a first structure was deposited in the PDB during the same period of time. Genome Coverage Previous analyses of structural coverage of known proteomes suggest that up to 30–40% of protein residues, and ~50% of domain sequences, can currently be assigned a structure by modelling (Liu et al., 2007;Marsden et al., 2007). This proportion varies with the sequence database used, and the prediction methods used to assign sequences to structural families (e.g. PSI-BLAST (Altschul et al., 1997), HMMs, profile-profile comparisons, threading etc). A non-negligible proportion of the remaining domain sequences in these proteomes belong to families that are problematic for high throughput structure determination, because they are membrane associated, intrinsically disordered or have regions of low complexity, and thus more appropriate targets for the specialized centres of PSI (Norvell et al., 2007). A significant proportion of the remaining structurally uncharacterised and non-problematic sequences were targeted by the expanded BIG list (2298 families). The remaining targets

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 9

chosen by the four centres came from 48 MEGA families and 136 META families. In total, 193249 targets have been selected over the three years since the start of PSI-2.

NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 5 shows the increase in structural coverage of fractions of proteins and residues from UniProt, obtained by solving structures since the start of PSI-2, and compares the contributions of structures from the entire PDB, PSI-2 only and PSI-2 large-scale centres only (see Methods). Altogether, the fraction of UniProt proteins (residues) that can be structurally modelled is now reaching 48% (42%). This represents an increase of about 10% (6%) over the past three years, with a contribution of more than 2% (1.3%) from PSI-2 structures. In terms of increase in structural coverage, the contribution of PSI-2 is practically entirely due to structures solved by the four large-scale centres. About 23% of the increase in structural coverage of proteins in UniProt (UniProt Consortium, 2008) is due to structures from large-scale centres. The contribution of these structures is about 19% when defining structural coverage at the residue level. These contributions are rather encouraging given that structures from PSI-2 large scale centres only account for around 13% of the distinct structures released in the PDB since July 1st 2005. This result is somewhat expected, particularly because targets have been specifically selected for the coarse-grained sampling of sequence space (i.e. BIG families) rather than to optimally increase modelling coverage. Since the data presented here only considers structures released within the first three years of PSI-2, the proportion of novel structural coverage due to PSI-2 may increase in the final 2 years as PSI-2 large-scale centres reach optimal productivity. When considering specific proteomes, the contribution of the PSI-2 large-scale centres to the increase in structural coverage greatly depends on the type of organism. For example, there was a total of 7049 novel human proteins (~10% of the total number of human sequences in UniProt 12.8 – i.e. 72034 protein sequences) for which a structure could be modelled thanks to structures deposited in the PDB between July 1st 2005 and July 1st 2008, but only 231 (i.e. 3.3% of the structural coverage increase, and 0.3% of the human proteome) out of these were due to structures solved by the four large-scale centres (for residues, the fraction of the total increase in structural coverage that is due to large-scale centres is also 3.3%). In contrast, the contribution of the large-scale centres to novel structural coverage amounts to 37% for Escherichia coli over the same period of time, i.e. 206 out of 560 proteins for which structure can now be modelled (respectively 5% and 13% of the total number of Escherichia coli sequences – i.e. 4381 protein sequences). For residues, the fraction of the total increase in structural coverage of Escherichia coli that is due to large-scale centres is 28.2%. These discrepancies between human and E. Coli are somewhat expected given that large-scale centres have preferentially targeted prokaryotic proteins.

NIH-PA Author Manuscript

Structural novelty of PSI-2 structures Whether a new structure is deemed structurally novel depends on the criteria used to recognize structural similarity. Recent analyses of homologous domains in the CATH database revealed a mean value of 5Å for the normalised RMSD following superposition of homologous domains to be an appropriate cut-off for defining structural similarity (see Methods for definition of the normalised RMSD) (Cuff et al., submitted). Relatives superposing with higher normalised RMSD values have been observed to be structurally divergent often due to significant structural embellishments to the cores of the structural domains (Reeves et al., 2006). Therefore a normalised RMSD cut-off of 5Å was applied to determine whether structures solved by PSI-2 and traditional structural biology were significantly structurally different from those previously deposited in the PDB (Berman et al., 2000). Since improved structural alignments can sometimes be obtained by aligning the constituent domains rather than complete multi-domain structures, all the structures were scanned against the CATH non-redundant domain library (CATH version 3.2).

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 10

NIH-PA Author Manuscript

Figure 6 shows that 28% of the domain structures solved by PSI-2 large scale centres are structurally novel when using these criteria. This compares with 3% of domains solved by non-Structural Genomics structural biology worldwide which are structurally novel. These results cover domain structures solved by the PSI-2 large-scale centres, whether or not the targets were selected as part of BIG families. Of the 365 distinct domain structures (less than 98% sequence identity) from BIG families that have been solved and classified in CATH, 155 (42%) were found to be structurally novel according to the normalised RMSD cut-off of 5Å. Encouragingly, a significant proportion of structures from MEGA families were also found to be structurally novel (15%), as computed over the total number of 282 distinct domains from MEGA families and solved by PSI-2 large-scale centres. This suggests that the strategies described above for selecting structurally diverse representatives from these families appear to be performing well.

NIH-PA Author Manuscript

We also evaluated structural novelty by counting the number of structures that were the first representative of their superfamily or fold in CATH. Of the 859 distinct domain structures solved by PSI-2 large-scale centres, which are classified in CATH, 102 structures comprise novel CATH superfamilies, and 28 comprise novel CATH folds. Unfortunately, equivalent numbers for non-structural genomics structural biology since June 2005 cannot be readily computed for comparison, because a specific effort to classify PSI-2 structures was made by curators for the most recent release of the CATH database (CATH v3.2). Besides, of the 365 distinct domain structures from BIG families, 75 (21%) were found to represent novel CATH superfamilies (including 21 that represented novel folds), whereas 290 (79%) were found to belong to previously existing CATH domain families among which 116 (32%) were assigned to MEGA superfamilies. These BIG families are therefore clearly diverse subfamilies of the CATH families, that were no longer recognizable by sequence based protocols but that showed clear structural similarity to relatives from previously known CATH superfamilies. Number of Structurally Uncharacterised Functional Groups for which structures were solved

NIH-PA Author Manuscript

In order to assess how well structural genomics was contributing structures towards the aim of increasing the number of functional groups with a representative structure, the number of functional categories in the Gene Ontology (GO) for which PSI-2 or structural biology solved the first representative structure was assessed. Of the 1502 distinct structures solved by PSI-2 large-scale centres, 51% could be mapped to a functional category in the GO database (molecular function ontology). This contrasted with 81% of structures solved by non-structural genomics structural biology worldwide. Similar ratios were obtained when considering the GO biological process ontology. Thus a significant proportion of PSI-2 structures have been functionally annotated, suggesting a non-negligible leverage of structural information from PSI-2 in terms of functional data. More importantly, 2.2% of distinct structures (i.e. 33 structures) solved by PSI-2 large-scale centres represented the first structure solved for one of their GO terms, including 27 structures for molecular function terms and 12 for biological process terms, with 7 structures being first representative for one term of both category. These GO terms, which consist mostly of enzymatic functions, are listed together with their representative PSI-2 structure in Tables 4 and 5 for molecular function and biological process, respectively. For comparison, 6% of distinct structures (i.e. 374 out of 6080 structures) solved by non-Structural Genomics projects and released in the PDB between July 1st 2005 and July 1st 2008 represented the first structure solved for one of their GO terms. Thus, the proportion of structures being first representatives of a given function is of the same order of magnitude for structures from PSI-2 large-scale centres and those from standard structural biology. This is encouraging given that targeting novel functions was not an explicit aim for PSI-2.

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 11

Functional Insights Gained From the META Structures Solved

NIH-PA Author Manuscript

META families coverage was initiated in year 3 of PSI-2 and insufficient data is available at this point to fully evaluate this target selection strategy. PSI centres solved a significant number of novel proteins from human gut microbes, including over 25 proteins involved in carbohydrate metabolism and first representatives of over 10 novel protein families first found in the human gut. These preliminary results highlight two dominant mechanisms of adaptation of microbes to the specific challenges of the gut environment, namely expansion and functional diversification of known protein families, and evolution of new specialized families (Ellrott et al., submitted).

Conclusion

NIH-PA Author Manuscript

The Protein Structure Initiative (PSI) is now more than half way through its second phase. An important stated aim of this effort has been to make structural information available for a large proportion of genome sequences. In order to achieve this, a strategy has been set up to select structural genomics targets in protein domain families of substantial size for which no structural information was available yet. These families have been referred to as BIG families. This target selection strategy, which is extensively presented here, has been made possible by the joint efforts of several bioinformatics groups associated to PSI-2. Early in the second phase of PSI, analyses made it clear that a large fraction of BIG families that were targeted turned out to be remote homologues of previously known structural families. Genomic analyses also suggest that a significant proportion of genome sequences belong to a few universal families that are highly structurally and functionally divergent. It is clear that structural genomics can make a major contribution to biology by understanding the manner in which these families diverge structurally and how this mediates changes in molecular function, biological role and interaction partners. Therefore another important aim of PSI-2 has been to increase the number of representative structures from these families (referred to as MEGA families) in a way that reveals more comprehensively their considerable diversity and that contributes new structural information for the relatives within the superfamilies that clearly have different functional roles.

NIH-PA Author Manuscript

The results presented here suggest that during its first three years, PSI-2 has been successful at meeting several of its stated aims, by contributing significant numbers of structural representatives of novel structures and functions, and by participating substantially to a general increase in the number of genome sequences that can be modelled structurally. We hope that this analysis, together with previous reports on the success of structural genomics (Todd et al., 2005;Chandonia et al., 2006;Watson et al., 2007) and more specific analyses (Todd et al., 2005;Watson et al., 2007;lali-Hassani et al., 2007), will shed light on the capacity of the Protein Structure Initiative and other similar efforts world-wide to contribute valuable data for facing the new challenges in understanding biology at the molecular and cellular levels (Gerlt, 2007;Blundell, 2007).

Methods and Definitions Definitions for target selection strategy At the start of PSI-2, the PSI committee issued a statement publicizing the fact that PSI-2 would aim to ‘increase the number of large families for which a structure was known’. This can be described as coarse-grained coverage of protein structural space. However, it was also recognized that for some large and highly diverged families a single representative would not provide sufficient structural insight for the entire family and that in such cases, structures should be solved for several representatives. This process would be described as fine-grained coverage. Although these definitions appear intuitively obvious, practical use

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 12

NIH-PA Author Manuscript

of the guidelines was initially hampered by the lack of universally accepted definitions. For instance, the term “protein family” is used by different authors to designate groups of proteins that share differing levels of similarity, so that coarse-grained coverage according to one author could correspond to fine-grained coverage according to another. Various groups working on protein families and domain definitions (e.g. Pfam (Finn et al., 2008), TIGRFAMs (Haft et al., 2003)) have used different strategies and protocols to construct databases of domains and families. However, the BIG4 felt that none of the existing resources fully incorporated structural information into the families and domain definitions. Furthermore, in determining a sensible strategy for target selection for structural genomics, there are various experimental issues that have an important bearing on choice of a suitable approach. For example, whilst it may seem tempting to opt for a particular organism of biological significance such as yeast or human, there may be significant experimental difficulties with expressing proteins from this organism or restricting the selection strategy to a few organisms. In order to coordinate target selection, the BIG4 came up with the following working definitions of families:

NIH-PA Author Manuscript

Modelling Subfamily (MS)—This describes a group of closely related sequences in which any two targets share a “minimal similarity”. Modelling subfamilies were constructed by multi-linkage clustering, using a clustering threshold of 30% pair-wise sequence identity between any two members of the subfamily. This threshold was chosen as it ensures that once a single structure has been solved for the MS, there is a reasonable probability that homology models can be built for all other relatives with good accuracy. We anticipate that the precise definition of a modelling subfamily will probably change as modelling algorithms evolve and improve. BIG family (large families)—We refer to BIG families to describe groups of related proteins, with many relatives, identified using profile-based sequence similarity search strategies. Currently a minimum of 10 relatives is being employed to define a BIG family, though this may change in the future. We hypothesize that a BIG family could consist of multiple modelling subfamilies and that members of a BIG family may display nonnegligible structural diversity. Since the primary focus of PSI-2 is to solve representatives of large families with unknown structures, BIG families were validated as targets by ensuring that they contained no relative with a known structure in the PDB. Standard bioinformatics approaches were used to eliminate families that could be problematic (as in PSI-1) (Marsden et al., 2008).

NIH-PA Author Manuscript

MEGA family (very large families)—Some domain families are extremely large (some are ten-fold or more larger than the average BIG family) and we can anticipate extreme structural divergence within them (Marsden et al., 2007). Multiple targets from these families would be needed to get even approximate models for all structural variants. We refer to such families as MEGA families. In practice, MEGA families were defined as the 200 most populated homologous superfamilies in CATH (H-level). Taken together, these 200 MEGA families cover at least 50% of domain sequences in genomes. Most MEGA families already have representatives of known structure, but an important goal of PSI-2 is to fully characterize structural (and functional) variability in these families. META family—We use the term META family to refer to clusters of homologous sequences that are over-represented in metagenomic samples from a particular environment (microbiome). This term falls into a slightly different category than MEGA, BIG or modelling subfamily since it does not refer to the size of a family nor to the presence of already determined structures. PSI targets selected from META-families were usually

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 13

chosen from fully sequenced microbial genomes, but metagenomic sequences were used to calculate their over-representation ratios and, thus, to identify META-families (see below).

NIH-PA Author Manuscript

Final selection of BIG families by mapping families from Pfam and from the different BIG protocols The final target list of BIG families was defined by looking for a consensus between the families defined from Pfam and different protocols defined by the BIG4. Consensus mapping between the different BIG family resources was achieved as follows: Each family was defined by a multiple alignment of the seed sequences. Relatives were then identified by profile based scans of a non-redundant version of the UniProt database (UniProt Consortium, 2008). Two families were deemed to be equivalent if at least 70% of the sequences in the larger family can be matched to sequences in the smaller family, where sequences are identified as matching if they have the same UniProt ID and at least 60% of the residues in the larger sequence are equivalent to residues in the smaller sequence. Some manual inspection was undertaken to check the quality of these family assignments. Families identified by several approaches were eventually considered for assignment to PSI-2 large-scale centres. Selection of META-families

NIH-PA Author Manuscript

The Godzik group performed two rounds of identification of protein families overrepresented in human gut microbiome, with the underlying aim to identify protein families that are important for the human gut flora, unique for this environment, or significantly over-represented there. Modelling subfamilies were identified in the first round, and BIG families were identified in the second round. 1) identification of modelling subfamilies over-represented in human gut microbiome: Modelling META-subfamilies were defined as sequence clusters seeded with proteins from four bacteria isolated from human gut flora: Eubacterium rectale, Bacteroides vulgatus, Bacteroides thethaiotaomicron, and Bacteroides fragilis (made available by Jeff Gordon laboratory, http://gordonlab.wustl.edu/). For each seed sequence, BLASTP (Altschul et al., 1990) hits were collected from two sets of sequences: •

“GUT”: US human gut metagenome samples (Human Gut Community Subjects 7 and 8 (Gill et al., 2006))

•

“ALL”: all microbial genomes available from NCBI (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi).

NIH-PA Author Manuscript

The ratio between the number of hits of a seed sequence found in the “GUT” and in “ALL” was used as a measure of over-representation of a modelling META-subfamily defined by that seed sequence. The top 20% most over-represented modelling subfamilies were distributed between the four large-scale centres using a random pick mechanism which ensured that all close homologues of any given seed sequence were assigned to the same centre. 2) identification of novel BIG-families from human gut microbiome: the aim in this round was to identify BIG-families with no functional annotation that were over-represented in the human gut microbiome. BIG families were defined by Hidden Markov Models (HMMs) (Eddy, 1996). Available sequences of human gut metagenomic samples were first collected (from datasets published by the Hattori lab (Kurokawa et al., 2007), and from the above-mentioned US metagenomic samples). Functionally annotated sequences were filtered out from these sets

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 14

NIH-PA Author Manuscript

by removing all sequences with significant BLASTP hits (e-value lower than 0.001) to annotated sequences in KEGG (Kanehisa et al., 2000). The remaining sequences were clustered using PDB-Blast (Li et al., 2002) and an e-value equal to 0.001 as the clustering cut-off. The resulting clusters were expanded by collecting non-redundant homologues of all cluster members. Homologues were obtained using PSI-BLAST searches against a database that consists of the NR database and metagenomic datasets clustered at 85% sequence identity using CD-HIT (Li et al., 2006a). Multiple sequence alignments of these homologues were then constructed with CLUSTALW (Thompson et al., 1994), and were used to build HMMs (Eddy, 1996) using HMMBUILD. These HMMs, which represented BIG-families, were used to collect hits from two sets of sequences using HMMPFAM (both programs available from http://hmmer.janelia.org/): •

"GUT" - 30 microbial genomes from human gut microbiome (available from Washington University Sequencing Center, St. Louis: http://genome.wustl.edu/)

•

"NHR" (Not Human Related) - 440 microbial genomes without apparent relationship to human (downloaded from the NCBI database: http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi).

The ratio between the number of hits in “GUT” and in “NHR” was used to define BIG families that were over-represented in the human gut microbiome. The most overrepresented families were then distributed between the large-scale centres.

NIH-PA Author Manuscript

Structural and functional novelty, and Genome Coverage calculations Lists of PSI-2 structures used for computing structural coverage of genomes were obtained directly from the large-scale centres, considering only structures deposited in the PDB (Berman et al., 2000) between July 1st 2005 and July 1st 2008. Corresponding lists of nonPSI structures were downloaded from the PDB website using identical date restrictions. Distinct structures have been defined as lists of structures sharing less than 98% pair-wise sequence identity, and were obtained by running CD-HIT with that cut-off and considering single representatives from all resulting CD-HIT clusters (Li et al., 2006b). Increase in genome coverage in terms of structural modelling was computed by running PSIBLAST against UniProt (release 12.8) for each PDB structure in turn, and by considering modelling subfamilies around each structure to decide on sequences for which the structure could be modelled.

NIH-PA Author Manuscript

Structural novelty was computed by considering domains from novel structures that have been classified in CATH release 3.2 (Greene et al., 2007). All domains in CATH v3.2 were structurally aligned against one another using SSAP (Orengo et al., 1996;Greene et al., 2007), and were clustered into structurally similar groups by complete clustering with a normalised RMSD cut-off of 5.0Å. The normalised RMSD score (normRMSD) is computed as follows:

Where RMSD is the root mean square deviation of the superposition, max(L1,L2) is the length in amino acids of the largest domain in the superposition, and Nmat is the number of aligned residue pairs (Kolodny et al., 2005). Domains that are assigned to the same cluster are considered structurally similar, and the structure from each cluster that was first deposited in the PDB is considered to be structurally novel. Fold and superfamily novelty

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 15

was evaluated by considering the first structure in each CATH fold and CATH superfamily to have been deposited in the PDB.

NIH-PA Author Manuscript

Functional novelty was evaluated by mapping PDB structures to GO terms using the PDB to GO mapping provided by the MSD at the EBI (Velankar et al., 2005). Results and statistics generated by the BIG4 groups, and presented in this article, are also available from the BIG4 website (http://psi-big4.org/).

Abbreviations JCSG

Joint Center for Structural Genomics

MCSG

Midwest Center for Structural Genomics

NESG

Northeast Structural Genomics Consortium

NYSGXRC

New York SGX Research Center for Structural Genomics

PSI

Protein Structure Initiative

Acknowledgments NIH-PA Author Manuscript

This work was supported by a grant from the Protein Structure Initiative (PSI) of the National Institute for General Medicine at the National Institutes of Health.

References

NIH-PA Author Manuscript

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol 1990;215:403–410. [PubMed: 2231712] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25:3389–3402. [PubMed: 9254694] Aravind L, Anantharaman V, Koonin EV. Monophyly of class I aminoacyl tRNA synthetase, USPA, ETFP, photolyase, and PP-ATPase nucleotide-binding domains: implications for protein evolution in the RNA. Proteins 2002;48:1–14. [PubMed: 12012333] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000;28:235–242. [PubMed: 10592235] Berman HM, Westbrook JD, Gabanyi MJ, Tao W, Shah R, Kouranov A, Schwede T, Arnold K, Kiefer F, Bordoli L, Kopp J, Podvinec M, Adams PD, Carter LG, Minor W, Nair R, La BJ. The protein structure initiative structural genomics knowledgebase. Nucleic Acids Res 2009;37:D365–D368. [PubMed: 19010965] Blundell T. New dimensions of structural proteomics: exploring chemical and biological space. Structure 2007;15:1342–1343. [PubMed: 17997956] Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 2005;33:D212–D215. [PubMed: 15608179] Chandonia JM, Brenner SE. The impact of structural genomics: expectations and outcomes. Science 2006;311:347–351. [PubMed: 16424331] Eddy SR. Hidden Markov models. Curr. Opin. Struct. Biol 1996;6:361–365. [PubMed: 8804822] Fersht AR. From the first protein structures to our current knowledge of protein folding: delights and scepticisms. Nat. Rev. Mol. Cell Biol 2008;9:650–654. [PubMed: 18578032] Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A. The Pfam protein families database. Nucleic Acids Res 2008;36:D281–D288. [PubMed: 18039703]

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 16

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Gerlt JA. A Protein Structure (or Function ?) Initiative. Structure 2007;15:1353–1356. [PubMed: 17997960] Gerlt JA, Babbitt PC. Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu. Rev. Biochem 2001;70:209–246. [PubMed: 11395407] Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, FraserLiggett CM, Nelson KE. Metagenomic analysis of the human distal gut microbiome. Science 2006;312:1355–1359. [PubMed: 16741115] Goldstein RA. The structure of protein evolution and the evolution of protein structure. Curr. Opin. Struct. Biol 2008;18:170–177. [PubMed: 18328690] Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, Sillitoe I, Yeats C, Thornton JM, Orengo CA. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 2007;35:D291–D297. [PubMed: 17135200] Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res 2003;31:371–373. [PubMed: 12520025] Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 2000;28:27–30. [PubMed: 10592173] Kolodny R, Koehl P, Levitt M. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol 2005;346:1173–1188. [PubMed: 15701525] Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H, Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y, Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res 2007;14:169–181. [PubMed: 17916580] lali-Hassani A, Pan PW, Dombrovski L, Najmanovich R, Tempel W, Dong A, Loppnau P, Martin F, Thornton J, Edwards AM, Bochkarev A, Plotnikov AN, Vedadi M, Arrowsmith CH. Structural and chemical profiling of the human cytosolic sulfotransferases. PLoS. Biol 2007;5:e97. [PubMed: 17425406] Lee D, Grant A, Marsden RL, Orengo C. Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins 2005;59:603–615. [PubMed: 15768405] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006b;22:1658–1659. [PubMed: 16731699] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006a;22:1658–1659. [PubMed: 16731699] Li W, Jaroszewski L, Godzik A. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng 2002;15:643–649. [PubMed: 12364578] Liu J, Montelione GT, Rost B. Novel leverage of structural genomics. Nat. Biotechnol 2007;25:849– 851. [PubMed: 17687356] Liu J, Rost B. CHOP proteins into structural domain-like fragments. Proteins 2004;55:678–688. [PubMed: 15103630] Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA. Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res 2006;34:1066–1080. [PubMed: 16481312] Marsden RL, Lewis TA, Orengo CA. Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC. Bioinformatics 2007;8:86. [PubMed: 17349043] Marsden RL, Orengo CA. Target selection for structural genomics: an overview. Methods Mol. Biol 2008;426:3–25. [PubMed: 18542854] Moore AD, Bjorklund AK, Ekman D, Bornberg-Bauer E, Elofsson A. Arrangements in the modular evolution of proteins. Trends Biochem. Sci 2008;33:444–451. [PubMed: 18656364] Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr. Opin. Struct. Biol 2005;15:285–289. [PubMed: 15939584] Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol 1995;247:536–540. [PubMed: 7723011] Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 17

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Norvell JC, Berg JM. Update on the protein structure initiative. Structure 2007;15:1519–1522. [PubMed: 18073099] Orengo CA, Taylor WR. SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol 1996;266:617–635. [PubMed: 8743709] Pegg SC, Brown SD, Ojha S, Seffernick J, Meng EC, Morris JH, Chang PJ, Huang CC, Ferrin TE, Babbitt PC. Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database. Biochemistry 2006;45:2545–2555. [PubMed: 16489747] Ponting CP, Russell RR. The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct 2002;31:45–71. [PubMed: 11988462] Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007;35:D61–D65. [PubMed: 17130148] Ranea JA, Sillero A, Thornton JM, Orengo CA. Protein superfamily evolution and the last universal common ancestor (LUCA). J. Mol. Evol 2006;63:513–525. [PubMed: 17021929] Redfern OC, Dessailly B, Orengo CA. Exploring the structure and function paradigm. Curr. Opin. Struct. Biol 2008;18:394–402. [PubMed: 18554899] Reeves GA, Dallman TJ, Redfern OC, Akpor A, Orengo CA. Structural diversity of domain superfamilies in the CATH database. J. Mol. Biol 2006;360:725–741. [PubMed: 16780872] Riesenfeld CS, Schloss PD, Handelsman J. Metagenomics: genomic analysis of microbial communities. Annu. Rev. Genet 2004;38:525–552. [PubMed: 15568985] Sali A. 100,000 protein structures for the biologist. Nat. Struct. Biol 1998;5:1029–1032. [PubMed: 9846869] Shakhnovich BE, Koonin EV. Origins and impact of constraints in evolution of gene families. Genome Res 2006;16:1529–1536. [PubMed: 17053091] The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat. Genet 2000;25:25–29. [PubMed: 10802651] Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994;22:4673–4680. [PubMed: 7984417] Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of structural genomics initiatives: an analysis of solved target structures. J. Mol. Biol 2005;348:1235–1260. [PubMed: 15854658] Todd AE, Orengo CA, Thornton JM. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol 2001;307:1113–1143. [PubMed: 11286560] UniProt Consortium. The universal protein resource (UniProt). Nucleic Acids Res 2008;36:D190– D195. [PubMed: 18045787] Velankar S, McNeil P, Mittard-Runte V, Suarez A, Barrell D, Apweiler R, Henrick K. E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res 2005;33:D262–D265. [PubMed: 15608192] Vitkup D, Melamud E, Moult J, Sander C. Completeness in structural genomics. Nat. Struct. Biol 2001;8:559–566. [PubMed: 11373627] Watson JD, Sanderson S, Ezersky A, Savchenko A, Edwards A, Orengo C, Joachimiak A, Laskowski RA, Thornton JM. Towards fully automated structure-based function prediction in structural genomics: a case study. J. Mol. Biol 2007;367:1511–1522. [PubMed: 17316683] Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res 2008;36:D414–D418. [PubMed: 18032434]

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 18

NIH-PA Author Manuscript NIH-PA Author Manuscript Figure 1.

Distribution of numbers of sequences from Gene3D v6.0 (Yeats et al., 2008) for all CATH superfamilies (Greene et al., 2007).

NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 19

NIH-PA Author Manuscript NIH-PA Author Manuscript Figure 2.

Proportion of structurally characterized modelling sub-families in very large and diverse families (referred to as MEGA families). MEGA families are the 200 largest superfamilies in CATH, and taken together they represent more than 50% of domains in genome sequences. These families are typically very diverse in terms of structure and function.

NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 20

NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 3.

Correlation between structural and functional diversity in CATH superfamilies. For each superfamily, the x-axis gives the number of molecular function GO terms identified for members of that superfamily in Gene3D. The y-axis gives the number of structurally similar sub-groups (see Methods) obtained by clustering domains from the superfamily with a normalised RMSD cut-off of 5Å.

NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 21

NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 4.

Number of modelling families in 200 very large and diverse CATH superfamilies.

NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 22

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 23

NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 5.

Increase in the fraction of proteins (a) and residues (b) from UniProt (release 12.8), that can be structurally modelled using structures released in the PDB since the start of PSI-2. The black line shows the increase in structural coverage resulting from all structures released in the PDB, the green line shows the increase resulting from PSI-2 structures only, and the blue line shows the increase resulting exclusively from structures solved by the PSI-2 large-scale centres.

NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 24

NIH-PA Author Manuscript NIH-PA Author Manuscript Figure 6.

Structural novelty of structural domains solved by PSI-2 large-scale centres (‘LSC’) and traditional structural biology worldwide (excluding Structural Genomics structures) between June 2005 and June 2008. Only domains classified in CATH are considered in this plot.

NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 25

Table 1

NIH-PA Author Manuscript

This table summarises results from two previous studies regarding the increase in the number of structures for novel distinct proteins, proteins from novel superfamilies in existing folds, and proteins with novel folds, due to PSI, and structural biology worldwide (excluding structural genomics initiatives) over the same period of time. These results were obtained from (a) Todd et al. 2005 and (b) Chandonia and Brenner 2006. Superfamily and fold definitions were obtained from Scop (Murzin et al., 1995) in both studies. Novel protein

Novel superfamily

Novel fold

Non-SG (a)

42%

2%

2%

PSI (a)

92%

7%

11%

Non-SG (b)

24%

1%

2%

PSI (b)

91%

6%

10%

NIH-PA Author Manuscript NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

271

332

337

329

1269

JCSG

MCSG

NESG

NYSXRC

Total

10/2005

331

71

63

68

129

04/2006

300

75

75

75

75

08/2006

457

99

129

129

100

11/2006

2357

574

604

604

575

Total

Numbers of structurally uncharacterised large families (i.e. BIG families) allocated to PSI-2 large-scale centres.

NIH-PA Author Manuscript

Table 2 Dessailly et al. Page 26

Structure. Author manuscript; available in PMC 2010 August 12.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Superfamily name

Bh1534 unknown conserved protein

Dimeric alpha+beta barrel

NADH Oxidase-like

FMN-binding split barrel

Ubiquitin Conjugating Enzyme-like

Enolase superfamily

Ferritin-like

PLP-binding barrel

Protein kinase-like

MurD-like peptide ligases, catalytic domain

NTF2-like

Glucose-6-phosphate isomerase-like

Ribokinase-like

TolB, C-terminal domain-like

Ubiquitin-like

Nudix hydrolases

Acyl-CoA dehydrogenase C-terminal domain-like

Dihydroxybiphenyl Dioxygenase-like

6-phosphogluconate dehydrogenase C-terminal domain-like

Thioesterase/thiol ester dehydrase-isomerase

GAF domain-like

Metallo-hydrolase/oxidoreductase

Zn peptidases

CATH Code

3.30.530.40

3.30.70.900

3.40.109.10

2.30.110.10

3.10.110.10

3.20.20.120

1.20.1260.10

3.20.20.10

3.90.1200.10

3.40.1190.10

3.10.450.50

3.40.50.10490

3.40.1190.20

2.120.10.30

3.10.20.90

3.90.79.10

Structure. Author manuscript; available in PMC 2010 August 12.

1.10.540.10

3.10.180.10

1.10.1040.10

3.10.129.10

3.30.450.40

3.60.15.10

3.40.630.10

8338

7353

6915

6641

6348

5960

5634

5569

5327

5247

5156

4437

3609

3596

3174

2961

2889

2653

2576

2433

2094

1450

1011

Sequences

106

65

90

41

61

43

60

80

376

217

52

67

92

35

62

41

37

35

185

25

14

14

8

GO terms

595

863

1508

785

427

863

457

739

1217

1159

461

474

729

235

504

177

259

192

230

275

248

276

231

s30 clusters

3.866

1.622

0.729

2.548

2.81

2.665

2.407

2.165

3.615

0.777

3.905

4.008

2.195

2.979

0.595

4.52

8.88

8.854

7.391

5.818

2.823

2.536

0.433

%s30 with structure

JCSG

NESG

MCSG

NESG

NYSGXRC

NYSGXRC

NYSGXRC

NYSGXRC

NESG

NESG

JCSG

JCSG

JCSG

JCSG

MCSG

NYSGXRC

JCSG

NYSGXRC

NESG

JCSG

JCSG

JCSG

NESG

PSI-2 center

2007-03

2007-12

2007-03

2007-12

2007-12

2007-03

2007-06

2007-12

2007-06

2007-03

2007-03

2007-12

2007-06

2007-06

2007-03

2007-06

2007-06

2007-12

2007-06

2007-03

2007-12

2007-06

2007-06

Allocation date

Very large and diverse (MEGA) families with incomplete structural coverage allocated to PSI-2 large-scale centres. For each MEGA superfamily, the table shows the name of the superfamily, the number of distinct sequences assigned to that superfamily in Gene3D v6.0, the number of distinct GO terms (biological process ontology), the number of sequence clusters at 30% sequence identity, the percentage of these sequence clusters for which a structure has already been solved, the PSI-2 centre to which the superfamily was allocated, and the allocation date. For SUPERMEGA superfamilies, PSI-2 centre and allocation date are not shown due to the particular allocation protocol used for these superfamilies (see text).

NIH-PA Author Manuscript

Table 3 Dessailly et al. Page 27

Class I glutamine amidotransferase-like Metal-dependent hydrolases Class II aaRS and biotin synthetases ClpP/crotonase HAD-like PYP-like sensor domain (PAS domain) Jelly Rolls Glycosidases Acyl-CoA N-acyltransferases (Nat) Nucleic acid-binding proteins Nucleotide-diphospho-sugar transferases YVTN repeat-like/Quinoprotein amine dehydrogenase Ribonuclease H-like Glutaredoxin-like HUP domains PLP-dependent transferases TPR-like FAD/NAD(P)-binding domain ATPase domain of HSP90 chaperone-like CheY-like Alpha/Beta-hydrolases SAM-dependent methyltransferases Aldolase class I NAD(P)-binding Rossmann-like domains P-loop containing nucleotide triphosphate hydrolases

3.20.20.140 3.30.930.10 3.90.226.10 3.40.50.1000 3.30.450.20 2.60.120.10 3.20.20.80 3.40.630.30 2.40.50.140 3.90.550.10 2.130.10.10 3.30.420.10 3.40.30.10 3.40.50.620 3.40.640.10 1.25.40.10 3.50.50.60 3.30.565.10 3.40.50.2300 3.40.50.1820 3.40.50.150 3.20.20.70 3.40.50.720 3.40.50.300

NIH-PA Author Manuscript 3.40.50.880

Structure. Author manuscript; available in PMC 2010 August 12. 184999

70263

38574

35362

31099

27818

26943

26841

24833

23500

23142

22258

20593

19132

17894

17863

14690

14373

13512

13173

12691

11852

9840

9684

9291

Sequences

NIH-PA Author Manuscript Superfamily name

1711

644

341

299

394

253

266

328

503

287

225

335

158

760

272

244

195

177

239

220

239

110

81

126

104

GO terms

18682

6463

2597

4104

4182

3599

3128

3376

6931

1273

2032

2481

2101

3740

2195

1735

2472

1278

2197

4020

1251

862

553

799

625

s30 clusters

1.493

3.249

4.582

1.389

1.793

1.334

0.607

1.777

0.245

4.399

3.346

4.434

1.475

0.535

1.185

4.438

1.214

7.042

2.003

0.473

1.918

2.784

3.797

3.379

3.68

%s30 with structure

SUPERMEGA

SUPERMEGA

SUPERMEGA

SUPERMEGA

MCSG

NYSGXRC

NYSGXRC

NESG

NESG

JCSG

MCSG

MCSG

NYSGXRC

NESG

NESG

NESG

MCSG

NYSGXRC

JCSG

MCSG

MCSG

MCSG

MCSG

MCSG

NYSGXRC

PSI-2 center

SUPERMEGA

SUPERMEGA

SUPERMEGA

SUPERMEGA

2007-06

2007-03

2007-03

2007-03

2007-12

2007-12

2007-06

2007-12

2007-06

2007-12

2007-06

2007-03

2007-06

2007-06

2007-12

2007-03

2007-06

2007-12

2007-12

2007-12

2007-12

Allocation date

NIH-PA Author Manuscript

CATH Code

Dessailly et al. Page 28

Dessailly et al.

Page 29

Table 4

NIH-PA Author Manuscript

Molecular Function Gene Ontology terms for which PSI-2 large-scale centres produced the first structural representative between 2005-07-01 and 2008-07-01. The Table shows the PDB ID of the structure, the GO term identifier and the corresponding GO term name.

NIH-PA Author Manuscript NIH-PA Author Manuscript

PDB ID

GO ID

GO Name

2aa4

GO:0009384

N-acylmannosamine kinase

2ajt

GO:0008733

L-arabinose isomerase

2ako

GO:0004349

glutamate 5-kinase

2ap9

GO:0003991

acetylglutamate kinase

2awd

GO:0009024

tagatose-6-phosphate kinase

2fpo

GO:0008990

rRNA (guanine-N2-)-methyltransferase

2gfh

GO:0050124

N-acylneuraminate-9-phosphatase

2ghr

GO:0008899

homoserine O-succinyltransferase

2gok

GO:0050480

imidazolonepropionase

2i09

GO:0008973

phosphopentomutase

2idb

GO:0008694

3-octaprenyl-4-hydroxybenzoate carboxy-lyase

2jo6

GO:0008942

nitrite reductase [NAD(P)H]

2jzc

GO:0004577

N-acetylglucosaminyldiphosphodolichol N-acetylglucosaminyltransferase

2ols

GO:0008986

pyruvate - water dikinase

2p35

GO:0030798

trans-aconitate 2-methyltransferase

2ph5

GO:0047296

homospermidine synthase

2qez

GO:0008851

ethanolamine ammonia-lyase

2qgn

GO:0004811

tRNA isopentenyltransferase

2qiw

GO:0008807

carboxyvinyl-carboxyphosphonate phosphorylmutase

2qrr

GO:0015424

amino acid-transporting ATPase

2qrr

GO:0048474

D-methionine transmembrane transporter

2qt3

GO:0018764

N-isopropylammelide isopropylaminohydrolase

2qyv

GO:0008769

X-His dipeptidase

2r6h

GO:0016655

oxidoreductase acting on NADH/NADPH, quinone (or similar) as acceptor

2raa

GO:0019164

pyruvate synthase

3bp1

GO:0033739

queuine synthase

3cbw

GO:0004567

beta-mannosidase

3cea

GO:0050112

inositol 2-dehydrogenase

Structure. Author manuscript; available in PMC 2010 August 12.

Dessailly et al.

Page 30

Table 5

NIH-PA Author Manuscript

Biological Process Gene Ontology terms for which PSI-2 large-scale centres produced the first structural representative between 2005-07-01 and 2008-07-01. The Table shows the PDB ID of the structure, the GO term identifier and the corresponding GO term name.

NIH-PA Author Manuscript

PDB ID

GO ID

GO Name

2awd

GO:0005988

lactose metabolic process

2fa1

GO:0015716

phosphonate transport

2g7u

GO:0046278

protocatechuate metabolic process

2g9i

GO:0009108

coenzyme biosynthetic process

2gfh

GO:0046380

N-acetylneuraminate biosynthetic process

2ghr

GO:0019281

methionine biosynthetic process from homoserine via O-succinyl-L-homoserine and cystathionine

2gok

GO:0019556

histidine catabolic process to glutamate and formamide

2i09

GO:0043094

metabolic compound salvage

2js7

GO:0045351

type I interferon biosynthetic process

2jzc

GO:0006488

dolichol-linked oligosaccharide biosynthetic process

2oyn

GO:0009398

FMN biosynthetic process

2qrr

GO:0048473

D-methionine transport

NIH-PA Author Manuscript Structure. Author manuscript; available in PMC 2010 August 12.

Lihat lebih banyak...

PSI-2: structural genomics to cover protein domain family space

Descrição do Produto

Comentários