With these issues in mind, we present an evolution-directed series of algorithms, which, in the absence of experimental data, aim to identify relevant 3D templates, to guide an efficient search for molecular mimicry in other protein structures, and finally to isolate from among all matches a subset that is highly enriched in proteins that perform the same function. 1997) can confidently claim to find sequence homologs and suggest-but do not prove-functional similarity between proteins, it is not yet clear what degree of functional similarity can be inferred from a structural match. Third, although sequence methods such as BLAST (Altschul et al.
Second, while the search for 3D matches to small templates (3–4 residues) is not computationally expensive, this quickly changes for larger motifs that include amino acid substitutions when searched against the full Protein Data Bank (PDB) (Berman et al. First, 3D templates that rely on experimental data are limited by the availability of such information. However, fundamental difficulties remain. 1994) or against each other (Wallace et al. 2003) others compare enzymes with known functional sites against a structural database (Fischer et al. Some map sequence motifs onto structures (Kasuya and Thornton 1999 Liang et al. Many methods aim to derive 3D templates and match them to protein structures. These same residues in the same conformation should therefore be reasonably expected to carry out the same function even in a different fold, unless long-range effects impact their biophysical behavior. The rationale for structural motifs (“3D templates”) is that, typically, just a few key residues directly mediate catalysis or binding. 2002 Barker and Thornton 2003 Stark et al. These limitations motivated the extension of the concept of functional motifs from sequence to structure (Wallace et al. Local sequence motifs, however, cannot adequately capture functions distributed over nonadjacent stretches of primary structure. Whole sequence methods can fail when homologs develop unrelated functions, distinct chemistries, or different functional sites as sequence identity falls below 40% (Olmea and Valencia 1997 Russell et al. To address this problem, broad categories of computational methods for functional annotation have emerged that rely on either sequence or structure, considered whole or through motifs. 2005), thus illustrating the importance of reliable methods to identify protein function. However, up to 40% of these genes still lacked any annotation of biological function (Pruitt et al. Although these template building, searching, and match classification strategies are not yet optimized, their sequential implementation demonstrates a functional annotation pipeline which does not require experimental information, but only local molecular mimicry among a small number of evolutionarily important residues.īy August 2005, the NCBI Entrez Genome Project contained 273 fully sequenced genomes yielding almost 1.7 million putative protein sequences in NCBI's RefSeq database.
In 53 enzymes spanning 33 distinct functions, an automated pipeline identifies functionally related proteins with an average positive predictive power of 62%, including correct matches to proteins with the same function but with low sequence identity (the average identity for some templates is only 17%). Serine protease templates built from evolutionarily important residues distinguish between proteases and other proteins nearly as well as the classic Ser-His-Asp catalytic triad. Here, we demonstrate the recurrent use of evolutionary trace information to construct such 3D templates for enzymes, search for them in other structures, and distinguish true from spurious matches.
An emerging solution to this problem is to identify 3D motifs or templates in protein structures that are necessary and sufficient determinants of function. The annotation of protein function has not kept pace with the exponential growth of raw sequence and structure data.