A database of functional sites for proteins with
known structures, SITE, is constructed and used in conjunction
with a simple pattern matching program SiteMatch to evaluate
possible function conservation in a recently constructed
database of fold predictions for Escherichia coli
proteins (Rychlewski L et al., 1999, Protein Sci 8:614–624).
In this and other prediction databases, fold predictions
are based on algorithms that can recognize weak sequence
similarities and putatively assign new proteins into already
characterized protein families. It is not clear whether
such sequence similarities arise from distant homologies
or general similarity of physicochemical features along
the sequence. Leaving aside the important question of nature
of relations within fold superfamilies, it is possible
to assess possible function conservation by looking at
the pattern of conservation of crucial functional residues.
SITE consists of a multilevel function description based
on structure annotations and structure analyses. In particular,
active site residues, ligand binding residues, and patterns
of hydrophobic residues on the protein surface are used
to describe different functional features. SiteMatch, a
simple pattern matching program, is designed to check the
conservation of residues involved in protein activity in
alignments generated by any alignment method. Here, this
procedure is used to study conservation of functional features
in alignments between protein sequences from the E.
coli genome and their optimal structural templates.
The optimal templates were identified and alignments taken
from the database of genomic structural predictions was
described in a previous publication (Rychlewski L et al.,
1999, Protein Sci 8:614–624). An automated
assessment of function conservation is used to analyze
the relation between fold and function similarity for a
large number of fold predictions. For instance, it is shown
that identifying low significance predictions with a high
level of functional residue conservations can be used to
extend the prediction sensitivity for fold prediction methods.
Over 100 new fold/function predictions in this class were
obtained in the E. coli genome. At the same time,
about 30% of our previous fold predictions are not confirmed
as function predictions, further highlighting the problem
of function divergence in fold superfamilies.