[proteamdavis] Fwd: predicting secondary structure using possible tertiary structures

  • From: Paul Limb <paulimb@xxxxxxxxx>
  • To: proteamdavis@xxxxxxxxxxxxx
  • Date: Sat, 3 Jul 2004 11:24:40 -0700

---------- Forwarded message ----------
From: Paul Limb <paulimb@xxxxxxxxx>
Date: Sat, 3 Jul 2004 11:24:25 -0700
Subject: Fwd: predicting secondary structure using possible tertiary struct=
To: proteamdavis@xxxxxxxxxxxxx

---------- Forwarded message ----------
From: Paul Limb <paulimb@xxxxxxxxx>
Date: Mon, 28 Jun 2004 17:27:57 -0700
Subject: Fwd: predicting secondary structure using possible tertiary struct=
To: jtmorgan@xxxxxxxxxxx

---------- Forwarded message ----------
From: Paul Limb <paulimb@xxxxxxxxx>
Date: Mon, 28 Jun 2004 09:55:05 -0700
Subject: predicting secondary structure using possible tertiary structures
To: paulimb@xxxxxxxxx

The strong coupling between secondary and tertiary structure formation
in protein folding is neglected in most structure prediction methods.
In this work we investigate the extent to which nonlocal interactions
in predicted tertiary structures can be used to improve secondary
structure prediction. The architecture of a neural network for
secondary structure prediction that utilizes multiple sequence
alignments was extended to accept low-resolution nonlocal tertiary
structure information as an additional input. By using this modified
network, together with tertiary structure information from native
structures, the Q3-prediction accuracy is increased by 7=E2=80=9310% on
average and by up to 35% in individual cases for independent test
data. By using tertiary structure information from models generated
with the ROSETTAde novo tertiary structure prediction method, the
Q3-prediction accuracy is improved by 4=E2=80=935% on average for small and
medium-sized single-domain proteins. Analysis of proteins with
particularly large improvements in secondary structure prediction
using tertiary structure information provides insight into the
feedback from tertiary to secondary structure.

artificial neural networks | protein folding | ROSETTA | fragment
replacement | CASP

Results and Discussion

Many approaches for predicting secondary structure from sequence have
been developed (1=E2=80=9313). The PHD program published by Rost and Sander
(14, 15) used multiple sequence=E2=80=93sequence alignments for the first
time. The state-of-the-art PSIPRED program by Jones (16) uses
position-specific scoring matrices obtained in PSIBLAST searches (17).
The most accurate of these methods achieve a Q3 score between 75% and
80%, where Q3 is the percentage of amino acids correctly predicted as
helix, sheet, or coil if all amino acids are classified in one of the
three groups. Not only secondary structure but also supersecondary
structural elements such as U-turns or =CE=B2-hairpins can be predicted
from sequence (18=E2=80=9322). In essentially all previous work, the
prediction of the secondary structure at a given position i is based
entirely on a local sequence window of 5=E2=80=9327 aa centered on the
position; sequence information distant from position i is ignored,
although, during folding, interactions with residues distant along the
linear sequence but close in space are likely to influence the
structure at position i.

During the folding process of a protein, a certain fragment first
might adopt a secondary structure preferred by the local sequence
(e.g., an =CE=B1-helix) and later be transformed to another secondary
structure (e.g., a =CE=B2-strand) because of nonlocal interactions with a
segment distant along the sequence (Fig. 1). The structures of
peptides corresponding to portions of complete native sequences have
been investigated to identify parts of the sequence that adopt the
native conformation early, as well as parts that undergo transitions
(23, 24). Whereas some peptide fragments adopt stable conformations
similar to those seen in the complete protein (25, 26), other peptides
adopt different secondary structure in different contexts (27=E2=80=9333). =
shown by Minor and Kim (34), the same local 11-aa sequence can adopt
=CE=B2-strand or =CE=B1-helix structure, if inserted at two different posit=
in protein G. Also, for prion proteins, it appears that the same
sequence can adopt different tertiary folds with different secondary
structure (35=E2=80=9348). These results support the idea that the secondar=
structure in some portions of a protein sequence depends critically on
tertiary interactions (49).

Because of the indeterminacy of local sequence=E2=80=93structure
relationships, the prediction of secondary structure from a local
sequence window must fail in some cases. Secondary structure
prediction is excellent (with Q3 =E2=89=88 90%) for many proteins but is as
low as Q3 =3D 50% for some sequences. Usually the mistakes in secondary
structure prediction occur in regions with local sequences that do not
clearly prefer the formation of =CE=B1-helix, =CE=B2-strand, or coil, where=
choice may ultimately be dictated by quite nonlocal interactions.
These nonlocal interactions, which result from the complex folding
process, cannot be reproduced by a simple neural network, even if the
complete sequence is provided as input. However, given a set of
possible tertiary structure models, a neural network potentially could
extract nonlocal information that in turn could help to predict the
secondary structure of such regions more accurately and with a higher
confidence level.

Given the amino acid sequence of a protein, possible tertiary
structure models can be generated by de novo protein structure
prediction methods. The ROSETTAde novo protein structure prediction
method (50) has proven to be one of the most successful approaches. It
can make good predictions for a large number of different folds, as
demonstrated during CASP4 and CASP5 [Critical Assessment of Techniques
for Protein Structure Prediction (51=E2=80=9353)]. The Protein Data Bank is
screened for fragments that have a high primary-sequence homology and
a secondary structure that matches the predicted secondary structure
for each three- and nine-residue fragment of the query sequence. These
fragments sample possible conformations for each local segment of the
chain and are combined by using a Monte Carlo algorithm to generate
possible tertiary structures.

In this article, low-resolution 3D information obtained from ROSETTA
models is incorporated into a neural network secondary structure
prediction method and found to decrease the number of critical
mistakes. Going one step further, the improved secondary structure
prediction is shown also to improve the structural models generated by
ROSETTA when used for fragment selection. The procedure can be viewed
as a mimic of the actual folding process: the secondary structure is
formed based on local sequence preferences and later reevaluated based
on the long-range interactions in frequently sampled tertiary

Results and Discussion

Scoring Matrix-Based Secondary Structure Prediction. A previously
described neural network approach for predicting secondary structure
from a single sequence profile of seven amino acid properties over a
window of 39 aa (13) was extended to process position-specific scoring
matrices as additional input parameters. These matrices can be
obtained from PSIBLAST searches (17) and previously have been shown to
be useful for secondary structure prediction (16). For this purpose,
20 additional input units were added per position. Thus, the number of
input units was 1,053 [(7 + 20) =C3=97 39]: the number of hidden neurons in
the standard three-layer feed-forward network was optimized to be 39,
and three output neurons predicted three-state probabilities for an
amino acid's being helix, sheet, or coil. The network was trained with
=E2=89=881,000 structures from the Protein Data Bank (54) selected to have =
resolution better than 2.5 =C3=85 and a sequence identity of <50% as
obtained from the Culled PDB Page (R. L. Dunbrack and G. L. Wang,
Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia)
[now the Protein Sequence Culling Server (PISCES)

The training was performed by using the SMART program
(www.jens-meiler.de/index_soft.html), which performs back-propagation
of errors. The learning rate was decreased from 10=E2=80=932 to 10=E2=80=93=
4 during
the training process, and the momentum was kept constant at 0.5. A
monitoring set of 100 sequences was used to interrupt the training
process as soon as its standard deviation was minimized. A second
independent set of 100 sequences was used to evaluate the quality of
the prediction. The training took 13,425 cycles (=E2=89=88250 h on a 1.0-GH=
Pentium III processor equipped with 2 gigabytes of memory). The
prediction from sequence alone is accessible for academic users via
the JUFO server (www.jens-meiler.de/jufo.html).

Incorporation of Tertiary Structure Information. To use ROSETTA models
for secondary structure prediction, it is necessary to incorporate 3D
structural information into the neural network input. Because many
structural models (typically a few thousand) with different and
partially wrong secondary structure are built by ROSETTA, an algorithm
is desired that extracts information relevant for secondary structure
prediction from a set of 3D models and combines it with sequence
profile information. Because the local secondary structure at any
sequence position might be wrong in the majority of the models, it is
not used as input. Also, local sequence effects should be reflected in
the primary-sequence information and should therefore not add new
information to the input. The description of the 3D structure has to
focus on the incorporation of interactions between parts of the
molecule that are more distant in sequence and be robust in dealing
with incorrect secondary structure in some of the models.

For incorporating low-resolution structural information, 90 input
neurons were added to the neural network. The tertiary structure
information fed to the network for a particular amino acid i was
derived from all other amino acids j with C=CE=B1i=E2=80=93C=CE=B1j distanc=
es <8, 12,
and 16 =C3=85 and a sequence separation of at least five amino acids
[absolute (i =E2=80=93 j) > 5]. For these amino acids, the number of helix,
sheet, and coil residues (3 parameters), their average property
profiles (7 parameters; compare ref. 13), and their averaged
position-specific scoring matrices (20 parameters) in each of the
distance bins were captured with 30 (3 + 7 + 20) input units. Thus, a
total of 90 (3 distance bins =C3=97 30 input units) additional input
neurons for the low-resolution structural information was added to the
original network architecture. The network was trained in the manner
described above for the sequence-alone network, by using the native
structure of the proteins in the training, monitor, and independent
data sets. The training took 14,775 cycles until the weights were
optimized. The prediction from sequence in combination with a given
tertiary structure is accessible for academic users via the JUFO3D
server (www.jens-meiler.de/jufo3D.html).

Tertiary structure information was provided to the network from either
the native structure or 1,000 ROSETTA models (50). We chose to use the
network trained on native structures rather than retraining it with
ROSETTA models. This choice allowed the use of as many native
structures as possible for training, not only of proteins with <180 aa
as foldable with ROSETTA. A "moving target" effect is avoided in which
improvements in ROSETTA would require retraining. Also, the method is
more general in the sense that it potentially can be applied to models
generated with other protein structure prediction methods without
prior retraining.

To obtain a single secondary structure prediction from a set of
structural models, the three-state probabilities predicted by the
neural network were averaged over all models. Before averaging, each
model was weighted according to its score (a better score suggested a
more probable 3D structure) and the internal consistency between the
actual secondary structure of the model and the secondary structure
predicted by the neural network using the model.

Results and Discussion
Results and Discussion

Analysis of the Artificial Neural Networks. The input-sensitivity
profiles (defined as the first derivative of an output value with
respect to a changing input vector) of the two neural networks
(sequence-only versus sequence-plus-model) are similar over the
sequence window (Fig. 2). Not surprisingly, the actual amino acid of
interest and its direct neighbors had the largest influence on the
prediction. The network that utilized tertiary structure information
obtained =E2=89=8820% less information from the sequence than did the
sequence-only network, as can be seen from the reduced sensitivities
in the sequence profile. This part of the information was replaced by
the low-resolution structural data. The most useful structural
information was taken from the secondary structure of the spatially
close amino acids, but position-based scoring matrices and the
property profiles also contributed.

Secondary Structure Prediction from Sequence-Only Network. The
sequence-only neural network was tested on a set of 137 sequences with
<150 aa that were not used for training. The trained neural network
yielded prediction accuracy (Q3) of 75% (SS1, Table 1), in agreement
with the method of Jones (16) for this set of data (Q3 =3D 75%).

Secondary Structure Prediction Using Low-Resolution Information from
Tertiary Structure. As expected, the Q3 value improved (to 82%) for
the independent set when using the correct 3D structures as input
(SS3, Table 1). This value could be increased further by including
higher-resolution 3D information. However, the low-resolution
representation was chosen because it seemed most appropriate for the
low-resolution structural models obtained from ROSETTA.

It is encouraging that including the low-resolution structural
information from the true structures corrected serious mistakes in
secondary structure prediction, where sheet, helix, and coil are
interchanged. The gain of information naturally varies from sequence
to sequence. Whereas, for many sequences, the nonmodified
sequence-only setup yields already high Q3 values of =E2=89=8890% and not m=
improvement is possible, some sequences perform rather poorly with
only local information (Q3 < 70%) and allow for a significant
improvement. This possibility is particularly notable for =CE=B2-strand
prediction, which improved from 58% to 76% (Table 1). In contrast to
=CE=B1-helices, =CE=B2-sheets are defined by nonlocal contacts and are ther=
harder to predict from a local sequence window alone. Most of this
lack of information already can be overcome by using a low-resolution
description of tertiary structure as introduced here. Whereas the
accuracy of helix and coil prediction increased by only 5% and 2%,
respectively, the accuracy of sheet prediction increased by 18%.

Secondary Structure Prediction Using Predicted Tertiary Structure.
Although the above results are encouraging, they require knowledge of
the native structure and hence cannot be used for a protein of unknown
structure. How much can be obtained from low-resolution and, often,
low-accuracy de novo structural models?

The results naturally will suffer when predicted structural models are
used instead of the correct 3D fold. Nonetheless, on average, over the
set of 137 proteins, an increase of 5% in the Q3 value was obtained
(SS2, Table 1). The Q3 value increased from 75% to 80% when models
were used and to 82% when the native structure was used as input for
the neural network over a total of 10,127 aa. More important was the
improvement of 13% in the prediction of =CE=B2-sheets. A histogram of the
changes in the Q3 values for this set of proteins is given in Fig. 3a.
Although many of the models have incorrect topologies and even coil or
helix in place of a =CE=B2-strand, if a second =CE=B2-strand can come close=
the majority of the models, the judgment of the network changes.
Conversely, if no partner for a wrongly predicted =CE=B2-strand can be
found because of spatial restrictions, it can turn into a coil or

The improvement that can be gained from the incorporation of
low-resolution 3D models varies from case to case, depending on the
quality of the sequence-only prediction and the variety and quality of
the structural models. Of the set of 137 structures, a subset of 14
structures with differences in the sequence-only and
sequence-plus-model prediction of >15% of the positions was selected.
Table 2 gives an overview of this subset of proteins. The average
sequence-only secondary structure prediction accuracy was 62%,
significantly lower than the 75% seen for the complete set of data.
The prediction accuracy achieved by including the structural models
increased to 75%, which is only 3% lower than that achieved by using
correct structure.

The improvement obtained for the sequence-plus-model prediction raises
the question of how well the tertiary structure of the ROSETTA models
alone reflects the true secondary structure of the protein. A
secondary structure prediction from the tertiary structure of 1,000
models alone was obtained by computing the ratio of helix, strand, or
coil conformation for every amino acid in the 137 proteins of the
benchmark set. The Q3 value achieved with this prediction method (71%)
is significantly lower than the prediction from sequence alone (75%).
Hence, the combination of sequence and tertiary information is
critical to obtain an improvement in the predicted secondary

CAFASP3 and LIVEBENCH6. The neural network was used to predict the
secondary structure from models generated by the ROSETTA server during
the CAFASP3 and LIVEBENCH6 (55) experiments. The results obtained for
the 31 proteins modeled by using the ROSETTAde novo protocol are
consistent with the numbers reported in Table 1 for the independent
set of 137 proteins. The Q3 value increased from 72% to 76% when using
models and to 82% when using the native structure as input for the
neural network over a total of 4,423 aa. The distribution over the
protein sets is plotted in Fig. 3b. The average confidence level of
the neural network decision increased from 45% (sequence-only) to 49%
(sequence-plus-model) to 54% (sequence-plus-native structure) as the
network came to a more definite decision by using the tertiary
structure in regions where only an ambiguous prediction was made

Fig. 4 illustrates ways in which tertiary structure can feed back to
improve secondary structure prediction in four examples from
LIVEBENCH6 and CAFASP3. T148 is a domain-swapped (the first strand
lies in the second domain) ferredoxin fold that consists of two
=CE=B2-sheets, each of them packed with two helices on one side. The
sequence-only prediction missed the first and the last strand
completely, as indicated in green (which represents coil in Fig. 4).
Also, the prediction of the helical and strand regions was rather
ambiguous at some places. Virtually all of these mistakes were
corrected when the native fold was used in the modified neural
network. The Q3 value increased from 74.1% to 87.7%. The spatial
closeness of weakly predicted =CE=B2-strands to a different =CE=B2-strand i=
n the
3D structure helped the neural network to draw the correct conclusion.

In this case as well as in the other three examples, ROSETTA was
unable to predict the complete protein structure correctly. The two
subdomains were built correctly in some of the models; however, their
relative orientation was wrong, and the domain swap was very rarely
suggested by ROSETTA. Still, those partial predictions allowed the
neural network to improve the secondary structure significantly to
achieve a Q3 of 85.8%. The sampling of possible 3D structures and the
analysis of the consistency of predicted and modeled secondary
structure suggest that =CE=B2-strands are more likely than coil or helix in
the ambiguously predicted regions.

Domain A of the arterivirus nsp4 (1mbmA) (56) folds in three
subdomains. The first two contain only =CE=B2-sheet, whereas the latter one
contains two =CE=B1-helical regions. The single-state prediction was at
66.2% accuracy, mainly because some small secondary structure elements
were missed and the length of the individual =CE=B2-strands was wrongly
predicted. When using the native structure as input to the neural net,
many of the ambiguous regions were clearly predicted, which resulted
in an increased Q3 of 75.3% and an improved confidence level. Beside a
better prediction of beginnings and endings of =CE=B2-strands, two
=CE=B2-bridges in the third domain, as well as one additional strand in the
second domain, were found correctly. The ROSETTA models did not
capture the complex nonlocal topology of the two =CE=B2-domains. Typical
models contained three separate domains, two of them with a local
=CE=B2-sheet, one =CE=B1-helical. However, even these models were sufficien=
t to
improve the prediction to a Q3 of 75.8%, although they contained
mainly =CE=B2-hairpins instead of the less local strand contacts in the
native structure.

The third example, domain A of HI0073/HI0074 protein pair from
Haemophilus influenzae (1jogA) (57), is an all-helical protein.
However, the sequence-only prediction gave the end regions of the
first two helices a high strand probability. In addition, one short
helix was predicted as strand, and the prediction for the last helix
had significant coil probability. Still, the sequence-only prediction
was at a high level of 71.3%. When the correct 3D structure was used,
those mispredicted regions were mostly corrected. The strand signal
vanished almost completely, and only at few places was a significant
coil signal obtained, which was, however, still lower than the helix
probability in those regions. The Q3 value increased to 86.0%. ROSETTA
was (correctly) unable to bring the regions of the molecule with an
increased strand probability spatially close, and, hence, the strand
regions were converted to helices, leading to an improved Q3 of 87.6%.

In domain A of the homologous pairing domain from the human Rad52
recombinase (1kn0A) (58), only the middle strand of the three-stranded
=CE=B2-sheet was predicted from sequence alone with a high probability. The
two neighboring strands that lie on the edge of the sheet were
predicted as coil with a very low confidence level. Also, one strand
of the small =CE=B2-hairpin was missing, as well as one of the short
=CE=B1-helices. When using the 3D structure as additional input for the
modified neural network, most of the strand amino acids were correctly
predicted, and only the small helix was still predicted as coil. The
Q3 value increased from 69.0% to 81.0%. The prediction from models (Q3
value of 79.9%) was not significantly worse. Interestingly, the best
models built for this protein adopt a fold that appears to be the
spatial inverse of the native structure. In the model shown in Fig. 4,
the three-stranded sheet is properly formed, but the helix is packed
on the opposite site compared with the native structure. In
consequence, the small helix bundle sits on the opposite site. This
model still has a relatively high rms deviation (rmsd) from the native
structure (>10 =C3=85); however, all C=CE=B1=E2=80=93C=CE=B1 distances are =
close to the
distances measured in the native structure, and these are the data
used by the neural network.

Tertiary Fold Prediction. We investigated whether the tertiary
structure=E2=80=93secondary structure feedback could be extended to generat=
improved 3D models using the improved secondary structure prediction
as input to ROSETTA. One thousand structures were built with the
sequence-only prediction (TS1, Table 2) and with the model-assisted
prediction (TS2, Table 2) where the models were taken from the
previous run.

The quality of the models produced was assessed by comparing the model
native C=CE=B1=E2=80=93C=CE=B1 rmsd of the tenth-most accurate model to avo=
statistical artifacts that might be caused by looking at the best rmsd
only. When using the model-assisted secondary structure prediction,
the average rmsd decreased from 7.4 to 6.6 =C3=85 (TS2, Table 2). Whereas
in many cases the rmsd did not change significantly (=CE=94rmsd < 0.5 =C3=
it did improve for the majority of the problematic proteins (1c8cA,
1dtdB, 1fwp_, 1hz6A, 1isuA, 1sap_, 1vqh_, and 1wapA) between 0.6 and
4.3 =C3=85. ROSETTA on average generates poorer models for these proteins
than for the complete set of 137 structures, which might partially be
caused by the ambiguous secondary structure prediction. However, the
improvement of the secondary structure prediction is certainly more
significant than the change in the quality of the predicted models. A
second iteration of secondary structure prediction and generating
ROSETTA models further improved neither the secondary structure
prediction nor the 3D models.

The most significant improvement in the quality of the tertiary fold,
which was accompanied by an improvement of the Q3 prediction accuracy
of 15%, was for 1vqh, an 86-residue, all-=CE=B2 protein. Only three of the
eight =CE=B2-strands were recognized, whereas a fourth was predicted to be
a helix when using sequence-only prediction. After incorporating a set
of 1,000 models (obtained by using this ambiguous secondary structure
prediction) into the secondary structure prediction, the Q3 value
became 71%, and seven of the eight strands were recognized. When using
the correct 3D fold as input, all eight strands were recognized and Q3
was found to be 76%. In this particular case, the predicted structure
improved drastically. The rmsd value of the tenth-best model by rmsd
dropped from 11.1 to 6.8 =C3=85.

Results and Discussion

Although very accurate for many proteins, secondary structure
prediction from sequence alone can fail if the formation of secondary
structure is strongly coupled to the formation of tertiary
interactions. This is especially true for =CE=B2-strands, where nonlocal
partners are frequently necessary. Here we show that even very
low-resolution tertiary structure information can improve the
prediction of secondary structure.

A drawback of the new method is its dependence on ROSETTA models,
which limits its application to single-domain proteins. Incorporation
of very long-range interactions between domains and within single
large domains will require improvements in de novo protein structure
prediction methodology. Despite this currently limited applicability,
the method does illuminate the ways in which tertiary structure can
feed back on secondary structure. The characterization (Fig. 4) of the
proteins for which the largest changes in secondary structure
prediction were brought about by using tertiary structure models
suggests that the most important influences are on regions with some
=CE=B2-strand propensity. Such regions are predicted to be =CE=B2-strands i=
f and
only if there are nearby =CE=B2-strands in plausible tertiary structures.
This resolution of ambiguous =CE=B2-strand propensity by the presence (or
absence) of tertiary =CE=B2-sheet interactions is likely to mirror the fate
of segments of the polypeptide chain with weak =CE=B2-strand propensity
during the actual folding process.

Other related posts: