Endonuclease PvuII (1PVI) DNA - GATTACAGATTACA
CAP - Catabolite gene Activating Protein (1BER)
DNA - GATTACAGATTACAGATTACA Endonuclease PvuII bound to palindromic DNA recognition site CAGCTG (1PVI) DNA - GATTACAGATTACAGATTACA TBP - TATA box Binding Protein (1C9B)
CAP - Catabolite gene Activating Protein (1BER)
GCN4 - leucine zipper transcription factor bound to palindromic DNA recognition site ATGAC(G)TCAT (1YSA)
GCN4 - leucine zipper transcription factor bound to palindromic DNA recognition site ATGAC(G)TCAT (1YSA)
GCN4 - leucine zipper transcription factor bound to palindromic DNA recognition site ATGAC(G)TCAT (1YSA)
GCN4 - leucine zipper transcription factor bound to palindromic DNA recognition site ATGAC(G)TCAT (1YSA)
GCN4 - leucine zipper transcription factor bound to palindromic DNA recognition site ATGAC(G)TCAT (1YSA)
GCN4 - leucine zipper transcription factor bound to palindromic DNA recognition site ATGAC(G)TCAT (1YSA)
TBP - TATA box Binding Protein (1C9B)
 

Modeling Proteins with a Transgenic Algorithm

Elmar Krieger and Gert Vriend
Center for Molecular and Biomolecular Informatics, Toernooiveld 1, NL - 6525 ED Nijmegen, the Netherlands

Genetic algorithms are known for many years, and after the publication by Unger and Moult in 1993 [1], they became a standard tool for protein fold prediction. These algorithms are so called because they utilize the same optimization procedures as natural evolution: mutation, crossover and selection [2]. We added a new idea to these original concepts: genetic engineering. The flow of the procedure is: 1) collect alignments from a series of programs; 2) build models using WHAT IF [3]; 3) determine their quality with WHAT_CHECK [4]; 4) use the newly developed program YASARA [5] to combine the good parts of the models and run molecular dynamics simulations; 5) iterate the steps 3 and 4 as part of a transgenic algorithm. The program suite does not distinguish between ab initio protein folding, threading or homology modeling. Here we concentrate on the latter topic.

We have implemented a genetic algorithm (GA) based protein modeling protocol. In this GA, the genotypes are sets of inter-atomic distances, and the phenotypes are the corresponding three-dimensional coordinates. Pairs of models are combined (mating) to determine a set of distances (genotype or genome) that describes the child. YASARA uses the inter-atomic distances, generates three-dimensional coordinates (phenotype) and performs a molecular dynamics run that mimics the "life" of the molecule. The child is then added to the ensemble of models (gene pool). Normally, a genetic algorithm uses the amount of offspring (often encoded in an energy term) as the only quality-measure of a genome. Our GA, however, does not follow the normal evolutionary principles because the inter-atomic distances (genome) of the child protein are selected based on the 'quality' of the corresponding residues in the parent molecules. In real evolution it is, of course, not known which trades of the parents are good and which are bad, but the WHAT_CHECK structure validation software gives us this knowledge, which we use in a genetic engineering-like manner to derive a new genotype. Therefore, a better name for our GA would be TA like "transgenic algorithm".

The whole protocol is a three-step procedure: 1) initiation; 2) optimization using the transgenic algorithm (TA); 3) termination.

Initiation: A series of publicly accessible threading and sequence alignment programs is used to generate possible alignments between multiple templates and the model sequence: GenThreader [6], 3DPSSM [7], BIOINBGU [8] and Smith&Waterman running on a Compugen Bioccelerator. In the latter case, the newly developed program SecMatch filters out those alignments in which the PSI-PRED secondary structure prediction [9] for the model sequence disagrees too much with the observed secondary structure in the template. The program WHAT IF then builds one model for each alignment. A short energy minimization is performed by the interactive real-time molecular dynamics program YASARA.

The transgenic optimization cycle: The program WHAT_CHECK is used to determine quality parameters for each model and for each residue, which are then converted to a residue-specific fitness matrix (see fig 1) by the newly written program WHAT_MODELBASE. A per-model score is calculated, and pairs of models are selected for mating. The higher the per-model score, the higher the chance that the model mates. YASARA uses the fitness matrix to judge the parent models. In a process called impression or reverse expression, it then derives a genome for each parent, that - when expressed - reproduces the good aspects of the phenotype (i.e. the parent model) but omits the bad ones.

At this point, features acquired during "lifetime" (the MD simulations) can be propagated back to the genome, in a sense reviving Lamarck's ideas. Mating is then also done in a non-standard way, as the probabilities of crossovers at certain genes depend on their quality (i.e. the quality of the associated residues). Finally, YASARA converts the atomic distances stored in the child's genome back to a three-dimensional phenotype (gene expression), performs a molecular dynamics run of a few picoseconds, and adds the structure with the lowest energy to gene pool. The distance restraints stored in the genome remain active during the entire MD simulation, residues that get a high score in the fitness matrix are restrained more tightly. A mutation in TA terms is implemented as the release of the restraints during the MD run. Operations like shortening or extending secondary structure elements or shifting b-strands along each other form special classes of mutations.

Termination: The TA cycle is run for a few days on about 30 PCs. All software used in this project is part of a screen-saver, so that the PCs at the desks of our colleagues can be used in a non-obtrusive way. The TA run is terminated after the score of the best molecule has not improved for two days, or when the CASP deadline is near, whatever comes first.

We submitted five models to CASP4 and tried several minor variations in the protocol. For example, in one target we kept most of the backbone constrained, in another model the same was also true for the side chains of the conserved residues, and in three targets no constraints were used. Besides the sequence alignment software, the full protocol consists of seven programs. A full description of these programs and the parameters used will be published after the CASP4 meeting.

References:

1. R.Unger, J.Moult (1993) J.Mol.Biol. 231(1): 75-81
2. J.H.Holland (1975) Adaption in Natural and Artificial Systems, The University of Michigan Press
3. G.Vriend (1990) J. Mol. Graph . 8: 52-56.
4.
R.W.W.Hooft, G.Vriend, C.Sander, E.E.Abola (1996) Nature 381:272-272.
5. E.Krieger, prepared for submission
6. D.T.Jones (1999) J. Mol. Biol. 287: 797-815
7. L.A.Kelley, R.M.MacCallum and M.J.E Sternberg (2000) J. Mol. Biol. 299(2): 501-522
8. D.Fischer, Pacific Symp. Biocomputing, Hawaii, 119-130, January 2000.

// 1crn.cdb - WHAT_MODELBASE Quality Check
Sequence :   TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN|
Access :     4100347612540561477626355017624034295783476263| 0.4285
Quality :    0.7425
Bonds :      9899897999989989999998698897979999892979999999| 0.9246
Angles :     2699996999908089899798899898679996696737996929| 0.7963
Torsions :   ?64644647678687546352666375655764644423755543?| 0.5475
Phi/psi :    ?45545655666665434455646453453654644433754335?| 0.4937
Planarity :  9999999999999999999999999999999999999999999999| 1.0000
Chirality :  9899999999999999999999999999999999999999999979| 0.9901
Backbone :   ??999999999999999919799999999996999986999999??| 0.9476
Peptide-Pl : ??9999999999999999?9?9999999999?8999??999999??| 0.9944
Rotamer :    ??559959954767969699?9699899799?9849999799899?| 0.8573
Chi-1/chi-2: 4444955494574644459949693494549546599944994495| 0.6341
Bumps :      9999909099999999904990999999099999999999999999| 0.7159
Packing 1 :  6489397666677654013238656653014663314314631213| 0.4651
Packing 2 :  9466653456674655444547434744336843142536653532| 0.4934
In/out :     9999999999999999999999999999999999999999999999| 0.9998
H-Bonds :    9999979799999999999999999999999999999999999979| 0.9837
Flips :      9999999999999999999999999999999999999999999999| 1.0000

Fig. 1: Residue specific fitness matrix for crambin (1CRN). 16 different quality measures (from "Bonds" to "Flips") are considered by the transgenic folding algorithm. Quality ranges from 0 (bad) to 5 (average/normal) and 9 (perfect). A detailed description of the various quality indicators can be found at http://swift.cmbi.ru.nl/gv/pdbreport/