We have implemented a genetic algorithm (GA) based protein modeling protocol. In this GA, the genotypes are sets of inter-atomic distances, and the phenotypes are the corresponding three-dimensional coordinates. Pairs of models are combined (mating) to determine a set of distances (genotype or genome) that describes the child. YASARA uses the inter-atomic distances, generates three-dimensional coordinates (phenotype) and performs a molecular dynamics run that mimics the "life" of the molecule. The child is then added to the ensemble of models (gene pool). Normally, a genetic algorithm uses the amount of offspring (often encoded in an energy term) as the only quality-measure of a genome. Our GA, however, does not follow the normal evolutionary principles because the inter-atomic distances (genome) of the child protein are selected based on the 'quality' of the corresponding residues in the parent molecules. In real evolution it is, of course, not known which trades of the parents are good and which are bad, but the WHAT_CHECK structure validation software gives us this knowledge, which we use in a genetic engineering-like manner to derive a new genotype. Therefore, a better name for our GA would be TA like "transgenic algorithm".
The whole protocol is a three-step procedure: 1) initiation; 2) optimization using the transgenic algorithm (TA); 3) termination.
Initiation: A series of publicly accessible threading and sequence alignment programs is used to generate possible alignments between multiple templates and the model sequence: GenThreader , 3DPSSM , BIOINBGU  and Smith&Waterman running on a Compugen Bioccelerator. In the latter case, the newly developed program SecMatch filters out those alignments in which the PSI-PRED secondary structure prediction  for the model sequence disagrees too much with the observed secondary structure in the template. The program WHAT IF then builds one model for each alignment. A short energy minimization is performed by the interactive real-time molecular dynamics program YASARA.
The transgenic optimization cycle: The program WHAT_CHECK is used to determine quality parameters for each model and for each residue, which are then converted to a residue-specific fitness matrix (see fig 1) by the newly written program WHAT_MODELBASE. A per-model score is calculated, and pairs of models are selected for mating. The higher the per-model score, the higher the chance that the model mates. YASARA uses the fitness matrix to judge the parent models. In a process called impression or reverse expression, it then derives a genome for each parent, that - when expressed - reproduces the good aspects of the phenotype (i.e. the parent model) but omits the bad ones.
At this point, features acquired during "lifetime" (the MD simulations) can be propagated back to the genome, in a sense reviving Lamarck's ideas. Mating is then also done in a non-standard way, as the probabilities of crossovers at certain genes depend on their quality (i.e. the quality of the associated residues). Finally, YASARA converts the atomic distances stored in the child's genome back to a three-dimensional phenotype (gene expression), performs a molecular dynamics run of a few picoseconds, and adds the structure with the lowest energy to gene pool. The distance restraints stored in the genome remain active during the entire MD simulation, residues that get a high score in the fitness matrix are restrained more tightly. A mutation in TA terms is implemented as the release of the restraints during the MD run. Operations like shortening or extending secondary structure elements or shifting b-strands along each other form special classes of mutations.
Termination: The TA cycle is run for a few days on about 30 PCs. All software used in this project is part of a screen-saver, so that the PCs at the desks of our colleagues can be used in a non-obtrusive way. The TA run is terminated after the score of the best molecule has not improved for two days, or when the CASP deadline is near, whatever comes first.
We submitted five models to CASP4 and tried several minor variations in the protocol. For example, in one target we kept most of the backbone constrained, in another model the same was also true for the side chains of the conserved residues, and in three targets no constraints were used. Besides the sequence alignment software, the full protocol consists of seven programs. A full description of these programs and the parameters used will be published after the CASP4 meeting.
1. R.Unger, J.Moult (1993) J.Mol.Biol. 231(1): 75-81
2. J.H.Holland (1975) Adaption in Natural and Artificial Systems, The University of Michigan Press
3. G.Vriend (1990) J. Mol. Graph . 8: 52-56.
4. R.W.W.Hooft, G.Vriend, C.Sander, E.E.Abola (1996) Nature 381:272-272.
5. E.Krieger, prepared for submission
6. D.T.Jones (1999) J. Mol. Biol. 287: 797-815
7. L.A.Kelley, R.M.MacCallum and M.J.E Sternberg (2000) J. Mol. Biol. 299(2): 501-522
8. D.Fischer, Pacific Symp. Biocomputing, Hawaii, 119-130, January 2000.