Computer-Aided Molecular Design Using
Neural Networks and Genetic Algorithms
Venkat Venkatasubramanian*
Anantha Sundaram
King Chan
James M. Caruthers
Laboratory for Intelligent Process Systems
School of Chemical Engineering
Purdue University
West Lafayette, IN 47907
U.S.A
ABSTRACT
The design of materials possessing desired physical, chemical and biological properties is a challenging problem in the chemical, petrochemical and pharmaceutical industry. This involves modeling important interactions between basic structural units for property prediction as well as efficiently locating viable structures that can yield desired performance on synthesis. This paper describes a neural network (NN) based approach for solving the forward problem of property prediction based on the structural characteristics of the molecular sub-units, and a genetic algorithm (GA) based approach for the inverse problem of constructing a molecular structure given a set of desired macroscopic properties. The neural network approach was found to develop nonlinear structure-property correlations more accurately and easily in comparison with other conventional approaches. Similarly, the genetic algorithm was found to be very effective in locating globally optimal targets for the design of fairly complex polymer molecules.
To gain a better understanding of the nature of the search space and
the mechanics of GA's search, we explored the search space in detail in
an attempt to analyze and characterize its features. A pseudo-Hamming distance
analysis using 'super-groups' suggested that the average fitness and the
maximum fitness of the molecules increased as they approached the target,
a feature which the genetic algorithm exploits through its 'survival of
the fittest' principle to carry out an effective search. Since real-life
CAMD problems are so difficult and complex, a successful approach would
require a combination of several techniques integrated into a unified,
interactive, framework. We present one such framework with the GA-based
design engine integrated with expert systems and math programming methodologies
to improve the effectiveness of the overall design process. We also envision
such a system as an interactive one, where the molecular designer actively
participates in the design process.
Keywords: Computer aided molecular design, Neural Networks, Genetic Algorithms, QSAR, Polymer molecules
INTRODUCTION
Computer Aided Molecular Design: Background
The process of designing new molecules possessing desired physical, chemical and biological properties is an important endeavor in chemical, material and pharmaceutical industries. Industrial applications include designing composites and blends, drugs, agricultural chemicals such as pesticides or herbicides, refrigerants, solvents, paints and varnishes etc. Recent developments such as stricter penalties on environmentally unfriendly products and emphasis on value-added products and designer molecules, the search for novel materials has become an essential part of research in the above fields. The traditional approach for formulating new molecules requires the designer to hypothesize a compound, synthesize the material, evaluate to see if it meets the desired design targets, and to reformulate the design if the desired properties are not achieved. Figure 1 shows a schematic of this iterative design process. This process is usually lengthy and expensive involving the preliminary screening of dozens or hundreds of candidates. Computer tools such as molecular graphics, molecular modeling, reaction pathway synthesis, and structure-property prediction systems have somewhat alleviated the time and expense involved in this process. However, there are still major problems remain to be addressed as discussed below.
Figure 1. The molccular design problem
Forward Problem in CAMD
The explosion in computational power of modern computers as well as their inexpensive availability has prompted the development of computer-assisted procedures for designing new materials to ease the protracted design, synthesis and evaluation cycle. Computational molecular design systems require the solution of two problems: the "forward" problem which predicts physical, chemical and biological properties from the molecular structure, while the "inverse" problem requires the identification of the appropriate molecular structure given the desired macroscopic properties (Figure 2). In the last decade, significant advances have been made in a variety of methods for structure-property predictions including molecular and empirical models, group contribution methods, QSARs and other correlation methods, etc. (Reid et al., 1977, 1987; Joback and Reid, 1987). However, most of these approaches usually assume that physicochemical properties depend on chemical structure in a relatively simple manner. Hence, they are generally less successful for complex molecules with nonlinear structure-property relationships.
Figure 2. Iterative molecular design process
Inverse Problem in CAMD
The inverse problem has similarly been approached by a variety of methods. The earliest work of Gani and Brignole (1983) and Brignole et al. (1986) presented an enumeration approach for solvent design for liquid-liquid extraction. The enumeration approach was also used by Derringer and Markham (1985) for polymer design. Following this, a heuristic-guided enumeration technique to combat combinatorial explosion was proposed by Joback and Stephanopoulos (1989) for polymer, refrigerant, and solvent design. Subsequently, Macchietto et al. (1990) applied a mixed integer non-linear programming (MINLP) formulation to the design of solvents for liquid-liquid extraction and gas absorption. Knowledge-based approaches for drug design (Bolis et al., 1991), polymer design as well as refrigerant and solvent design have been reported (Gani et al., 1991). Zefirov and coworkers (Skvortsova et al., 1993) and Kier et al. (1993) developed graph reconstruction methods in which molecular structures can be derived from a quantitative structure-activity relationship (QSAR) based on topological indices.
While all these methods have some appeal, they suffer from drawbacks
due to combinatorial complexity of the search space, design knowledge acquisition
difficulties, nonlinear structure-property correlations and problems in
incorporating higher-level chemical and biological knowledge. The combinatorial
complexity associated with CAMD makes random, exhaustive or heuristic enumeration
procedures less effective for large-scale molecular design problems. Mathematical
programming methods consider molecular design as an optimization problem
where the objective is to minimize the error between the desired target
values and the values attained by the current design. The solutions to
mixed integer nonlinear programming (MINLP) formulations are susceptible
to local minima traps for problems with nonlinear constraints, as is the
case with most structure-property relations. The solutions to these problems
are also computationally very expensive, especially for highly nonlinear
systems. The knowledge-based systems approach assumes that expert rules
exist for manipulating chemical structures to achieve the desired physical
properties. However, many nonlinear structure-property relationships can
not be easily expressed as rules, especially when designing for multiple
design objectives. Furthermore, extraction of such design expertise from
experts on molecular design is not easy, making the knowledge acquisition
problem a very difficult one. Lastly, graph reconstruction methods require
expressing all structure-property relations in terms of topological indices.
This may not be appropriate or feasible for all properties. Furthermore,
topological indices are not unique and there currently does not exist a
general graph reconstruction method for all molecular indices. In addition,
since one often deals with a number of design criteria to be satisfied
and not just one or two properties, this approach may not be feasible in
general.
Using Genetic Algorithms and Neural Networks
Recently, Venkatasubramanian et. al [1994,1995] proposed an evolutionary molecular design approach using genetic algorithms (GA) to address the drawbacks of the other methods for the inverse problem. Genetic algorithms are general purpose, stochastic, evolutionary search and optimization strategies based on the Darwinian model of natural selection and evolution. Their study showed that the genetic design (GD) approach was able to locate globally optimal designs for many target molecules with multiple specifications.
In this paper, we extend this approach further by describing a CAMD
framework that uses both neural networks and genetic algorithms to address
the difficulties in both the forward and inverse problems. The neural network
based property prediction methodology addresses the forward problem, while
the genetic algorithmic component tackles the inverse problem. This paper
also discusses some recent results on the characterization and analysis
of complex search spaces for molecular design to gain further insight into
GA's operation. The paper also describes an interactive CAMD system that
is under development.
THE FORWARD PROBLEM USING NEURAL NETWORKS
Figure 3 Schematic diagram of model-based
approaches for structure property/activity studies
Approaches to the Forward Problem
The forward problem refers to the prediction of macroscopic properties of a product given an abstraction of its basic structure. Depending on the kind of the final product desired, the basic structure can be as detailed as a three-dimensional atomic/molecular structure or a formulation of the basic components of the final product. Since this paper is focused on molecular design, the basic structure of interest to us is the molecule. Forward methods or property estimation methods can be broadly classified into three categories: (1) group contribution, (2) equation oriented, and (3) model-based techniques. The model-based approaches are further subdivided into: (a) topological, (b) pattern recognition, (c) molecular modeling methods.
The group contribution or additivity approach assumes that a physical property is equal to the sum of the individual contributions of the molecule fragments or groups to that property. It is developed by utilizing experimental data for a large number of compounds in a regression scheme to estimate the contributions of each structural fragment. Additivity schemes for estimating thermodynamic physicochemical properties of small molecules are well developed and widely used (Reid et al., 1977, 1987). Van Krevelen and Hoftyzer (1972, 1976, 1992), presented group methods as well as semi-empirical relationships utilizing group additive parameters for thermal, mechanical, optical, electrical, and rheological properties of polymers. Equation oriented approaches assume the dependence of a final physico-chemical property on a basic set of other physicochemical properties. This approach is essentially empirical in nature and is of limited utility.
Model-based approaches represent the molecular structure by a set of
descriptors in order to develop a mathematical model which relates descriptors
to physicochemical properties or biological activity. These approaches
are in general, known as structure-property or structure-activity methods
(QSARs) as shown in Figure 3. These are further classified based on the
nature of descriptors as discussed below.
Topological Techniques
Topological techniques are based on graph theory. They consider molecular
structure as planar graphs where atoms are represented by vertices and
chemical bonds by edges. The matrix representation of the chemical constitutional
graph is transformed to an invariant which is normally an index. These
non empirical descriptors are sensitive to constitutional features such
as size, shape, symmetry, and atom heterogeneity. Structure activity models
have been reviewed in detail by Kier and Hall (1976, 1986).
Pattern Recognition
The basis of pattern recognition techniques is that a collection of
structural descriptors representing points in high-dimensional space will
assemble in localized regions of common physicochemical property. Thus,
the assumption is similar chemical structures exhibit similar physicochemical
properties. The objective is to find discriminant functions or decision
regions which separate the data into subsets (i.e. classes) with similar
properties. There has been a considerable amount of work in applying pattern
recognition techniques to biological, chemical, environmental, and pharmaceutical
problems (Jurs and Isenhour, 1975; Strouf, 1986).
Molecular Modeling Techniques
Molecular modeling techniques are separated into three branches: (1)
ab initio or Gaussian methods, (2) semi-empirical molecular orbital
methods, and (3) molecular mechanics methods. Quantum mechanical methods
are based on computing (ab initio methods) or approximating analytically
(semi-empirical molecular orbital methods) the wave function from first
principles, namely from the Schrodinger equation, without resorting to
any kind of regression with experimental data. On the other hand, molecular
mechanics is a computational realization of the ball-and-stick model (Burkert
and Allinger, 1982). A molecule is considered to be a series of masses
(atoms) attached by springs (bonds), where the interactions between the
nuclei are treated according to the laws of classical mechanics. These
force laws and its constants constitute the internal energy of a molecule
or the force-field, V which is then approximated by a sum of different
types of energy contributions such as bond stretching, angle bending, bond
rotations, non-bonded interactions, hydrogen bonding, etc.. The molecular
model is then parameterized by assigning the force constants and natural
values for angles and lengths which allow for the reproduction of experimental
data.
Comparison of the Approaches to the Forward Problem
The property prediction methods may be evaluated based on their classification
as empirical, semi-empirical, theoretical, and hybrid approaches. The empirical
methods usually require extensive data collection and result in linear
or simple nonlinear structure-property relations. Computations are very
rapid at the expense of prediction accuracy. In addition, these methods
require a specific functional form which may not always be available and
the parameters determined by regression from the data. At the opposite
end of the spectrum are theoretical approaches which require an extensive
understanding of the first-principles to develop the models. They are also
computationally expensive, but provide excellent property estimations.
Most approaches settle for the middle ground by utilizing simplifying assumptions
as those found in semi-empirical methods and hybrid approaches. These methods
provide the best compromise between model development effort, computational
time and property prediction accuracy. In this regard, neural network based
methods offer advantages of ease of development and implementation, and
execution speed, while maintaining a high degree of accuracy of predictions.
Neural network based models are relatively model free, in the sense that
the underlying functional form is not as rigorous as in the traditional
model based methods. This adds to the generality of these methods.
Neural Networks for the Forward Problem
Neural network applications to chemistry have been explored in recent years. These include systems for secondary protein structure prediction (Qian and Sejnowski, 1988), QSAR parameters to determine biological activities (Aoyama et al., 1990), functional group identification from mass and infrared spectra (Robb and Munk, 1990), connection tables for predicting chemical composition of aromatic substitution reactions (Elrod et al., 1990) and a host of other applications which are summarized by Zupan and Gasteiger (1991). Sumpter et. al. (1994) give an excellent review on the recent advances in neural network applications in chemistry ranging from molecular spectroscopy (Wythoff et. al. 1990) to applications to QSAR and QSPR for polymers (Sumpter and Noid 1994) and pharmaceuticals (Rose et. al 1991).
Neural Networks in general can be viewed as nonlinear regression models, where the functional structure is some combination of activation functions that operate on linear (the weighted sum of) inputs. Different types of activation functions (sigmoidal, radial etc.), also known as basis functions, can be used and this adds to the generality of the neural network structure. Specifying a basis function fixes the most basic functional unit of the neural network. The complex interactions between the inputs are captured by the neural network by adaptive alteration of its connection weights while matching the training output patterns. However, neural networks are not completely parameter free since one still needs to specify the number of layers and the number of nodes in the hidden layers in the architecture and these drastically affect the performance of the neural network. Cybenko (1988) has proved the capability of two layer neural networks as universal function approximators, which helps with the number of layers issue. Yet, the choice of the number of nodes to use for a particular problem is not an easy one and is often determined by trial and error.
This neural-network-based methodology used in this paper treats polymer property prediction as one of function approximation. The function to be approximated is the one correlating a basic set of structural descriptors to a desired set of properties. In other words, we wish to approximate, using a neural network, a function F such that:
Property = F(structural descriptors)
The key element in the above approach is the selection of an appropriate
set of structural descriptors. The chosen set of descriptors should adequately
capture the phenomena underlying the macroscopic behavior or property of
the material. Depending upon the property sought after and the nature of
the material, the structural descriptors could be of any form. It is also
important that these descriptors are obtained without a lot of computational
effort since they have to be computed for every molecule whose property
needs to be predicted, not to mention the ones used for training the neural
network. Once these structural descriptors are identified the neural network
is trained to accurately correlate the property values to the values of
the descriptors. In this regard, traditional neural network based function
approximators use the back-propagation algorithm to arrive at the optimal
set of connection weights. However, these weights may retain some redundancy
due to the particular choice of descriptors and hence lead to poor generalization
performance of the neural network.
Neural Network Architecture for Property Prediction: Polymers Case Studies
Figure 4 shows the schematic of the neural network architecture for
structure-property predictions. The polymer sub-structural sequence (at
the bottom of the figure) represents a chemical structure to structural
descriptor encoding. Molecular mechanics (MM) techniques were chosen over
the less rigorous topological and physicochemical methods used to quantify
chemical structure. The computational speed coupled with the accuracy with
which molecular mechanics methods calculate the ground state energies and
geometries make them more suitable than the much slower ab initio
and semi-empirical methods.
Figure 4 Neural Network architecture for
structure property prediction
Chemical structural descriptors are obtained from molecular mechanics conformational analysis. Two monomer units were used in present MM simulations since it has been demonstrated that the conformational energy rapidly converges with increasing monomer units beyond three (Orchard et. al., 1987). However, for polymers with many flexible units, the high dimensionality of the dihedral angles or residuals to the contribution of the total energy prevent ready conceptualization of the features of the conformational energy surface. Further simplification is effected by representing the total energy, V, as a sum of contributions Vi(fi) from each of the several residuals of the polymer. This approximation of independent, first-neighbor interactions allows conceptualization of the conformational energy of a polymer of any size to a consideration of dimeric polymer segments. Thus, a judicious selection of dihedral angles representing the major relaxation processes is required.
Polymer configurations are another important consideration in the MM conformational analysis. Consequently, head-to-head, head-to-tail, and, tail-to-tail as well as syndiotactic, isotactic, and atactic forms must be considered. This leads to multiple MM simulations for each polymer. Average structural descriptors were then obtained from Boltzmann statistics. The configurational complexities encountered with copolymers, however, are not considered in this study.
The neural network architecture consists of a single hidden layer and the number of nodes used in this layer is varied to estimate the improvement in the performance obtained. However, increasing the number of nodes in this layer can eventually lead to memorization of the training sets and poor generalization characteristics. The number of input nodes is however, fixed by the number of structural descriptors used and there is only one output node corresponding to the property being predicted. Sigmoidal activation functions were used as the basis functions.
The properties considered in the case studies were density and glass
transition temperature. The objective of the case studies was to implement
a neural network based polymer property predictor and evaluate its performance
in comparison to existing techniques. The case study also explores alternative
architectures for the neural network in an effort to study its influence
on performance. The structural parameters for density and Tg were taken
from van Krevelen and from Hopfinger and coworkers. The polymer data for
network training were taken from Hopfinger and is listed in Table I. Hopfinger
et al. (1988) and Koehler and Hopfinger (1989) combined molecular mechanics,
group additive methods, and multi-variate regression analysis for developing
molecular models for estimating the glass transition (Tg) and crystal-melt
transition temperatures (Tm) of linear polymers. The intermolecular energetic
contribution was later added to the previous intramolecular flexibility
parameters. The final set of descriptors that are assumed to reflect the
dominant type of polymer intermolecular interactions were, <ED> the
non-bonded/dispersion interactions , <E+> and <E-> which are
the electrostatic interactions. The trained network was then tested on
a different polymer set to evaluate its generalization characteristics.
Table II lists the polymer set used for the testing/generalization phase.
Case Study 1: Density
Density can be directly calculated as the ratio of monomer mass to the molar volume of the unit. The neural network input descriptors employed in this case study consist of the monomer mass, molar volume, as well as Hopfinger's intermolecular parameters (<E+> and <E->). The error measure used is the root-mean-square (RMS) error, E, for all exemplars, P, defined as:
(1)
and
(2)
where Ep is the root-mean-square error for all output nodes, N, and
di , oi are the desired and actual outputs of node i, respectively. The
input and output data are discretized between 0.2 and 0.8. Several network
architectures were considered starting with the (2-6-1) and (4-10-1) networks.
The network architecture specification within the parenthesis should be
interpreted, from left to right, as the number of input units, hidden units,
and output units, respectively. Initially, twice the number of hidden neurons
as the number of inputs were considered as in the case of the (2-6-1) and
(4-10-1) networks. The selection of these architectures was based on prior
experience with this set of structural descriptors and a general rule of
thumb based on the performance of different architectures. These networks
were evaluated as starting points only and both smaller and larger architectures
(based on the number of hidden neurons) were analyzed to evaluate the best
performance.

Figure 5 Error during the recall and generalization
phases for density case study
As seen from Figure 5, for each architecture of hidden units, the recall error decreases monotonically with an increase in the learning iterations. Furthermore, hidden units greater than eight in the (4-n-1) networks weakly influenced the recall error. Generalization results, also shown in Figure 5, however, suggest a degradation of property prediction accuracy for hidden units less than 8 and greater than 12. Similar results are seen in the (2-n-1) networks. Thus, the property estimation accuracy is strongly dependent on the neural network architectures and is an important research issue. A graphical summary of the van Krevelen and the neural network recall results for the two networks is presented in Figure 6. Table III lists the neural network prediction and van Krevelen group additive results for density. Consequently, the neural network recall and generalization results demonstrate a significant improvement over the van Krevelen results.
Case Study 2: Glass Transition Temperature
The glass transition temperature is dependent on a number of factors, including the bulkiness and flexibility of movable units, intermolecular interactions, and the free volume fraction. The Tg case study employed Hopfinger's descriptors, such as conformational entropy, mass moments, and the intermolecular descriptors, <ED>, <E+>, and <E->, for neural network structure-property correlations. The neural network architectures considered were (5-10-1), (5-12-1), and (5-14-1) networks. Figure 7 illustrates the recall and generalization errors for each architecture of hidden units as a function of neural network iterations, respectively. A graphical summary of the network's recall and generalization results is illustrated in Figure 8. Table IV lists the neural network prediction, Hopfinger's results, and van Krevelen group additive results for Tg. Once again, the neural network recall and generalization results are a considerable improvement over both van Krevelen and Hopfinger's results. Moreover, the limitation of linear regression analysis is evident as Hopfinger's MM-based (RMS error = 21.07) descriptors yield only a slight improvement over van Krevelen's group methods (RMS error = 21.76) while the neural network correlation methodology using the same Hopfinger parameters result in an improvement in both recall (RMS error = 6.07) and generalization results (RMS error = 17.93).

Figure 6 Comparison of NN recall results
to van Krevelen's group contribution: density
As demonstrated, neural networks were able to acquire structure-property relations from examples of structural descriptors and properties for complex polymer molecules. The neural network recall accuracy to trained properties was near perfect, with the best RMS error of 0.115 and 6.0731 for the density case study and Tg case study, respectively. The best RMS values for density and Tg prediction (i.e. generalization) was 0.0063 and 17.93, respectively. These results are graphically illustrated in Figure 9 and represent a significant improvement over traditional methods. These investigations strongly encourage the use of neural networks for the forward problem for complex nonlinear molecules which can not be adequately modeled using traditional methods.

Figure 7 Error during the recall and generalization
phases for Tg case study

Figure 8 Comparison of NN recall results
to van Krevelen's group contribution Tg
While neural networks have proven to be quite effective, it is important to note some weaknesses of this approach. First, it is a 'black-box' model, which means that no insights or explanations can be provided for the nonlinear correlation that has been developed. Second, the neural net model will only be as good as the data that was used to train the network. Hence, it is important to have a substantial amount of good data for developing the network model. Third, the neural network models are generally more reliable for interpolation than extrapolation of data. That is, if one tries to use the network to predict the properties of molecules that differ substantially from the molecules used in the training set, the prediction are likely to be suspect. For instance, in the case studies performed, the emphasis was on improving the ability of the neural network for recall without compromising the generalization performance. Hence, the neural network trained as above would probably perform very well in predicting properties for molecules that are structurally similar to those used in training. A large connectivity for the neural network as used in the case studies is motivated towards developing very good interpolation characteristics. However, any reasonable prediction as to the performance of the network for molecules with structural characteristics outside that of the training set can be made only with extensive testing. Finally, as noted earlier, the effectiveness of the neural net model depends critically on the choice of the input descriptors which characterize the molecule's structure. Some of these shortcomings can be circumvented by using hybrid neural network models. In the standard 'black-box' neural net approach, one does not use a priori information about well-established functional relationships between the property and the structural descriptors. In a hybrid approach, one incorporates such a priori knowledge and use these network to approximate only unmodeled phenomena . Such hybrid neural networks have been shown to perform better than the 'black-box' models, particularly for generalization problems (Psichogios and Ungar 1992; Aoyama and Venkatasubramanian, 1995).

Figure 9 Neural Network error during recall
and generalization for density and Tg case studies
GENETIC ALGORITHMS FOR THE INVERSE PROBLEM
Overview
Genetic algorithms (GA), originally developed by Holland (1975), are stochastic optimization procedures based on natural selection and Darwin's theory of survival of the fittest. They have been demonstrated to provide globally optimal solutions for a variety of problems and their effectiveness has been explored in different fields including engineering (Goldberg 1989) and chemistry (Hibbert 1993). Under a GA framework, a population of solutions competes for survival based on their closeness to a predefined measure of the optimum (or a characteristic of the optimum). The closeness to the optimum is captured by a normalized value between 0 and 1 called the "fitness" of the solution, where a fitness of 1 implies the achievement of the optimum. Typically, the solution candidates are represented in the form of strings which are manipulated in a manner similar to that of genes in nature. The surviving members from a population are then allowed to propagate their genetic materials, which in this case, are components of their underlying structure (components of the string) to the next generation. However, the nature of this propagation is dependent on the use of genetic operators such as crossover, mutation etc. which are random in nature. This evolutionary process continues until convergence is obtained i.e. no significant improvement is obtained in the solution over generations. The real advantage of using a GA, is its ability to provide several near nearly-optimal solutions even when it fails to converge. This characteristic of the GA is highly desirable especially with regards to the problem of CAMD as the designer is finally left with a set of candidates which can then be synthesized and evaluated instead of a single sub-optimal solution.
Figure 10 Flowchart for genetic algorithm
based CAMD
GAs for CAMD
To circumvent the difficulties faced by most CAMD approaches as discussed earlier, Venkatasubramanian et. al (1994,1995) proposed a GA based framework, a schematic of which has been shown in Figure 10. The basic steps of this procedure are: initialization of a population of solutions, evaluation of the population, survival of the fittest, probabilistic application of genetic operators and elitist propagation of solutions to create the next generation and finally application of a test for convergence. The GA-based CAMD makes use of several genetic operators such as: crossover (main-chain and side-chain), mutation, hopping, blending, insertion and deletion. The representation of the solutions as well as the details of the operators and the evaluation procedures are discussed at length elsewhere (Venkatasubramanian et al., 1994, 1995). Venkatasubramanian et. al. demonstrated that this approach was quite effective in locating globally optimal designs for significantly complex polymer molecules. Since the details can be found elsewhere, only a short summary of the results will be provided here. The focus then shifts to the problem of characterizing and analyzing the search space for molecular design to gain insights into the mechanics of GA search.
Table VI. Four target polymers and their properties
|
|
Density r [=] |
Glass Transition Temp. Tg [=] K |
Thermal Expansion Coeff. a [=] K-1 |
Heat Capacity Cp [=] |
Bulk Modulus K [=] |
|
TP1: |
|
|
|
|
|
|
TP2: |
|
|
|
|
|
|
TP3: |
|
|
|
|
|
|
TP4: |
|
|
|
|
|
Design Case Studies
The effectiveness of GA as an optimization procedure lies in its handling of different forms of the objective function while including constraints on its variables. This was explored in the first case study performed by Venkatasubramanian et. al (1994), where open-ended constraints were used on the target properties and the fitness function was modified to reflect this. The GA-based CAMD approach was applied to the semi-conductor encapsulation case-study performed by Joback (1989) and it was very successful in locating several feasible solutions which satisfied the property constraints. This case study was performed with base-groups that occupy positions along the main-chain only and no independent manipulations of the side-chains was allowed.
The second case study was performed to evaluate the performance of the GA in locating an exact target. For this purpose, three target polymers were chosen (Polyethylene terephthalate, Polyvinylidene copolymer, Poly carbonate) of varying degrees of complexity in structure. A base-group set of four main-chain and four side-chain groups were chosen. This set is reproduced from Venkatasubramanian et. al. (1994) in Table V. The target properties were now exact values with a tolerance specification, and the constraints were closed-ended and hence tighter. The GA-based CAMD was allowed to perform manipulations along the main-chain as well as the side-chain. The fitness function was modified to reflect the closed nature of the target constraints. A statistically significant number of runs were performed and the GA-based CAMD was found to be 100% successful in locating the target molecules when the initial population was a random combination of the base groups. An extension to this case study included a larger set of target polymers and a comparison with the performance of random search on the same problems. The GA-based CAMD outperformed the random search with more than twice the success rate and consistently located some target molecules that the random search failed to identify in any of the runs. The results of this and the previous case studies are given in Venkatasubramanian et. al. (1994). These case studies have demonstrated the ability of the GA to perform efficient search in a combinatorially explosive space of solutions with several local optima.
The advantage of using a GA-based approach, especially for CAMD, is
its ability to provide several near-optimal solutions even when it fails
to converge. From a design point of view, this implies that one has a set
of alternative structures to synthesize and evaluate which are expected
to perform close to their specifications. In fact, some of these alternative
structures could be more attractive for synthesis from an economic perspective.
In addition to this advantage, any GA-based method has the flexibility
of including complex criteria in its objective function and hence use additional
knowledge about the desired designs. These two aspects, and the effectiveness
of GAs for the design of significantly complex molecules, were demonstrated
on a large case study involving 17 main-chain groups and 15 side-chain
groups in Venkatasubramanian et. al (1995). The case study presented considers
nine target polymers and a set of five target properties. Additional chemical
knowledge in the form of restrictions on the bulkiness of the molecules
as well as their stability was incorporated into the objective function.
Venkatasubramanian et. al. clearly demonstrate the ability of the GA-based
CAMD to locate several near optimal solutions of fitness greater than 0.99
in some cases. The results also established an improvement in the performance
of the approach when including additional knowledge. A summary of some
of the results of the case study in Venkatasubramanian et. al.(1995) is
provided in Tables VI, VII and VIII. Table VI gives the target properties
to be designed for by the GA, corresponding to four target polymers. Table
VII summarizes the results of the GA design procedure performed repeatedly
from a random initial population over 25 runs. Table VII also compares
the effect of knowledge augmentation to include complexity and stability
criteria on the success rate with that of the standard GA (each of these
results based on 25 different runs). The results establish an improvement
in performance of the approach when including additional knowledge. It
is also clear that the heuristic procedure offers no guarantee of convergence
to the target as is evident from the results for target polymer TP4 in
Table VII. However, Table VIII shows some of the very high fitness alternatives
to the same set of target properties, as identified by the GA based CAMD
during the different runs. In fact there are around 500 such alternatives
that are usually located by the GA for this target property set. The number
of such high fitness alternatives (fitness
0.99)
is shown against the (c) entry in Table VII for the different polymers.
This feature of locating several near optimal solutions, even when the
GA fails to converge, is one of the important advantages of the GA based
CAMD.
Table VII. Summary of GA runs for four target polymers
|
|
part |
Standard GA Random MC,SC |
Feasible MC Random MC, SC |
Feasible MC Penalize Complex Random MC, SC |
|
TP1:
|
a)
b)
c) |
12% (3)
184
282 |
28% (7)
233
281 |
64% (16)
428
166 |
|
TP2: |
a)
b)
c) |
32% (8)
400
175 |
48% (12)
317
197 |
92% (23)
420
99 |
|
TP3:
|
a)
b)
c) |
68% (17)
210
162 |
76% (19)
147
158 |
96% (24)
81
125 |
|
TP4: |
a)
b)
c) |
0% (0)
-
861 |
0% (0)
-
910 |
0% (0)
-
570 |
Legend : a) Success in locating target percentage (No. of times out of 25 runs)
b) Average generation number when target was located
c) Total number of distinct polymers with fitness greater
than 0.99
Table VIII. Near optimal designs for target polymer #4
| Polymer Design | Overall Error | Fitness |
| Target Polymer: #4
|
0% |
1.0 |
| Case 1: Standard GA
|
0.74% |
0.995 |
![]() |
1.18% |
0.991 |
| Case 2: Modified GA, Stability
|
0.25% |
0.999 |
![]() |
1.10% |
0.991 |
| Case 3: Modified GA, Stability & Complexity
|
0.21% |
0.999 |
![]() |
0.83% | 0.995 |
The genetic algorithm approach suffers from two main drawbacks. One is the heuristic nature of the search and that there is no guarantee of finding the target solution. But then, this criticism would be applicable to the other heuristic approaches as well. The other drawback is that the selection of the genetic design parameter values would require some experimentation.
While the GA-based design framework performed quite well under difficult
circumstances, it failed to locate the target in a few instances. To gain
a better understanding of the nature of the search space and the mechanics
of GA's search which would shed some light on the success and failure of
genetic algorithms, we recently carried out certain investigations to characterize
and analyze the search space for molecular design. The rest of this paper
discusses these results.
CHARACTERIZATION OF THE SEARCH SPACE
The key to understanding the characteristics of the search space is
to understand the correlation between the structural similarity of molecules
and the closeness in their fitness values. Towards this end, a target molecule
(Polyvinylidene Copolymer) was chosen from the case study in Venkatasubramanian
et. al (1994) along with the set of four main-chain and side chain groups
defined in the study. The GA-based CAMD located this target in all the
25 runs as reported in Venkatasubramanian et. al. (1994). The properties
of this target are invariant to the ordering of the base groups. Every
possible combination (feasible) of the base groups (both main-chain and
side-chain) within a certain maximum length (7 in this case) of the monomer
unit was generated and its fitness evaluated. Figure 11 shows the distribution
of fitness over a portion of the search space. It is clear from this figure
that the search space is hostile with several local optima. Figure 12 and
Table IX show the histogram and the breakdown of the distribution in fitness
respectively.
Figure 11 Distribution of fitness in combinatorial search space
Figure 12 Histogram of fitness distribution
Super-Groups Representation and Pseudo-Hamming Distance
To structurally characterize the search space, each molecule was identified by a coordinate in the space spanned by the base groups and its distance from the target molecule was determined. The coordinate structure of a molecule is generated based on a set of "super-groups". These super-groups are constructed from the set of base groups chosen for the design. The idea behind the generation of these base groups is to map the essentially two dimensional structure of a molecule i.e. along the main-chain and along the side-chain, into a one-dimensional structure that contains the super-groups along the main-chain only. This simplifies the analysis and interpretation of the search space characteristics. The super-grouping is achieved by generating all possible combinations of every main-chain group that can take side-chains, with the side-chain group set to yield a set of super-groups that only occur along the main-chain. Combinations of these super-groups create the different molecules. Table V shows the set of base groups used for the case study, reproduced from Venkatasubramanian et. al. (1994). It is clear, from the valency of the main-chain groups that only >C< can accommodate side-chains. Hence, the set of super-groups includes the three main-chain groups C6, COO, O and all combinations of >C< with the four side-chain groups. The set of super-groups is listed in Table X with a number assigned to each. The molecular coordinate vector has 13 dimensions corresponding to the 13 super-groups and the value of each coordinate of the vector is the number of occurrences of that particular super-group in a molecule. Figure 13 gives an illustrative example for the construction of the coordinate. Once, these coordinates are constructed the distance (dM) of a molecule M from the target molecule T is calculated as :
(3)
where, DiM is the ith coordinate of a molecule M and DiT is the ith coordinate of the target molecule.
Molecule: 
Super-group Notation : -C6-CH2-CH2-C6-CHCH3-CHF-
Super-group (Super-group #, No. of Occurences): C6(11,2);CH2(1,2);CHCH3(2,1);CHF (3,1)
Coordinate Vector : (2 1 1 0 0 0 0 0 0 0 2 0 0)
Figure 13 Illustrative example for coordinate vector construction
In this manner the entire search space is characterized in terms of
these "pseudo-Hamming" distances from the target molecule. The
distance (d) of a particular molecule M from the target is the number of
flips of the super-groups including additions and deletions, required to
get to the target molecule from the structure of molecule M. Hence, molecules
that are structurally similar will have a small value of d and those that
are structurally dissimilar will have larger values. Once the distances
of the molecules from the target were obtained, the average and maximum
fitness of a group of molecules at a certain distance (d) away from the
target was plotted as a function of d. Figure 14 shows the average and
maximum fitness of a cluster d units away from the target , the distance
d plotted on the X-axis. In this particular case, the property values of
the target molecule (and hence the overall fitness) are independent of
the ordering of the groups in the molecule and hence the coordinate vector
constructed in the above manner uniquely identifies every distinct solution
generated. Thus, for order-invariant molecules, for any given fitness value
there might exist several molecular configurations that have the same groups
but with different group orderings. This means that the number of configurations
investigated by the GA will be much more than the number of different fitness
points seen in the search space by the GA. For instance in this case study,
there are about 77,000 different fitness points whereas the search space
contains about 10 million different molecular configurations.
Figure 14 Fitness as a function of distance from target molecule
It is clear from Figure 12 and Table IX that the search space is hostile
with only a small fraction of the molecules having fitness greater than
0.90 (21 molecules out of about 78000 and these are randomly dispersed
within the search space. However, Figure 14 clearly shows that the gradation
in the fitness follows a very clear pattern w.r.t. the distance from the
target. Cast in the coordinate space generated by the super-groups, the
average fitness follows a concave path to the global optimum. Similarly,
the maximum fitness values also show a decreasing trend with respect to
fitness though there are small local deviations from this trend. This implies,
that molecules that are structurally similar to the target in the coordinate
space are also of very high average fitness. Thus, the identification of
highly fit regions of the search space is very likely to lead to molecules
that are structurally similar to the target. Hence, the natural process
of exploitation of the solution characteristics by the GA combined with
the survival of the fittest paradigm may be hypothesized as leading to
the convergence of the solution to the target. It is clear from this exercise
that there is a need to characterize the structure of the search space
in order to gain a deeper understanding of the convergence properties of
the GA. In addition, knowledge of the search landscape would help to enhance
the performance of the GA-based CAMD by a judicious selection and use of
various operators, their probabilities etc. Even though this investigation
has shed some light on the nature of exploitation of the search space by
the GA, it is still far from clear and further explorations are needed
to gain a deeper understanding.
AN INTERACTIVE FRAMEWORK FOR EVOLUTIONARY DESIGN
Designing complex molecules with desired properties is a very challenging problem for industrial-scale molecular design applications. Most of the past approaches for the inverse problem are not satisfactory as they get bogged down by the reasons discussed earlier. Similarly, the forward problem also poses challenges for complex molecules with nonlinear structure-activity relationships. In our past studies on using genetic algorithms summarized above, we had used group contribution methods for the forward problem as they were found to be adequate for our needs. For more complex design situations, we advocate the use of a neural network approach for the forward problem, while using the genetic algorithm for the inverse problem. This proposed model of CAMD is shown in Figure 15.
Figure 15 Proposed
framework for CAMD
While our investigations have shown that both the neural networks and genetic algorithms show great promise in tackling the difficult molecular design problems, we realize that the real-life CAMD problems are so difficult and complex that a successful approach would require a combination of several techniques integrated into a unified, interactive, framework. For instance, the basic genetic design approach can be augmented with additional knowledge. A wealth of information may be available for the CAMD process as such, in terms of the groups to be changed and the combinations to be avoided or favored while constructing solutions for a particular set of target properties. This knowledge is acquired from prior experience with similar problems, and such knowledge may also reside with the expert user. Better moves could be constructed into more promising regions of the search space based on prior experience about the local structure of the search space. Experiential knowledge-based focus of the search procedure could be obtained by incorporating a knowledge-base of rules for the manipulation of structures and even suggestions for operator probabilities. On the other hand, incorporation of user knowledge could be accomplished by an interactive design process which communicates with the user and receives suggestions and directions regarding, the structures to delete or add to the current population, as well as the kind of changes desired in terms of the operators. The input from the user is used to guide the search to the more optimal solutions. The speed of the GA-based CAMD could be enhanced by augmenting it with other methodologies such as math programming methods. As mentioned earlier, the GAs by virtue of their global search characteristics can identify regions in the solution space which may contain globally optimal designs. However, lack of use of local information might prolong the convergence. Math programming methods on the other hand, by virtue of their greater focus during local search could converge in on the exact solution based on the local gradient information and using the GA generated designs as their starting points.
Figure 16 shows an interactive design framework that is under development.
In this object-oriented framework, the GA-based design engine would be
coupled with expert systems as well as nonlinear programming methodologies
to improve the effectiveness of the overall design process. We also envision
such a system as an interactive one, where the molecular designer actively
participates in the design process.
CONCLUSIONS
Computer-aided design of molecules with desired properties is a challenging problem due to a number features such as combinatorial complexity and nonlinearity of the search space, difficulties in knowledge acquisition and incorporation of high-level chemistry-related constraints and so on. To circumvent these difficulties, we propose a CAMD framework that uses neural networks for the properties prediction problem and genetic algorithms for the inverse problem. We discuss the merits and demerits of these approaches along with results on some case studies.
Figure
16 Interactive Framework for evolutionary molecular design
While the GA-based design framework performed quite well under difficult circumstances, it failed to locate the target in a few instances. To gain a better understanding of the nature of the search space and the mechanics of GA's search which would shed some light on the success and failure of genetic algorithms, we explored the search space in detail for one of the molecular case studies in an attempt to analyze and characterize its features. This analysis showed that the search space is indeed quite unfriendly with numerous local optima which will trap many of the standard methods such as gradient-ascent. It was found to exhibit sharp discontinuities, and that only a very small fraction of the search space contained molecules of high fitness. A pseudo-Hamming distance analysis using 'super-groups' suggested that the average fitness and the maximum fitness of the molecules increased as they approached the target, a feature which the genetic algorithm exploits through its 'survival of the fittest' principle to carry out an effective search. Even though these investigations have demonstrated the promise of genetic algorithms for CAMD and have shed some light on the nature of its exploitation of the search space, many aspects still remain less clear and further explorations are needed to gain a deeper understanding.
We realize that the real-life CAMD problems are so difficult and complex
that a successful approach would require a combination of several techniques
integrated into a unified, interactive, framework. We present one such
framework with the GA-based design engine integrated with expert systems
and math programming methodologies to improve the effectiveness of the
overall design process. We also envision such a system as an interactive
one, where the molecular designer actively participates in the design process.
REFERENCES
Aoyama, A. and Venkatasubramanian, V. (1995). Internal Model Control
Framework Using Neural Networks for the Modeling and Control of a Bio-reactor.
Engng. Applic. Artif. Intell., In press.
Aoyama, T., Suzuki, Y., and Ichikawa, H. (1990). Neural Networks Applied
to Quantitative Structure-Activity Relationship. J. Med. Chem.,
33(3), 905-908.
Bolis, G., Pace, L.D., and Fabrocini, F. (1991). A Machine Learning
Approach to Computer-Aided Molecular Design. Journal of Computer-Aided
Molecular Design, 5, 617-628 .
Brignole, E.A., Bottlini, S., and Gani R. (1986). A Strategy for Design
and Selection of Solvents for Separation Processes. Fluid Phase Equilibria,
29, 125.
Burkert, U., and Allinger, N.L. (1982). Molecular Mechanics.
ACS Monograph 177, American Chemical Society, Washington, DC .
Cybenko, G. (1988). Approximation by superpositions of sigmoidal functions.
Math. of Cont. Syst. Sig., 2, 303-314.
Derringer, G.C., and Markham, R.L. (1985). A Computer-Based Methodology
for Matching Polymer Structure with Required Properties. Journal of
Applied Polymer Science, 30, 4609-4617 .
Elrod, D.W., Maggiora, G.M., and Trenary, R.G. (1990). Applications
of Neural Networks in Chemistry. 1. Prediction of Electrophilic Aromatic
Substitution Reactions. J. Chem. Inf. Comput. Sci., 30(4),
477-484.
Gani, R., and Brignole, E.A. (1983). Molecular Design of Solvents for
Liquid Extraction Based on UNIFAC. Fluid Phase Equilibria, 13,
331-340 .
Gani, R., Nielsen, B., and Fredenslund, Aa. (1991). A Group Contribution
Approach to Computer-Aided Molecular Design. AIChE Journal, 37(9),
1318-1332 .
Goldberg, D.E. (1989). Genetic Algorithms in Search, Optimization,
and Machine Learning. Addison-Wesley.
Hibbert, D.B. (1993). Genetic Algorithm in Chemistry. Chemometrics
and Intelligent Laboratory Systems, 19, 277-293 .
Holland, J.H. (1975). Adaptation in Natural and Artificial Systems.
The University of Michigan Press, Ann Arbor.
Hopfinger, A.J., Koehler, M.G., and Pearlstein, R.A. (1988). Molecular
Modeling of Polymers. IV. Estimation of Glass Transition Temperatures.
J. Polym. Sci.: Part B: Polym. Phys., 26, 2007-2028 .
Joback, K.G., and Reid, R.C. (1987). Estimation of Pure-Component Properties
from Group Contributions. Chem. Eng. Comm., 57, 233-243 .
Joback, K.G. (1989). Designing Molecules Possessing Desired Physical
Property Values, Volume 1 & 2. Ph.D. Thesis in Chemical Engineering,
MIT .
Joback, K.G., and Stephanopoulos, G. (1989). Designing Molecules Possessing Desired
Physical Property Values. In, Proceedings of FOCAPD '89. Snowmass,
CO. 363-387.
Jurs, P.C., and Isenhour, T.L. (1975). Chemical Applications of Pattern
Recognition. Wiley & Sons, New York, NY .
Kier, L.B., and Hall, L.H. (1976). Molecular Connectivity in Chemistry
and Drug Research. Academic Press, New York .
Kier, L.B. and Hall, L.H. (1986). Molecular Connectivity in Structure-Activity
Analysis, Research Studies Press, Letchworth, Hertfordshire, U.K.
Kier, L.B., Lowell, H.H. and Frazer, J.F. (1993). Design of Molecules
from Quantitative Structure-Activity Relationship Models. 1. Information
Transfer between Path and Vertex Degree Counts. J. Chem. Inf. Comput.
Sci., 33, 142-147 .
Koehler, M.G., and Hopfinger, A.J. (1989). Molecular Modeling of Polymers:
5. Inclusion of Intermolecular Energetics in Estimating Glass and Crystal-Melt
Transition Temperatures. Polymer, 30, 116-126 .
Macchietto, S., Odele, O. and Omatsone, O. (1990). Design of Optimal
Solvents for Liquid-Liquid Extraction and Gas Absorption Processes. Trans.
Ind. ChemE., 68, Part A, 429-433.
Orchard, B.J., Tripathy, S.K., Pearlstein, R.A., and Hopfinger, A.J.
(1987). Molecular Modeling of Polymer: I. Corrected and Efficient Enumeration
of the Intrachain Conformational Energetics. J. Comput. Chem., 8(1),
28-38 .
Psichogios, D.C. and Ungar, L.H. (1992). A Hybrid Neural Network - First
Principles Approach to Process Modeling. AIChE Journal., 38,
1499-1512.
Qian, N., and Sejnowski, T.J. (1988). Predicting the Secondary Structure
of Globular Proteins using Neural Network Models. J. Mol. Biol.,
202, 865-884 .
Reid, R.C., Prausnitz, J.M., and Sherwood, T.K. (1977). The Properties
of Gases and Liquids. McGraw-Hill, New York .
Reid, R.C., Prausnitz, J.M. and Poling, B.E. (1987). The Properties
of Gases and Liquids. McGraw-Hill, New York .
Rose, V.S., Macfie, H.J.H., Croall, I.F. (1991). Pharmacochem. Libr.
16, 213-216 (1991).
Robb, E.W., and Munk, M.E., (1990). A Neural Network Approach to Infrared
Spectrum Interpretation. Mikrochim Acta, 31-155.
Skvortsova, M.I., Baskin, I.I., Slovokhotova, O.L., Palyulin, V.A., and Zefirov, N.S. (1993), Inverse Problem in QSAR/QSPR Studies for the Case of Topological Indices Characterizing
Molecular Shape (Kier Indices). J. Chem. Inf. Comput. Sci., 33(4),
630-634 .
Sumpter, B.G., Gettino, C., Noid, D.W. (1994). Theory and Applications
of Neural Computing in Chemical Science. Annu. Rev. Phys. Chem.,
45, 439-481
Sumpter, B.G., Noid, D.W. (1994). Makromol Chem. Theory Simul..
3, 363-378
Strouf, O. (1986). Chemical Pattern Recognition. Research Studies
Press, Letchworth .
Van Krevelen, D.W., and Hoftyzer, P.J. (1972). Properties of Polymers,
their Estimation and Correlation with Chemical Structure. First Edition,
Elsevier Scientific.
Van Krevelen, D.W., and Hoftyzer, P.J. (1976). Properties of Polymers,
their Estimation and Correlation with Chemical Structure, Second Edition,
Elsevier Scientific
Van Krevelen, D.W., and Hoftyzer, P.J. (1990). Properties of Polymers,
their Estimation and Correlation with Chemical Structure. Third Edition,
Elsevier Scientific.
Venkatasubramanian, V., Chan, K., and Caruthers, J.M. (1994). Computer-Aided
Molecular Design Using Genetic Algorithms. Computers Chem. Engng
, 18, No. 9, 833-844.
Venkatasubramanian, V., Chan, K., and Caruthers, J.M. (1995). Evolutionary
Design of Molecules with Desired Properties using the Genetic Algorithm.
J. Chem. Inf. Comput. Sci., 35, 188-195
Wythoff, B.J., Levine, S.P., and Tomellini, S.A. (1990). Anal. Chem.,
62, 2702-2709.
Zupan, J., and Gasteiger, J. (1991). Neural Networks: A New Method for Solving Chemical Problems or Just a Passing Phase ?, Technical Report, TU Munchen, D-8046, Garching, Organisch-Chemisches Institute Germany