Practical Guide to

Chemometrics Research Group

Naval Research Laboratory

Chemistry Division, Code 6110

Washington, DC 20375

I have several years of experience implementing and using genetic algorithms to solve optimization problems in analytical chemistry. I have found them to be quite powerful and very interesting to study. To spur further interest in genetic algorithms, I have written this tutorial titled "Practical Guide to Genetic Algorithms." This guide covers what types of analytical chemistry problems genetic algorithms are well-suited for, what their known weaknesses are, how they work, and why they work. I have also written a genetic algorithm demo in MATLAB which can be downloaded. Near the bottom of this page I have listed some interesting links related to genetic algorithms as well as some references to applications of genetic algorithms in chemistry. If you have any questions about the material on the page, suggestions for improvements, corrections, or additional references or links please feel free to send me email: shaffer@ccf.nrl.navy.mil

Introduction

Genetic algorithms (GAs) are optimization techniques based on the concepts of natural selection and genetics. In this approach, the variables are represented as genes on a chromosome. GAs feature a group of candidate solutions (population) on the response surface. Through natural selection and the genetic operators, mutation and recombination, chromosomes with better fitness are found. Natural selection guarantees that chromosomes with the best fitness will propagate in future populations. Using the recombination operator, the GA combines genes from two parent chromosomes to form two new chromosomes (children) that have a high probability of having better fitness than their parents. Mutation allows new areas of the response surface to be explored. GAs offer a generational improvement in the fitness of the chromosomes and after many generations will create chromosomes containing the optimized variable settings.

Advantages to using genetic algorithms

They require no knowledge or gradient information about the response surface
Discontinuities present on the response surface have little effect on overall optimization performance
They are resistant to becoming trapped in local optima
They perform very well for large-scale optimization problems
Can be employed for a wide variety of optimization problems

Disadvantages to using genetic algorithms

Have trouble finding the exact global optimum
Require large number of response (fitness) function evaluations
Configuration is not straightforward

How do genetic algorithms work?

The basic or simple GA comprises four important steps (GA flowchart):

The initial population of chromosomes is created either randomly or by perturbing an input chromosome. In my opinion, how the initialization is done is not critical as long as the initial population spans a wide range of variable settings (i.e., has a diverse population). Thus, if you have explicit knowledge about the system being optimized that information can be included in the initial population. For example, in the wavelength selection problem in infrared spectroscopy, you can initially turn on the wavelengths which correspond to known absorption bands of the analyte of interest on the first chromosome in the population. Then, all ensuing chromosomes in the initial population can be generated by randomly perturbing the input chromosome. Thus, your initial population is still diverse, yet your optimization will not waste time exploring regions of the response surface that will not yield good calibration models (i.e., choose poor wavelengths).
In the second step, evaluation, the fitness is computed. In the GA demo, this consists of either the simple function (summation of genes) or the more complicated Bohachevsky function. The goal of the fitness function is to numerically encode the performance of the chromosome. For real-world applications of optimization methods such as GAs the choice of the fitness function is the most critical step.
The third step is the exploitation or natural selection step. In this step, the chromosomes with the largest fitness scores are placed one or more times into a mating subset in a semi-random fashion. Chromosomes with low fitness scores are removed from the population. There are several methods for performing exploitation. One of the most common methods and the one employed in the GA demo is the binary tournament mating subset selection method. In this mating selection method, each chromosome in the population competes for a position in the mating subset. Two chromosomes are drawn at random from the population, the chromosome with the highest fitness score is placed in the mating subset. Both chromosomes are returned to the population and another tournament begins. This procedure continues until the mating subset is full. A characteristic of this scheme is that the worst chromosome in the population will never be selected for inclusion in the mating subset.
The fourth step, exploration, consists of the recombination and mutation operators. Two chromosomes (parents) from the mating subset are randomly selected to be mated. The probability that these chromosomes are recombined (mated) is a user-controlled option and is usually set to a high value (e.g., 0.95). If the parents are allowed to mate, a recombination operator is employed to exchange genes between the two parents to produce two children. If they are not allowed to mate, the parents are placed into the next generation unchanged. The two most common recombination operators are the one-point and two-point crossover methods. In the one-point method, a crossover point is selected along the chromosome and the genes up to that point are swapped between the two parents. In the two-point method, two crossover points are selected and the genes between the two points are swapped. The children then replace the parents in the next generation. A third recombination operator, which has become quite popular recently, is the uniform crossover method. In this crossover method, recombination is applied to the individual genes in the chromosome. If crossover is performed, the genes between the parents are swapped and if no crossover is performed the genes are left intact. This crossover method has a higher probability of producing children which are much different than their parents so the probability of recombination is usually set to a low value (i.e. 0.1). The probability that a mutation will occur is another user-controlled option and is usually set to a low value (e.g., 0.01) so that good chromosomes are not destroyed. A mutation simply changes the value for a particular gene.

After the exploration step, the population is full of newly created chromosomes (children) and steps two through four are repeated. This process continues for a fixed number of generations.

How do I choose what genetic representation to use in my application?

The representation or coding of the variables being optimized has a large impact on search performance, as the optimization is performed on this representation of the variables. Lucasius and Kateman (Chemometrics and Intelligent Laboratory Systems, 25, 1994, 99-145) have defined three classes of optimization problems. Two of these, SUB and NUM, are commonly found in analytical chemistry applications. Optimization problems classified as SUB involve subset selection or feature selection problems such as wavelength selection in infrared spectroscopy. The more commonly found classification of optimization problems, termed NUM, involve real number variables and include such applications as nonlinear model building, determining rate constants, and the positioning of weight vectors for nonparametric linear discriminant analysis. The most suitable choice of representation is based upon the type of application. The two most common representations, binary and real number codings, differ mainly in how the recombination and mutation operators are performed and will be discussed below. Click here to see a figure showing several types of GA representations.

Most of the developmental work of GA theory was performed using a binary-coded GA and historically is the most widely used representation. In a binary coding each chromosome is vector comprised of zeroes and ones with each bit representing a gene. For SUB-type optimization problems each gene codes for a particular feature. Thus, in the wavelength selection problem, each gene corresponds to one wavelength or a group of wavelengths. When that gene has a value of one that wavelength or group of wavelengths is included into the calibration model. For optimization problems with real number variables (NUM), binary coding is implemented by converting the floating point numerical value for each variable to bit string of fixed size and then converting the bit string back to a floating point number for use in computing the fitness. This has the effect of mapping the floating point number onto a finite set of discrete values determined by the number of bits used in the binary representation. To formulate the chromosome for optimization, the bit string is concatenated with the bit strings for the other variables to form one long binary string. Within this binary string, the genes of the chromosome can be defined in one of two ways. The significance of this definition is observed in how the recombination operators are applied.

Using the terminology offered by Lucasius and Kateman (Chemometrics and Intelligent Laboratory Systems, 25, 1994, 99-145), in B_MX (binary multi-point [general term encompassing both the one and two-point methods] crossover) crossover the genes are simply the ones and zeroes and a crossover point may occur in the middle of the representation for a real number variable. B_MX' crossover only allows crossover points to occur at the boundaries (breakpoints) of the variable representations. While both crossover methods can be employed for NUM-type problems, only B_MX crossover is employed for SUB-type problems. During a recent survey of the GA literature, I found that B_MX crossover is employed more frequently than B_MX' crossover (see for example, Janikow and Michalewicz, Proceeding of the 4th International Conference on Genetic Algorithms, pages 31-36). Uniform crossover can be performed by either method (B_UX or B_UX'). For further discussion, the interested reader is referred to the review paper by Lucasius and Kateman cited above.

The second representation and the more conceptually simple of the two is the real number representation. In this approach each variable being optimized is represented as a conventional floating point number. Thus, a one gene/one variable relationship is present. Crossover (recombination) is performed as discussed above by simply swapping a contiguous set of floating point values for the genes between the crossover points. Mutation is performed by adding a Gaussian-distributed random deviate scaled by a step size to the floating point value for each variable. A third representation, but related to real number representation, is the integer-coding. In an integer coding, the floating point value in the real number coding is replaced with an integer. The only practical difference between real number codings and integer coding is in the mutation operator. Mutation for integer codings is performed analogous to real number coding except that after mutation the value for that gene is rounded to the nearest integer.

Decisions regarding the suitability of binary coding versus real number codings are complicated by the fact that many researchers who specialize in GAs are divided on this issue. The theoretical aspects of binary codings have been well developed, while the theoretic aspects of real number codings are still being argued. As discussed above, some applications dictate which representation should be used, however NUM-type problems represent a difficult decision. The generally accepted advantages of binary codings are that the theoretical aspects are well developed and that the binary coding reduces the size of the optimization space. While a good theoretical understanding of the representation is not a prerequisite to its use and does not guarantee good search performance, it can be of help in rationalizing the performance of the GA in an effort toward improving it. Using a fixed number of bits to represent a real number forces the real number to have a fixed resolution, thus reducing the possible number of values that it can take. A disadvantage of binary coding is that there is overhead associated with converting a floating point number to a binary string and back for each fitness evaluation. The primary advantages of real number codings are its conceptual simplicity and that it no longer requires the conversion between the binary strings and the floating point values for the variables being optimized, thus resulting in no loss of resolution and the elimination of some computation overhead. My experience (see for example, Shaffer and Small, Chemometrics and Intelligent Laboratory Systems, 35, 1996,87-104) as well as some other GA researchers (Hibbert, Chemometrics and Intelligent Laboratory Systems, 19, 1993, 319-329) has been that in many cases a real number representation will perform better for NUM-type problems but many other scientists report other conclusions.

David Goldberg offers this advice in his 1989 textbook on genetic algorithms:

"The user should select a coding so that short, low-order schemata are relevant to the underlying problem and relatively unrelated to schemata over other fixed positions"

and

"The user should select the smallest alphabet that permits a natural expression of the problem"

The primary meaning behind these statements is that the proper choice of genetic representation is problem-dependent and must obey the schema theorem (see next section). If your application has a natural binary representation then use a binary representation and if your application consists of integer variables then an integer representation may be appropriate. Above all, be flexible about how you choose the representation. Just because a certain representation worked well for one application does not mean it will work well for another application.

Figure depicting binary and real number GA codings

Figure depicting binary crossover methods for NUM-type problems

Why do genetic algorithms work?

The search heuristics of a GA are based upon Holland's schema theorem. The mathematics of this theorem were developed using the binary representation, although recent work (see for example, Eshelmann and Schaffer, Foundations of Genetic Algorithms 2, pages 5-17) has now extended it to include integer and real number representations. In the following section, a brief non-mathematical introduction of the schema theorem will be given assuming a binary representation.

A schema (H) is defined as a template for describing a subset of chromosomes with similar sections. The template consists of multiple 0's, 1's, and meta-characters or "don't care" symbols (#). The meta-character is simply a notational device used to signify that either a 1 or 0 will match that pattern. For example, consider a schema such as, #0000. This schema matches two chromosomes, 10000 and 00000. The template is a powerful way of describing similarities among patterns in the chromosomes. The total number of schemata present in a chromosome of length L is equal to 3^L. According to Holland, the order of a schema (o(H)) is equal to the number of fixed positions (i.e., non-meta-characters) and the defining length of a schema (L(H)) is the total number of characters. Thus, the schema #00#0 is an order 3 schema (o(H) = 3) and has a length of 5 (L(H) = 5). Holland derived an expression that predicts the number of copies a particular schema, H, would have in the next generation after undergoing exploitation, recombination and mutation. This expression is shown below

m(H,t+1) >= m(H,t) - f(H)/f [1 - (P_r(L(H)/(t-1))-o(H)P_m]

where H is a particular schema, t is the generation, m(H,t+1) is the number of times a particular schema is expected in the next generation, m(H,t) is the number of times the schema is in the current generation, f(H) is the average fitness of all chromosomes that contain schema H, f is the average fitness for all chromosomes, P_r is the probability of recombination occurring, and P_m is the mutation probability. The primary conclusion that can be drawn from inspection of this equation is that as the ratio of f(H) to f becomes larger, the number of the times H is expected in the next generation increases. Thus, particularly good schemata will propagate in future generations.

Two more points need to be made concerning Holland's schema theorem. Although both mutation and recombination destroy existing schemata, they are necessary for building better ones. The degree to which they are destroyed is dependent upon the order (o(H)) and the length (L(H)) of the schemata. Thus, schemata that are low-order, well-defined, and have above average fitness are preferred and are termed "building blocks". This definition leads to the building block principle of GAs which states that there is a high probability that low-order, well-defined, average fitness schemata will combine through recombination to form higher order, above average fitness schemata. Recombination is critical because it is the only procedure by which building blocks located on different sections of a chromosome can be combined onto the same chromosome. By employing the concept of building blocks, the complexity of the problem is reduced. Instead of trying to find a few large-order schemata by chance, small pieces of the chromosome that are important (i.e., building blocks) are combined with other important small pieces to produce over many generations an optimized chromosome. Because the GA has the ability to process many schemata in a given generation, GAs are said to have the property of "implicit parallelism", thereby making them an efficient optimization algorithm.

Should I use genetic algorithms for my optimization problem?

The answer to this question is quite difficult. Most scientists feel that there is no 'optimal' optimization method. There was a discussion recently on the Usenet newsgroup comp.ai.genetic discussing the so-called "No Free Lunch (NFL) theorem". [More information about this discussion can be found here] In a nutshell, Macready and Wolpert have shown that there are no "free lunches" when it comes to optimization. Used blindly, there is an equal chance that any optimization technique (simulated annealing, genetic algorithms, gradient ascent, Simplex optimization, Monte Carlo, etc.) will perform the same. The key is to use your knowledge of the system being optimized to choose the best approach.

Many scientists have written about the advantages of GAs (see section above "Why genetic algorithms") compared to other optimization methods. However, when you read articles from practitioners of these other optimization methods the advantages do not seem so strong. Thus, I have made an honest attempt at characterizing what types of applications I have found GAs to be a good choice and some where I have found them to be a poor choice. It is important to note that even for the types of applications where I suggest that other optimization methods will probably perform better, with enough experimenting with the configuration, good search performance may be obtained using a GA. Of course, getting any optimization method to work in an optimal manner requires some experimenting with the configuration or the optimization parameters for that technique and poor choices may result in poor search performance.

When to use genetic algorithms

When not much is known about the response surface and computing the gradient is either computationally intensive or numerically unstable many scientist prefer to use optimization methods such as genetic algorithms, simulated annealing, and Simplex optimization which do not require gradient information. One of the reasons I prefer to use genetic algorithms is their versatility. Using my knowledge about the system I can tailor the algorithm for a particular application. If the application calls for an optimization method with hill-climbing characteristics the algorithm can be modified by using an elitist strategy. If becoming trapped in local optima is a problem, mutation can be increased. Thus, while there is no guarantee that GAs will perform the best for a particular application I can usually change some aspect of the genetic configuration or use different genetic operator to achieve adequate search performance.

Another feature about GAs that I take advantage of is that GAs do not optimize directly on the variables but on their representations. For SUB-type applications such as wavelength selection having the chromosome coded such that each gene represents a particular wavelength or group of wavelengths is particularly useful. My own experience has been that for SUB-type problems GAs are an excellent choice.

What about applications where the variables being optimized are very different from each other (i.e. a mixture of integers, binary values, and floating points numbers)? I have found that GAs are an excellent choice for these types of applications. The GA configuration can be modified to include different mutation operators for different sections of the chromosome. For example, in recent work (Bangalore, Shaffer, and Small, manuscript in progress), the chromosome included sections that represented binary values and another section for an integer variable. We were attempting to perform both wavelength selection as well as optimize the number of latent variables employed in a multivariate calibration model using an infrared spectroscopy measurement. We hypothesized that including both types of variables on the chromosome would allow the GA to find combinations of wavelengths and numbers of latent variables which would yield better prediction performance than optimizing the wavelengths separately. The results from the mixed-representation were very promising.

When not to use genetic algorithms

For applications where the calculation of the gradient vector is numerically precise and fast, I recommend using some form of gradient descent or Powell's method (See Press, W.H.; Flannery, B.P.; Teukolsly, S.A.;Vetterling, W.T. Numerical Recipes; Cambridge University Press: Cambridge, 1986 for more details). GAs will work for these types of applications but will reach the optimal region much slower than hill-climbing methods.

Applications which require that the exact global optimum be found may be a challenge for a GA. GAs are best at reaching the global optimum region but sometimes have trouble reaching the exact optimum location. Many researchers use GAs to get close to the optimal region then switch to another method for final exploration.

One of the most commonly cited difficulties with GAs is that compared to hill-climbing techniques they generally require more response (fitness) function evaluations. If the response surface is quite smooth then a hill-climbing method such as Simplex optimization will outperform a GA for a given number of evaluations.

Unfortunately, there are certain optimization problems, termed "GA-hard", which present a difficult challenge to GAs. One of the main areas of research in GAs is the study of these types of applications and to develop methods of determining beforehand whether an optimization problem is GA-hard. Only recently have GA theoreticians been able to understand some of the more common causes of GA-hardness.

One cause of GA-hardness occurs when the genes on the chromosome are not ordered properly and the defining length of the schemata, L(H), are too large for the GA to process them properly. According to the schema theorem, schemata of longer length have a higher probability of being destroyed by mutation and recombination. A reordering scheme where the 1's and 0's on the schema are closer together on the chromosome would prevent this from occurring by reducing L(H). David Goldberg has developed an automated method of re-ordering chromosomes called a messy GA ("Messy Genetic Algorithms Revisited: Studies in Mixed Size and Scale", Complex Systems, 4, 1990, 415-444, and IlliGAl publications).

As discussed under the section title "How Genetic Algorithms work?", for a GA to optimize effectively the genetic algorithm must follow the schema theorem. My experience has been that for a GA to work you must be able to state something about the whole only by knowing its parts. In the context of a GA, the parts are the building blocks and the whole is the fitness function score for the chromosome. A really nice paper by Yuval Davidor (Foundations of Genetic Algorithms (FOGA) 1991) discusses the concept of epistasis. Davidor defines epistasis as interdependencies among the variables. He hypothesized that epistasis was important in determining whether a particular optimization space would follow the building block principle. If the epistasis was small, no variable (gene) is affected by the values of the other variables. When an optimization space is highly epistatic, too many variables are dependent on each other and the building blocks (or schema) become of very high order. Davidor hypothesized that GAs perform most optimally between the two extremes.

In my own research I came across an optimization problem which I would classify as GA-hard (Shaffer and Small, Analytica Chimica Acta, 331,1996,157-175). The results from that study showed that for optimization problems where a good starting point was available, the variables are highly correlated (i.e., the order of the building blocks was large), and a limited number of response (fitness) evaluations was available Simplex optimization appears to outperform both simulated annealing and GAs. However, this further justifies the point that to choose the best optimization method for your application requires some knowledge about the system.

Configuring a genetic algorithm

One of the more challenging aspects of using genetic algorithms is to choose the configuration parameter settings. Discussion of GA theory provides little guidance for proper selection of the settings. Several research papers have been published to fill that void and are listed in the reference section of the webpage. My experience has been that the population size, the mutation rate, and the type of recombination have the largest effect on search performance. Some guidelines that I use in selecting these parameter settings are given in the discussion below.

Population Size

The population size dictates the number of chromosomes in the population. Larger population sizes increase the amount of variation present in the initial population at the expense of requiring more fitness evaluations. I have found that the best population size is both application dependent and related to the length of the chromosome. A good population of chromosomes contains a diverse selection of potential building blocks resulting in better exploration. If the population loses diversity the population is said to have premature convergence and little exploration is being done. For longer chromosomes and challenging optimization problems, larger population sizes are needed to maintain diversity (higher diversity can also be achieved through higher mutation rates and uniform crossover) and hence better exploration. Many researchers suggest population sizes between 25 and 100. For a more detailed discussion of the optimal population size see IlliGAl publications.

Mutation rate

Mutation rate determines the probability that a mutation will occur. Mutation is employed to give new information to the population (uncover new building blocks) and also prevents the population from becoming saturated with similar chromosomes (premature convergence). Large mutation rates increase the probability that good schemata will be destroyed, but increase population diversity. The best mutation rate is application dependent but for most applications is between 0.001 and 0.1.

Some researchers have published "rules of thumb" for choosing the best mutation rate based on the length of the chromosome and the population size. DeJong suggested that the mutation rate be inversely proportional to the population size. Hessner and Manner suggest that the optimal mutation rate is approximately

(M * L^1/2)^-1

where M is the population size and L is the length of the chromosome. For these references and other related ones see the reference list at the end of this webpage.

Type of recombination

Although this guide only covers the basic crossover types, there are several other choices available to the GA user. The textbook by Lawrence Davis "Handbook of Genetic Algorithms" has a practical treatment of the basic crossover methods as well as some of the more application-specific crossover methods. My personal recommendation is to try each method out and choose the one that performs the best in conjunction with your other configuration parameter settings.

How do I improve search performance?

Although GAs are conceptually simple, to the newcomer the number of configuration choices can be overwhelming. Once the scientist using a GA for the first time receives less than satisfactory results there are several steps which can be taken to improve search performance. In the next several paragraphs I will outline some of the common ways in which experienced researchers modify the algorithm to improve search performance.

The first approach to improving search performance is to simply use different values for mutation rates, population size, etc.. Many times, search performance can be improved by making the optimization more stochastic or becoming more hill-climbing in nature. This trial and error approach, although time-consuming, will usually result in improved search performance. For practical suggestions on GA configuration parameter settings see the section on "Configuring a genetic algorithm". If changing the configuration parameters has no effect on search performance a more fundamental problem may be the cause.

A simple method of improving GA performance for NUM-type problems is to simply change the genetic representation. Many researchers publish results showing that binary codings worked better for their application, while other researchers report different results. The coding should be selected so that short, low-order schemata are relevant to the underlying optimization problem. If the configuration does not obey the building block principles then good search performance will be difficult to achieve. There are certain advantages to each representation scheme that can be exploited for certain applications. For more details see the discussion above titled "How do I choose a genetic representation for my application". Another simple modification which has been known to increase search performance for binary coded NUM-type problems is changing to a gray coding. Gray coding attempts to circumvent the difficulties associated with Hamming cliffs on the chromosome. For more information on Gray coding see Lucasius and Kateman (Chemometrics and Intelligent Laboratory Systems, 25, 1994, 99-145). If changing representations or codings does not work or if the application involves a form of feature selection, the problem may be more complicated.

Linkage problems occur when the schemata are located far apart from each other on the chromosome. To overcome linkage problems the length of the schemata L(H) must be reduced so that the schemata are less likely to be destroyed by recombination. This can be performed by changing the position of the genes in the chromosome. In the wavelength selection application, the order of the wavelengths can be changed such that higher signal to noise wavelengths are grouped together on the chromosome. The hypothesis being that these wavelength are more likely to be important to the optimization than noisy wavelengths. For feature selection-pattern recognition applications, the genes can be ordered so that adjacent genes code for similar features. Chromosome re-ordering is one area where an in-depth knowledge of the application is necessary. For an automated method of re-ordering the chromosome use a Messy Genetic Algorithm. Another potential solution to linkage problems is to employ a uniform crossover operator. Uniform crossover does not swap contiguous sections of the chromosome between parents thus eliminating any problems with linkage.

When the optimization space contains only higher-order building blocks (i.e., GA-hard problems), the primary means of processing them is no longer recombination but mutation. Thus, increasing the stochastic properties of the algorithm through uniform crossover or higher mutation rates may partially overcome this problem.

Final Remarks

I should make mention of the fact that GAs are more than just another optimization method. Many scientists study GAs to understand how evolution occurs and some more esoteric concepts such as genetic music and genetic art. Some scientists even argue that GAs were not intended to be numerical optimization methods and that comparisons with other methods are not exactly fair.

I hope this practical guide to GAs has been informative and understandable for you. I have tried to be impartial in discussing the pro's and con's of using GAs and to give you some feel as to what types of analytical chemistry optimizations might benefit from the application of GAs.

Reference List

GA References

List of Optimization-related Web Sites

Optimization in Analytical Chemistry

Return to ChemoWeb

Send comments to shaffer@ccf.nrl.navy.mil

< "hmsbeagle.com">

Chemometrics Research Group

Naval Research Laboratory