Abstract:Distributed genetic algorithm can be combined with the adaptive genetic algorithm for mining the interesting and comprehensible classification rules.The paper gives the method to encode for the rules,the fitness function,the selecting,crossover,mutation and migration operator for the DAGA at the same time are designed.
Keywords:classification rules;adaptive genetic algorithm;information gain;distributed genetic algorithm
CLC:TP183Document Code:B
No:[WTBZ]1004-373X(2008)10-132-04
1 Introduction
The data classification is a process of giving data object partition according to the features of group of data objects,and has widely been researched in machine learning,statistics,neural network,expert system,and etc.And mining the classification rules is also an important task in the course of Knowledge Discovery in Databases (KDD).
Genetic Algorithm [1-7] is an overall random searching method based on the theory of evaluation and molecular genetics.In this paper,we study the key technology to the implementation of genetic algorithm and put forward a particular scheme of how to determine the parameters and operations when classify,including individual encoding,fitness function design,GA operators design,and so on.But the classification rules mining methods based on traditional genetic algorithm have drawbacks as follows: producing too much uninteresting classification rules; redundant rules are too much in the optimized population; and the classification accuracy isn′t high.This paper proposes a classification rules mining method based on Distributed and Adaptive Genetic Algorithm (DAGA),which can overcome above drawbacks.
Furthermore,in order to assure that the acquired classification rules to be accuracy as well as comprehensible,we put forward a method to compute the comprehensibility of the classification rules using attribute information gain,different to other methods which evaluate the comprehensibility of the rule only by its simplicity.Thus the exports of the classification rules which are mined are more understandable and informational.
2 Use Attribute Information Gain to Compute the Comprehensibility of the Classification Rules
2.1 Concept of Information Gain
Given random variation x=xi (i=1,2,…,n),where probability of xi is pi,and it has ∑ni=1pi=1,then the definition of entropy of x is:
H(x)=-∑ni=1pilog pi(1)
The definition of the above formula is called Shannon entropy.In the formula,when we take the log to the base 2,the unit of the entropy is a bit and is called an engineering unit.
[BT(3+1]2.2 Attribute Information Gain and Its Computation Method
Information gain contains classified information about the value the presence of large amounts of information.The higher the attribute information gain the information required less used to classify the results as possible.Therefore,the attribute of the information gain in the classification rules is as high as possible.Information gain calculated as follows:
Given S is a set of sample data set,assuming different types of labeling attribute value n,definition different n types of Ci.(i=1,2,…,n).Assuming Si is located a few samples of the type of Ci,classification of samples required for a given expectations of the information given by the following formula:
I(S1,S2,…,Sn)=-∑ni=1pilog pi(2)
Give a set attributes with different values V{a1,a2,…,av},and A can be divided into sub-sets V{S1,S2,…,Sv},where includes Sj such samples in A with a value of aj.Sij is located subset of the sample Ci in Class Sj,A sub-divided into n subset by the entropy or the expectation information given by the following formula:
E(A)=∑vi=1S1j+…+SnjSI(S1j+…+Snj)(3)
The information gain of the A attribute is:
Gain(A)=I(S1,S2,…,Sn)E(A)(4)
2.3 Comprehensibility of the Classification Rules Based on Information Gain
In this paper,we define the comprehensibility of the classification rules.The comprehensibility is that a rule which can offer users more relevant to the field of knowledge and information,but not decided by the length of the rules.According to this view,we have introduced the concept of information gain attributes to construct the rules easier to understand,measure the fitness function.We define the comprehensibility of the classification rules based on the information gain as follows:
comprehensibility=∑li=1Gain(Ai)∑nj=1Gain(Aj)(5)
Where l is the length of the rule,i.e.the number of feature attributes.∑li=1Gain(Ai) symbolizes a result which is plus all the information gain of the characteristics classification rules.∑nj=1Gain(Aj) symbolizes a result which is plus all the information gain of characteristicsattributes which are used to classify the future data objects.
3 The Distributed and Adaptive Genetic Algorithm
3.1 Genetic Algorithm
Genetic algorithm [1,2] is stochastic search methods that mimic the metaphor of natural biological evolution.Genetic algorithm operates on a population of potential solutions applying the principle of survival of the fittest to produce better and better approximations to a solution.At each generation,a new set of approximations is created by the process of selecting individuals according to their level of fitness in the problem domain and breeding them together using operators borrowed from natural genetics.This process leads to the evolution of populations of individuals that are better suited to their environment than the individuals that they were created from,just as in natural adaptation.
Genetic algorithm models natural processes,such as selection,crossover,mutation,migration.Genetic algorithm works on populations of individuals instead of single solutions.In this way the search is performed in a parallel manner.
At the beginning of the computation a number of individuals are randomly initialized.The fitness function is then evaluated for these individuals.The initial generation is produced.
If the optimization criteria are not met the creation of a new generation starts.Individuals are selected according to their fitness for the production of offspring.Parents are recombined to produce offspring.All offspring will be mutated with a certain probability.The fitness of the offspring is then computed.The offspring are inserted into the population replacing the parents,producing a new generation.This cycle is performed until the optimization criteria are reached.
3.2 Distributed and Adaptive Genetic Algorithm
In the design of genetic algorithm the selection of mutation probability and crossover probability is of great importance.If the sample of the mutation probability is smaller,the new born individual will have a weaker ability and a less population variety and even easy to be early-mature.If the sample is bigger,it will destroy the fine modular in the current generation and make the genetic algorithm be similar as the random searching.If the sample of the crossover probability is smaller,the efficiency of searching will be low and even stagnation.If the sample is bigger,it will destroy the structure of the fine chromosome in the current generation.Srinvivas[7] proposed adaptive genetic algorithm.This paper presents a mutation probability broad wise adaptive turning algorithm based on individual fitness,for the individual which has a big fitness and contains many fine modular should reduce its respectively probability; for the individual which has a small fitness should increase its mutation probability.Generally,in the adaptive genetic algorithm,crossover probability (Pc) and mutation probability (Pm) are usually adaptive by the formula as follows:
Pc=k1(fmax-f′)fmax-favg,[WB]f≥favg
k2,f Pm=k3(fmax-f)fmax-favg,[WB]f≥favg k4,f Where fmax symbolizes the highest fitness in the population,favg symbolizes the average fitness in each generation,f′ symbolizes the fitness of the two chromosomes which are going to crossover,and f symbolizes the fitness of the chromosome which is going to mutation. Reiko Tanese[3] proposed the distributed genetic algorithm as a way of efficiently paralleling the canonical genetic algorithm.The global population is divided into several subpopulations,one per processor.Each processor runs the canonical genetic algorithm independently on its subpopulation.The only inter-processor communication occurs during the migration phase,which takes place at a regular interval: A fixed proportion of each subpopulation is selected and sends to the neighboring processor.In return,the same numbers of migrants is received from some other subpopulation and replace individuals selected according to some criteria.This migration can occur either asynchronously or after all of the processor have been synchronized.Each processor then resumes running the canonical genetic algorithm as before,until the next migration phase. A single population genetic algorithm are powerful and perform well on a wide variety of problems,and the distributed genetic algorithm can improve the efficiency,and the adaptive genetic algorithm can maintain the diversity of the population and also can ensure the convergence.However,better results can be obtained by the distributed and adaptive genetic algorithm.This paper combines the advantages of them,based on distributed genetic algorithm,separating the global population into many sub-populations,each executes the adaptive genetic algorithm independently,choosing the promising individuals periodically and migrate them to different sub-populations.Then shows the algorithm of such a distributed and adaptive genetic algorithm. procedure: Distributed and Adaptive Genetic Algorithm input: GA parameters output: best solution begin t←0;//t: generation number initialize P(t) by encoding routine;//P(t):population of chromosomes fitness eval(P) by decoding routine; while(not termination condition) do crossover P(t) to yield C(t);//C(t):offspring mutation P(t) to yield C(t); fitness eval(P) by decoding routine; iff′≥favg // f′: fitness of the two chromosomes which are going to crossover then increase Pc for next generation;//Pc: crossover probability iff≥favg // f:fitness of the chromosome which is going to mutation then increase Pm for next generation;//Pm: mutation probability select P(t+1) from P(t) and C(t); migrate P(t+1) to neighboring process; t←t+1; end output best solution; end 4 Use the DAGA to Mine the Interesting Classification Rules The method to mine the classification rules right now is the way based on the traditional greedy algorithm.The genetic algorithm is an adaptive optimized-overall probability searching algorithm.Comparing with the two optimizing methods like the least square method and escalator method needn′t the genetic algorithm the gradient information of the object when solving the problem of non-linear-optimization.Because it also can avoid sinking into the smallest order of the part,therefore the genetic algorithm has a wide adaptable scope and a strong robustness.At present genetic algorithm has already been widely applied to many areas because of its simplicity,prevalence,robustness,and so on. 4.1 Encoding We use the encoding method of the Michigan′s,which is each of the single classification rules corresponding with a database records.The other hand,we have adopted for similar evolution from the initial group of individuals.As the individuals belong to the same type of groups,we just encoding for the premise of the classification rules but not encoding for the conclusion rules.It will get the corresponding classification rules when run the program each time. As the Table 1 shows,we dispose the records of the database as follows: A database records the basic carrier of genetic encoding into chromosomes,feature attributes Ai became coding genes.Each gene consists of four fields,namely the right range from weight(Wi),operator(Oi),value(Vi),to information gain(Gi).Where Wi symbolizes the weight,it takes on the characteristics of the current value of the individual attributes of a few percentage of the total number of individuals.Oi symbolizes the attribute associated with the operator of discrete attributes,using=and≠symbols or continuous nature,using ≤and> symbols.Vi is the attribute domain,using the binary coding method for discrete attributes.To the median value of the discrete binary attribute,we use the 1 symbolizes take corresponding attribute,and 0 symbolizes not corresponding values.Gi symbolizes the attribute information gain of the rules.Each attribute information gain in the value of pre-implementation genetic algorithm has been well calculated and stored in the individual. 4.2 Design Fitness Function. As we known,there are 4 kinds′ rules in the database as shown in Table 2. In order to design the fitness function,we define some formula as follows. S(r)=n1n2n1n2+n1n2(8) Cov(r)=n1n2n1n2+n1n2(9) accuracy=S(r)*Cov(r)(10) In order to mine the interesting classification rules we design the fitness function as follows: f(w1,w2)=w1*accuracy+w2*comprehensibility(11) Where the w1 and w2 not only satisfied with w1∈[0,1] and w2∈[0,1],but also satisfied with w1+w2=1. 4.3 Selection Operator The simplest selection scheme is roulette-wheel selection,also called stochastic sampling with replacement.This is a stochastic algorithm and involves the following technique: The individuals are mapped to contiguous segments of a line,such that each individual′s segment is equal in size to its fitness.A random number is generated and the individual whose segment spans the random number is selected.The process is repeated until the desired number of individuals is obtained,and we call this process as mating population.This technique is analogous to a roulette wheel with each slice proportional in size to the fitness. We choose the scheme of roulette-wheel selection as the selection operator of the DAGA. 4.4 Crossover Operator During the recombination of binary variables only parts of the individuals are exchanged between the individuals.Depending on the number of parts,the individuals are divided before the exchange of variables.The numbers of cross points distinguish the methods. This paper uses the method which is called single-point crossover.In single-point crossover,crossover position k∈[1,2,…,N-1],N is the number of variables of an individual,is selected uniformly at random and the variables exchanged between the individuals about this point,then two new offspring are produced. 4.5 Mutation Operator For binary valued individuals mutation means the flipping of variable values,because every variable has only two states.Thus,the size of the mutation step is just one.For every individual the variable value to change is chosen,mostly uniform at random. 4.6 Migration Operator The migration model divides the population into multiple subpopulations.These subpopulations evolve independently of each other for a certain number of generations.After the isolation time a number of individuals is distributed between the subpopulation.The number of exchanged individuals that is migration rate,the selection method of the individuals for migration and the scheme of migration determines how much genetic diversity can occur in the subpopulation and the exchange of information between subpopulation. In neighborhood migration topology,migration is made only between the nearest neighbors.However,migration may occur in either direction between subpopulations.For each subpopulation,the possible immigrants are determined,according to the desired selection method,from the adjacent subpopulations and a final selection is made from this pool of individuals. 5 Conclusions The comprehensibility,accuracy and the interest of theclassification rules is of importance in the course of Knowledge Discovery in Databases.This paper presents the formula to calculate the comprehensibility based on the information gain of the attributes,and also gives a method to the accuracy of the classification rules,so that we can mine the interesting classification rules easily.In the paper,a new approach for mining the interesting classification rules based on Distributed and Adaptive Genetic Algorithm (DAGA) is put forward.And we give the method to encode for the rules,design the fitness function,and design the selecting,crossover,mutation and migration operator for the algorithm as well. References [1]Murata T,Isibuchi H,Gen M.Cellular Genetic Local Search for Multi-objective Optimization[C].Prof.of The Genetic and Evolutionary Computation Conference,2000:307-314. [2]Ishibuchi H,Yoshida T,Murata T.Selection of Initial Solutions for Local Search in Multi-objective Genetic Local search[C].Proc.of 2002 Congress on Evolutionary Computation,2002. [3]Tanese R.Distributed Genetic Algorithms for Function Optimization[D].University of Michigan,Ann Arbor,MI,1989. [4]Joe Suzuki.A Markov Chain Analysis on Simple Genetic Algorithms[J].IEEE Trans.on Sys.Man.and Cyber.,1995,25(4):655-659. [5]Srinivas M.Adaptive Probability of Crossover and Mutation in Genetic Algorithms[J].IEEE Trans.Syst.Man.Cybern.,1994,24(4):655-667. [6]Fogarty T C.Evolutionary Computing.Proceedings of AISB Workshop on Evolutionary Computing 1996,Lecture Notes in Computer Science,Berlin,Heidelberg,New York:Springer-Verlag,1996. [7]Yang Liuyu.An Entropy-based Adaptive Genetic Algorithm for Learning.Classification Rules,2001:790-796. 作者簡介 YI Yunfei born in 1981,from Ziyuan in Guangxi Province,teaching assistant in Hechi College,postgraduate of Computer Scicnce College in South Central University for Nationalities,research in data mining,artificial intelligence and information security. 注:本文中所涉及到的圖表、注解、公式等內(nèi)容請以PDF格式閱讀原文。