Minggang DONG, Ming LIU, Chao JING,3
1School of Information Science and Engineering, Guilin University of Technology, Guilin 541004, China
2Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin 541004, China 3Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China
?E-mail: jingchao@glut.edu.cn
Received Aug. 17, 2020; Revision accepted Feb. 14, 2021; Crosschecked Nov. 9, 2021
Abstract: Since traditional machine learning methods are sensitive to skewed distribution and do not consider the characteristics in multiclass imbalance problems, the skewed distribution of multiclass data poses a major challenge to machine learning algorithms. To tackle such issues, we propose a new splitting criterion of the decision tree based on the one-against-all-based Hellinger distance (OAHD). Two crucial elements are included in OAHD. First,the one-against-all scheme is integrated into the process of computing the Hellinger distance in OAHD, thereby extending the Hellinger distance decision tree to cope with the multiclass imbalance problem. Second, for the multiclass imbalance problem, the distribution and the number of distinct classes are taken into account, and a modified Gini index is designed. Moreover, we give theoretical proofs for the properties of OAHD, including skew insensitivity and the ability to seek a purer node in the decision tree. Finally, we collect 20 public real-world imbalanced data sets from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository and the University of California, Irvine(UCI) repository. Experimental and statistical results show that OAHD significantly improves the performance compared with the five other well-known decision trees in terms of Precision, F-measure,and multiclass area under the receiver operating characteristic curve(MAUC).Moreover,through statistical analysis,the Friedman and Nemenyi tests are used to prove the advantage of OAHD over the five other decision trees.
Key words: Decision trees; Multiclass imbalanced learning; Node splitting criterion; Hellinger distance;One-against-all scheme
Numerous lines of evidence have shown that decision tree is one of the most popular classifiers(Wu et al., 2008; Sharmin et al., 2019), with characteristics of high efficiency, simplicity, and interpretability (Cieslak et al., 2012). In a decision tree, there is a tree structure model that consists of one root node, multiple internal nodes, and some leaf nodes.Given a training data set, the decision tree uses the splitting criterion to partition the data set into child nodes (internal and leaf nodes) recursively until a stopping condition is meet.
Originally, the decision tree was developed to solve the problem of balanced classification. There are two representative decision trees, CART and C4.5 (Breiman et al., 1984; Quinlan, 1986), which have shown pronounced capability in dealing with balanced classification problems. Due to the importance of the splitting criterion, based on the impurity, a splitting criterion is proposed for each decision tree. These decision trees have benefited from their proposed splitting criterion while dealing with the balanced classification problem. However, these methods face a problem in processing data sets with distinct classes. Chandra et al. (2010)proposed the distinct class based splitting measure (DCSM) relying on the number of distinct classes. In this splitting criterion, DCSM takes into account not only the distribution of each class but also the number of distinct classes in a partition. By doing so, DCSM has better performance in classification. Meanwhile,strict proof has been provided to show the advantages of DCSM in terms of its properties such as being convex and well-behaved(Safavian and Landgrebe,1991;Kotsiantis,2013;Osei-Bryson,2014).
However, these methods have limitations in solving the multiclass imbalance problem. Using the prior probability of classes to compute the splitting criteria in these decision trees leads to a poor performance of minority classes(Flach,2003;Akash et al., 2019). In the multiclass imbalanced classification problem, the minority classes occupy a small portion but are more important than the majority classes;thus,there is a challenge for traditional decision trees to deal with data sets containing minority classes. Therefore, it is crucial to improve the performance of the minority classes under the multiclass imbalanced classification problem.
To cope with the imbalanced classification problem, some decision trees have been proposed to improve the performance (Cieslak and Chawla, 2008;Liu et al.,2010;Boonchuay et al.,2017;Akash et al.,2019;Su and Cao,2019). Liu et al.(2010)designed a class confidence proportion decision tree,introducing a new measure(class confidence proportion)without bias to the majority classes. However, this method can be used only to solve the two-class imbalance problem. To solve the multiclass imbalance problem,Akash et al. (2019) presented a weighted internode Hellinger distance(iHDw) based decision tree. Considering the multiclass imbalance problem, iHDw adopts the weighted squared Hellinger distance to measure the difference in class distribution between the parent node and child nodes rather than using the prior probability of classes. In this way,the splitting criterion with iHDw can avoid being sensitive to class imbalance. Moreover, Su and Cao (2019)integrated the Hellinger distance(Kailath,1967;Cichocki and Amari, 2010) and KL divergence (Feng et al., 2019; Wan et al., 2020) into the lazy decision tree to solve the imbalanced classification problem. Cieslak and Chawla (2008) proposed a new decision tree, called the Hellinger distance decision tree (HDDT), which can solve the imbalance problem. In their work, the Hellinger distance (Kailath,1967; Cichocki and Amari, 2010) was regarded as a splitting criterion in HDDT, showing better performance in the imbalanced classification problem;however,there is a limitation in their work using the Hellinger distance, because the Hellinger distance has a problem in identifying the difference between two different splits in the multiclass imbalance problem. Moreover, the splitting criterion does not consider the class distribution or the number of distinct classes.
We hereby consider a situation to illustrate the defect of the Hellinger distance. For example, for a certain node in the decision tree, we suppose that there are a number of samples with five classes notated withA,B,C,D,andE.The number of samples in each class is 100, 40, 20, 10, and 10. We assume that there are two splits for dividing these classes.One split divides all the samples in classesAandB(totaling 140)to the left child node,and the remaining 40 samples in the other classes are classified to the right child node. For the other split, 100 samples in classAand half of the number of samples in the other classes are categorized under the left child node with 140 samples; meanwhile, the leftover 40 samples from classesBtoEare classified under the right child node. Because HDDT can be used only to compute the distance between two classes,it regards the multiclass problem as a two-class problem. If we suppose that classAis the positive class, the other classes are divided into the negative class. Therefore,we can obtain uniform class distribution by these two splits, obtaining the same results through the two splits by HDDT. Obviously, we can find that HDDT is inappropriate for solving the multiclass imbalanced classification problem. The main reason is that the Hellinger distance can solve only the two-class classification problem;it has limitations in identifying some differences in the multiclass imbalance problem. Furthermore, this process does not tackle the issues of the multiclass distribution and the number of distinct classes, which are critical in the multiclass imbalance problem.
To address the insufficiencies of the methods mentioned above, in this study we propose a oneagainst-all-based Hellinger distance (OAHD) decision tree to solve the multiclass imbalanced classification problem. Due to the defect in the Hellinger distance in calculating the distance between multiple classes, we introduce the scheme of one-againstall (Anand et al., 1995)to the process of computing the splitting criterion of OAHD. During the computing process, we adopt the decomposition scheme so that OAHD can be extended to deal with the multiclass problem. We also consider the issue of purity of node in the decision tree. To tackle this issue, we take into account the number of distinct classes and the distribution of the multiclass imbalance problem without considering the prior probability of the classes. Meanwhile, we modify the Gini index to incorporate it into the multiclass imbalance problem. Moreover, we strictly prove that OAHD has the skew-insensitivity property,and that a purer node is sought by the splitting criterion. Finally,we collect 20 public real-world imbalanced data sets available from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository (Alcala-Fdez et al., 2011) and the University of California,Irvine (UCI) repository (Asuncion, 2007). Experimental and statistical results demonstrate that the proposed OAHD outperforms the five unpruned decision trees in terms of various metrics.
Here,we discuss several splitting criteria for decision trees related to our proposed criterion.
The CART decision tree (Breiman et al., 1984)and the C4.5 decision tree (Quinlan, 1986) are the classical algorithms that deal with a balanced data set. The intuition of CART is to split the attribute that reduces impurity the most. Consider a parent nodeuwithVchild nodes, and consider that there areCdistinct classes in nodeu; then, the splitting criterion(named the Gini index) is calculated asdenotes the number of samples of a classωkin partitionv, andNvdenotes the total number of samples in partitionv.
Similar to CART, C4.5 is based on choosing a partitioning that has the largest decrease in the information gain ratio. The information gain based on an attributexjis defined as
whereNudenotes the number of samples in nodeu.
Since Eq. (2) favors the attribute with a larger number of values,Quinlan(1986)normalized the information gain to introduce a new splitting criterion(the information gain ratio),which is defined as
These decision trees show better performance while dealing with the balanced classification problem. However, for a skewed class distribution, the above-mentioned methods are inadequate for improving the performance.
Chandra et al. (2010) proposed a new splitting criterion called DCSM, which not only emphasizes the proportion of each distinct class but also pays attention to the number of distinct classes in a partition. Considering a splitting nodeu(parent node)withVpartitions(child nodes),for a given attributexj, the splitting criterionM(j)is calculated as
whereD(v)represents the number of distinct classes in partitionv,δvis equal toD(v)/D(u), andis the probability of classωkin partitionv, i.e.,
There are two critical parts in Eq. (4). One isD(v)exp(D(v)), which considers the number of distinct classes in each child node. As the number of distinct classes increases, the first part will increase sharply. Thus, the purer partition will be preferred. The other is the summation of,v= 1,2,...,V. First, asδvdecreases,the impurity of the partition will decrease,and then 1?(avωk)2decreases when there are more samples of a class compared to the total number of samples in the partition. Therefore, this measure is intended to favor the purer partition. Compared with other split measures,DCSM takes into account the number of distinct classes. The limitation of DCSM is similar to those of the above-mentioned decision trees. It has a preference for the majority classes,resulting in poor performance for the minority classes.
Cieslak and Chawla (2008) introduced a new splitting criterion (Hellinger distance) to solve the imbalanced classification problem. The main difference between HDDT and other splitting measures is that HDDT is insensitive when dealing with a skewed class distribution. In HDDT, assuming that there is a two-class problem (classX+and classX-) and dividing all continuous features intoVpartitions, a feature is selected as a splitting attribute when it achieves the largest Hellinger distance between classX+and classX-. The Hellinger distance is defined as
where|X+j|and|X-j|represent the numbers of samples of classesX+andX-in partitionjrespectively,and|X+|and|X-|denote the numbers of samples of classesX+andX-in all partitions respectively.
HDDT is strongly considered to be skewinsensitive because it does not use the prior probability of a class in the distance calculation(Abdi and Hashemi,2016;Akash et al.,2019). Nevertheless,the splitting criterion essentially captures the differences of the feature values only for the two classes without considering the multiclass imbalanced classification problem. Therefore,it is necessary to design a splitting criterion to address the multiclass imbalanced classification problem.
Akash et al. (2019) used the weighted squared Hellinger distance to measure the difference between the parent node and each child node. Though the proposed selection weight is combined with the squared Hellinger distance,iHDw can obtain a purer child node in a split partition. iHDw can be defined as
where
whereqLandqRdenote the ratios of the numbers of samples in the left and right child nodes to the number of samples in the parent node, respectively. The function ofD2H(Pt‖P) is to maximize the distance between the distributions of the two child nodes to make a mutually exclusive and skew-insensitive partition.PtandPindicate the child nodes and parent node, respectively.Cis the number of distinct classes.ptjandpjrepresent the proportion from classjin the child node and the parent node,respectively.ωtis not dependent on the prior probability of the classes, resulting in nonbiased majority classes,andωtincreases linearly with the growth of dissimilarity between the parent node and the child node to divide a purer partition.
Due to the defect in current research, by extending the Hellinger distance,we propose a splitting criterion OAHD to solve the multiclass imbalanced classification problem. Moreover,for each partition,the impurity of the training patterns and the distribution of the multiple classes are considered to seek a purer child node.
The Hellinger distance is a divergence measure and a member of theαdivergence family (Kailath,1967; Cichocki and Amari, 2010). Considering two discrete probabilitiesPandQ, the definition of the Hellinger distance can be given as in Eq. (5).However, for the multiclass imbalance problem, the Hellinger distance does not work. To address this problem,we propose a new splitting criterion OAHD,which is based on the Hellinger distance. We use OAHD to calculate a better split threshold between two child nodes. The purpose of OAHD is that the samples in the two child nodes are divided into mutually exclusive regions, and the node is suffi-ciently pure when OAHD is maximized. Considering a parent nodeuand the corresponding child nodev,v ∈{L,R},for each featurej,the proposed splitting criterion is defined as follows:
where
Here,Ndenotes the number of samples in a node;for example,Nvidenotes the number of samples of classiin nodev.βudenotes the number of distinct classes in the parent node,andβvrepresents the number of distinct classes, withWk ≥1/2 for the child node.θanddenote the selected minority class and the remaining classes,respectively.
The splitting criterion is composed of two parts.The first part is the summary of, which is skew-insensitive. Previous works have always decomposed aC-class imbalance problem intoCtwo-class imbalance problems. Differently, we apply the one-against-all scheme to calculate the multiclass Hellinger distance directly. In this process, the original multiclass imbalanced data set is used to calculate OAHD directly. For each class,OAHD regards the selected minority class as the positive class (θ), and the remaining classes as the negative class();by doing this,a one-against-all-based Hellinger distance of the selected minority class is obtained. Then,the threshold of the maximum oneagainst-all-based Hellinger distance is viewed as the optimal split threshold.represent the ratios of the number of samples in the child node to the number of samples in the parent node of classesθand, respectively. In previous splitting criteria, the termωv/ωualways denotes the ratio of the number of samples in the child node to the number of samples in the parent node. However, due to the enormous difference in the number of samples between the majority and minority classes, this index does not work. Therefore,in our method,represents the proportion of the weight between the parent node and the child node without using the prior probability of the classes. As the difference in the distribution probability between classesθandincreases, the partition becomes purer. Therefore,purer partition is preferred by this part.
The main process of calculating the oneagainst-all-based Hellinger distance is described in Algorithm 1.
Several properties are used to characterize the node splitting criterion,which is expected for a good splitting criterion. OAHD has the following basic properties:
1. OAHD is insensitive to the skewness of the class distribution; therefore, this splitting criterion will not be biased toward the majority classes.
Algorithm 1 OAHD Input: f (a feature in the imbalanced data set T) and j (indicating that f is the jth feature of T)Output: HD (the maximum value of OAHD calculated using Eq. (7)) and TH (the corresponding split threshold of HD)1: HD ←?1 2: Sort the value of feature f, and then divide it into N equal parts 3: for k ←1 to C do 4: for i ←1 to N do 5: TH ←Threshold(i)6: As mentioned above, calculate the corresponding parameter 7: Cur_HD =8: if Cur_HD >HD then 9: HD =Cur_HD 10: TH =Cur_TH 11: end if 12: end for 13: end for
2.HDis nonnegative, and it will obtain the lowest value,i.e.,HD=0,which indicates that most samples of each class are split into the same child node; this will result in the node becoming more impure. Thus,a purer node can be sought by OAHD.
For the first property, we adopt the formulation of Flach (2003) in this study. This method can transform the splitting criterion into a formula containing the true positive rate (the ratio of the number of positive samples in the child node to the number of positive samples in the parent node) and the false positive rate (the ratio of the number of negative samples in the child node to the number of negative samples in the parent node), which can be used to measure the degree of sensitivity to skewness. Thus,this method can be used to measure the skew-sensitive of the splitting criterion. Vilalta and Oblinger (2000)had similar analysis. As mentioned above, without the prior probability of the classes,the Hellinger distance can also be related to the true positive rate and the false positive rate, which have been proved to be insensitive to the skewed class distribution (Vilalta and Oblinger, 2000; Flach, 2003).Thus,Eq. (5)can be written as follows:
where tpr denotes the true positive rate and fpr denotes the false positive rate.
Proposition 1OAHD has the skew-insensitive property.
ProofTo explain Proposition 1, consider a data set withCclasses (k= 1,2,···,C); the number of samples of classkin the left child node is denoted by tpLk, and tpRkis the number of samples of classkin the right child node; thus, the true positive rate tprkis equal to,v ∈{L,R}. Suppose that classθdenotes the selected minority class, and that the remaining classes are recorded as classθ.From Eq. (7), for featurej, the splitting criterion of OAHD can be formulated as follows:
Since (βu ?βv)/βu(v ∈{L,R}) denotes the impact of the number of distinct classes on the process of splitting a node, whereβudenotes the number of distinct classes in the parent node, it is not necessary to use the prior probability of the classes.Therefore, to clarify Eq. (9), we set (βu ?βv)/βunotated withDv.(v ∈{L,R})is organized with two parts, a constant value and a true positive rate without the prior probability of the classes, whereis a constant value that equalsSv. Thus, it can be written as,assuming thatis equal to DRv(a constant value). It is obvious thatSL+SRis equal toC,so we can setSR=C ?SL. Similarly,DRcan be calculated from 1?DL, whereβv /=βuand= 2. Thus, Eq. (9) is eventually formulated as follows:
From Eq. (10), we can see that OAHD can be transformed as a pattern that contains only tpr and the constant values (DL,SL, and DRv), without the prior probability of classes (Akash et al.,2019). Thus, OAHD is insensitive to the skewness of the class distribution of the data set (Vilalta and Oblinger,2000;Flach,2003).
Proposition 2OAHD is nonnegative. It will obtain the lowest value, i.e.,HD= 0, when most of the samples of all classes are divided into the same child node, and it will seek a purer node.
ProofThrough the calculation using Eq.(7),both critical parts are larger than 0,soHDis nonnegative and the lowest value ofHDis equal to zero. Assuming a data set with three classes(k=1,2,3),for a featurej, we suppose that the split threshold divides 60%samples of each class to the left node and 40%to the right node. Eq. (11)shows the specific calculation:
As we can see from Eq. (11), when most samples of all classes are divided into the same child node, (βu ?βv)/βuequals zero, soHDachieves the lowest value of zero. In this situation, this node is disordered as impure.
In this subsection, the performance of OAHD was evaluated using 20 public real-word multiclass imbalanced data sets,which were collected from two well-known public sources, namely, the KEEL and UCI repositories. Details of the multiclass imbalanced data sets are shown in Table 1. The imbalance ratio(IR)denotes the ratio of the number of samples of the largest majority class to the number of samples of the smallest minority class. The OAHD decision tree was compared with five unpruned decision tree classifiers,which included CART, C4.5,DCSM,internode Hellinger distance(iHD), and iHDw. Furthermore,to guarantee a fair comparison,all experiments were conducted using five-fold cross-validation and with 20 independent runs.
Table 1 Description of the imbalanced data sets, including the name, size, number of attributes, number of classes, class distribution, and imbalance ratio (IR)
Precision and F-measure are widely discussed single-class metrics in imbalance problems (He and Garcia,2009;Nekooeimehr and Lai-Yuen,2016),and it is a suitable way to use them to evaluate the performance of the algorithm. In the next experiment,Precision and F-measure were used only to evaluate the smallest minority class, and they are shown in Eqs. (12) and (14), respectively. There is another popular evaluation metric, namely, the multiclass area under the receiver operating characteristic (ROC)curve(MAUC), which has been extended by the area under the ROC curve (AUC) (Hanley and McNeil, 1982; Bradley, 1997; Ali et al., 2019)to evaluate the multiclass imbalanced classification problem. MAUC is calculated in Eq. (15).
where TP is the number of true positive samples(actually positive class,and classified as positive class),FP represents the number of false positive samples(actually negative class, but classified as positive class),FN denotes the number of false negative samples(actually positive class,but classified as negative class), and parameterβis usually set to one. Here,|C|indicates the number of classes, andA(Ci|Cj)andA(Cj|Ci) are the AUC values between classesCiandCjin the two-class imbalance problem;however,for the multiclass imbalance problem,A(Ci|Cj)may not equalA(Cj|Ci).
To further evaluate the statistical differences between OAHD and the comparative methods on multiple imbalanced data sets, we conducted the Friedman test(Friedman,1937,1940)with the corresponding post-hoc test,i.e.,Nemenyi test(Nemenyi,1963). The Friedman statisticχ2Fis calculated as follows:
wherekis the number of compared classifiers andRiis the mean rank of theithclassifier onNdata sets.
To compare the different classifiers upon multiple data sets, the Iman F-statistic (Iman and Davenport,1980)is calculated fromχ2Fas
If the null hypothesis is rejected (the performances of all the classifiers are similar), the posthoc Nemenyi test is adopted to find the classifier with performance significantly better than those of the others. In the Nemenyi test, since the critical difference (CD) is an important value as defined in Eq.(18),when the difference between the mean rankings of the two classifiers is larger than CD, the performance of one classifier is significantly better than those of the others. The calculation of CD is given below:
whereqδis the crucial value for the two-tailed Nemenyi test forkclassifiers at theδsignificance level.In this study,by referring to the table,we know that at the 0.05 significance level,qδis equal to 2.850.
To show the effectiveness of OAHD,we provided the experimental results of the metrics Precision,Fmeasure,and MAUC.Tables 2–4 show the mean values of the three metrics of OAHD and the five other unpruned decision trees(CART, C4.5,DCSM, iHD,and iHDw); the best mean values on each data set are in bold. From Tables 2–4,it can be observed that the proposed OAHD outperformed the five other unpruned decision trees for these imbalanced data sets.For the metrics Precision, F-measure, and MAUC,OAHD obtained the best results in 11 out of the 20 multiclass imbalanced data sets.
Table 2 Precision of the proposed OAHD and the five decision trees for the 20 imbalanced data sets
The results are further summarized in Table 5,which shows the mean rankings of the six decision trees in terms Precision, F-measure, and MAUC on the 20 multiclass imbalanced data sets. For each data set, the algorithm with the best performance obtains the ranking of “1,” and the ranking of “6”indicates the algorithm with the worst performance.When the rankings obtained by some algorithms are at the same level, the ranking will be equally assigned. For instance, when two algorithms are tied for the ranking of“1,” the ranking result with“1.5” is allocated for each of them. As listed in Table 5, we can find that OAHD obtained the mean rankings of 1.775, 1.85, and 1.6 in terms Precision, F-measure,and MAUC, respectively, which were all the best mean rankings among the six decision trees.
To further evaluate the statistical significance of OAHD compared to the five other unpruned decision trees, the Friedman test and the post-hoc Nemenyi test were applied in our experiments;the results are shown in Table 6. The Friedman test was used to compare the performances of OAHD and the five unpruned decision trees for all the data sets. The null hypothesis of the Friedman test is to observe the differences that occasionally occur in the performance for these decision trees. IfFFcalculated using Eq. (17) is not less than the critical value, the difference between these two algorithms is statistically significant,and the null hypothesis will be rejected.
Table 3 F-measure of the proposed OAHD and the five decision trees for the 20 imbalanced data sets
Table 4 MAUC of the proposed OAHD and the five decision trees for the 20 imbalanced data sets
As listed in Table 6, we can see that the null hypothesis of the Friedman test was rejected. In the following, we detail the calculation of the null hypothesis of the Friedman test. In Eq. (16), for the metric Precision, we can calculate the Friedman statisticχ2Faccording to these mean rank-ings, i.e.,21.7357. Here,RPidenotes the mean ranking of the algorithms on the evaluation criterion Precision. The Iman F-statistic was calculated using Eq. (17), i.e.,FF(P) =ical value for the 0.05 significance level was 2.3102 with regard to the table, whereFF(P)>2.3102;thus, the null hypothesis was rejected. Similar to Precision, using Eq. (16), we can calculate the Friedman statisticχ2Fof the metrics F-measure and MAUC asχ2F(F) = 22.3143 andχ2F(M) = 35.7429.Later,the Iman F-statistic of the metrics F-measure and MAUC can be obtained using Eq. (17), i.e.,= 5.2767, and the crit-FF(F) = 5.4575 andFF(M) = 10.5687. Explicitly, we can see that theχ2Fvalues of both the Fmeasure and MAUC were greater than the critical value (2.3102); therefore, the null hypotheses were rejected.
Table 5 Mean ranking of the proposed OAHD and the five decision trees for the three metrics on the 20 imbalanced data sets
Table 6 Friedman test and Nemenyi test for the six methods on the 20 data sets, with OAHD as the base classifier
After the null hypotheses of the Friedman test for the metrics Precision, F-measure, and MAUC were rejected, we conducted the post-hoc Nemenyi test to find which decision tree performed better.The results are listed in Table 6. The symbol “√”indicates that OAHD outperforms the compared algorithm. Under the six classifiers and 20 imbalanced data sets, at the 0.05 significance level,qδwas equal to 2.850. Thus,using Eq.(18),CD can be calculated as 2.850×= 1.6861. For the metric Precision, the difference in the mean rankings between OAHD and CART was 2.1. Since the difference was greater than CD,the performance of OAHD was significantly better than that of the CART decision tree.Accordingly,we can calculate that the differences in the mean rankings between OAHD and C4.5,DCSM,iHD, and iHDw were 2.4, 2.15, 1.75, and 1.95, respectively. Because all of the differences were greater than CD,the performance of the OAHD decision tree was significantly better than those of the compared algorithms. For the metrics F-measure and MAUC,the differences in the mean rankings between OAHD and the five compared decision trees can be obtained using Eq. (18). The differences in the mean rankings between OAHD and the five unpruned decision trees were 1.95, 2.55, 2.15, 1.5, and 1.75, respectively, for the metric F-measure. Likewise, for the metric MAUC, the differences in the mean rankings between OAHD and the five unpruned decision trees were 1.75, 3.3, 2.7, 1.9, and 1.75, respectively. Evidently, the differences in the mean rankings were greater than CD,which means that OAHD was better than the other algorithms at the 0.05 significance level. The only exception was the metric Precision for iHD, where the difference in the mean rankings was less than CD.
From the above analysis of results, we can find that OAHD is significantly better than the compared algorithms, because the one-against-all decomposition scheme has been introduced to the process of computing the splitting criterion in OAHD, which is skew-insensitive. Furthermore, to guarantee the purity of nodes in the decision tree,OAHD accounts for the number of distinct classes and considers the class distribution of the multiclass imbalance problem; meanwhile, OAHD has a modified Gini index that fits the multiclass imbalance problem. Therefore, the performance of OAHD is greatly improved while processing the multiclass imbalance problem.
The major goal of this work is to construct a decision tree built upon the one-against-all-based Hellinger distance(OAHD)for addressing the multiclass imbalanced classification problem. Initially,to enhance the performance of the decision tree in dealing with a multiclass imbalance problem,we design a new splitting criterion (i.e., OAHD) which is associated with the idea of the one-against-all scheme by extending the Hellinger distance to deal with the issue of multiclass imbalance. In OAHD,while considering the multiclass imbalance problem,the number of distinct classes and the class distribution are considered without a prior probability of the classes. Meanwhile, we modify the Gini index to fit the multiclass imbalance problem,which ensures the purity of the node in the decision tree. Furthermore,we theoretically prove that the proposed splitting criterion enables the decision tree with the property of skew-insensitivity and the ability to seek a purer node. Finally,OAHD is compared with five different unpruned decision trees upon 20 data sets. The experimental results show that the proposed splitting criterion is better than the five other splitting criteria. Moreover, the Friedman and Nemenyi tests are used to evaluate the performances of the six decision trees;the results demonstrate that this improvement is statistically significant.
In our future work,to improve the performance of OAHD,we intend to explore the effect of the pruning method on OAHD. Furthermore, we intend to make an extension of OAHD that is helpful for treebased ensemble classifiers.
Contributors
Minggang DONG guided the research. Ming LIU designed the research and drafted the paper. Chao JING helped organize the paper. Minggang DONG and Chao JING revised and finalized the paper.
Compliance with ethics guidelines
Minggang DONG, Ming LIU, and Chao JING declare that they have no conflict of interest.
Frontiers of Information Technology & Electronic Engineering2022年2期