Kulanthaivel BALAKRISHNAN,Ramasamy DHANALAKSHMI
Department of Computer Science and Engineering,Indian Institute of Information Technology,Tiruchirappalli 620012,India
Abstract:For optimal results,retrieving a relevant feature from a microarray dataset has become a hot topic for researchers involved in the study of feature selection(FS)techniques.The aim of this review is to provide a thorough description of various,recent FS techniques.This review also focuses on the techniques proposed for microarray datasets to work on multiclass classification problems and on different ways to enhance the performance of learning algorithms.We attempt to understand and resolve the imbalance problem of datasets to substantiate the work of researchers working on microarray datasets.An analysis of the literature paves the way for comprehending and highlighting the multitude of challenges and issues in finding the optimal feature subset using various FS techniques.A case study is provided to demonstrate the process of implementation,in which three microarray cancer datasets are used to evaluate the classification accuracy and convergence ability of several wrappers and hybrid algorithms to identify the optimal feature subset.
Key words:Feature selection;High dimensionality;Learning techniques;Microarray dataset
In the last two decades,the development of DNA microarray(MA)datasets has stimulated a new wave of research in bioinformatics and machine learning(ML).The gene expression patterns of malignant and normal cells in MA datasets are used for cancer diagnosis in clinical research.An MA dataset contains the minimum number of samples and the maximum number of features(Hambali et al.,2020).The number of features in the raw data ranges from 6000 to 60000 because the gene expression is analyzed as a whole,despite the fact that the numbers of training and testing samples are generally quite small(frequently,less than 100)(Alonso-Betanzos et al.,2019;Bolón-Canedo and Remeseiro,2020).
Several studies have demonstrated that the majority of genes detected in MA research are not important in accurately identifying an ailment.Feature selection(FS)is a pivotal preprocessing step in ML tasks(including classification,clustering,association,and regression)that helps overcome the problem of high dimensionality(Chen RC et al.,2020).Selecting relevant features from the MA dataset provides high accuracy and lowers the computational complexity.The optimality of the chosen function subset is assessed using a predetermined criterion.To illustrate the concept,Fig.1 gives a non-technical view of FS.
Fig.1 Non-technical perspective of feature selection
Various researchers have focused on elaborating and evaluating FS on supervised learning tasks due to the prevalence of FS in varied disciplines such as medicine(Remeseiro and Bolon-Canedo,2019),engineering(Shadravan et al.,2019),and healthcare(Tadist et al.,2019).Fig.2 illustrates the process of a conventional FS technique.A subset is generated from the original MA dataset with a valid search procedure at the first step.The optimum subset is compared to the antecedent subset in the second phase,which involves evaluating a list of subsets.The update keeps the same if the newly updated subset is more highly recommended than the previous one.The procedure continues until the stopping condition is satisfied.The best subgroup is chosen and is given as the input to the classification procedure for validation.
Fig.2 A general feature selection process
The label status thechnique and search strategy based technique are two types of FS(Saw and Myint,2019).In label status,methods are classified based on whether the samples are labeled or not.They are classified into supervised and unsupervised methods.The majority of well-known,supervised FS algorithms rely on the creation of a similarity matrix to choose features based on the graph structure.The prominent FS algorithms are fisher score and linear discriminative analysis(LDA).Unsupervised FS is considered a challenging issue as there is no label information to assist in searching for relevant features in the data.
Search strategy based FS approaches are classified into five different categories:
1.Filter method(FM).Filters are employed,rather than a learning algorithm,to choose the optimal function subset based on the general features of the input.In most situations,filters use a set of assessment metrics to calculate a feature’s score(Saeys et al.,2007).Pearson correlation(Mangal and Holm,2018),Chi-squared test(Ranjani and Ramyachitran,2018),entropy,Fisher score(Prasad et al.,2018),ANOVA(Sahu et al.,2017),relief(Ebrahimpour et al.,2018),information gain(IG)(Arunkumar and Ramakrishnan,2018),and minimum redundancy maximum relevance(mRMR)(Dong et al.,2018)are well-known examples of conventional FMs.Fig.3 illustrates the taxonomy of FS strategies.
Fig.3 Taxonomy of feature selection
2.Wrapper method.The wrapper technique chooses the optimal feature subset using learning algorithms.In addition,subsets are chosen using regular search techniques,while the quality of the chosen features is assessed through learning algorithms.The drawback of this strategy is that it requires more computation time than the filter technique.Sequential forward selection(SFS),sequential backward selection(SBS)(Maldonado et al.,2014),beam search(Urbanowicz et al.,2017),genetic algorithm(GA)(Mc-Call,2005),particle swarm optimization(PSO)(Shao et al.,2012),advanced binary ant colony optimization(ACO)(Kashef and Nezamabadi-Pour,2013),harmony search(HS)(Diao and Shen,2012),differential evolution(DE)(Storn and Price,1997),whale optimization algorithm(WOA)(Mirjalili S and Lewis,2016),and artificial bee colony(ABC)(Gao et al.,2012)are some notable wrapper methods.
3.Embedded method.This method facilitates the integration of the filter and wrapper methods.Thisapproach addresses the concerns of the filter and wrapper techniques.Embedded methods keep track of relevant features while learning at the initial stage.Support vector machine(SVM)(Bouazza et al.,2018),decision tree(Shukla et al.,2019),weighted naive Bayes(Rani and Devaraj,2019),kernel-penalized SVM(KP-SVM)(Maldonado and López,2018),and LASSO(Rahimipour and Usefi,2019)are prominent examples of embedded methods.
4.Hybrid method.The hybrid method employs two recombination approaches to hybridize wrapper and filter methods(Aziz et al.,2017).As a preprocessing strategy,the filter approach is employed first,followed by the wrapper approach—a two-stage procedure.Second,either a filter or a wrapper method is used to integrate local search algorithms.Recent examples of the hybrid method are the hybrid salp swarm algorithm(SSA)(Balakrishnan et al.,2021),hybrid WOA(Liu et al.,2020),and hybrid Harris hawk optimization(HHO)(Houssein et al.,2020).
5.Ensemble method.This method uses a set of feature subsets from various base classifiers(Seijo-Pardo et al.,2017).Functionally,two categories of ensemble technique are defined by being heterogeneous or homogeneous.The first method employs several selection algorithms on the same dataset.In contrast,the second method employs the same selection strategy across multiple,dispersed sets of the same dataset,such as the multicriterion fusion-based recursive feature elimination(MCF-RFE)(Yang and Mao,2011),multilayer perceptron(MLP)(Bramer,2007),and random forest(RF).
Researchers have used the above-mentioned five approaches to address problems in high-dimensional datasets.This review strives to identify research on this subject and highlight some of the unresolved problems.The primary objectives of this research are as follows:
1.to present an intensive overview of FS methodologies based on search strategies and an insight into contemporary FS techniques used on existing MA datasets;
2.to give a comprehensive literature review on these five types of FS techniques—filter,wrapper,embedded,hybrid,and ensemble;
3.to trace the difficulties and research concerns associated with developing an FS algorithm;
4.to thoroughly show the procedure for computing performance evaluation metrics;
5.to present a case study to understand the performances of different FS approaches;and
6.to evaluate how three MA cancer datasets are used to evaluate the efficiency of a few,well-known wrapper and hybrid FS algorithms.
This section comprises a meticulous overview of MA datasets.Biologists employ DNA MA technologies to track gene expression in afflicted cells and identify responsible biomarkers(van Hal et al.,2000).It is also understood that MA gene expression data has a significant impact in cancer research.Thousands of genes and a limited number of samples make it a highdimensional dataset.MA experimentation data is organized and stored as a matrix of dimensionsm×n:
MA data’s high dimensionality combines superfluous information alongside vital characteristics,making it difficult to identify the best outcomes(Cooper,2001).MA datasets are often found with binary categories and multiple categories in the target variable.The challenges associated with MA datasets are as follows:
1.The computation time is high due to the problems of high dimensionality and irrelevant and insignificant information.
2.The imbalanced MA dataset negatively impacts the learning process,which leads to lower accuracy.
Table 1 lists several datasets that have been used by academics in recent years and are publicly available for experimentation.
Table 1 Microarray datasets
In general,the FS approach involves a five-step process:search direction,search strategy,FS techniques,stopping criteria,and evaluation measures.Fig.4 shows the steps involved in FS.The following subsections discuss these steps in detail.
Fig.4 Steps involved in feature selection
FS selects a search direction at the initial stage.There are four typical search directions:
1.Forward search:The forward search process begins with an empty set of characteristics.For each succeeding phase,it adds a random feature or a unique feature.
2.Backward search:The entire list of features is used to begin a backward search.It removes a random feature or a feature that optimizes some objectives in each successive phase.
3.Bi-directional:The benefits of forward search and backward search are combined in bi-directional search.For each step of the search,a feature is either added or removed.
4.Random search:Showing excellent results,random search may be used for FS.It selects features at random,tests the model’s performance,and iterates as required.Random search creates a random integerNbetween 1 and the number of attributes.It produces a random series ofNinteger values ranging from 0 toN–1 with no repetitions.Then,the model is run on such characteristics and is validated,and the average value of a certain performance metric is preserved.Furthermore,the feature array that provides the best efficiency based on the performance metric of choice is obtained.
An efficient search strategy must achieve rapid convergence and deliver an optimum solution with low computing costs and strong global search capabilities.The following are the three most common search strategies:
1.Sequential search:Sequential forward search,as an example,follows a certain sequence in selecting the optimal feature subset.This technique is vulnerable to feature interaction and risks,achieving local minima.
2.Exponential search:This is a comprehensive search that ensures an optimal solution,but it is costly.This method identifies all potential feature subsets before selecting the best one,which is computation demanding,especially in large datasets.
3.Heuristic search:Heuristic search is carried out using a cost measure or a heuristic function that optimizes the result iteratively.It often does not provide the best option,but it does provide an acceptable amount of time and memory space.
This subsection includes a comprehensive summary of five main FS approaches and a detailed analysis.
3.3.1 Filter methods
FM checks whether a particular feature is relevant in knowledge discovery regardless of the error rate.The feature/relevance score is initially calculated and compared with those of the remaining features in the dataset.Later,features with a low-score subset from the MA dataset are ignored.The recommended optimal subset of features is then transferred to the classification approach for validation.The two categories of FM are univariate and multivariate.The univariate category examines each characteristic independently,whereas the multivariate category evaluates features concerning the relationship between the features.Fig.5 shows the basic steps involved in the filter approach.
Fig.5 Filter approach
1.Traditional filter methods
Some of the traditional FMs that are documented in the literature include the following types:
(1)Correlation is a mathematical tool that uses the correlation coefficient to select features or estimate the linear relationship between two variables or features.It is a relationship that aids in determining the correctness of the features in the dataset pertinent to hypothesis testing.The correlation value lies in the range of[–1,1].It includes the Pearson correlation,Kendall correlation,Spearman correlation(SC),concordance correlation,intraclass correlation,Moran’s I,Phi coefficient,point biserial correlation,polychoric correlation,and zero-order correlation(Sakae et al.,2019).These are examples of some widely used correlations.IG selects the features in an ordered ranking based on the threshold value(Hira and Gillies,2015).The threshold value helps select the features with positive IG for the next process.
(2)Entropy of a subset plays a vital role in calculating IG.Mutual information(MI)is a measure for calculating the dependency of two random characteristics on one another.This approach determines how much information one variable has over another(Vergara and Estévez,2014).
(3)mRMR focuses on MA between two classes in a high-dimensional dataset.This approach identifies the minimum redundancy and the maximum relevance of both discrete and continuous variables.It is simple to implement,and it produces more accurate results than other approaches(Peng et al.,2005).Symmetrical uncertainty(SU)is a suitable metric for assessing the quality of features.SU is a condensed version of MI that ranges from 0 to 1.As a result,the total number of contrasts is limited(Hall,1999).
(4)Relief evaluates the relevance score of a certain attribute by comparing neighboring distances in the same class and between classes.The modified relevance score,which ranges from–1(irrelevant)to 1(relevant),chooses relevant features.The main flaw of the algorithm is its rigid binary categorization.As a result,a new relief algorithm has been created to effectively handle missing data in the dataset(Jain et al.,2018).
(5)ANOVA aims to find the difference between the means of two or more groups.It also assesses the differences between groups as well as within groups(Arowolo et al.,2016).The result of ANOVA produces ap-value and anF-statistic value.Thep-value is used to rank relevant features,which decreases the computational complexity.The Laplacian score uses an unsupervised feature-filter system with similar features potentially connected to the same class(He et al.,2016).Independent component analysis(ICA)is an FS approach for extracting independent components from non-Gaussian data by finding the linear representation.ICA decomposes a feature when it is statistically independent of others(Zheng et al.,2006).The instance-based learning(IBL)filtering technique ranks the features by monitoring the instances.In IBL,various instances assign different ratings to various characteristics(Aha et al.,1991).The most important genes are chosen for the Bhattacharyadistance by lowering the overall probability of an upper limit.The co-variance and the vector mean of each class are used to measure the Bhattacharyya difference(Xuan et al.,2006).
2.Recent filter methods
Mazumder and Veilumuthu(2019)proposed a novel FM using Joe’s normalized MI to improve the FS FM and its application to select optimal features.Multi-class MA cancer datasets were assessed using five different classifiers.In all the five classifier situations on all MA datasets studied,the suggested technique exhibited strong improvements in terms of classification accuracy(improvement of 5.1%)and area under the curve(AUC)values while reducing classification time(a median of 2.86 s was conserved during training).By integrating SC and distributed filter FS approaches,Shukla and Tripathi(2019)developed a unique,two-stage FS methodology that is able to choose effective feature genes for discriminating samples from MA datasets.The suggested model’s aim was to enhance classification accuracy on MA datasets.It was also employed to estimate the gene–gene and gene–class relationships,as well as to find groups of important genes at the same time.On six datasets,the experimental findings showed that the technique suggested offers additional assistance for a large reduction of features employing the SVM classifier and achieves high prediction performance in terms of the accuracy,sensitivity,precision,andF-measure.Using filter-based FS,Kavitha et al.(2020)proposed to reduce computation time and improve classification and prediction accuracy.An SVM classifier was used to evaluate the Leukemia dataset with different,conventional,filter-based FS methods.The use of a score-based strategy for FS can yield better results than employing separate methods.Ke WJ et al.(2018)suggested a scorebased criteria fusion(SCF)for FS,which uses two composite score filters to choose factors based on an estimate of feature–class significance.The authors tested the technique suggested using SVM andK-nearest neighbor(KNN)based on five MA datasets.When compared to previous approaches,the method suggested showed its high efficiency in picking highly relevant features.
Yuan et al.(2019)introduced a novel FS for MA data categorization called partial maximum correlation information(PMCI).This study introduced a novel,general,class-encoding approach for multi-class situations.To optimize classification accuracy,SVM and KNN classifiers were used with 10 benchmark MA datasets and compared to traditional techniques.Zare et al.(2019)suggested a novel FS for MA datasets.They suggested using matrix factorization and singular value decomposition.The approach suggested was used to evaluate the criteria for relevance and redundancy.The KNN classifier was used to evaluate the proposed model’s accuracy on nine benchmark datasets.
3.Inferences from filter-based feature selection
Much research has successfully employed stateof-the-art filters as a pre-processing step in the hybrid FS technique approach to minimize the number of features that must be approved in the wrapper stage.It is a promising approach for researchers because it yields more accurate findings than the FM FS technique while taking less time than the wrapper FS methods.
Following this,we look at the classification accuracy of features chosen by the FM as well as the runtime required for FS,and develope a suitable classification model based on the chosen feature.We observe that no subgroup of FM outperforms the majority of FMs across all datasets.
4.Specific ideas to handle filter-based feature selection problems
One of the obvious issues in FM is the discrepancy of results that occurs when various methods are applied to the same dataset and provide conflicting outcomes.This discrepancy in FM shows that this issue might contribute to the improper feature being selected,affecting the quality of classification models.In this situation,the process of normalizing feature scores and then combining them into a single unified score is effective in lowering the volatility of FS results.
All FM FS approaches employ a“ranker”to assess the obtained attribute scores generated using statistics,information theory,or some functions of the classifier’s output.Domain experts employ feature ranking as a basic approach in determining the best feature subset.However,ranker search methods do not offer the variety of features to be chosen;instead,they leave it up to the domain experts.We conclude that no ranker approach is sophisticated enough todistinguish significant traits from redundant ones without the help of domain experts.Furthermore,no study discovered an intelligent solution for ranking inside filter techniques.Additional research and analysis are required to produce more complex rankers that can be employed successfully with any FS approach.
Only a few studies have been undertaken to emphasize the significance of detecting feature-to-feature correlation to achieve an improved performance of the FS process.Using relevance and redundancy analysis,fast correlation based FS(FCBF),mRMR,andF-statistic are used to choose relevant characteristics and find the optimal features from the specified set to improve the selection process.
To provide fair and reliable findings,FS necessitates properly balanced data.However,having completely balanced data is not always realistic.So,we emphasize the necessity for a viable technique to balance unbalanced data prior to the FS to achieve better outcomes.Automated sampling strategies should be included into FMs to find the imbalanced data without affecting the original data.
3.3.2 Wrapper methods
In wrapper methods,adding a feature at each stage exponentially increases the size of the subset.The model hypothesis is coupled with the classifier in the search space to provide a more reliable classification result.The wrapper method often employs evolutionary or bio-inspired algorithms to aid the search process.The fitness function based learning technique is used to test the feature subset.The wrapper technique usually has higher computing costs and is more likely to overfit,but it outperforms the filter approach in terms of efficiency.Fig.6 shows the basic steps involved in the wrapper approach.
Fig.6 Flow diagram of the wrapper approach
1.Traditional wrapper methods
GA is inspired by the biological evolution process(Siedlecki and Sklansky,1989).GA is undoubtedly the most commonly used technique for FS problems.The goal of GA is to spontaneously generate a population with the same hereditary traits.The algorithm consists of three operations:selection,crossover,and mutation.By the selection mechanism,the fittest chromosomes are selected as the algorithm progresses to the following generation.
PSO is inspired by the foraging technique of a flock of birds(Eberhart and Kennedy,1995).The goal of PSO is to locate the most optimal positions of all particles.Each search agent in the design matrix updates its position concerning the personal and global best values gathered to find the optimal solution.
ACO was employed in the process of FS for the first time in Ke LJ et al.(2008).ACO mimics the behavior of ants by determining the shortest route between the ants’nest and the food supply.As a result,ACO seeks to find the best way inside the weighted graph.Ants can make their way back to the nest using the pheromone they have left on the trail.
Heidari et al.(2019)developed the HHO,a swarm-based optimization method that mimics the Harris hawk’s attacking methodology.The main objective of HHO is to accomplish single-and multiobjective assignment by imitating the hawk’s hunting methodology in nature.HHO is a theoretically efficient optimizer that solves complicated nonlinear problems quickly and effectively.
Mirjalili S et al.(2017)developed the SSA,a population-based optimization method inspired by the foraging technique of sea salps.SSA mimics the formation of the salp chain in the search for food sources.Individuals are classified as leaders or followersin SSA based on their place in the salp chain.The position of the leader determines the movement of the salp chain.
The grasshopper optimization algorithm(GOA)is inspired by swarms of grasshoppers(Mirjalili SZ et al.,2018).Grasshopper positions in this algorithm represent candidate solutions.The location of the grasshopper is determined by its social interaction,gravity,and wind advection.WOA is a swarm-based metaheuristic(MH)inspired by the hunting methodology of humpback whales(Tubishat et al.,2019).Humpback whales use a unique hunting technique known as bubble-net feeding to hunt schools of tiny krill fish.The hunting technique of the humpback whales is classified into three typical phases:encircling the prey,bubble-net attacking,and searching for prey.
2.Recent wrapper methods
Almugren and Alshamlan(2019)suggested the FF-SVM,a wrapper FS approach,for categorizing cancer MA gene expression that employs the firefly algorithm(FFA)and an SVM classifier.Five classical microarray datasets(Leukemia1,SRBCT,Lung,Colon,and Leukemia2)were used to assess the suggested model’s accuracy.Over the Leukemia1 dataset with three genes and the Lung dataset with two genes,FF-SVM achieved a classification accuracy of 100%.
Ragunthar and Selvakumar(2019)suggested a wrapper-based FS approach called ABC-SDS,which combines hybridized ABC and stochastic diffusion search(SDS)algorithms.The superiority of the suggested approach is assessed with two MA datasets(GDS 531 and GDS 2643),evaluated using the SVM classifier,and compared with the classical FS methods in terms of accuracy,sensitivity,specificity,and the F1-score(Ragunthar and Selvakumar,2019).
Tawhid and Ibrahim(2020)proposed a wrapper method based FS model using a binary variant of WOA.The suggested model uses three different classifiers(LR,C4.5,and NB)on a breast cancer dataset.The suggested algorithm was assessed with 32 benchmark University of California Irvine(UCI)datasets.The suggested approach achieved the highest mean accuracy for the MA dataset using LR(97%),C4.5(98%),and NB(97%).
Recently,Balakrishnan et al.(2021)suggested a wrapper-based FS approach based on enhanced SSA using the Lévy flight approach.Six different high-dimensional MA datasets were used(Breast Cancer,CNS,Ovarian,OSCC,Colon,and Leukemia)in this study.The authors used an SVM classifier to assess the selected features in terms of precision,recall,F1-score,and accuracy.The suggested model outperformed SSA by offering a 0.1033% higher confidence in the specified characteristics.
Ghosh et al.(2019)employed a wrapper-based FS method using the recursive memetic algorithm(RMA).To assess the superiority of the suggested approach,seven different MA datasets were employed in the three well-known classifiers.RMA surpassed both the GA and standard memetic algorithm in terms of performance.It achieved 100% accuracy in all circumstances while using a relatively minimum number of genes.
Based on the unique fitness function and binary bat algorithm,Nakamura et al.(2012)suggested a novel,wrapper-based FS approach.This study aimed to reduce intra-class distances while increasing interclass distances.To test the performance of the suggested strategy,they employed an extreme learning machine as a classifier as well as eight MA datasets.The new fitness function surpassed conventional fitness functions in terms of classification,accuracy,precision,recall,specificity,and F1-score metrics,according to the findings.
Table 2 highlights several other studies that used this wrapper-based method.
3.Inferences from wrapper-based feature selection
Researchers are paying some attention to this area.Some of this research combines two different swarm intelligence(SI)algorithms in a wrapper to combine their benefits,whereas other research simply uses a single SI algorithm.The issues with this technique are that feature space is large and examining every potential mixture takes a large amount of processing time.Some of the algorithms(such as simulated annealing,local beam search,and hill-climbing search)employ the local search algorithms to enhance the proposed approach.Few algorithms(such as binary bat algorithm,WOA,and SSA)use the fitness function to enhance the suggested approach.In the last decade,wrapper-based algorithms such as PSO and GA have been widely used.The use of high-dimensional MA datasets in the majority of studies has been observed.SVM and KNN are the often used classifiers in this research area to examine classification tasks.
4.Specific ideas to handle wrapper-based feature selection problems
Features are chosen using the fundamental learning process in the wrapper approach,although it is computationally expensive owing to the iterative picking of the optimal subset of features.The search overhead associated with sequential search techniques is also a disadvantage.To get around this,heuristic search and optimum search techniques based on a bio-inspired algorithm are used to find the best characteristics with the least amount of overhead.As a result,there is much to improve in search algorithms for improved feature subset selection.
The stability of wrapper-based FS approaches is a critical issue that must be addressed.In this context,stability is described as the capability to pick the same set of features regardless of variance in training sample partitioning.Employing empirical aggregation of several trials to form a stronger approximation of key features,essentially wrapping the wrapper,is one strategy to mitigate the instability of the sequential feature of selection strategies.Parallel search techniques and evolutionary algorithms are two alternative search strategies that may provide a solution to the problem of stability.
Another key challenge is determining which classifiers are acceptable for the dataset.This may lead to a reduction of classification performance in the predictive model.SVM and KNN are often used classifiers in this research area to examine classification tasks.
3.3.3 Embedded methods
Despite the deployment of numerous embedded FS techniques in recent years,a unified conceptual model is yet to be developed.Embedded techniques are computationally less expensive than wrapper approaches.They use the core of the learning algorithm for rating the features.The purpose of the embedded methodology is to minimize the computation time required to reclassify distinct feature groups.Fig.7 shows the basic steps involved in the embedded approach.
1.Embedded methods from 2002 to 2012
Guyon et al.(2002)suggested an SVM based on recursive feature elimination(SVM-RFE)as one of the most prominent embedded techniques.It was designed with the sole purpose of identifying genes that would be used to discover types of cancers.
Maldonado and Weber(2011)proposed a unique embedded methodology that simultaneously picks key features throughout classifier modeling by penalizing individual features employed in the dual form of SVM.Kernel penalized SVM(KP-SVM)outperformed competing techniques with increasingly fewer relevant characteristics.
Anaissi et al.(2011)developed a novel embedded technique based on the RF algorithm to address the issue of data imbalance predominantly in MA datasets.The methodology used several tactics and algorithms to tackle the problems of complicated gene expression in the Leukemia dataset.A strategy was used to find the optimal training error cost for a specific class and to deal with data imbalance.Finally,the RF chose the most important characteristics and prevented the learning model from overfitting.
Canul-Reich et al.(2012)proposed an iterative feature perturbation(IFP)approach as an embedded gene selector.They used four distinct MA datasets to test the proposed approach.If adding noise to a feature substantially changes classifier performance,the feature is considered significant.When compared to the SVM-RFE approach,the IFP technique achieved similar or greater average class accuracy on three of the four datasets tested.
2.Recent embedded methods
Zhang G et al.(2020)introduced a rule-based FS technique based on the first-order inductive learner(FOIL).The suggested technique generated a classification rule using a modified propositional version ofthe FOIL algorithm.The subset features obtained in previous rules were then combined to create a candidate feature subset that eliminates duplicate features while maintaining interactive and relevant features.
El Kafrawy et al.(2021)proposed a new embedded technique based on SVM-mRMRe.The model was tested using eight of the most widely employed MA datasets for diverse forms of cancer.Four types of classifiers,i.e.,RF,MLP,KNN,and SVM,were used to assess the chosen subset feature.The suggested model minimized the time consumption and complexity while improving the distinction of cancerous and benign tissues,according to the findings.
Albashish et al.(2021)suggested an embedded FS model using binary biogeography optimization(BBO)and SVM-RFE.The basic purpose of FS approaches was to maximize the classification performance while reducing the number of features used.SVM-RFE was incorporated in the BBO to increase the quality of the produced results in the mutation operator,thus improving exploitation capabilities and establishing an appropriate balance between exploitation and exploration of the existing BBO.In terms of accuracy and quantity of chosen features,the BBO-SVM-RFE technique surpassed the BBO method as well as other current wrapper and filter methods.
Dabba et al.(2021a)suggested a new embedded technique to deal with gene selection in MA datasets.MI maximization(MIM)was used to determine the genes’relevance and redundancy,whereas mMFA was used to develop gene sets and assessed using the fitness function.The findings in this study,which were executed on 16 binary-class and multi-class benchmark datasets,showed that the MIM-mMFA technique delivers better classification accuracy.Kang et al.(2019)suggested a new embedded approach termed the relaxed LASSO-GenSVM(rLGenSVM)for classification tasks.To test the proposed FS strategy,they employed four sets of MA data of two classes and four sets of multi-class data with GenSVM as a classifier.rLGenSVM achieved 100%accuracy in six datasets.
By optimizing parameters in kernel functions,Zhu et al.(2018)suggested a unique embedded technique(KPD-SVM)for enhancing accuracy and choosing the most suitable characteristics.An SVM classifier was used to assess the predictive model.For the Wisconsin breast cancer(WBC)dataset,KPD-SVM had an accuracy of 95%using five features,and roughly 88% for the Sonar dataset using around 15 genes.The KPD-SVM approach surpassed the F1-score,filterbased method,RFE-SVM,and wrapper-based methods according to the findings.Kernel-penalized SVDD and KP-CSSVM are two embedded techniques for FS and SVM categorization developed by Zhu et al.(2018).The proposed method achieved this by extending the concept of KP-SVM to two SVM models for class imbalance distribution(SVDD and CS-SVM)to address a skewed class supply problem by combining categorization and variable penalization.Maldonado and López(2018)employed 12 MA datasets using SVM as a classifier to assess the suggested approach.
Zhang L and Huang(2015)suggested a new embedded approach for multiples of SVM-RFE for multi-class FS and classifications.Three different MA datasets,including CNS tumors,Leukemia,and Lung cancer,were evaluated using the SVM classifier in terms of classification accuracy.Because it is effective for CNS tumors and Leukemia datasets,the suggested strategy can increase the performance of each class.
3.Inferences from embedded-based feature selection
According to this review,this is an area to which researchers are paying moderate attention.In many works,the embedded technique was combined with the SVM classifier.RF,MLP,and KNN classifiers have been used only in a limited amount of research and have not outperformed the SVM classifier in terms of accuracy.Little research has been performed on multi-class classification issues.The use of highdimensional MA datasets has been noted in most of the investigations.Most of the embedded FS study was evaluated based on accuracy,the number of selected features(NSF),and convergence ability.
4.Specific ideas to handle embedded feature selection problems
Identifying the appropriate combination is challenging since it involves the embedding of two different FS approaches(filter and wrapper).We suggest to employ the classical filter approach rather than experimenting with new filter approaches.
Another key challenge in the embedded approach is classifier orientation.The kernel-based SVM classifier is often used in this research area to examineclassification tasks.KPD-SVM provides superior performance compared to conventional embedded approaches.We suggest that this strategy be modified based on FS considerations.
3.3.4 Hybrid methods
Rather than relying exclusively on classic FS approaches such as univariate and multivariate statistical tools,a contemporary approach combines traditional FS methods and new hybrid and ensemble FS techniques.A hybrid methodology combines an independent test with the performance estimation method for the feature subset.Hybrid methods are ideal for selecting important features from the high-dimensional dataset because they reduce the runtime.The goal of the hybrid strategy is to achieve a trade-off between time complexity and feature space size by the filter technique to eliminate unnecessary data from the original dataset.The wrapper approach is then used to extract the best feature subset from the feature pool that has been chosen.This technique accelerates FS since the filtering process quickly removes superfluous characteristics from the dataset.The hybrid method’s core workflow is shown in Fig.8.
Fig.8 Flow diagram of the hybrid approach
1.Traditional hybrid methods
The following are a few contemporary,wellknown,and hybrid FS methods:
(1)Hybrid GA.GA variations are divided into five categories:real and binary coded,multi-objective,parallel,chaotic,and hybrid GA(Katoch et al.,2021).Based on chromosome expression,GAs are classified into two types:binary GA(BGA)and real coded GA(RGA)(Payne and Glen,1993).BGA is created to detect molecule similarities,positions,and conformations.The depiction of chromosomes in an RGA is directly linked to real-life issues.Most RGAs are produced by experimenting with the crossover,mutation,and selection operators(Chuang et al.,2016).
(2)Hybrid FFA.This algorithm refracts optical flashes to simulate the mating and information sharing of fireflies.Emary et al.(2015)presented the first binary variant of the FFA,which uses a threshold value to solve function selection problems.The recommended method is exposed to extensive testing,which results in a straightforward solution to the problem.Kanimozhi and Latha(2015)proposed an image-retrieval strategy based on SVM classifiers and FFA to improve the algorithm’s accuracy using optimal functionality.
(3)Hybrid WOA.WOA is a swarm intelligence algorithm designed to solve problems involving continuous optimization.Ling et al.(2017)proposed the Lévy flight trajectory based WOA(LWOA)to accelerate and strengthen WOA movement,thus preventing premature convergence.The Lévy flight trajectory can increase population diversity and potentially hop out of local optima concerning premature convergence.To solve large-scale global optimization(LSGO)problems,an updated WOA(MWOA)has been proposed.A cosine function is used to update the control parameter in a nonlinear dynamic approach to establish equilibrium between exploration and exploitation skills(Sun et al.,2018).
(4)Hybrid HHO.HHO was modified by Gupta et al.(2020)to solve general engineering design problems.The proposed model maintained an equilibrium between the discovery and extraction phases while optimizing HHO’s population diversity and convergence efficiency.Thirty-three benchmark problems were used to validate the efficiency of the proposed algorithm.Improved HHO based on SSA—assuming that SSA’s powerful explorative capacity will facilitate the exploration of the original HHO—was proposed by Zhang G et al.(2020).The proposed methodcontained two stages:initialization and updating.A hybrid version of HHO based on cuckoo search and chaotic maps to boost the efficiency of the original HHO was proposed in Sihwail et al.(2020).
(5)Hybrid SSA.S-and V-shaped transition functions created an efficient binary SSA with a crossover operator(Faris et al.,2018).The proposed method was combined with a KNN classifier and applied to 22 well-known UCI machine learning repository datasets,yielding the best results.The salps’location was modified using Singer’s chaotic map and local search algorithm to avoid being trapped into local optima and improve the SSA’s discovery and exploitation(Tubishat et al.,2021).Twenty benchmark datasets were used to evaluate the performance of the proposed approach.
2.Recent hybrid methods
ENSVM is a hybrid FS model for cancer classification suggested by Qaraad et al.(2021).The suggested model employs the elastic net technique for FS in high-dimensional MA data,which controls and chooses variables.The authors used the SSD-SVM model and SVM with an RBF kernel without any particular FS techniques in the proposed model.The suggested model was evaluated using seven highdimensional MA datasets in terms of specificity,sensitivity,and classification accuracy.
Xie et al.(2021)suggested a two-stage hybrid FS approach to enhance the accuracy of classifiers.The suggested model combines the mRMR and improves binary differential evolution(BDE).Four cancer datasets were employed to assess the superiority of the suggested approach in terms of accuracy.The suggested technique may regulate the number of chosen characteristics while balancing global search efficiency and local optimization capability.
Al-Rajab et al.(2021)suggested a framework for exclusive colon-cancer categorization that includes a two-stage,multifilter hybrid approach of FS.The suggested model employs a combination of IG and GA to choose features.To rank the genes,the standard filter approach mRMR was used.The suggested approach was evaluated using DT,KNN,NB,and SVM classifiers.The model outperformed all the classifiers.
To enhance classification accuracy,Wang et al.(2022)introduced the MMPSO technique,which combines the feature-ranking method with the heuristic search method.The suggested approach employed 10 benchmark datasets and the LIHC MA dataset in terms of accuracy(Xie et al.,2021).
Zhang G et al.(2020)suggested a novel hybrid approach using IG and the improved binary krill herd algorithm for MA data classification.To help search for a feature using a hyperbolic tangent function,an adaptive transfer factor and a chaotic memory weight factor were proposed.The KNN classifier was used to assess the predictive model using nine MA datasets.
Albaldawi and Almuttairi(2021)proposed a hybrid FS model using the ANOVA and LASSO methods.Five different cancer datasets were used using a linear support vector,MLP,and RF classifier in terms of accuracy.The findings showed that the models in the spark environment are particularly successful at processing high-dimensional data that cannot be handled using traditional implementations of some techniques.
3.Inferences from hybrid-based feature selection
Based on our review,we find that a hybrid approach is an effective and competitive strategy because it incorporates the benefits of both filter and wrapper or embedded approaches.Furthermore,most of the well-known SI algorithms are used to hybridize novel approaches to solve FS problems.A highdimensional MA dataset has been used in the majority of research.SVM,NB,and KNN have been commonly used in the literature for classification.The accuracy and NSF are two important evaluation measures that are used frequently.
4.Specific ideas to handle the hybrid-based feature selection problems
The major issue with a hybrid approach is finding the appropriate SI algorithm.Hybridization should be used with two SI algorithms along with any of the suggested algorithm improvements,such as Lévy flight,Brownian motion,opposition-based learning,novel control factor,and parameter tuning.
Another key challenge is determining which classifiers are acceptable for the dataset.This may lead to a reduction of the classification performance in the predictive model.SVM and KNN are used often as classifiers to examine classification tasks.
Stability is also a key challenge in the hybrid approach.The trade-off between bias and variance of the classification error rate is aided by stability.The stabilityof the FS algorithm is determined by factors such as the dataset’s dimensionality,NSF,sample size,and variability of the data.
3.3.5 Ensemble methods
To solve a specific problem,ensemble learning relies on integrating several models rather than a single model.In recent years,ensemble learning models have demonstrated their usefulness in tackling FS issues.Bagging and boosting are the most common methods for ensemble learning.These methods vary by altering the training set to run the learning algorithm several times through diverse training sets.Fig.9 depicts the basic flow of the homogenous and heterogeneous approaches of ensemble models.The homogeneous solution employs the same FS approach but with different training data subsets.The heterogeneous strategy employs a variety of FS algorithms,but they are all applied to the same training data.RF is a popular homogenous ensemble technique that combines several DT models,with the added feature that the trees are built from diverse,random subsets of data(Del Río et al.,2014).
Fig.9 Homogenous(a)and heterogeneous(b)ensemble methods
1.Traditional ensemble methods
Five separate filters were used in Bolón-Canedo et al.(2012),each of which chose a different subset of features to train and evaluate five classifiers;the results were then merged using basic voting.The MCF-RFE algorithm improved the accuracy and stability of FS(Bonilla-Huerta et al.,2016).Further research suggested a function-rating scheme for MLP ensembles using an out-of-bootstrap(OOB)approximation as the stopping criterion(Windeatt et al.,2011).Three widely used,filter-based attribute-ranking strategies for text classification problems,with the lowest,top,and average rank mixing strategies,were applied in Olsson and Oard(2006).In some other research,the findings of the basic selectors were merged using several combination techniques,also known as aggregators(Olsson and Oard,2006).Fig.10 illustrates the benefits and drawbacks of various FS methods.
Fig.10 Advantages and disadvantages of feature selection techniques
2.Recent ensemble methods
Hengpraprohm and Jungjit(2020)suggested an ensemble approch,named EnSNR,for breast cancer classification using MA data.EnSNR approach’s fundamental concept is to merge relevant features derived from two independent sets of feature assessment.In the suggested approach,three cancer datasets—Ovarian,Lung,and Prostate—were employed with the SVM classifier in terms of accuracy.The EnSNR method considerably decreased the number of irrelevant features(genes)that must be evaluated for cancer classification.
Wang AG et al.(2022)suggested another ensemble model for breast cancer classification.The suggested model was evaluated based on four different cancer datasets based on two classifiers,SVM and KNN.The effectiveness of the suggested approach was assessed with accuracy and stability measures.
Hashemi et al.(2022)suggested a multi-criterion decision-making(MCDM)procedure to examine ensemble FS for the first time.Initially,the authors suggested the EFS-MCDM approach to create a decision matrix using the ranking of each feature according to various rankers.The simple decision matrix and the VIKOR technique were then used to allocate a score to each characteristic.Hashemi et al.(2021)proposed a novel ensemble approach for bi-objective
optimization problems involving the relevance and degree of redundancy of features.First,a simple decision matrix was generated by various FS techniques,and the modeled bi-objective optimization problem was used to locate non-dominated features.Next,these features were sorted using the crowding distance.
Tsai and Sung(2020)suggested a novel ensemble approach based on a parallel serial combination approach for high-and low-dimensional MA datasets.Ensemble FS performed better than single FS in terms of classification accuracy when compared to nine parallel and nine serial combinations,as well as three single-baseline FS approaches.The serial combination strategy yielded the highest rate of feature reduction.Kalaimani and Umagandhi(2020)proposed an ensemble FS approach for MA data classification.They used SCF,FEHO,and SVM-t.This worked by combining the FS results obtained from the singlefeature pickers into a final WMV file.The suggested approach’s fitness was determined using KNN,SVM,and RNN.
3.Inferences from ensemble feature selection
This is one of the approaches receiving the least amount of research attention in the literature.It is worth noting that the research in this area has ranged from proposing a novel thresholding technique to developing and comparing alternative approaches to aggregate data,developing cost-sensitive ensemble FS algorithms,and developing various ensemble designs.
4.Specific ideas to handle ensemble feature selection problems
Stability is the major key challenge in the ensemble approach.The trade-off between bias and variance of the classification error rate is aided by stability.Stability of the FS algorithm is determined by factors such as the dataset’s dimensionality,NSF,sample size,and variability of the data.
Complex computational search problems are another issue in the ensemble approach.To solve this issue,the term hyper-heuristics has expanded rapidly to denote a learning process or search strategy for creating or choosing heuristics.Hyperparameters for MH algorithms are a type of parameters that may be used to evaluate different control model parameters during the assessment phase of determining the algorithm’s feasibility.
Stopping criteria are a type of criteria that manage the classifier when to stop selecting features.Adequate stopping conditions will prevent a model from overfitting,resulting in superior results that are computationally expensive.The following are two of the most common stopping factors:
1.When the search approaches its limit,the limit might be a number of iterations or a large number of characteristics.
2.The findings do not need to be enhanced when some other features are removed.
Various assessment metrics have been used in the literature to evaluate the performance of the FS approach.The following are the metrics that are commonly used to monitor the effectiveness of algorithms:
where TP,TN,FP,and FN represent the true positive,true negative,false positive,and false negative,respectively.
The average fitness value(a standard deviation of fitness values),convergence ability,and average number of selected features from the source datasets are used to assess the performance.The mean fitness value over the number of complete runs in the algorithm is calculated as
The ratio between the total number of selected features and the available number of features in a high-dimensional dataset is used to compute the average number of selected features:
The average computation time is calculated by dividing the mean value of the time by the number of total runs:
Large volumes of high-dimensional MA data present opportunities as well as obstacles for FS.The importance of appropriate computational paradigms(such as distributed and parallel,multi-label learning,and fusion data mining)for new issues is growing in FS.As a result,we offer these latest ongoing issues and solutions in the following section.
FS is frequently carried out in a centralized fashion,which means that just one learning model is used to address issues.However,if the data is spread out,FS may be able to analyze numerous subsets sequentially or simultaneously.Fig.11 shows the vertical partitioning of the dataset.There are two reasons for using distributed feature selection(DFS)(Bolón-Canedo et al.,2015):First,with the advancement of network technology,data is sometimes dispersed over numerous sites,so it may be skewed;Second,most contemporary FS algorithms may not scale well,and their efficiency suffers drastically when handling a massive amount of data.By distributing smaller datasets across multiple cores and learning and integrating the results simultaneously,learning can be parallelized to improve the speed.The two basic strategies for splitting and distributing data are either vertical(i.e.,by characteristics)or horizontal(i.e.,by samples).Datasets that are too large for batch learning in terms of the number of samples have been scaled up using DFS.
Fig.11 Vertical partitioning of a dataset
It is worth emphasizing the work of Das et al.(2010),in which a method was provided that executes FS in an asynchronous mode with low communication cost using a horizontal split(by samples)and allows each peer to set its own privacy requirements.Banerjee and Chakravarty(2011)suggested a DFS approach derived from virtual dimension reduction,in which data was partitioned vertically or horizontally.Bolón-Candeo et al.(2013)suggested a distributed filter strategy to increase the accuracy over MA data while reducing the execution time.The model was formed using three steps:(1)datasets were partitioned,(2)filtering was applied to the subsets,and(3)the findings were combined.The findings on eight MA datasets demonstrated that the execution time is significantly reduced while the performance is maintained or even increased when compared to non-partitioned datasets using traditional approaches.
Potharaju and Sreedevi(2018)suggested a DFS approach for complex,high-dimensional datasets.Using the suggested method,the features were spread evenly throughout multiple clusters without duplication.After applying the suggested technique to seven high-dimensional datasets and one low-dimensional dataset,it achieved a 57% success rate and an 18%competitive rate against established methods.
In Morán-Fernández et al.(2017),a distribution strategy was presented that includes splitting data horizontally and vertically and then combining the partial outputs.They used four classifiers to assess the proposed technique on 11 datasets(five of which were MA datasets).They gave users some suggestions for splitting high-dimensional datasets that were appropriate for their aims.The horizontal split was advised by users if reducing storage requirements and processing time was more important than increasing categorization accuracy.
Ye et al.(2019)suggested a DFS model based on intermediate representation.Each party in the suggested methodology found intermediate representations from the original data and distributed them for collaborative FS.The original data from many parties was translated to the same low-dimensional space via shared intermediate representations.The suggested strategy can increase the performance of FS in the local party according to the experimental data.
A novel distributed FS strategy for identifying gene expression data was proposed by Ayyad et al.(2019).The objective was to find the most likely cancerrelated genes in a dispersed fashion,which assists in categorizing the data more efficiently.First,a massive quantity of characteristics to be evaluated were split and spread across numerous processors.Next,for each subset of the dataset,a new filter selection approach based on a fuzzy inference system was implemented.Last,all the features that have been generated were ranked,and a wrapper-based selection approach was used.
Jung(2021)proposed a DFS for multi-class classification using the alternating direction method of multipliers(ADMM).Convex optimization and ADMM were used to create a distributed FS algorithm.By employing parallel calculations,the distributed approach scaled effectively with an increasing number of classes.The suggested model was evaluated using two case studies(defect classification and the MNIST dataset),which illustrated a large,multi-class classification challenge.
The least amount of study has been done in the inferences about DFS.In comparison to horizontal partitioning,we note that vertical partitioning is more often employed.
Problem 1:How to perform DFS when the data is stored in a central database?
Solution:MPI and Google’s MapReduce protocols are used to store the sophisticated distributedprogramming model(Chu et al.,2007).On highperformance computer grids or clusters,these models are helpful for the implementation of DFS.The following four stages can be used to parallelize FS:(1)Across training instances,we break the FS procedure into summation forms;(2)We split data into segments and store them on cluster nodes;(3)On the cluster nodes,we calculate local FS results in parallel;(4)We integrate the local results to acquire the ultimate FS results.
Problem 2:Data is disseminated over a vast number of nodes in a network rather than being stored in a single repository.In these situations,standard FS methods cannot be used directly.As a result,data analysis in such networks will necessitate the creation of a new type of DFS which is capable of operating in such large-scale,distributed contexts.
Solution:These measures are initially examined in a peer-to-peer(P2P)network without data centralization.The program then operates depending on local interactions among participants.In contrast to centralization,the procedure is demonstrably valid because it leads to the appropriate results.
A DFS technique must provide an information sharing and fusion process when data is dispersed over a large number of nodes in a network.This ensures that all distributed sites can operate together to reach a global optimization goal.The data in each node has an identical set of features in the current DFS.Furthermore,since the data on various nodes may have distinct feature representations,more research in DFS is needed.
Problem:How can vital information for FS be extracted and represented from several sources in parallel FS?
Solution:Venkataramana et al.(2019)introduced a parallel FS framework,termed HFS,which uses a correlation-based feature subset.Furthermore,in parallel,they employed ranking-based FS algorithms to rank features and choose the best ones with a KNN classifier.
Using the Hadoop MapReduce technology,Ke?o et al.(2018)built a parallel GA.They employed 11 GEMS datasets in their research for assessing the proposed technique,as well as two classifiers.For fewer than 25 genes,the suggested technique obtained a 100%accuracy rate.
Ray et al.(2016a)implemented an MI FS technique on the Spark framework and applied it to several MA datasets using multiple classifiers,which is an example of parallel-based methodology.
To choose a resilient collection of genes,Boucheham and Batouche(2014)devised a massively parallelmeta-ensemble FS technique.It combines the results of each filter within each ensemble and all ensemble results in parallel.They tested the suggested approach on five MA datasets with three classifiers.The proposed approach was adaptable and produced acceptable results.
Ray et al.(2016b)proposed an sf-ANOVA statistical test using the Spark framework for selecting one of most relevant features.Two different classifiers with three different,high-dimensional datasets were used to assess the suggested model.When compared to logistic regression,the suggested technique using naive Bayes obtained greater accuracy.
Some other research offers a parallel Chi-squared FS approach using Spark for FS.On the Childhood Tumor Gene dataset with a binary class,parallel logistic regression and SVM were used in Lokeswari and Jacob(2017).With 25 genes,the suggested system achieved 63% accuracy using parallel logistic regression and 75% accuracy using a parallel SVM classifier.
In the process of FS,feature fusion is a procedure that fuses different types of features.Feature fusion’s main goal is feature reduction,which can help eliminate noisy features.Feature fusion techniques in particular deal with the selection and combination of characteristics to eliminate duplicate and unnecessary features.They may also be combined with two or more distinct types of characteristics,and thus the dimensionality is reduced.The FS procedure is the most important part of feature fusion.In summary,feature fusion is a step forward in information fusion,and its related strategies may be broadly classified as linear weighted fusion,maximum entropy fusion,neural networks,and Bayesian inference.
A fusion-based FS framework was suggested by Almutiri et al.(2021)who attempted to use different FS approaches and integrate them using ensemble methods.Out of three layers,the first layer runs separately to rank genes and award a score to each gene.A threshold is used in the second layer to filter each gene based on its estimated score.The ultimate choice regarding which genes are significant is determined in the last layer using one of two decision-voting techniques:plurality or agreement.In contrast to other earlier techniques,the suggested framework shows an improvement in accuracy and dimensionality reduction.
Ke WJ et al.(2018)presented a score-based criterion fusion FS model for cancer prediction with the purpose of improving the prediction performance of the classification model.The suggested model was evaluated using different dimensions of datasets using two classifiers.The results showed that the suggested model can uncover extra discriminative features when compared to alternative methods,and that it may be used as a pretreatment algorithm to be successfully integrated with classical models.
Shalabi(2022)developed a new FS technique based on feature stability and correlation to choose the most appropriate minimal subset of features.The suggested technique was compared to various conventional DR algorithms using benchmark datasets to determine its efficiency.The findings showed that the suggested method is the first to reduce a dataset with excellent predicted accuracy.
In several studies,the linear fusion approach has been used to conduct various analytical tasks.In comparison to classical methods,this method is less computationally costly.A fusion system must determine and alter the weights for optimal task completion.Maximum entropy fusion is a statistical model that uses a data theoretic method to determine the likelihood of an occurrence belonging to a specific class based on its information content.Neural network(NN)is another method for combining characteristics extracted from raw data.It is a black box that may be trained to tackle problems that are ill-defined and computationally costly.Although the NN approach appears to be suited for dealing with high-dimensional problems and performing high-order,nonlinear mapping,choosing the right network topology for a specific application may be problematic.Furthermore,the NN approach is prone to sluggish training.Because of these drawbacks,the NN technique has not been used as widely as fusion methods(Zhang R et al.,2019).
Problem:The training sample has a distinct label in classic FS.In many real-world applications,however,the same instance can be assigned to many class labels.Thus,there is FS for multi-label data,which attracts much research attention(Chen WZ et al.,2007).
Solution:The most straightforward technique for implementing FS on a multi-label dataset is to turn it into a single-label dataset and then to use a typical FS method.There are several ways for converting a multi-label dataset into a single-label dataset,including:(1)simple transformation;(2)copy transformation;(3)label powerset transformation;(4)binary relevance transformation.
The connection between class labels is typically not taken into consideration when multi-label FS methods based on transformation are used.Sparse training samples and an uneven class distribution present difficulty in every imbalanced single-label dataset.As a result,FS algorithms that deal directly with multi-label data are preferable.Most of today’s multi-label FS approaches are simple extensions of single-label FS procedures.A future study will focus on developing multi-label FS software that will be able to deal with multi-label datasets,will not require any transformation,and will take into account the relationships among labels.
Challenges in the FS process and additional issues related to MA datasets are discussed in this section.
Stability is a critical consideration when constructing an FS method concerning high-dimensional datasets.An FS algorithm is considered a robust algorithm if identical output is produced regardless of perturbation in input data.Neglecting the FS algorithm’s stability problem can lead to incorrect inference and unreliable outcomes.The abandoning of characteristics associated with the chosen features and aligned with the dependent variables is one of the most prominent sources of instability.The tradeoff between bias and variance of the classification error rate is aided by stability.The stability of an FS algorithm is determined by factors such as dataset dimensionality,NSF,sample size,and variability of the data.
Wrapper FS algorithms maximize a given objective function to find the optimum feature subset.FS objective function design differs depending on the classification issues.Initially,an objective function is created during the first phase of the metaheuristic optimization approach that optimizes classification accuracy or decreases the number of characteristics used.Feature counts and classification accuracy are integrated into a single fitness function in a singleobjective method.
Different classifiers,such as KNN,naive Bayes,SVM,RF,and artificial NN(ANN),have been employed to solve FS problems.The choice of a classifier is the most important aspect in achieving the best results from a high-dimensional dataset.According to the literature,KNN is the most often used classifier.SVM is critical in the classification process.
Many researchers have attempted to deal with the most well-known and targeted challenge,that is,small sample size in MA datasets.The fundamental concern is that small size samples have a significant impact on the performance of the learning technique.To address this issue,it is important to employ an appropriate validation approach to assess the misclassification rate.
Class imbalance occurs when a class has more occurrences in a dataset than the other classes,hindering the learning process.Multi-class MA datasets are well-known instances of imbalanced MA datasets.This problem becomes arduous when the test set’s imbalance is more evident than that of the training set.Preprocessing methods such as under-sampling and oversampling are commonly employed to overcome this problem.Recently,an ensemble classifier has been presented as a possible solution to the problem of class imbalance.
Outlier detection in MA data is one of the key topics that received little attention in the literature.Outliers are samples in databases that have been polluted owing to measurement errors or the malfunctioning of devices.Outliers are not suitable for the learningprocess because they obstruct the selection of informative genes.
This section demonstrates the application of the wrapper and hybrid FS approaches on diverse MA datasets.
A few real-life MA datasets with a large number of features are discussed in this subsection.The datasets used in this study are obtained from publicly accessible archives.Table 3 lists three high-dimensional MA datasets used to assess the effectiveness of different FS techniques.
Table 3 Overview of datasets
We compare the performances of different wrapper FS methods,such as GA,PSO,WOA,SSA,and HHO,along with a few hybrid approaches,such as ISSA,IWOA,and IHHO.A detailed discussion of all the methodologies has been given in Section 4.These models are evaluated based on their accuracy and convergence capability.The cross-entropy objective function evaluates the model’s efficiency by computing the error rate for each iteration.The dropin error rate,as the model progresses through each iteration,demonstrates the model’s capacity to converge to the global minimum.Fig.12 compares the convergence ability of various wrapper approaches,GA,PSO,WOA,SSA,and HHO,based on three cancer MA datasets.The convergence graph in Fig.12 demonstrates that WOA experiences premature convergence by revealing that its convergence capability remains unchanged after 5–10 epochs.GA and PSO are effective on both Colon and CNS,but HHO is exclusively effective.In the instance of Leukemia,none of the optimizers except SSA produce an optimum solution.
Fig.12 Converging ability of GA,PSO,WOA,SSA,and HHO based on Colon(a),CNS(b),and Leukemia(c)datasets
Fig.13 compares the classification performances of GA,WOA,SSA,and HHO based on the three MA datasets.The optimizers exhibit the best accuracy based on Leukemia.However,the HHO model produces the greatest accuracy for all datasets compared to the other traditional MH optimization approaches.
Fig.13 Classification performances of GA,WOA,PSO,HHO,and SSA based on three MA datasets
Fig.14 depicts the convergence ability of three hybrid models,ISSA,IWOA,and IHHO.Based on the Colon dataset,IWOA outperforms the two other models in terms of convergence ability.Based on Leukemia,ISSA gradually moves to find the optimal solution,whereas both IWOA and IHHO experience premature convergence.However,in the case of CNS,both IWOA and IHHO perform effectively as compared to ISSA.
Fig.14 Converging ability of ISSA,IWOA,and IHHO based on the Colon(a),CNS(b),and Leukemia(c)datasets
The comparative analysis of the classification performances of IWOA,ISSA,and IHHO is shown in Fig.15.All three hybrid models show significant outcomes based on Leukemia.IWOA has the highest accuracy on CNS,Colon,and Leukemia compared to ISSA and IHHO.
Fig.15 Classification performance of ISSA,IWOA,and IHHO based on three MA datasets
MA data analysis offers valuable insight that is helpful in the resolution of problems related to disease discovery.The task of classification is challenging due to the high complexity of gene expression and a limited sample size.As a result,the most practical solution to these problems is to use an FS approach.This study scrupulously consolidates methodologies,methods,datasets,and prospects in recent years of research on the MA dataset.Based on the critical review,this research critique has analyzed numerous research areas such as multi-class classification,improving learning algorithms’performance by different approaches,fixing the dataset imbalance problem,and validating researchers’efforts on MA datasets.To summarize,the following are the contributions of this study:
1.This study gives a thorough overview of FS methodologies based on search strategies and insights into contemporary FS techniques used in existing MA datasets.
2.It provides a comprehensive literature review for five types of FS techniques:filter,wrapper,embedded,hybrid,and ensemble.
3.The difficulties and research concerns associated with developing an FS algorithm are discussed.
4.A procedure for computing performance evaluation metrics is presented.
5.To further understand the performance of alternative FS techniques,a case study is presented.
6.Three MA cancer datasets are used to assess the performances of several well-known wrapper and hybrid FS algorithms.
Contributors
Kulanthaivel BALAKRISHNAN designed the research.Kulanthaivel BALAKRISHNAN and Ramasamy DHANALAKSHMI processed the data.Kulanthaivel BALAKRISHNAN drafted the paper.Ramasamy DHANALAKSHMI helped organize the paper.Kulanthaivel BALAKRISHNAN revised and finalized the paper.
Compliance with ethics guidelines
Kulanthaivel BALAKRISHNAN and Ramasamy DHANALAKSHMI declare that they have no conflict of interest.
Frontiers of Information Technology & Electronic Engineering2022年10期