YELGHI Aref,K?SE Cemal,YELGHI Asef,and SHAHKAR Amir
1. Department of Computer Engineering,Avrasya University,Trabzon 61250,Turkey; 2. Department of Computer Engineering,Karadeniz Technical University,Trabzon 61080,Turkey; 3. Department of Business Administration,Gazi University,Ankara 06560,Turkey; 4. Civil Engineering Department,Karadeniz Technical University,Trabzon 61080,Turkey
Abstract: Clustering is one of the unsupervised learning problems. It is a procedure which partitions data objects into groups.Many algorithms could not overcome the problems of morphology,overlapping and the large number of clusters at the same time. Many scientific communities have used the clustering algorithm from the perspective of density,which is one of the best methods in clustering. This study proposes a density-based spatial clustering of applications with noise (DBSCAN) algorithm based on the selected high-density areas by automatic fuzzy-DBSCAN (AFD) which works with the initialization of two parameters. AFD,by using fuzzy and DBSCAN features,is modeled by the selection of high-density areas and generates two parameters for merging and separating automatically. The two generated parameters provide a state of sub-cluster rules in the Cartesian coordinate system for the dataset. The model overcomes the problems of clustering such as morphology,overlapping,and the number of clusters in a dataset simultaneously. In the experiments,all algorithms are performed on eight data sets with 30 times of running. Three of them are related to overlapping real datasets and the rest are morphologic and synthetic datasets. It is demonstrated that the AFD algorithm outperforms other recently developed clustering algorithms.
Keywords: clustering,density-based spatial clustering of applications with noise (DBSCAN),fuzzy,overlapping,data mining.
Clustering is one of the techniques in data mining. The main aim of clustering is to divide a given data (point) into similar groups while dissimilar groups contain the dissimilar data. Clustering is useful in data mining,document retrieval,image segmentation,pattern classification and statistic [1,2]. In the field of knowledge discovery in databases,clustering is defined as an unsupervised learning technique,because there is no prior knowledge about the dataset for analyzing. Scientific community focuses on the popular techniques and develops them by its own methods with effectiveness and efficiency in data mining.Using fuzzy c-means clustering and neural network,a new multiple model adaptive control method was proposed,which led to the stability of the control switching system [3]. Pulse description words of detected signals are performed by the density-based spatial clustering of applications with noise (DBSCAN) algorithm which shows the identification time of key target signals. It can adapt to the complex signal environment with noise intervention and overlapping signals,and is not susceptible of the loss of local pulse parameters [4]. Local-DBSCAN(LDBSCAN) was proposed to distinguish the false targets (FTs) from the physical targets (PTs) after compensating the FTs time delays,while PTs possess small distribution [5]. Different from the other work which focused on estimating the density of each sample using different kinds of density estimators,a clustering algorithm named adapative DBSCAN was developed based on inherent properties of the nearest neighbor graph [6]. A new algorithm NRDD-DBSCAN based on the DBSCAN algorithm was presented using resilient-distributed datasets(RDDs) to explore the outliers which influence the data quality of IoT [7].
Many clustering algorithms have been presented with the different points of view and subject intersection in a related field,while each algorithm has its own advantage and disadvantage,and all of them have not been able to solve the complex,overlapping,heterogeneous outlier datasets simultaneously. In [8],a multi-stage model for anomaly detection was proposed to remove the problem of DBSCAN. The other work [9]proposed a method of non-triangle inequality (non-TI) clustering in the context of social network,which used the distance function in quantum logic-based query language.
Generally,the clustering algorithms are divided into eight categories which include partition,hierarchy,fuzzy theory,distribution,density,graph theory,grid and fractal theory model [10]. The K-means algorithm and its framework,with the concept of partition for the developing of other algorithms,is one of the popular algorithms,which is frequently used in the data mining field. K-means algorithm failed,when it had faced with the arbitrary shape in spatial data and it could only minimize an amount of cost functions which lead to convergence in the local minimum [11]. K-medoids [12],partitioning around medoids (Pam) [13]and clustering large applications(CLARA) [14]are also based on partitioning and reduce the sensitivity in terms of noise,but the algorithms fail to address arbitrarily shaped clusters in their strategy.
Some of the algorithms based on hierarchical clusterings such as balanced iterative reducing and clustering using hierarchies (BIRCH) algorithm [15],clustering using representatives (CURE) [16],receiver operating characteristic (ROC) curve [17]and Chameleon [18]from the perspective of tree perform arbitrarily shaped clusters.They show better results than others. Nonetheless,they are not able to overcome the time complexity.
To overcome the complexity structure,some algorithms based on the hierarchical clustering algorithm were proposed. One of them presented an automatic version based on hierarchical clustering [19]. The leaderssubleaders algorithm [20]assures the potentiality and significance of hierarchical clustering in terms of time series in DataStream. The considering of the Euclidean distance discloses the existing of the density problem in a large dataset. In order to tackle them,shared nearest neighbor (SNN) was proposed in [21]which was based on the graph model. The algorithm finds the nearest neighbors of each data point and then redefines the similarity between pairs of points in connection with a number of nearest neighbors. DBSCAN [22]is based on density data with two parameters,one of which is the number of neighbors and the other is the radius of the node. It will be discussed in detail in the next section.
Optics [23]is an algorithm proposed in order to overcome the problem of DBSCAN. For solving the problem of detecting meaningful clusters in various densities of data,the algorithm is considered within the scope of two concepts: one is the maximum distance for each core point and the other is the density of each core point. Spatial-temperal DBSCAN (ST-DBSCAN) [24]was constructed by modifying the DBSCAN algorithm,by using the density factor,which tried to find noise data when clusters have dissimilar densities in the spatial-temporal dataset. Enhanced DBSCAN [25]is also based on DBSCAN proposing local epsilon instead of the global epsilon. The Gaussian mixture model (GMM) [26],based on kernel density,estimates the density region with a small number of components and allows for arbitrary clustering. Another algorithm Gaussian density distance (GDD)[27]presents a new clustering method without prior information and parameters by combining the Gaussian kernel and Euclidian distance. By using the rough set theory,rough-DBSCAN [28]overcomes the runtime scanning dataset with the perspective of density.
In order to take the density based on arbitrary shape clusters and overcome the time consuming problem,the algorithm uses the core leader for the selection of the high-density data. Another kind of clustering is fuzzy cmeans (FCM) clustering [29],which is basic for developing fuzzy clustering. Fuzzy neighborhood (FN-DBSCAN) [30]based on the DBSCAN algorithm was introduced by using the fuzzy neighborhood relation concept instead of the crisp neighborhood relation. The FNDBSCAN algorithm gives more robust results than DBSCAN.
The spectral clustering (SC) algorithm is one of the density-based clustering algorithms,and clusters data with the graph method. The algorithm has demonstrated the clustering on the unstructured dataset [31]. Identifying density-based local outliers factor (LOF) converts binary concept to degree concept (degree formula) for each data in the dataset. Then it uses degree formula for distinguishing the near data (normal data in the cluster)and the far data (outlier data) [32]. Breunig et al. [33]used the fuzzy proximity relations between data points in order to gain the dissimilar dense clusters without any priori knowledge of a dataset. The aim of this study is to consider the advantage and disadvantage of the clustering algorithm,which have been mentioned above,and then propose a algorithm which overcomes the limits.This paper conducts discovering with the gaps in clustering by considering the accuracy clustering on three problems (overlapping,morphology and the number of clusters). The proposed algorithm automatic fuzzy-DBSCAN (AFD) has tried to overcome morphology,overlapping and the number of clusters at the same time. It is initialized with two parameters and generates two parameters Eps2 (second epsilon) and Pos (epsilon position for merging and separating) during running which are defined by the state of sub cluster rules. Although many algorithms have been proposed without parameters,they ignore the mentioned problems.
In the experiment,we show and compare the performance of our algorithm regarding accuracy by external indices [34,35]such as rand index (RI),adjusted rand index (ARI) and f-measure (F) [36]. Our experiment is performed on the three real and five synthetic datasets. We will also describe the DBSCAN and FN-DBSCAN algorithms (our algorithm is inspired from them) in Section 2.We propose an algorithm with the generated state of sub cluster rules during running. It is written by symbolic set in Section 3. The performance of AFD with regard to accuracy and representation is also demonstrated by comparison of well-known algorithms in Section 4 and finally,this research’s conclusions are described in Section 5.
DBSCAN is able to distinguish the noise data and classify the arbitrary shape dataset. It includes two parameters:epsilon (Eps1) and the minimum number of points/data(MinPoints),which are based on a user defined neighbor radius and the existing number of points related to the radius. DBSCAN can be conceptually described as follows.The neighborhood is specified by different types of distance functions. For two pointspandqand their distance d ist(p,q),the epsilon of pointpis defined by{q∈D|dist(p,q)≤Eps}which indicates the radius of it. A core object denotes that a point which is its epsilon contains at least a minimum number of points (MinPoints)(see Fig. 1).
Fig. 1 Basic notion in density clustering
LetC={C1,···,Ck} be the clusters with respect to Eps1 and MinPoints in the datasetD. We define outlier objects as a set of the objects far from the core object in datasetDnot belonging to any clusterCi,i.e.,outlier={p∈D|?i:p?D}. If it is not a core object or outlier object,it would be the border which is densityreachable from a core object (see Fig. 1). An objectpis directly density-reachable from the objectq,whenpis within the epsilon-neighborhood ofqandqis also a core object.p∈NEps(q) whereNEps(q) is the subset of the dataset which includes Eps-neighborhood ofq,and|NEps(q)|>MinPts(core object condition).
Datapis density-reachable from the dataqin connection with Eps1 and MinPoints if there is a sequence objectP={p1,p2,···,pn},p1=qorpn=psuch thatpi+1 is directly density-reachable frompiwith respect to Eps1 and MinPoints,for 1 ≤i≤n,pi∈D(see Fig. 1(a)). An objectpis density-connected to the objectqwith respect to Eps1 and MinPoints,if there is an objecto∈Dthen bothpandqare density-reachable fromo(see Fig. 1(b)).Density connection is a symmetric relation. A clusterCis a non-empty subset ofDsatisfying the following requirements [22]:
(i) ?p,q: ifq∈Candpis density-reachable fromqwith respect to Eps1 and MinPoints,thenp∈C. (maximality)
(ii) ?p,q∈C:pis density-connected toqwith respect to Eps1 and MinPoints. (connectivity) [22].
In this algorithm,four formulas are defined for the extensional DBSCAN by FN-DBSCAN. Here,the neighborhood set of pointsNx(y) is shown as follows:
Nx:D→ [0,1]is any membership function that determines the neighborhood relation between data (Cartesian plane) that is done by (2). ε1is different from the definition of the radius in DBSCAN. Instead of distanced-based DBSCAN,the new level-based set with the fuzzy neighborhood set is used and shown below [30]. Here,kis the value coefficient andk>0 affects the neighborhood radius. LetY={y1,···,yn} and the exponential neighborhood relation and formula are shown in Fig. 2 and (2) respectively.
Fig. 2 Exponential neighborhood relation [30]
Each core pointxis defined with the neighborhood membership function which is shown in the following equation.d(x,y) is the Euclidian distance between two pointsxandy.dmaxis the maximum distance between two points in the coordinate system [30]. For simplicity,in (3) a point/data is defined with the neighbor degree to all points in the dataset,and (4) shows the concept of fuzzy cardinality [30].
The algorithm using ε1,ε2and MinPoints,has tried to convert the distance-based DBSCAN to level-based DBSCAN. However,those features are density-based clustering.
In this paper,we propose clustering using the features of DBSCAN and fuzzy for solving the number of clusters,overlapping and morphological problems,as shown in Fig. 3. The upside of the subfigures presents three clusters and the downside of them presents four clusters.
Fig. 3 Three problems
Firstly,our algorithm selects the high-density region with (2) (fuzzy value) and then all data are scanned from the left to the right coordinate system by DBSCAN with predefined parameters. Scanning gains the core nodes from the dataset using Eps1 and MinPoints,which leads to categorization of the sub clusters. Secondly,the sub clusters are sent to the A_rule function in order to decide the merging and separating of them. The function generates two parameters such as second epsilon (Eps2) and epsilon position (Pos) automatically and pass to the B_rule function. The generated parameters are done by using the arrangement of the location of sub clusters in a dataset space (Cartesian coordinate system).
All the state locations with the center of sub-clusters are defined in the coordinate system. The states are Minimum,Maximum,Dmean and Diff (see Definition1).Merging and separating are done based on the four state locations. Our algorithm by using the concept of fuzzy cardinality and fuzzy neighborhood,used in (1) and (2)respectively and in the concept of the DBSCAN algorithm,proposes a new point of view for the problem of clustering,which is mentioned above. Data,in order to trim the scaling,are converted to normal data [30]by
The Iris dataset has 150 instances with four attributes such as sepal length,sepal width,petal length and petal width which are defined in three classes.
Definition 1State locations are defined as follows and Fig. 4 also presents an example of the Iris dataset.
Fig. 4 An example of Iris dataset
Dmean: The distance matric value between the center of sub clusters.
Maximum: The maximum value of each column in Dmean.
Minimum: The minimum value of each column in Dmean.
Diff: The difference value between each column in Minimum.
Definition 2Find peaks (Pos): return a vector local maximum and minimum value as peaks. A local peak is a data sample that is either higher than its two neighboring samples (here,one sub cluster with its two neighboring sub clusters) or lower than them. See Fig. 5(e)fndpeak=3and Fig. 5(f) f ndpeak=2.
Definition 3Binary separation: return the binary code from each state of sub clusters to each other such as Minimum,Maximum and Average of both. The binary code for them will be done by left sub clusters to right sub clusters (scanning dataset from left to right) if the left sub cluster is less than the right one,then it takes zero,otherwise one for the station. See Fig. 5(d),F(xiàn)ig. 5(e) and Fig. 5(f) and results of them as follows:
(i) Fig. 5(d) by taking the maximum peak: BMA =Minimum = 101;
(ii) Fig. 5(e) by taking the maximum peak: BME =Means = 101;
(iii) Fig. 5(f) by taking the minimum peak: BMI =Maximum = 111;
The metric measures areF=0.9600,RandIndx=0.9495,A djRandIndx=0.8857 on the IRIS dataset. * is used for better presentating the peak.
Fig. 5 Dataset performed by the proposed algorithm
The proposed algorithm is presented below and written in nine lines,which could be divided into three steps(see Algorithm 1). Lines 1-6 are written for the first step,as shown in Fig. 5(a). Lines 7 and 8 refer to the second step as shown in Fig. 5(b). For a better understanding of Lines 7 and 8,two rule are described (see Algorithm 2 and Algorithm 3). In these lines based on the gained Eps2,values have been tried to merge and separate the sub clusters and line 9 is the deciding data,which does not belong to any clusters,as shown in the third step and Fig. 5(c). Rule A is divided into five parts: The first part is defined as the sets and the rest of them are sub rules,which happens during the running algorithm. Sub rule 0(line 1) checks the completely separable clusters. Other sub rules (lines 3-5) check different conditions and gain Eps2. Rule B with eight lines based on Eps and Diff is assigned to carry out the separation or merging of sub clusters.
Algorithm 1AFD
BEGIN
1. Require: d-dimensional data set (D),Number of data points (N),Eps1,and MinPoints parameters.
2. Normalizing the data points with (5).
3. Gaining neighborhood degree of normalized data points with (2).
4. Providing sub clusters with DBSCAN (by using Eps1 and MinPoints parameters).
5. For all sub clusters gain the mean of sub clusters.
6. Calculating the distance between per mean of sub clusters (called as Dmean).
7. Finding the Eps2 and Pos with Dmean (Call the rule_A function).
8. Merging or separating sub clusters with Eps2 (call the rule_B function).
9. Obtaining the mean of sub clusters and clustered remaining of data points (not belong to any cluster).
END
Algorithm 2Rule_A function
BEGIN
1. Use Definition 1 and Definition 2 for finding peak of them,then create set for them.
2. Use the binary check of sets and calculate intersect with them.
3. Obtaining the peak of the Minimum set intersected with the maximum set.
Eps2 = Diff(position = peak of the minimum set)
4. Obtaining the peak of the minimum set and no any peak in the maximum set.
Eps2 = Diff(position = peak of the minimum set)
5. Obtaining the peak of the minimum set intersected with the minimum set.
Eps2 = Diff(position = intersect the peak of minimum and minimum sets)
Return Eps2,Pos
END
Algorithm 3Rule_B function
For better representing the dataset,we show three graphics (three classes in 2-dimentional based on the two attributes,as shown in Fig. 5(a),F(xiàn)ig. 5(b) and Fig. 5(c)).In running,with initialization of the parameters Eps1 =0.53,MinPts = 5,we gain states of sub clusters,which are guided to decide the merging and separating of sub clusters. Fig. 5(d),F(xiàn)ig. 5(e),F(xiàn)ig. 5(f) and Fig. 5(g) are generated by Rule_A with Eps2 = 0.664 9 and Pos = 2,see Fig. 5(g) and Fig. 4. Here,Eps2 = 0.664 9 is the same as Eps2 = 0.665 1 in Fig. 4. For separation,we need the amount less than one (e.g.,Eps2 = Eps2 - 0.000 2). Then Rule_B is used for separating and merging sub clusters which means if Eps2 = 0.664 9 < Eps2 = 0.665 1 (critic point),it leads to generating two new sub clusters. See Figs. 6-8 for observing clustering steps on the Iris dataset.
Fig. 6 Original Iris dataset
Fig. 7 Spactral clustering on Iris dataset with sigma = 2
Fig. 8 LOF clustering on Iris dataset with Minpts = 10 and Eps = 1
The experiment is performed on three real and five synthetic datasets (see Fig. 9). The real datasets are obtained from the University of California,Irvine (UCI) Machine Learning Repository such as Wine,Glass and Iris datasets that are suitable for testing overlapping datasets and for testing the morphology of datasets. We use synthetic datasets. To compare algorithms,features based on partial,fuzzy,density clustering are considered.
Fig. 9 Synthetic datasets
They are also popular in the clustering field,but the disadvantages of them are not overcome to one of the mentioned problems. In this paper,our algorithm is compared with well-known DBSCAN,fuzzy means and K-means with the average of 30 times running. For measuring the performance of the algorithm regarding accuracy on datasets,well-known measures such as rand,adjusted rand and F-measure have been used. In Table 1,the maximum values of indices are bolded and Table 2 shows their parameters.
Table 1 All algorithms with rand,adjusted rand and F-measure
We notice the AFD algorithm has more robust results than the others except on the Glass dataset with adjusted rand and F-measure,and on the Wine dataset with F-measure. The other advantage of the AFD algorithm is that it achieves all datasets with correct number except for Glass. The proposed algorithm and some recent density-based algorithms such as LOF and spectral clustering which are performed on the Iris dataset are shown in Figs. 6-8. It can be seen that spectral clustering and LOF are unable to cluster the Iris dataset. Each color indicates a cluster except pink.
Table 2 Parameters settings
The proposed algorithm focusing on a variant shape dataset in terms of efficiency has been tested on the real dataset and synthetic dataset. It generates two parameters called Eps2 and Pos in the inner dataset,which are presented in order to make a decision between separating and merging data. The results of our experiments demonstrate that the AFD algorithm shows better results than the other algorithms,which is applied to the real dataset of UCI benchmark (Wine,Glass and Iris) datasets and synthetic datasets (the rest of the datasets). AFD is able to run in low dimensions,but it is more time consuming than others. The challenge here is Eps1 and MinPoints parameters,which should be initialized carefully. The different initializations lead to generating different clusters with Eps2 and Pos,which will be a challenge in this study. Generally,initialization parameters are performed by observing the high density. We need to estimate the amount of them in initialization. In the future,we plan to extend this algorithm considering the mentioned challenge by metaheuristic methods. Metaheuristic methods with their own power search on functional space lead to the best solution to initialization parameters.
Journal of Systems Engineering and Electronics2020年6期