TANG Lin ,SUN Leilei,2 ,GUO Chonghui,* ,and ZHANG Zhen
1.Institute of Systems Engineering, Dalian University of Technology, Dalian 116024, China; 2.School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Abstract: Affinity propagation (AP) is a classic clustering algorithm.To improve the classical AP algorithms,we propose a clustering algorithm namely,adaptive spectral affinity propagation(AdaSAP).In particular,we discuss why AP is not suitable for non-spherical clusters and present a unifying view of nine famous arbitrary-shaped clustering algorithms.We propose a strategy of extending AP in non-spherical clustering by constructing category similarity of objects.Leveraging the monotonicity that the clusters’ number increases with the self-similarity in AP,we propose a model selection procedure that can determine the number of clusters adaptively.For the parameters introduced by extending AP in non-spherical clustering,we provide a grid-evolving strategy to optimize them automatically.The effectiveness of Ada-SAP is evaluated by experiments on both synthetic datasets and real-world clustering tasks.Experimental results validate that the superiority of AdaSAP over benchmark algorithms like the classical AP and spectral clustering algorithms.
Keywords: affinity propagation (AP),Laplacian eigenmap (LE),arbitrary-shaped cluster,model selection.
Much data has been collected with the rapid development of sensing and storage technology in the past years.How to extract knowledge from these data is an essential topic.To understand these data, we usually organize them into meaningful groups.Clustering is a technique that makes objects inside a cluster more similar than those between clusters [1].As an efficient unsupervised learning method,clustering has been used in numerous fields, e.g., image segmentation, face recognition, video summary, text mining, web analysis, genome analysis, social community detection, and marketing segmentation [2,3].
Different kinds of clustering algorithms [4,5] have been proposed.These algorithms can be divided into two categories: spherical clustering and non-spherical clustering algorithms, according to the shape of the distribution of data within a cluster.Spherical clustering means that these data are ball shaped in space in a cluster.Nonspherical clustering means the shape of data within the cluster is uncertain.If a clustering algorithm can handle both spherical and non-spherical distribution data, it is called an arbitrary-shape clustering algorithm.Affinity propagation (AP) clustering is an extremely successful one [6] as a spherical clustering algorithm.AP is also suitable for both structured and unstructured clustering data.It takes a collection of pairwise similarities as input.Unlike traditional exemplar-based algorithms, for example, thek-centers algorithm [7], the AP algorithm treats all objects as potential exemplars and usually results in a better clustering result [6,8].Considering the problems caused by the parameter preference in the AP clustering algorithm.Li et al.[9] proposed the adjustable preference AP (APAP) algorithm, which can automatically adjust the value of each element preference during the iteration process.For solving the non-spherical cluster problem,Fan et al.[10] used the density distributions and structures of data to define an adaptive AP clustering algorithm,which adopts a nearest neighbor searching strategy.Now AP clustering has been widely applied in text mining[11,12], gene expression analysis [13,14], social network analysis [15,16], etc.Meanwhile, many improved AP algorithms have been developed, such as the semi-supervised AP [17,18], the hierarchical AP [19], the incremental AP [20], and the fast AP [21,22].
Though AP has achieved remarkable success, it also suffers from some bottleneck problems in practical applications:
(i) AP cannot be used to discover clusters with nonspherical shapes, which limits the applications of AP in many real-world clustering tasks, such as image segmentation, location-based search, graph partition, and community detection.The shapes of underlying clusters in these tasks are often much more complicated than what AP can deal with.
(ii) It is difficult for users to control the clusters’ number.A parameter called preferencepneeds to be specified by users, which controls the clusters ’ number.The only prior knowledge of preferencepis that the clusters’ number increases the value ofp.However, users usually need to tunepmany times to get the desired number of clusters.On the other hand,pcannot be determined automatically.In our opinion,pshould be able to vary its value automatically and stop at a suitable value.The clustering result can be better.
(iii) Our extension of AP from the above two perspectives may introduce other parameters.The proposed AP clustering method in this paper should be able to find the appropriate parameter values adaptive.
This paper proposes a clustering algorithm named adaptive spectral AP (AdaSAP) to improve the classical AP clustering algorithms from the following aspects:
(i) We study how to extend AP clustering in the nonspherical clustering problem.
(ii) Leverage the monotonicity of preferencep, which is able to determine the number of clusters automatically.The clusters’ number (or the proper value ofp) found by our procedure can well reflect the structure of the studied dataset.
(iii) An evolutionary process is proposed, which can tune the new parameters AdaSAP adaptively.
This paper also has two main theoretical contributions.First, it provides an overview of existing famous arbitrary-shaped clustering algorithms and points out the essential of arbitrary-shaped clustering it; second, it provides a general strategy to determine the clusters’ number,which is a long-standing problem in the clustering field.From the practical perspective, this paper improves the state-of-the-art AP clustering algorithms, and many practical clustering tasks can benefit from the proposed AdaSAP algorithm.
The rest of the paper is organized as follows: Section 2 briefly describes the classical AP clustering algorithm and Laplacian eigenmaps (LE), and provides a unifying view of the existing famous arbitrary-shaped clustering algorithms.In Section 3, we propose the AdaSAP algorithm.In Section 4, we discuss how to determine the number of clusters automatically in AdaSAP.In Section 5, experimental results are presented.In Section 6, we conclude our work and future work.
This section presents foundations and related work for further discussions.We also present a unifying view of the existing arbitrary-shaped clustering algorithms.
AP is an exemplar-based clustering algorithm.The algorithm’s objective is to maximize the sum of all the similarities of the cluster’s inner objects and the cluster’s exemplar.The exemplar set is the microcosm of the entire dataset [23].How to find good exemplars is a formidable combination’s optimization problem.
The AP algorithm considers all data points as potential exemplars initially.It treats each data point as a network node.Two kinds of messages, responsibilities and availabilities, pass on each edge of the graph.
There areN={x1,x2,···,xn} sample points clustering,and pointxiand pointxjare two of the sample points.Responsibilityr(i,j) indicates how strong pointxiwants to choose candidate exemplarxjas its exemplar.According to [6],r(i,j) is computed as
Availability α(i,j) means how well-suited it is for objectxito choose objectxjas its exemplar.It is computed as
According to (1) and (2), the responsibilitiy and availability values are continuously updated through an iterative process until convergence.The clustering resultis
The generic problem of manifold learning is as follows:For a set {x1,x2,···,xn} ofnpoints inXlin dimensionl,the goal of manifold learning is to find the intrinsic dimensionmof the manifoldM.Then a set of points{y1,y2,···,yn} inXm(m?l) can represent them well in term of the local relationships.
In order to preserve local relationships between points,LEs first construct an adjacency graph with weightsG=(V,E,W).Grepresents the completely connected graph.Vcorresponds to objectsXis the set of points.Eis the set of edges, the elementeijis the edge of objectsiandj.AndWis the set of weights on the edges, the weight of the edgeeijiswij, which is equivalent to (or increases with)sij.And then map the graphGinto a lowdimensional manifold [24].Many methods have been proposed to construct graphG.They guarantee that the closer the two points are, the greater their edge weight is.
When mappingGto a low-dimension space, we need to make the connection points as close as possible.Y=[y1,y2,···,yn]Tis the low-dimensional graph representation.In this process, we need to minimize the objective function with appropriate constraints.
By derivation and transformation [24], we need to minimize
Particularly for unstructured data, Belkin et al.thought those arbitrary-shaped clusters cannot be separated [24].They discussed the intrinsic consistency of spectral embedding methods and kernel principal component analysis in 2004 [25].Their work bridges three main fields, which are spectral clustering, kernel methods, and nonlinear dimension reduction together.Fig.1 presents the relationships between the five studied algorithms.
Fig.1 Relationships among the five studied algorithms
Another important category of arbitrary-shaped clustering algorithm is a density-based one.We discuss the mathematical similarity of three typical density-based clustering algorithms as follows:
First, densinty-based spatial clustering of applications with noise (DBSCAN) [27] describes the density based on the data point neighborhood.To find arbitrary-shaped clusters, DBSCAN uses the iterative method to cluster.Mean-shift was proposed in 1975 [28] for non-parametric density gradient estimation.Then Cheng applied meanshift on clustering [29] in 1995.Comaniciu et al.[30]applied mean-shift as a clustering method, and the work makes the method well-known.
Density peaks clustering (DPC) is a recently proposed density-based clustering algorithm published in Science in 2014 [31].At first, it finds some particular objects called density peaks.It then assigns each remaining object to a density peak along a density gradient descent direction.A density peak in DPC is the object with the highest density in a relatively large scope.Such an object is called a density peak.Two indicators are the local density?and the minimum distanceδ.
We know that?in DPC equals |N?(xi)| in DBSCAN.In addition, when we use the kernel function to estimate?of an object, the indicator?is in fact the densityf(x) in mean-shift [32].That is, the three definitions can be alternatively used in the three density-based arbitrary-clustering algorithms.Based on the estimated density of an object,DBSCAN defines a relationship of two objects as densityconnected, which can be explained from our unifying view that, if two objects as density-connected, then category similaritys′(i,j)=1 ; otherwise,s′(i,j)=0.In meanshift and DPC, an objectiis assigned to its nearest neighbor with even higher density.If two points have the same ending point (or local density maximum point),s′(i,j)=1 ; otherwise,s′(i,j)=0.Therefore, the main difference between DBSCAN with mean-shift and DPC lies in the label propagation manner.From the perspective of density estimation and category similarity construction,they are essentially very similar.Table 1 compares the three density-based arbitrary clustering algorithms, whereεis a similarity threshold (or cutoff distance),η(·) can be an activation function or a kernel function.
Table 1 Comparisons of three density-based arbitrary clustering algorithms
The last arbitrary-shaped clustering algorithm studied in this section is the Chameleon algorithm, which is an agglomerative hierarchical clustering algorithm using dynamic modeling [33].It consists of three phases: (i) usingK-nearest neighbors to generate a sparse adjacency graphG′ according to the similarity matrixS; (ii) dividing the sparse adjacencygraph into many sub-graphs by the graph partitioning algorithm, a sub-graph can be seen as an initial sub-cluster in agglomerative clustering; (iii) recombining sub-clusters agglomerative to generate the deprogram.Chameleon is a heuristic clustering algorithm, so it is difficult to reason the mathematical equivalence of Chameleon with other arbitrary-shaped clustering algorithms.Even though the mechanism behind Chameleon is still strictly with our unifying view, assume thatCi,Cj,andCkare three sub-graphs at the leaf level.In the procedure of agglomeration,Ciis first combined withCj, and then the three sub-clusters are merged together.From the viewpoint of the category similarity, the dynamic modeling process in Chameleon is, in fact, a procedure of category similarity construction, while the category similarity of two objects is also determined by the local (or key) feature similarity.
Fig.2 summarizes all the above discussions.It can be known that kernel principle component analysis (PCA),LE, locally linear embedding (LLE), kernelK-means [34],and spectral clustering [35] are equivalent from the mathematics perspective, which were presented by [24?26].DBSCAN, mean-shift, and DPC are consistent both in terms of density estimation and label propagation.The only difference lies in the label propagation manner.Such consistency has not been discussed ever before.This presents the three on the same ground.From an even higher level, though some algorithms are proposed as heuristic clustering algorithms, it is difficult to get the consistency of these algorithms with others.We can still find two key principles: (i) in whichever algorithm, a category similarity matrixScis constructed either explicitly or implicitly;(ii) larger feature similaritiessijplay a significant role during the construction ofSc, which means that the distant similarity information is not necessary for arbitraryshaped clustering.
Fig.2 Hierarchy of nine studied algorithms
Based on the unifying view of arbitrary-shaped clustering algorithms, a general framework of arbitraryshaped clustering is proposed in Fig.3.All the studied nine arbitrary-shaped clustering algorithms can be well explained by the framework.Spectral clustering for unstructured data starts with feature similarity matrixSwhich represents the similarity matrix which also be called affinity matrix, wheresijis the similarity betweenxiandxj.A popular similarity measure is computed by negative distances of objects as follows:sij=(max(D)?di j)/(max(D)?min(D)).kernelK-means mapping collected dataXtoX′ by kernels if we select proper kernels,the mapping in kernelK-means is equivalent to that in the spectral method.Density-based clustering algorithms first compute the density of each object by its neighborhood, which is essentially an operation on sparse graphG′, asG′ reflects the relationship of neighbor objects.We usually obtainG'from the complete graphG, usually using the methods include theK-nearest neighbor (KNN)or similarity greater thanε(ε-NN) as a condition, whereεis the minimum value of similarity.Then the three density-based arbitrary-shaped clustering algorithms implement label propagation, where two objects having larger category similarity shares the same cluster label.Chameleon constructsG′ according to feature similaritySand constructs category similarity matrixScusing a dynamic process.Recently, Wang et al.[36] proposed a spectral cluster algorithm based on message passing(MPSC) with a density sensitive similarity measurement.After that Wang et al.[37] proposed an improved density-based adaptivep-spectral clustering algorithm.To the best of our knowledge, this is the first effort that adopts label propagation by spectral clustering to extend AP in arbitrary-shaped clustering.
Fig.3 A general framework of arbitrary-shaped clustering algorithms
As a centroid-based clustering algorithm, the classical AP clustering algorithm can only find spherical clusters.This section discusses how to extend AP in the discovery of clusters with complex shapes.Firstly, a toy example illustrates this paper’s core idea; then we point out the essentials of arbitrary-shaped clustering.
After that, we propose an algorithm named arbitraryshaped AP, which combines the spherical clustering with the AP clustering algorithm based on message passing of the category label.Furthermore, we also solve the problem of how to set the parameters in AP clustering.First, we introduce a method for determining the parameterpof AP to obtain the better clusters’ number.Second, we use a parameter grid to optimize the parameters bothk(the number of nearest neighbors) andd(the dimensionality in the lower dimensions) while constructing the sparse graphG′.
Fig.4 is a toy example, which illustrates the motivation of this paper.In Fig.4(a), the 15 points in a two-dimensional space can be divided into two clusters.PointsAandBare exemplars of the two clusters.PointCshould be assigned to exemplarBsince pointCis closer to pointBthan to pointA, denoted bys(B,C)>s(A,C) if similarities are computed by spatial distance.However, such an assignment does not take the propagation of the cluster label into account.Intuitively, we often infer that if (i)pointsAandDare with the same cluster label and (ii)pointsDandEalso belong to the same cluster, then pointsAandEshould have the same cluster label.That is,whether pointsAandEshould be divided into the same cluster is determined by min(s(A,D),s(D,E)) rather thans(A,E).According to such inference, pointCshould be assigned to exemplarAfinally, which is accordant with our intuition.
Fig.4 Motivation of nearby propagation
In most of the existing literature, the clustering algorithm is directly implemented on a computed or provided similarity matrix, where the similarities used mainly reflect the distances of objects in the original feature space.However, what is desired in clustering is how properly two objects should be divided into the same cluster.This measure is called category similarity in this paper.For the convenience of distinguishing, the original similarity is called feature similarity.Take Fig.1(b) for example, the feature similarity iss(A,C)sc(B,C), if the propagation of the cluster label is taken into account.It can also be observed that local feature similarity plays an important part in computing category similarity.For example,sc(A,C) is mainly determined by min(s(A,D),s(D,E)) rather thans(A,E).Therefore, the key of arbitrary-shaped clustering is to use category similarity instead of feature similarity,where the category similarity of two objects is constructed according to local feature similarities of the two objects,which can reflect not only the closeness of two objects,but also the connectivity of them.Based on the idea of category similarity, we extend the classical AP algorithm with the spectral algorithm.
The extension of the classical AP algorithm in arbitraryshaped clustering is guided by the framework shown in Fig.3.In the proposed arbitrary-shaped AP clustering algorithm, we construct a sparse adjacency graphG′according to feature similarity matrixS; then we map the sparse graph into a low dimensional space by LEs; third,we construct the similarity matrix of objects in new space,which is category similarityScthat reflects both closeness and connectivity of two objects; last, classical AP is implemented onScto get a non-spherical clustering resultc.Algorithm 1 presents the proposed spectral affinity propagation (SAP) clustering algorithm.
On the one hand, the proposed SAP is an extension of classical AP in non-spherical clustering.On the other hand, it can also be seen as an improvement of existing spectral clustering algorithms by replacingK-means partition with AP partition.A methodology-level significance of SAP is that category similarity, which is constructed by larger feature similarities, is essential to arbitraryshaped clustering, which can reflect not only the closeness but also the connectivity of two objects.Based on the constructed category similarity matrix, most of the spherical clustering algorithms can be extended to handle clusters with complex shapes.
The proposed SAP extends the classical AP in arbitraryshaped clustering problems, making AP able to be used to discover clusters with complex shapes.This section discusses how to determine the clusters’ number and optimize the parameters in SAP.
3.3.1 Determining number of clusters
Determining the number of clusters is a long-standing problem.Essentially, it is a balance of model accuracy and model complexity.The best balance may be different for different datasets, which means there is no generic solution for such kind of problem.In supervised learning,the balance can be adjusted by model validation.However, such a validation procedure cannot be implemented due to a lack of teaching signals in unsupervised problems like clustering.Because of the difficulty analyzed above,there is still no convincing method of determining the number of clusters in current literature.In most cases, the clusters’ numberkis specified by users when we useKmeans andK-medoids.
In AP clustering, the clusters’ numberkis not required any longer.Alternatively, a parameter called preferencepneeds to be specified in advance, which is used as a shared self-similaritys(i,i)=p.An observation is that the clusters’ numberkincreases with the value ofp.In general, preferencepis set as the median of the input similarities [6].In fact, the influence ofponkis also related to the objects ’ numberN.Therefore, we usepc(preference coefficient) as an input to avoid the influence of the objects’ number.
whereSis the normalized similarity matrix, which ranges from 0 to 1.
The automatic selection of parameterpcin this paper is motivated by the following common sense: Imagine we are observing a number of collected data objectsN={x1,x2,···,xn}.Move the objects from near to far,and our observations go from clear to increasingly blurred.We call this going from a fine granularity to a coarse granularity.Clustering results also vary with observation distance.The number of clusters varies fromnto 1, wherenmeans each object is treated as a cluster,and 1 means all the objects are viewed as one cluster.Though different users may prefer different observation granularity, the most stable number of clusters reflects the most probably observed structure of the studied datasets.Algorithm 2 presents the proposed adptive affinity propagation (AdaAP) algorithm, which can determine the value ofpcadaptively.
We improve the classical AP clustering to make it able to capture non-spherical clusters and to determine the number of clusters adaptively.In the transformed space,message passing is implemented to identify the exemplars.Similar to the classical AP algorithm, the computational complexity isO(N2), whereNis the number of data objects.Our method also supports sparse message passing: if we reduce the number of similarity pairs fromN2toMand retain only the key similarity pairs, the complexity of the proposed method isO(M).
In Algorithm 2,αis the step length ofpc.We first decreasepcfrom a moderate value to a minimum value,which leads to the maximum number of clusters equal toN; then increasepcfrom the moderate value to a maximum value, which leads to a clustering result with only one cluster.During this process,is the most stablekvalue,which is obtained by mode(·) operation onK.Sokis viewed as the true number of clusters.is The mean ofPc(I) is used as the most properpcvalue for this dataset.Istands for the serial numbers of
Fig.5 is a toy example to illustrate how AdaAP determines the clusters’ number automatically.The synthetic dataset is shown as Fig.5(a), which consists of 200 points from four clusters.According to AdaAP, we change the value ofpcfrom minimum to maximum; the clusters’number decreases withpc, as shown in Fig.5(b).Fig.5(c)is a part of Fig.5(b).It can be known that the most stable number of clusters is four, the second stable number of clusters is two, which means the most suitable number of clusters for this dataset is four, the second choice of the clusters’ number is two.It can be known that AdaAP can find the correct number automatically.The result is very intuitive.
Fig.5 AdaAP determining number of clusters
Algorithm 2 presents how to run AP clustering without predefined parameters, which can find the most stable number of clusters automatically.In fact, Algorithm 2 varies the clusters’ number from 1 toNby changingpc.If a user specifies the clusters’ number, Algorithm 2 can also find the correspondingpcvalue by a bisection procedure.
3.3.2 Optimization of parameters
In Algorithm 1, there are two other parameters that need to be specified by users.Step 2 uses two types of adjacency graphs constructed byε-neighborhood and KNN separately.We need to specify eitherεorKfor constructing an adjacency graph.In Step 4, we embed the con-structured adjacency graph into a low-dimensional space.The dimension of the new spacedneeds to be determined by users.What’s more, the two parameters have a great impact on the final clustering result.Therefore, we propose a grid-evolving strategy to select the most suitable parameters.Assume that the KNN graph is employed, combine Algorithms 1 and 2, and we get the final AdaSAP as shown in Algorithm 3.Algorithm 3 first uses a parameter grid to generate many parameter pairs and finds the best parameter pair (k*,d*) by evaluating the clustering result;then it shrinks the parameter grid toward (k*,d*), and finds a new optimal parameter pair around (k*,d*).It repeats the above procedure until the nearby grid’s margin is equal to or smaller than 1.Initially, the ranges ofkanddare set [1,N], andgcan be 2, 5, or 10.It can be viewed as a bisection grid wheng=2.ηcontrols the extent of shrinking.
We first evaluate the performance of the adaptive spectral AP clustering algorithms by computational experiments.Then compared with the classical AP clustering, the AdaSAP method has at least two advantages: (i) It can deal with clusters with complex shapes, while the original AP method can only discover spherical clusters; (ii) The number of clusters can be determined automatically by AdaSAP or specified by users.Thus, the experimental part is organized as follows: (i) testing of AdaSAP on the discovery of non-spherical clusters; (ii) testing of AdaSAP on automatic determination of the clusters’ number; (iii) comparison of AdaSAP with baseline algorithms by real-world clustering tasks.
The performance of the clustering algorithm is usually evaluated by the internal dispersity, which means the sum of similarities between objects and their exemplars.We use it as an evaluating criterion in this paper.
However, this paper aims to discover complex-shaped clusters.In such a clustering problem, one should not only take the closeness of objects into account but also consider the connectivity between objects, which cannot be reflected by the internal dispersity.We also use category-label based measures to evaluate the algorithms’ effectiveness.By comparing the clustering result with the actual category label, these measures can evaluate different clustering algorithms’ performance scientifically and reasonably.
The first category-label based evaluation criterion is normalized mutual information (NMI) [38] defined as
wherectstands for the real cluster tag andstands for the tag of the result clustering.denotes the mutual information betweenctand.H(·) represents information entropy.
Accuracy is a more direct category-label based criteria defined as follows:
where for the objectxi,is the real label, andis the result clustering label.Ifi=j,α(i,j)=1; otherwise,α(i,j)=0.Functionf(·) labels the clustering tag by comparing the actual label and marking it as the most faithful label or not.
The third evaluation criterion is covering, which is mainly used to evaluate clustering algorithms in image segmentation tasks.According to [39], it is defined as
where the standard segmentation isS′result, and the segmen tation result isSresult.O(R,R′) is the overlap between two regionsRandR', |R| indicates the number of pixels in the regionR, andNstands for the number of pixels in the whole image.
We validate the effectiveness of the AdaSAP algorithm by experiments on the synthetic datasets.We first test the ability of SAP in discovering non-spherical clusters by four synthetic datasets, then illustrate how AdaAP can determine the clusters’ number automatically.
4.2.1 Testing SAP in discovering non-spherical clusters
A lot of synthetic datasets have been generated to test the ability of clustering algorithms in discovering clusters with complex shapes.Among these datasets, the concentric circles in [2,40], two moons in [41,42], and spirals in [2,40] are the most frequently used synthetic datasets.
We combine the synthetic datasets mentioned above together and generate four new datasets with even more complex cluster shapes: (i) dataset includes 900 data points distributed around three concentric circles (3C);(ii) dataset includes data points distributed on a circle and two rectangles (1C2R); (iii) the dataset is made up of a circle and two moons (1C2M); (iv) the dataset is made up of a circle and two spirals (1C2S), and they are knotted in a three-dimensional space.
The clustering results of four studied algorithms are shown in Fig.6.AP is just suitable for discovering spherical clusters.The sub-figures in the left column show that the original AP cannot cluster these datasets correctly.In AdaSAP, the data points are first mapped into a new space by LEs, and then AP clusters the data points under the new space.The distances between two objects in the new space can reflect the closeness and the connectivity of objects in the original space.Therefore, AdaSAP can discover clusters with complex shapes.A similar method is spectral shapes, as shown in the sub-figures on the right clustering [34].However, the objects in the new space are clustered byK-means, which makes spectral clustering fail to get the correct clustering result if the initial centers are not correctly set.The sub-figures named spectral K-means (SKM) in the second column illustrate the incorrect clustering results suffered by spectral clustering.Different from spectral clustering, AdaSAP uses AP to cluster data points under the new space.The initial exemplar set is not required in AdaSAP.Therefore, it can always get the correct clustering result.Although the DPC algorithm can find the cluster center automatically and realize the efficient clustering of arbitrary shape data,the results also are not good in these datasets.The subfigures named DPC in the third column illustrate the clustering results.
Fig.6 Computational experiments on four synthetic datasets
To test the robustness of the algorithm, we add random noise to the experimental data 1C2S.In the following experiment, we add 1%, 3%, and 5% random noise to the experimental data, respectively.Then we use the AdaSAP algorithm to cluster the datasets containing noise, and the experimental results are shown in Fig.7.Experimental results show that there is no effect of the experimental results even the data contain different noise.It also proves that the algorithm has strong robustness.
Fig.7 Clustering results by the noise datasets
4.2.2 Testing AdaAP in determining number of clusters
Fig.5 presents a toy example to illustrate how AdaAP works.This section validates the effectiveness of AdaAP by more synthetic datasets.Fig.8(a) shows the original dataset.In Fig.8(b) and Fig.8(c), we move the left two clusters leftward gradually.Fig.8(d)?Fig.8(f) show how the clusters’ number varies withpcin this case.Fig.8(d) demonstrates that the most appropriate number of clusters for the original dataset is four.According to Fig.8(e), the probability ofK=4 decreases, while the probability ofK=2 increases, but the most proper number of clusters is still four.Fig.8(c) moves the two left clusters apart from the two right clusters in a further step,so it is very likely for us to view the left two clusters as one cluster and the right two clusters as the other one.Fig.8(f) suggests thatK=2 is the first choice of the clusters ’ number, whileK=4 becomes the second choice.It is unlikely to choose another value (rather than two and four) as the true number of clusters.
Fig.8 Computational experiments on four synthetic datasets
In Fig.9, we generate synthetic datasets in another way.The upper right cluster is moved far apart from the other three clusters in Fig.9(b), while the lower right cluster is further moved away in Fig.9(c).Therefore, the most proper number of clusters for dataset in Fig.9(b) is two,whileK=4 is the second choice; the most suitable number of clusters for dataset in Fig.9(c) is three.According to Fig.9(e) and Fig.9(f), AdaAP selectsK=2 andK=3 as numbers of clusters for the two datasets, which are accordant with our observations.
Fig.9 AdaAP determining number of clusters automatically in another way to modify datasets
The performance of AdaSAP was evaluated by experiments on publicly available datasets.The website in [43]provides several kinds of clustering datasets.We choose the four most popularly used shape sets from this website.Fig.10 presents the four studied datasets named Aggregation, Compound, Flame, and Spiral.
Fig.10 Computational experiments on four public datasets
Aggregation consists of seven clusters.Two of them are connected.The compound has five clusters in total.One cluster is surrounded by another.The Flame dataset has two clusters.The Spiral dataset consists of data points scattered along three separate spirals.It can be known that most of the clusters have complex shapes, so arbitrary-shaped clustering algorithms usually test their performance with the four public datasets.Additionally, the data points’ real class label is also provided to compare different clustering algorithms by Accuracy and NMI.
On the one hand, the AdaSAP algorithm can be seen as an extension of AP in arbitrary-shaped clustering, so we select classical AP as a benchmark algorithm.On the other hand, AdaSAP can be seen as an improvement of spectral clustering as both the two algorithms construct a new similarity matrix leveraging LEs, so we choose spectral clustering SKM algorithm as the other benchmark method.On each dataset, the three algorithms are first compared by Accuracy, and then by NMI, the clustering result of SKM is affected by the initial exemplar set.Therefore, we repeat SKM clustering on each dataset 100 times.Fig.10 presents the box-plots of Accuracy and NMI achieved by each clustering algorithms.It can be known that AdaSAP achieves the highest Accuracy and NMI on the four datasets.The performance of AdaAP is unanimously inferior to AdaSAP on all the datasets,which suggests that AdaSAP can achieve better clustering results than AdaAP when there exist non-spherical clusters.However, the highest clustering performance achieved by SKM is equal to that achieved by AdaSAP.In most cases, the clustering performance of SKM is lower than that of AdaSAP.Therefore, replacingKmeans by AP to cluster objects in a new space is a feasible way to improve the existing spectral clustering algorithms’ performance.
Image segmentation plays a vital role in computer vision.It can be formulated as a data clustering problem [44,45].In this kind of method, every pixel is treated as an object.The clustering algorithm is employed to divide the pixels into a certain number of clusters.Pixels with the same category label are segmented in the same region.In image segmentation, the shapes of clusters are usually very complex.
Some images from the Berkeley segmentation dataset[46] (BSDS) are used in this paper.The dataset contains so many color images, including humans, sceneries, animals, buildings, flowers, etc.All the images are in a size of 321×481 (or 481×321).Besides the images, five ground-truth segmentation of each image is provided.
In this paper, the segmentation of an image is accomplished by three main steps: (i) Convert an image into many objects.For example, an image in a size of 321×481 can be converted into 1536 (32×48) objects, each object is a patch with 10×10.(ii) Cluster these patches into several groups.All three clustering algorithms mentioned are implemented.(iii) Refine the clustering result from the patch level to the pixel level, which is realized by assigning each pixel a category label according to its most similar patch.
In the second step, we have to compute similarities between patches.The similarity is computed based on both color and spatial features.The feature of theith patch iswhereλis a coefficient which balances the importance of the two different features.is the location of theith patch on the image.For the sake of simplicity, we use a relatively simple color feature scheme in this paper.is the average of each patch under the RGB space.At last, the similarity between a patch and a pixel is computed based on the same mechanism as a pixel can also be represented by a quintuple vector.
Finally, segmentation of an image can be represented by a two-dimensional label matrixS.Five ground-truth segmentation of each image is provided in BSDS, and the best-matched ground-truth segmentationS′ is used as the standard segmentation to evaluate the quality of segmentationS.Two indicators NMI and covering are used to measure the concordance between a segmentationSand the standard segmentationS′.The experiments are implemented on 20 randomly selected images from BSDS, and the computational results are presented in Fig.11 and Fig.12.
Fig.11 Comparisons of three clustering algorithms in image segmentation by NMI
Fig.12 Comparisons of three clustering algorithms in image segmentation by Covering
Fig.11 demonstrates that AdaSAP achieves 17 highest NMI values and three median NMI values among experiments on 20 images.NMI achieved by AdaSAP is always higher than that achieved by AdaAP.From the perspective of Covering, as shown in Fig.12 the superiority of AdaSAP over the other algorithms is even more significant.AdaSAP over the other algorithms is even more significant.It wins out with 19 highest Covering values, and with only one median Covering value.The runner-up algorithm SKM has only three highest values in NMI and one highest value in Covering.These experimental results indicate that AdaSAP can well competent the image segmentation task.Though the segmentation performance of an algorithm varies on different images, which makes an algorithm difficult to achieve consistent highest performance on all images, AdaSAP has the best segmentation performance on a larger proportion of the images.Moreover, it rarely gets the worst segmentation result.
Three of the most popular datasets, the Olivetti Research Laboratory (ORL) face dataset [47], the University of Manchester Institute of Science and Technology (UMIST)database [48], and the Yale database [49], are considered in this subsection.Table 2 shows the description of the three datasets.
Table 2 Description of the three datasets
The ORL dataset.Four hundred images were taken of 40 humans at different times, varying the lighting, facial expressions (with open or closed eyes, smiling or not),and facial details (glasses or no glasses).A dark homogeneous background took ten images per person with the subjects in an upright, frontal position.
The UMIST dataset.There are 575 images of 20 person.Each person shows in some poses, from profile to frontal views.
The Yale dataset.It contains 11 images for each of 15 individuals.The images show in lighting conditions (leftlight, center-light, and right-light), facial expression (normal, happy, sad, sleepy, surprised, and a wink), with or without glasses.
The appearance-based method [50] is used to extract the features of face images, which is a popular method in face recognition.In this method, a two-dimensional face image of sizewbyhpixels is expressed by a vector in aw×hspace.According to [51], grey level normalization is also implemented, then the negative Euclidian distance is used to compute similarities between objects.
AP is implemented on the obtained similarity matrix to provide benchmark performance.In AdaSAP and spectral clustering, we first construct an adjacent graph according to the similarity matrix and then map the adjacent graph onto a low-dimensional manifold.In this paper, the dimension of the newly constructed low-dimensional space is set five, which means that we reduce the dimension of the feature space from 112×92 (or 243×320 to five before implementing the clustering algorithms.
The experimental results of three clustering algorithms on different face datasets are presented in Table 3 and Table 4.AdaSAP achieves the highest performance in both NMI and accuracy.Compared with AdaAP,AdaSAP can find clusters with complex shapes; while compared with SKM, AdaSAP can avoid the impact of the initial exemplar set on clustering performance.This experiment validates the superiority of AdaSAP over two classical algorithms, AP, and spectral clustering.
Table 3 Comparisons of three clustering algorithms in terms of NMI
Table 4 Comparisons of three clustering algorithms in terms of accuracy
Mechanical ventilation is the most critical life support method for intensive cate unit (ICU) physicians.Retrospective studies based on patient care have essential significance for ICU physicians.The AdaSAP algorithm was used to subdivide the patients into different subgroups.The real dataset of our experiment comes from multiparameter intelligent monitoring in intensive care III(MIMIC-III) [52], a publicly available critical care medicine database developed by the MIT Computational Physiology Laboratory.
The experimental procedure is shown in Fig.13.It consists of three stages, namely data filtration, feature matrix construction, and clustering.Firstly, during the data filtering stage, we select adult neoplastic patients from the MIMIC-III database who are not in the perinatal period.Secondly, we construct the feature matrix in the feature matrix construction stage.At the clustering stage,based on the constructed feature matrix, the AdaSAP algorithm was used to cluster.
Fig.13 Experimental procedure on real medical data
4.6.1 Data filtration
MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with the patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.The dataset includes “subject_id”, which identifies a unique ID for each patient; “hadm_id”, which identifies a unique ID for each patient admission to the hospital; “icustay_id”which identifies a unique ID for each patient admission to ICU.subject_id tells us that the database contains 46520 patients; hadm_id tells us that the database contains 58976 patients; icustay_id tells us that the database contains 61532 patients.The experimental dataset was selected from adult patients (age≥18, non-perinatal patients) within 24 h of their first ICU admission (disregard multi ICU admissions afterwards).The total number of cases is 26761.Referring to the data provided by the MIMIC-III database, the experts finally selected disease diagnosis and 29 laboratory indicators from seven aspects[53].These laboratory indicators are shown in Table 5.
Table 5 29 laboratory indicators from seven aspects
Due to the problem of missing important attributes in some patient data, the cases are filtered again, the filtering condition is that the number of missing attributes is no more than two, and finally 10381 patient cases.According the ICD codes of the patients ’ first disease diagnosis, the patients are divided into subgroups as shown in Fig.14.
Fig.14 Subgroups of the patients according to the first disease diagnosis
However, each subgroup of patients can be further subdivided by related laboratory indexes.This task is a typical non-spherical cluster task.Fine clustering of patients is needed.The AdaSAP algorithm not only clusters arbitrary shapes but also helps to find the optimal parameters.The following experiment is carried out with the neoplastic patients (ICD code range 140?239).
4.6.2 Feature matrix construction
In the feature matrix of our experiment, each row represents the data of one patient and one column of the matrix represents an indicator.Therefore, the dimension of the feature matrix used in this experiment is 700 ×29.Since all the indicators in this experiment are equally weighted,we normalize each indicator.
4.6.3 Clustering by AdasAP
First of all, we calculate the similarity matrixSbased on the feature matrix by negative Euclidean distance.Second,we construct the adjacency graphG′ using the KNN algorithm.In the experiment, the number of nearest neighborsk=11.Thirdly, Laplace feature mapping for dimensionality reduction.We got the category similarity matrixS′.the dimensiondis finally set to 10.The parameters (k,d) are auxiliary selected by Algorithm 3.Because the dataset has not been labeled, we use the contour coefficient to evaluate the result of the cluster, and larger clustering contour coefficients indicate better clustering.At last, the typical AP algorithm is used to cluster the category similarity matrixS′.thepcvalue is set to 0.05 according to Algorithm 2.The patients are divided into three clusters,as shown in Fig.15.
Fig.15 Distribution of the patients among the clusters
Furthermore, we do a one-way ANOVA between groups for each indicator.The results show that four of the indicators are significant (P<0.05).The names of the indicators are SAPSII, incoming, temperature, and VT.We use the boxplots to visualize the indicators under different clusters, as shown in Fig.16.The experts analyzed the result that since the first diagnosis was neoplasm, the patient was mechanically ventilated purely for respiratory support.Patients with cluster 0 have the highest SAPSII scores among the three clusters but do not have obvious symptoms of fever.During the treatment, the incomings of the patients are large but the tidal volumes of mechanical ventilation are not high.The Cluster 1 patients generally have high body temperatures, but the SPASII scores are relatively not high.This type of treatment protocol has the characteristic of the lower input volume and higher tidal volume.Cluster 2 patients have neither high SPASII scores nor significant abnormalities in body temperature using the usual treatment strategy.
Fig.16 Distribution of the indicators among the clusters
In this paper, we propose an AdaSAP, which improves the classical AP clustering from the following two perspectives.On the one hand, we design an arbitraryshaped AP clustering algorithm by introducing spectral embedding with category similarity.On the other hand,we propose a procedure to solve a long-standing model selection problem in the clustering field.Leveraging the monotonicity of the number of exemplars with preference,the proposed adaptive AP can determine the clusters’ number automatically.Both synthetic datasets and practical clustering tasks are used to test the performance of AdaSAP.Experimental results validate the effectiveness of the proposed AdaSAP.
In the future, we will explore the following directions:
(i) The AdaSAP algorithm has good support for both structured and unstructured data, and we will explore how our model should be applied to specific domains, such as text clustering.
(ii) We will focus on deep clustering algorithms.Combine our work with deep learning to further improve the effect.
Journal of Systems Engineering and Electronics2022年3期