王軍 周凱 程勇
摘 要:密度峰值聚類(DP)算法是一種新的基于密度的聚類算法,當(dāng)它處理的單個聚類包含多個密度峰值時,會將每個不同密度峰值視為潛在聚類中心,以致難以在數(shù)據(jù)集中確定正確數(shù)量聚類,為此,提出一種混合的密度峰值聚類算法C-DP。首先,以密度峰值點為初始聚類中心將數(shù)據(jù)集劃分為子簇;然后,借鑒代表點層次聚類算法(CURE),從子簇中選取分散的代表點,將擁有最小距離的代表點對的類進(jìn)行合并,引入?yún)?shù)收縮因子以控制類的形狀。仿真實驗結(jié)果表明,在4個合成數(shù)據(jù)集上C-DP算法比DP算法聚類效果更好;在真實數(shù)據(jù)集上的Rand Index 指標(biāo)對比表明,在數(shù)據(jù)集S1上,C-DP算法比DP算法性能提高了2.32%,在數(shù)據(jù)集4k2_far上,C-DP算法比DP算法性能提高了1.13%。由此可見,C-DP算法在單個類簇中包含多密度峰值的數(shù)據(jù)集中能提高聚類的準(zhǔn)確性。
關(guān)鍵詞:密度峰值;層次聚類;類合并;代表點;收縮因子
中圖分類號: TP301.6
文獻(xiàn)標(biāo)志碼:A
Abstract: As a new density-based clustering algorithm, clustering by fast search and find of Density Peaks (DP) algorithm regards each density peak as a potential clustering center when dealing with a single cluster with multiple density peaks, therefore it is difficult to determine the correct number of clusters in the data set. To solve this problem, a mixed density peak clustering algorithm namely C-DP was proposed. Firstly, the density peak points were considered as the initial clustering centers and the dataset was divided into sub-clusters. Then, learned from the Clustering Using Representatives algorithm (CURE), the scattered representative points were selected from the sub-clusters, the clusters of the representative point pairs with the smallest distance were merged, and a parameter contraction factor was introduced to control the shape of the clusters. The experimental results show that the C-DP algorithm has better clustering effect than the DP algorithm on four synthetic datasets. The comparison of the Rand Index indicator on real datasets shows that on the dataset S1 and 4k2_far, the performance of C-DP is 2.32% and 1.13% higher than that of the DP. It can be seen that the C-DP algorithm improves the accuracy of clustering when datasets contain multiple density peaks in a single cluster.
Key words: density peak; hierarchical clustering; class merging; representative point; contraction factor
0 引言
聚類是將數(shù)據(jù)組織到不同的類中以尋找數(shù)據(jù)的固有隱藏模式的無監(jiān)督學(xué)習(xí)方法[1]。聚類用途廣泛,可以應(yīng)用于圖像處理[2]、模式識別[3]、生物信息學(xué)[4]、微陣列分析[5]和社交網(wǎng)絡(luò)[6]等各個領(lǐng)域,聚類算法可以將更多類似的數(shù)據(jù)聚集到同一個類中,而不同的數(shù)據(jù)被組織到不同的類中,在數(shù)據(jù)挖掘領(lǐng)域應(yīng)用廣泛[7]?,F(xiàn)如今已經(jīng)提出了許多基于不同特征的聚類算法,并且可以進(jìn)一步分類為基于劃分的聚類算法、基于密度的聚類算法、基于模型的聚類算法、基于層次的聚類算法和基于網(wǎng)格的聚類算法[8]。
其中基于密度的聚類方法在研究人員中廣受歡迎。即使存在噪聲,基于密度的聚類算法也可以在大數(shù)據(jù)集中找到任意形狀的聚類[9]。基于密度的噪聲應(yīng)用空間聚類 (Density-Based Spatial Clustering of Applications with Noise,DBSCAN)算法[10]是一種先進(jìn)的基于密度的聚類算法,它利用關(guān)于數(shù)據(jù)的最小鄰域來發(fā)現(xiàn)任意形狀的聚類;但是,DBSCAN的有效性受到輸入?yún)?shù)選擇的影響[11]。
文獻(xiàn)[12]提出了一種密度峰值聚類(clustering by fast search and find of Density Peaks, DP)算法。DP算法以密度峰值來選擇聚類中心,運(yùn)行高效,并且人工干預(yù)較少。DP算法使用了決策圖(Decision Graph)的啟發(fā)式方法來選擇聚類中心。雖然DP算法簡單高效,但其劣勢和難點不容小覷: 對于單個數(shù)據(jù)集中可能包含多個密度峰值,而在DP算法中將不同的密度峰值作為潛在的聚類中心,這就將原本一個類分成多個類簇,導(dǎo)致聚類效果不佳。
針對這一問題,本文提出了一種新的基于密度的聚類算法C-DP(mixed Density Peaks clustering algorithm):先用DP算法對數(shù)據(jù)集進(jìn)行初始的聚類,找到初始聚類中心;然后借鑒代表點層次聚類(Clustering Using REpresentatives, CURE)算法從初始聚類中選取類的代表點,對具有最小距離的代表點對的類進(jìn)行合并,以提高聚類質(zhì)量。仿真實驗結(jié)果表明,該算法在合成數(shù)據(jù)集和真實數(shù)據(jù)集上可以識別任意形狀的類簇。
1 相關(guān)工作
基于密度的聚類算法對孤立點和噪聲不敏感,可以處理具有不規(guī)則形狀的數(shù)據(jù)并發(fā)現(xiàn)具有任意形狀的聚類,聚類效果較好[13]。
文獻(xiàn)[12]提出的DP算法利用數(shù)據(jù)集的密度峰值來找到任意形狀數(shù)據(jù)集的類中心,并高效進(jìn)行樣本點歸類和噪聲剔除,處理大規(guī)模數(shù)據(jù)集的性能良好。
DP算法屬于密度聚類算法中性能較好的,但仍有改善空間,研究人員已經(jīng)針對DP算法進(jìn)行了大量的優(yōu)化改進(jìn)工作。文獻(xiàn)[14]中結(jié)合DP算法和變色龍(ChAmeleon,CA)算法,提出了一種擴(kuò)展的快速搜索聚類算法(Extended fast search clustering algorithm: widely density clusters, no Density Peaks,E-DP),解決了DP算法無法處理一個類中有多個密度峰值點的問題;雖然與原DP算法相比,該改進(jìn)算法的時間復(fù)雜度并未增加,但它的運(yùn)算時間更長。文獻(xiàn)[15]提出了一種基于支持向量機(jī)(Support Vector Machine,SVM)的新型合并策略算法。該策略首先利用SVM計算DP算法聚類后每兩個類之間的反饋值;然后,以反饋值為依據(jù)合并聚類,以遞歸方式獲得準(zhǔn)確的聚類結(jié)果。該算法的時間復(fù)雜度較高。文獻(xiàn)[16]提出了一種用于自適應(yīng)選擇聚類中心的通過快速搜索和查找密度峰值的模糊聚類 (Fuzzy Clustering by Fast Search and Find of Density Peaks, Fuzzy-CFSFDP) 算法,它在DP算法基礎(chǔ)上通過使用模糊規(guī)則來確定聚類中心點,能提高類中心點選取的準(zhǔn)確率,從而提升算法的性能;然而該算法的實現(xiàn)條件是數(shù)據(jù)集中的每個類有且僅有一個密度峰值,算法具有局限性。文獻(xiàn)[17]提出了一種具有新的分配策略的領(lǐng)導(dǎo)密度峰值聚類 (Leading clustering with Density peaks, L-DP)算法,在原有DP算法獲得聚類中心后,運(yùn)用一種改進(jìn)的領(lǐng)導(dǎo)聚類方法(Leader clustering)作為點分配策略,以此獲得正確的聚類;然而,該方法沒有考慮一個類中心有多個密度峰值的情況,而且該方法假設(shè)聚類中心更靠近高密度區(qū)域,實際上,聚類中心也可能靠近低密度區(qū)域。
4 結(jié)語
針對數(shù)據(jù)集中單個聚類中包含多個密度峰值,DP算法無法得到準(zhǔn)確聚類結(jié)果的問題,本文提出了一種混合的密度峰值聚類算法C-DP:受CURE的啟發(fā),在聚類過程中先用DP算法得到初始聚類中心進(jìn)行初始聚類,然后從初始類中選取類的代表點,并引入收縮因子α,然后根據(jù)收縮因子來移動它們,可以很好地控制類的形狀。以最小距離度量方法作為類間相似性度量標(biāo)準(zhǔn),不同類之間有最近距離代表點對的兩個類被合并。實驗結(jié)果表明,對于單個聚類包含多個密度峰值的數(shù)據(jù)集,C-DP算法與原始的DP算法相比可以更準(zhǔn)確地進(jìn)行聚類,而對于一些只有一個密度峰值的數(shù)據(jù)集,C-DP算法和原DP算法都能進(jìn)行很好的聚類。
然而,C-DP算法與原始的DP算法,在獲取數(shù)據(jù)集的聚類中心時是由用戶選擇的,這很難準(zhǔn)確判定聚類中心。 以后的工作是將C-DP算法擴(kuò)展為完全自適應(yīng)方法。
參考文獻(xiàn):
[1] CASSISI C, FERRO A, GIUGNO R, et al. Enhancing density-based clustering: Parameter reduction and outlier detection [J]. Information Systems, 2013, 38(3): 317-330.
[2] LI K, ZHU C Y, LYU Q, et al. Personalized multi-modality image management and search for mobile devices [J]. Personal and Ubiquitous Computing, 2013, 17(8): 1817-1834.
[3] AHN C-S, OH S-Y. Robust vocabulary recognition clustering model using an average estimator least mean square filter in noisy environments [J]. Personal and Ubiquitous Computing, 2014, 18(6): 1295-1301.
[4] NICOVICH P R, TABARIN T, STIEGLER J, et al. Analysis of nanoscale protein clustering with quantitative localization microscopy [J]. Biophysical Journal, 2015, 108(2): 475a.
[5] YEGANOVA L, KIM W, KIM S, et al. Retro: concept-based clustering of biomedical topical sets [J]. Bioinformatics, 2014, 30(22): 3240-3248.
[6] CHANG M-S, CHEN L-H, HUNG L-J, et al. Exact algorithms for problems related to the densest k-set problem [J]. Information Processing Letters, 2014, 114(9): 510-513.
[7] JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review [J]. ACM Computing Surveys,1999, 31 (3): 264-323.
[8] 周濤,陸惠玲.數(shù)據(jù)挖掘中聚類算法研究進(jìn)展[J].計算機(jī)工程與應(yīng)用,2012,48(12):100-111. (ZHOU T, LU H L. Research progress of clustering algorithms in data mining [J]. Computer Engineering and Applications, 2012, 48(12): 100-111.)
[9] GENG Y-A, LI Q, ZHENG R, et al. RECOME:a new density-based clustering algorithm using relative KNN kernel density [J]. Information Sciences, 2016, 436/437: 13-30.
[10] 張麗杰.具有穩(wěn)定飽和度的DBSCAN算法[J].計算機(jī)應(yīng)用研究,2014,31(7):1972-1975. (ZHANG L J. Stable saturation density of DBSCAN algorithm [J]. Applications Research of Computers, 2014, 31(7): 1972-1975.)
[11] ZHOU A Y, ZHOU S G, CAO J, et al. Approaches for scaling DBSCAN algorithm to large spatial databases [J]. Journal of Computer Science and Technology, 2000, 15(6): 509-526.
[12] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks [J]. Science, 2014, 344(6191): 1492-1496.
[13] XU J, WANG G, DENG W. DenPEHC: density peak based efficient hierarchical clustering [J]. Information Sciences, 2016, 373: 200-218.
[14] ZHANG W, LI J. Extended fast search clustering algorithm: widely density clusters, no density peaks [EB/OL]. [2018-04-06]. https://arxiv.org/ftp/arxiv/papers/1505/1505.05610.pdf.
[15]XU X, DING S, XU H, et al. A feasible density peaks clustering algorithm with a merging strategy [J/OL]. Soft Computing, 2018 [2018-03-06]. https://doi.org/10.1007/s00500-018-3183-0.
[16] MEHMOOD R, BIE R, DAWOOD H, et al. Fuzzy clustering by fast search and find of density peaks [C]// Proceedings of the 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things. Washington, DC: IEEE Computer Society, 2015: 258-261.
[17] DU M, DING S. L-DP: A hybrid density peaks clustering method [C]// DMBD 2017:Proceedings of the 2017 International Conference on Data Mining and Big Data, LNCS 10387. Cham: Springer, 2017: 74-80.
[18] 沈潔,趙雷,楊季文,等.一種基于劃分的層次聚類算法[J].計算機(jī)工程與應(yīng)用,2007,43(31): 175-177. (SHEN J, ZHAO L, YANG J W, et al. Hierarchical clustering algorithm based on partition [J]. Computer Engineering and Applications, 2007,43(31):175-177.)
[19] WIWIE C, BAUMBACH J, RTTGER R. Comparing the performance of biomedical clustering methods [J]. Nature Methods, 2015, 12: 1033-1038.