張偉科
(沈陽(yáng)理工大學(xué) 理學(xué)院, 沈陽(yáng) 110159)
一種改進(jìn)的AprioriTid算法*
張偉科
(沈陽(yáng)理工大學(xué) 理學(xué)院, 沈陽(yáng) 110159)
針對(duì)經(jīng)典Apriori算法多次掃描數(shù)據(jù)庫(kù)產(chǎn)生I/O負(fù)載影響運(yùn)行效率等問(wèn)題,在對(duì)Apriori算法的原理及其相關(guān)改進(jìn)算法研究的基礎(chǔ)上,提出了一種基于壓縮集的改進(jìn)Apriori算法,即AprioriTid_M算法.通過(guò)有效的裁剪方法減少無(wú)效項(xiàng)集的產(chǎn)生,減少候選項(xiàng)集的數(shù)量,從而提高算法的效率.仿真實(shí)驗(yàn)表明,在支持度相同但數(shù)據(jù)量不同,以及數(shù)據(jù)量相同但支持度不同這兩種條件下,AprioriTid_M算法在性能上和運(yùn)算時(shí)間上都比Apriori算法有很大程度的改善.
Apriori算法; AprioriTid算法; AprioriTid_M算法; 關(guān)聯(lián)規(guī)則; 置信度; 項(xiàng)集; 支持度; 性能
數(shù)據(jù)挖掘關(guān)聯(lián)規(guī)則中相當(dāng)經(jīng)典的算法就是Apriori算法,該算法具有反單調(diào)性的特點(diǎn).Apriori算法首先生成候選項(xiàng)集判斷是否為頻繁項(xiàng)集[1],所生成的頻繁項(xiàng)集的任一子集均是頻繁的,含有非頻繁項(xiàng)集的任意子集的項(xiàng)集一定是非頻繁的.運(yùn)用迭代的思想,首先發(fā)現(xiàn)頻繁1項(xiàng)集,由頻繁k-1項(xiàng)集生成k候選項(xiàng)集,逐層掃描數(shù)據(jù)庫(kù)后從候選k項(xiàng)集中篩選出頻繁k項(xiàng)集,直到最終剩下的候選項(xiàng)集為空時(shí)算法結(jié)束[2].
Apriori算法生成頻繁項(xiàng)集的算法描述如下.
輸入:數(shù)據(jù)集D,設(shè)置最小支持度min_sup的閾值.
輸出:D中的頻繁項(xiàng)集L.
L1=find_frequent_1-itemset(D);
for(k=2;Lk-1≠φ;k++){
Ck=apriori_gen(Lk-1);
//產(chǎn)生候選項(xiàng)集
for all transactiont∈D{
Ct=subset(Ck,t);
//識(shí)別t包含的所有候項(xiàng)集
for all candidatesc∈Ct{
c.count++;
//支持度計(jì)算增值
}
}
//提取頻繁k項(xiàng)集
}
returnL=∪kLk;
procedure apriori_gen(Lk-1)
for eachitemsetl1∈Lk-1
for eachitemsetl2∈Lk-1
if(l1[1]=l2[1])∧…∧(l1[k-2]=l2[k-2])∧(l1[k-1] c=join(l1,l2); //連接:產(chǎn)生候選項(xiàng)集 if has_infrequent_subset(c,Lk-1)then deletec //剪枝:移除非頻繁的候選項(xiàng)集 else addctoCk } returnCk procedure has_infrequent_subset(c,Lk-1) //使用先驗(yàn)知識(shí)判斷候選項(xiàng)集是否頻繁 for each(k-1)-subsetsofc ifs?Lk-1then return TRUE; return FALSE 算法的具體步驟如下. 1) 單遍掃描數(shù)據(jù)集,得到每個(gè)項(xiàng)的支持度以及所有頻繁1項(xiàng)集的集合L1. 2) 通過(guò)調(diào)用apriori_gen[3]函數(shù)對(duì)前一次掃描得到的頻繁k-1項(xiàng)集再次掃描,依據(jù)每項(xiàng)的支持度使判斷閾值得到新的候選k項(xiàng)集.apriori_gen函數(shù)由頻繁k-1項(xiàng)集生成候選k項(xiàng)集,經(jīng)過(guò)連接和剪枝,其兩個(gè)步驟如下所示. ② 剪枝.由has_infrequent_subset函數(shù)完成[5],判定候選項(xiàng)集中的k項(xiàng)集是否含有k-1非頻繁項(xiàng)集,若含有k-1項(xiàng)集是非頻繁的,則要將該候選項(xiàng)集刪除以此完成剪枝.圖1為Apriori算法挖掘事物數(shù)據(jù)的關(guān)聯(lián)規(guī)則流程圖. 圖1 關(guān)聯(lián)規(guī)則流程圖 雖然Apriori算法能夠?qū)崿F(xiàn)對(duì)數(shù)據(jù)項(xiàng)的關(guān)聯(lián)規(guī)則挖掘,但是隨著數(shù)據(jù)庫(kù)存儲(chǔ)量的增加和對(duì)算法的迭代應(yīng)用及研究,表明Apriori算法主要有兩方面運(yùn)行性能瓶頸[6]. 1) 反復(fù)多次掃描事物的數(shù)據(jù)庫(kù),增加了I/O的負(fù)載.Apriori算法每次進(jìn)行k循環(huán)都要完整地掃描數(shù)據(jù)庫(kù),判定候選項(xiàng)集Ck中的每一個(gè)元素是否能夠成為項(xiàng)集Lk.例如,一個(gè)最大頻繁項(xiàng)集中有15個(gè)項(xiàng),就需要至少掃描事物數(shù)據(jù)庫(kù)15遍才能完成任務(wù).對(duì)于海量數(shù)據(jù)挖掘來(lái)說(shuō),I/O負(fù)載量非常大. 2) 海量數(shù)據(jù)會(huì)產(chǎn)生異常龐大的候選項(xiàng)集.項(xiàng)集Lk-1候選項(xiàng)集Ck是呈指數(shù)增長(zhǎng)的,數(shù)量龐大.例如,104個(gè)頻繁1項(xiàng)集會(huì)產(chǎn)生大約107個(gè)元素的候選2項(xiàng)集[7].龐大的候選項(xiàng)集浪費(fèi)了存儲(chǔ)空間,同時(shí)降低了運(yùn)行效率,因此,算法的運(yùn)行性能方面需加以優(yōu)化. 2.1AprioriTid算法 AprioriTid算法主要是通過(guò)減少掃描事物數(shù)據(jù)量來(lái)實(shí)現(xiàn)性能優(yōu)化.一個(gè)事物中如果不包含k階大項(xiàng)集,則一定不含有k+1階的大項(xiàng)集.因此,忽略大項(xiàng)集事務(wù)后,可減少后續(xù)循環(huán)掃描事物的次數(shù),并且不會(huì)影響到候選項(xiàng)集的支持度. AprioriTid算法的過(guò)程描述如下: L1={large 1-itemsets}; //計(jì)算1階大項(xiàng)集 for(k=2;Lk-1≠φ;k=k+1); Ck=apriori_gen(Lk-1); //構(gòu)造候選項(xiàng)集 //t包含的候選項(xiàng)集 for allC∈CtdoC.sup=C.sup+1;end for end if //構(gòu)造k階候選項(xiàng)集的Tid表 end for //計(jì)算k階大項(xiàng)集 end for L=∪kLk 2.2AprioriTid_M算法 AprioriTid算法只在計(jì)算1項(xiàng)集的支持度時(shí)對(duì)數(shù)據(jù)庫(kù)D進(jìn)行了掃描,減少了運(yùn)行時(shí)間,但是過(guò)于龐大的候選項(xiàng)集還是會(huì)影響運(yùn)行時(shí)間,因此,本文提出一種基于壓縮集的AprioriTid_M改進(jìn)算法.根據(jù)原理,頻繁項(xiàng)集的所有非空子集一定是頻繁項(xiàng)集[9],可得頻繁k項(xiàng)集的所有k-1項(xiàng)集一定也是頻繁的,以此為基礎(chǔ)進(jìn)一步地優(yōu)化Tid表. 性質(zhì)1如果頻繁k項(xiàng)集可以產(chǎn)生頻繁k+1項(xiàng)集,那么頻繁k項(xiàng)集中的項(xiàng)集個(gè)數(shù)一定大于k. 證明由一切頻繁項(xiàng)集的非空子集一定是頻繁的,推出Lk+1任何項(xiàng)集的k+1個(gè)不同k項(xiàng)子集一定在頻繁k項(xiàng)集中,證明完畢. 性質(zhì)2若Mk是數(shù)據(jù)庫(kù)D中的頻繁k項(xiàng)集[10],那么Mk中包含的任何一項(xiàng)在全部k-1項(xiàng)集Mk-1里出現(xiàn)的次數(shù)一定大于等于k-1次. 證明假設(shè)Nk={x1,x2,x3,…,xk},xi∈Lk,頻繁項(xiàng)集Nk∈Lk,xi∈I,i=1,2,…,k,其中,I={I1,I2,…,Im}是數(shù)據(jù)項(xiàng)的集合,則Nk中任何一個(gè)含有k-1個(gè)項(xiàng)的子集也一定是頻繁項(xiàng)集,且它們都屬于Lk-1.設(shè)Nk-1,i=Nk-{xi},則推出xi∈Nk-{xj},j≠i,j=1,2,…,k,即xi一定在其他的k-1項(xiàng)集集合中,因此,頻繁k項(xiàng)集Lk中任意的xi項(xiàng)在Lk-1里面至少出現(xiàn)k-1次,證明完畢. L1={頻繁1項(xiàng)集}; for(k=2;Lk-1≠φ;k++)do begin; //產(chǎn)生全部的頻繁項(xiàng)集 Lk-1=A_prune(Ck-1); for每一個(gè)項(xiàng)目T1∈t.set-itemsets; for每一個(gè)項(xiàng)目T2∈t.set-itemsets; for所有候選c∈Ctdo; if((T1[1]=T2[1])∧(T1[2]=T2[2])∧…∧(T1[k-2]=T2[k-2])∧(T1[k-1] {c=T1; addctoCt; c.count++}; end; Answer=UkLk 圖2 改進(jìn)的AprioriTid_M算法示例 為了驗(yàn)證改進(jìn)后的AprioriTid_M算法的性能,分別采用AprioriTid算法和改進(jìn)的AprioriTid_M算法對(duì)相同的數(shù)據(jù)進(jìn)行挖掘,測(cè)試出在不同的支持度下兩種算法執(zhí)行所需要的時(shí)間,和不同的數(shù)據(jù)規(guī)模下兩種算法運(yùn)行所需要的時(shí)間.操作系統(tǒng)采用Windows XP Professional,利用SQL 2005對(duì)實(shí)驗(yàn)數(shù)據(jù)進(jìn)行預(yù)處理. 圖3為設(shè)置不同支持度時(shí),數(shù)據(jù)集量為1 000條時(shí),使用AprioriTid算法和AprioriTid_M算法生成頻繁項(xiàng)集所消耗的時(shí)間對(duì)比圖.當(dāng)數(shù)據(jù)量相同時(shí),AprioriTid產(chǎn)生頻繁項(xiàng)集的時(shí)間隨著支持度的增加變化幅度比較大,性能不夠穩(wěn)定.改進(jìn)的AprioriTid_M算法對(duì)于相同的數(shù)據(jù)進(jìn)行運(yùn)算時(shí),時(shí)間變化幅度相對(duì)較小,且運(yùn)算時(shí)間明顯少于沒(méi)有改進(jìn)時(shí)所使用的時(shí)間,說(shuō)明AprioriTid_M算法在計(jì)算時(shí)間和性能上比AprioriTid算法有很大程度的提高. 圖3 支持度不同時(shí)產(chǎn)生頻繁項(xiàng)集所需的時(shí)間 圖4為分別選取事務(wù)數(shù)據(jù)量為2 000、3 000、4 000、5 000、6 000和7 000,支持度為0.5時(shí)兩種算法運(yùn)行所消耗的時(shí)間對(duì)比圖.由圖4可以看出,兩種算法在數(shù)據(jù)量增加時(shí),消耗的時(shí)間越來(lái)越多.但是在處理等量數(shù)據(jù)時(shí),AprioriTid_M算法運(yùn)行的時(shí)間明顯小于改進(jìn)前的算法消耗時(shí)間,且當(dāng)數(shù)據(jù)量增加時(shí),運(yùn)行時(shí)間的增值幅度是趨于平穩(wěn)的,而AprioriTid算法隨事務(wù)總量增加時(shí),消耗時(shí)間增長(zhǎng)幅度比較大,性能不夠穩(wěn)定. 圖4 支持度為0.5時(shí)兩種算法的運(yùn)行時(shí)間 圖5為支持度為0.7時(shí)兩種算法的運(yùn)行時(shí)間.由圖5可知,當(dāng)支持度為0.7時(shí),改進(jìn)后的AprioriTid_M算法用時(shí)較少,且隨著事務(wù)量的增長(zhǎng),運(yùn)行時(shí)間平穩(wěn)增長(zhǎng),而AprioriTid算法運(yùn)行時(shí)間隨事務(wù)量增加而急劇增長(zhǎng),性能不如改進(jìn)后的算法穩(wěn)定. 圖5 支持度為0.7時(shí)兩種算法的運(yùn)行時(shí)間 綜上可知,改進(jìn)后的AprioriTid_M算法在性能上和運(yùn)行效率上有所提高,穩(wěn)定性比較好. 本文研究了經(jīng)典Apriori算法的核心思想,分析了該算法性能上的缺點(diǎn)和不足,在此基礎(chǔ)上研究了AprioriTid算法,并提出一種基于事務(wù)集壓縮的AprioriTid_M算法.通過(guò)對(duì)這兩種算法在等量事務(wù)數(shù)據(jù)、不同支持度下的運(yùn)行時(shí)間比較和某一設(shè)定支持度下不同事務(wù)數(shù)據(jù)量的運(yùn)行時(shí)間比較分析,證明了AprioriTid_M算法在性能和效率上均高于AprioriTid算法. [1]張春燕,孟志青,袁沛.文本挖掘的時(shí)態(tài)文本關(guān)聯(lián)規(guī)則算法研究 [J].計(jì)算機(jī)科學(xué),2013,40(6):219-224. (ZHANG Chun-yan,MENG Zhi-qing,YUAN Pei.Mining algorithm for temporal text association rules in text mining [J].Computer Science,2013,40(6):219-224.) [2]Silverstri C,Orlando S.Approximate mining of frequent patterns on streams [J].Intelligent Data Analysis,2007,11(1):49-73. [3]于孝美,陳貞翔,彭立志.基于決策樹(shù)的網(wǎng)絡(luò)流量分類方法 [J].濟(jì)南大學(xué)學(xué)報(bào)(自然科學(xué)版),2012,26(3):291-295. (YU Xiao-mei,CHEN Zhen-xiang,PENG Li-zhi.Traffic classification based on decision tree [J].Journal of University of Jinan(Science and Technology),2012,26(3):291-295.) [4]Tsang S,Kao B,Kevin Y Y,et al.Decision trees for uncertain data [J].IEEE Transations on Knowledge and Date Engineering,2011,23(1):64-78. [5]Kohavi R,Provost F.Applications of data mining to electronic commerce [J].Data Mining and Knowldege Discovery,2011,5(1):5-10. [6]焉曉貞,謝紅,王桐.一種基于相關(guān)分析的多元回歸數(shù)據(jù)估計(jì)方法 [J].沈陽(yáng)工業(yè)大學(xué)學(xué)報(bào),2013,35(2):212-217. (YAN Xiao-zhen,XIE Hong,WANG Tong.Data evaluation method using multiple regression based on correlation analysis [J].Journal of Shenyang University of Technology,2013,35(2):212-217.) [7]翟云,楊炳儒,曲武,等.基于新型集成分類器的非平衡數(shù)據(jù)分類關(guān)鍵問(wèn)題研究 [J].系統(tǒng)工程與電子技術(shù),2011,33(1):196-201. (ZHAI Yun,YANG Bing-ru,QU Wu,et al.Study on source of classification in imbalanced datasets based on new ensemble classifier [J].Systems Engineering and Electronics,2011,33(1):196-201.) [8]向程冠,熊世桓,王東.基于關(guān)聯(lián)規(guī)則的社交網(wǎng)絡(luò)好友推薦算法 [J].中國(guó)科技論文,2014,9(1):87-91. (XIANG Cheng-guan,XIONG Shi-huan,WANG Dong.Social network friends recommendation algorithm based on association rules [J].China Sciencepaper,2014,9(1):87-91.) [9]章志剛,吉根林.一種基于FP-Growth的頻繁項(xiàng)目集并行挖掘算法 [J].計(jì)算機(jī)工程與應(yīng)用,2014,50(2):103-106. (ZHANG Zhi-gang,JI Gen-lin.Parallel algorithm for mining frequent item sets based on FP-Growth [J].Computer Engineering and Applications,2014,50(2):103-106.) [10]胡維華,馮偉.基于分解事務(wù)矩陣的關(guān)聯(lián)規(guī)則挖掘算法 [J].計(jì)算機(jī)應(yīng)用,2014,34(增刊2):113-116. (HU Wei-hua,F(xiàn)ENG Wei.Improved Apriori algorithm based on decomposed transaction matrix [J].Journal of Computer Applications,2014,34(Sup2):113-116.) (責(zé)任編輯:鐘媛英文審校:尹淑英) An improved AprioriTid algorithm ZHANG Wei-ke (School of Science, Shenyang Ligong University, Shenyang 110159, China) In order to solve the problem that the I/O load generated in the repeated scanning database for the classic Apriori algorithm will affect the running efficiency, an improved AprioriTid algorithm based on the compression set, namely the AprioriTid_M algorithm, was proposed on the basis of the research on the principle of Apriori algorithm and its related improved algorithms. Through the effective pruning methods, the generation of invalid item sets was reduced, and the number of candidate item sets was decreased. Therefore, the efficiency of the algorithm was improved. The results of simulation experiments show that under such conditions as the same support degree but different data amount or the same data amount but different support degree, the performance and running time of AprioriTid_M algorithm get greatly improved compared with those of Apriori algorithm. Apriori algorithm; AprioriTid algorithm; AprioriTid_M algorithm; association rule; confidence degree; item set; support degree; performance 2015-12-03. 遼寧省科學(xué)技術(shù)計(jì)劃項(xiàng)目(2012217005); 遼寧省科學(xué)事業(yè)公益研究基金資助項(xiàng)目(2012004002). 張偉科(1965-),男,河北秦皇島人,講師,碩士,主要從事計(jì)算機(jī)視覺(jué)、智能檢測(cè)與控制等方面的研究. 10.7688/j.issn.1000-1646.2016.03.14 TP 311 A 1000-1646(2016)03-0314-05 *本文已于2016-04-22 15∶41在中國(guó)知網(wǎng)優(yōu)先數(shù)字出版. 網(wǎng)絡(luò)出版地址: http:∥www.cnki.net/kcms/detail/21.1189.T.20160422.1541.006.html2 改進(jìn)的Apriori算法
3 算法性能比較
4 結(jié) 論