陳玉明,曾志強,田翠華
1.廈門理工學院 計算機與信息工程學院,福建 廈門 361024
2.江西師范大學 國家網(wǎng)絡化支撐軟件國際科技合作基地,南昌 330027
鄰域粗糙集中不確定性的熵度量方法*
陳玉明1,2+,曾志強1,田翠華1
1.廈門理工學院 計算機與信息工程學院,福建 廈門 361024
2.江西師范大學 國家網(wǎng)絡化支撐軟件國際科技合作基地,南昌 330027
CHEN Yuming,ZENG Zhiqiang,TIAN Cuihua.Uncertainty measures using entropy and neighborhood rough sets.Journal of Frontiers of Computer Science and Technology,2016,10(12):1793-1800.
針對傳統(tǒng)粗糙集理論中不確定度量方法難以適用于鄰域粗糙集模型的問題,引入信息熵的度量方法,提出了基于信息熵的鄰域粗糙集不確定性度量方法。該方法采用鄰域關系對連續(xù)型數(shù)據(jù)進行信息?;?,基于?;蟮臄?shù)據(jù)定義鄰域系統(tǒng)中的近似精度、鄰域信息熵、加權鄰域信息熵等不確定性度量。進一步提出鄰域系統(tǒng)不確定性度量的公理化表示,證明鄰域系統(tǒng)的近似精度、鄰域信息熵、加權鄰域信息熵都是公理化度量;給出其最大最小值,證明其滿足單調性原理。理論分析與實驗表明鄰域系統(tǒng)中的信息熵度量優(yōu)于近似精度度量。
鄰域粗糙集;鄰域信息熵;不確定性度量;信息系統(tǒng);近似精度
粗糙集理論由波蘭科學家Pawlak于1982年提出[1],是一種處理不精確、不確定與海量數(shù)據(jù)的理論與方法,近二十年來被廣泛應用于機器學習[2]、數(shù)據(jù)挖掘[3]、圖像處理[4]、離群點檢測[5]、特征選擇[6]、大數(shù)據(jù)分析與處理等領域[7]。粗糙集理論中度量方法的研究是機器學習領域關鍵的研究內(nèi)容之一。良好的度量工具能有效評價信息系統(tǒng)與決策系統(tǒng)的不確定性,提高機器學習中聚類與分類的精度與效率。
粗糙集理論中的不確定性度量是評價系統(tǒng)分類能力及提高分類精度的重要工具,國內(nèi)外眾多學者對此進行了研究。Pawlak[8]采用上下近似的比值構造的精度來度量等價類集合的不確定性,進一步用近似精度來度量決策系統(tǒng)的不確定性。精度與近似精度是隨屬性的增加而遞增的函數(shù),Pawlak同時提出了粗糙度與近似粗糙度兩個單調性遞減的度量工具。然而,Pawlak的不確定性度量并不精細,存在精度或者粗糙度一樣而等價類集合卻不一樣的情況。因此,很多學者從不同角度進行了改進,提出了信息質量[9]、近似質量[10]、知識粒度[11]、信息粒度[12]等度量。苗奪謙、梁吉業(yè)等人將熵的概念引入粗糙集領域,提供了更加有效而精細的度量工具,主要包括信息熵[13]、條件熵[14]、互信息[13]與粗糙熵[15]等度量。
粗糙集理論中的這些度量工具與方法已經(jīng)廣泛應用于機器學習與數(shù)據(jù)挖掘的研究。經(jīng)典粗糙集主要適用于具有離散型數(shù)據(jù)的決策系統(tǒng),而對于廣泛存在的連續(xù)型數(shù)據(jù),需進行離散化預處理,但為此造成了分類信息丟失,分類精度降低等問題。胡清華等人提出了鄰域粗糙集模型[16],能夠處理具有連續(xù)型數(shù)據(jù)的知識分類系統(tǒng),已經(jīng)廣泛用于屬性約簡[17]、特征選擇與提取[18]、分類與聚類[19]、基因選擇[20]、圖像處理[21]等領域。然而,鄰域關系并不是嚴格的等價關系,經(jīng)典的不確定性度量工具與方法并不適用于鄰域知識分類系統(tǒng)。
本文在深入研究經(jīng)典粗糙集度量工具與方法的基礎上,針對連續(xù)型數(shù)據(jù)的特點,引入鄰域粗糙集模型與信息熵度量,提出基于鄰域信息熵的不確定性度量方法。首先,對信息系統(tǒng)進行鄰域粒化,構造鄰域類集合;其次,定義鄰域精度與鄰域粗糙度概念度量鄰域類集合的不確定性,采用鄰域近似精度與鄰域近似粗糙度概念度量鄰域決策系統(tǒng)的不確定性;進一步,提出鄰域信息熵、加權鄰域信息熵等概念,用于度量連續(xù)型知識分類系統(tǒng)的不確定性及分類能力,證明了鄰域精度、鄰域信息熵及加權鄰域信息熵度量的單調性原理;最后,通過理論分析與實驗表明鄰域系統(tǒng)中的鄰域信息熵度量及加權鄰域信息熵度量優(yōu)于近似精度度量。
Pawlak粗糙集理論對離散型數(shù)據(jù)進行等價類劃分,形成等價類集合。而對于現(xiàn)實世界廣泛存在的連續(xù)型數(shù)據(jù),需要進行離散化處理后構造合適的等價類,但是離散化過程容易造成分類信息的丟失。為此,針對Pawlak粗糙集理論的局限性,引入鄰域粗糙集模型,給出鄰域粗糙集的相關概念[16],并討論鄰域粗糙集的精度度量與粗糙度度量。
信息熵是一種有效而精細的不確定性度量工具。經(jīng)典粗糙集中基于信息熵的度量并不適用于鄰域粗糙集模型,需要進行擴展與改進。因此,根據(jù)鄰域粗糙集模型的特點,引入信息熵理論,定義鄰域系統(tǒng)中鄰域信息熵的概念,證明該概念是一種公理化度量,給出其最大最小值,并證明其滿足單調性原理。進一步定義了基于鄰域信息熵與鄰域近似精度的加權度量,證明了相關性質。
Table 1 The first medicine decision system表1 醫(yī)療決策系統(tǒng)之一
從以上例子可知,鄰域近似精度、鄰域信息熵與加權鄰域信息熵度量都是隨特征子集的增加而遞增,不確定性增加,能夠度量鄰域系統(tǒng)的不確定性。然而,鄰域近似精度度量不夠精細。特征子集從{a}增加到{a,b},不確定性發(fā)生變化,鄰域近似精度的值卻沒有變化,鄰域信息熵和加權鄰域信息熵度量的值都增大,說明這兩個度量優(yōu)于鄰域近似精度度量。
為驗證鄰域信息熵度量的有效性,分別采用表1和表2中的數(shù)據(jù)進行不確定性度量實驗。度量方法分別采用精度度量、鄰域信息熵度量、加權鄰域信息熵度量。實驗中鄰域?;捎脷W氏距離,表1中的鄰域參數(shù)為0.3,表2中的鄰域參數(shù)為0.45。實驗結果如圖1和圖2所示。
Table 2 The second medicine decision system表2 醫(yī)療決策系統(tǒng)之二
Fig.1 Measure result of Table 1圖1 表1數(shù)據(jù)的度量結果
Fig.2 Measure result of Table 2圖2 表2數(shù)據(jù)的度量結果
由圖1與圖2中的度量結果可知,近似精度、鄰域信息熵、加權鄰域信息熵的值隨特征個數(shù)的增加而單調遞增,不確定性增加,能夠度量數(shù)據(jù)的不確定性。進一步分析可知,圖1中,特征個數(shù)從1增加到2,近似精度沒有變化,而鄰域信息熵和加權鄰域信息熵遞增;圖2中,特征個數(shù)從2增加到3,近似精度沒有變化,而鄰域信息熵和加權鄰域信息熵遞增。這些結果表明近似精度度量不夠精細,有時并不能反映不確定性的變化,而鄰域信息熵與加權鄰域信息熵則具有更好的不確定性度量性能。
傳統(tǒng)Pawlak粗糙集模型主要處理離散型數(shù)據(jù)集,對于連續(xù)型數(shù)據(jù)集則需要離散化預處理過程。然而,離散化算法不可避免會造成重要信息的損失,甚至降低機器學習算法的分類精度。為此,針對連續(xù)型的數(shù)據(jù)集的特點,在決策系統(tǒng)中引入鄰域關系、信息熵理論,定義鄰域近似精度、鄰域信息熵與加權鄰域信息熵等概念度量連續(xù)型數(shù)據(jù)的不確定性,并證明了鄰域近似精度、鄰域信息熵與加權鄰域信息熵的單調性,為機器學習相關分類算法的研究提供了理論基礎。
鄰域系統(tǒng)中的近似精度、鄰域信息熵等能夠度量數(shù)據(jù)的不確定性,不僅適用于連續(xù)型數(shù)據(jù)集,而且也適用于離散型數(shù)據(jù)集。因此,這些度量能夠應用于現(xiàn)實世界大量存在的同時具備以上兩種類型的復雜數(shù)據(jù)集,進一步可以基于不確定性度量構造特征重要度,應用于屬性約簡、特征選擇等領域。
[1]Pawlak Z.Rough sets[J].International Journal of Information and Computer Sciences,1982,11(1):341-356.
[2]Duan Jie,Hu Qinghua,Zhang Lingjun,et al.Feature selection for multi-label classification based on neighborhood rough sets[J].Journal of Computer Research and Development, 2015,52(1):56-65.
[3]Tseng T L,Huang C C,Fraser K,et al.Rough set based rule induction in decision making using credible classification and preference from medical application perspective[J]. Computer Methods and Programs in Biomedicine,2016, 127(4):273-289.
[4]Yue Xiaodong,Miao Duoqian,Zhong Caiming.Roughness measure approach to color image segmentation[J].Acta Automatica Sinica,2010,36(6):807-816.
[5]Jiang Feng,Du Junwei,Ge Yan,et al.Sequence outlier detection based on rough set theory[J].Acta Electronica Sinica, 2011,39(2):345-350.
[6]Chen Yumin,Miao Duoqian,Wang Ruizhi.A rough set approach to feature selection based on ant colony optimization [J].Pattern Recognition Letters,2010,31(3):226-233.
[7]Qian Jin,Lv Ping,Yue Xiaodong,et al.Hierarchical attribute reduction algorithms for big data using MapReduce[J]. Knowledge-Based Systems,2015,73:18-31.
[8]Pawlak Z.Rough sets[M].Dordrecht:Kluwer Academic Publishers,1991:45-64.
[9]Liang Jiye,Li Ru,Qian Yuhua.Distance:a more comprehensible perspective for measures in rough set theory[J]. Knowledge-Based Systems,2012,27(11):126-136.
[10]Dai Jianhua,Xu Qing.Approximations and uncertainty measures in incomplete information systems[J].Information Sciences,2012,198:62-80.
[11]Miao Duoqian,Fan Shidong.The calculation of knowledge granulation and its application[J].Systems Engineering-Theory&Practice,2002,22(1):48-56.
[12]Zhang Wenxiu,Wu Weizhi,Liang Jieye,et al.Rough set theory and methods[M].Beijing:Science Press,2001.
[13]Miao Duoqian,Wang Jue.An information representation of the concepts and operations in rough set theory[J].Journal of Software,1999,10(2):113-116.
[14]Wang Guoyin,Zhang Qinghua.Uncertainty of rough sets in different knowledge granularities[J].Chinese Journal of Computers,2008,31(9):1588-1598.
[15]Liang Jiye,Shi Zhongzhi,Li Deyu.Information entropy, rough entropy and knowledge granulation in incomplete information systems[J].International Journal of General System,2006,35(6):641-654.
[16]Hu Qinghua,Yu Daren,Xie Zongxia.Neighborhood classifiers[J].Expert Systems with Applications,2008,34(2): 866-876.
[17]Liu Yong,Huang Wenliang,Jiang Yunliang,et al.Quick attribute reduct algorithm for neighborhood rough set model [J].Information Sciences,2014,271:65-81.
[18]Xie Juanying,Li Nan,Qiao Zirui.Feature subset selection algorithms for incomplete decision systems based on neighborhood rough sets[J].Journal of Nanjing University:Natural Sciences,2011,47(4):383-390.
[19]Yao Ping Y,Lu Yongheng.Neighborhood rough set and SVM based hybrid credit scoring classifier[J].Expert Systems withApplications,2011,38(9):11300-11304.
[20]Meng Jun,Zhang Jing,Luan Yushi.Gene selection integrated with biological knowledge for plant stress response using neighborhood system and rough set theory[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015,12(2):433-444.
[21]Yu Ying,Pedrycz W,Miao Duoqian.Neighborhood rough sets based multi-label classification for automatic image annotation[J].International Journal of Approximate Reasoning, 2013,54(9):1373-1387.
[22]Chen Yumin,Wu Keshou,Chen Xuhui,et al.An entropybased uncertainty measurement approach in neighborhood systems[J].Information Sciences,2014,279:239-250.
附中文參考文獻:
[2]段潔,胡清華,張靈均,等.基于鄰域粗糙集的多標記分類特征選擇算法[J].計算機研究與發(fā)展,2015,52(1):56-65.
[4]岳曉冬,苗奪謙,鐘才明.基于粗糙性度量的彩色圖像分割方法[J].自動化學報,2010,36(6):807-816.
[5]江峰,杜軍威,葛艷,等.基于粗糙集理論的序列離群點檢測[J].電子學報,2011,39(2):345-350.
[11]苗奪謙,范世棟.知識的粒度計算及其應用[J].系統(tǒng)工程理論與實踐,2002,22(1):48-56.
[12]張文修,吳偉志,梁吉業(yè),等.粗糙集理論與方法[M].北京:科學出版社,2001.
[13]苗奪謙,王玨.粗糙集理論中概念與運算的信息表示[J].軟件學報,1999,10(2):113-116.
[14]王國胤,張清華.不同知識粒度下粗糙集的不確定性研究[J].計算機學報,2008,31(9):1588-1598.
[18]謝娟英,李楠,喬子芮.基于鄰域粗糙集的不完整決策系統(tǒng)特征選擇算法[J].南京大學學報,2011,47(4):383-390.
CHEN Yuming was born in 1977.He received the Ph.D.degree from Tongji University in 2010.Now he is an associate professor at Xiamen University of Technology,and the member of CCF.His research interests include rough sets and feature selection,etc.
陳玉明(1977—),男,江西吉安人,2010年于同濟大學獲得博士學位,現(xiàn)為廈門理工學院副教授,CCF會員,主要研究領域為粗糙集,特征選擇等。
ZENG Zhiqiang was born in 1971.He received the Ph.D.degree from Zhejiang University in 2007.Now he is the vice dean at College of Computer and Information Engineering,Xiamen University of Technology.His research interests include artificial intelligence and pattern recognition,etc.
曾志強(1971—),男,福建廈門人,2007年于浙江大學獲得博士學位,現(xiàn)為廈門理工學院計算機與信息工程學院副院長,主要研究領域為人工智能,模式識別等。
TIAN Cuihua was born in 1970.She received the Ph.D.degree from Northeastern University in 2008.Now she is an associate professor at Xiamen University of Technology.Her research interests include data mining and big data,etc.
田翠華(1970—),女,遼寧沈陽人,2008年于東北大學獲得博士學位,現(xiàn)為廈門理工學院副教授,主要研究領域為數(shù)據(jù)挖掘,大數(shù)據(jù)等。
Uncertainty Measures Using Entropy and Neighborhood Rough Sets*
CHEN Yuming1,2+,ZENG Zhiqiang1,TIAN Cuihua1
1.College of Computer and Information Engineering,Xiamen University of Technology,Xiamen,Fujian 361024,China
2.State International S&T Cooperation Base of Networked Supporting Software,Jiangxi Normal University,Nanchang 330027,China
+Corresponding author:E-mail:cym0620@163.com
In view of the fact that the uncertainty measures of classical rough set theory are difficult to be suitable for neighborhood rough set model,this paper proposes an uncertainty measurement method based on information entropy and neighborhood rough sets.By the definitions of neighborhood relation,each object in the universe is assigned with a neighborhood subset,called neighborhood granule.Some uncertainty measures of neighborhood granule are defined,including approximate accuracy,information entropy and weighted information entropy in the neighborhood system.Furthermore,this paper presents the axiomatic concept of measure,and proves that the three measures are axiomatic uncertainty measures.This paper also gives the maximum and minimum of these measures and proves their monotonicities.Theoretical analysis and experiments show that the information entropy measure in the neighborhood system is better than the approximate accuracy measure.
10.3778/j.issn.1673-9418.1605037
A
TP18
*The National Natural Science Foundation of China under Grant No.61573297(國家自然科學基金);the Natural Science Foundation of Fujian Province under Grant Nos.2015J01277,2016J01324(福建省自然科學基金);the Project of Department of Education of Fujian Province under Gant Nos.JA09217,JB13152(福建省教育廳項目);the Program for New Century Excellent Talents in Fujian Province(福建省高校新世紀優(yōu)秀人才支持計劃).
Received 2016-04,Accepted 2016-06.
CNKI網(wǎng)絡優(yōu)先出版:2016-06-27,http://www.cnki.net/kcms/detail/11.5602.TP.20160627.0929.002.html
Key words:neighborhood rough sets;neighborhood information entropy;uncertainty measure;information system; approximation accuracy