亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        基于間隔理論的過(guò)采樣集成算法

        2019-08-01 01:48:57張宗堂陳喆戴衛(wèi)國(guó)
        計(jì)算機(jī)應(yīng)用 2019年5期
        關(guān)鍵詞:機(jī)器學(xué)習(xí)

        張宗堂 陳喆 戴衛(wèi)國(guó)

        摘 要:針對(duì)傳統(tǒng)集成算法不適用于不平衡數(shù)據(jù)分類的問(wèn)題,提出基于間隔理論的AdaBoost算法(MOSBoost)。首先通過(guò)預(yù)訓(xùn)練得到原始樣本的間隔; 然后依據(jù)間隔排序?qū)ι兕悩颖具M(jìn)行啟發(fā)式復(fù)制,從而形成新的平衡樣本集; 最后將平衡樣本集輸入AdaBoost算法進(jìn)行訓(xùn)練以得到最終集成分類器。在UCI數(shù)據(jù)集上進(jìn)行測(cè)試實(shí)驗(yàn),利用Fmeasure和Gmean兩個(gè)準(zhǔn)則對(duì)MOSBoost、AdaBoost、隨機(jī)過(guò)采樣AdaBoost(ROSBoost)和隨機(jī)降采樣AdaBoost(RDSBoost)四種算法進(jìn)行評(píng)價(jià)。實(shí)驗(yàn)結(jié)果表明,MOSBoost算法分類性能優(yōu)于其他三種算法,其中,相對(duì)于AdaBoost算法,MOSBoost算法在Fmeasure和Gmean準(zhǔn)則下分別提升了8.4%和6.2%。

        關(guān)鍵詞:不平衡數(shù)據(jù);間隔理論;過(guò)采樣方法;集成分類器;機(jī)器學(xué)習(xí)

        中圖分類號(hào):TP181

        文獻(xiàn)標(biāo)志碼:A

        Abstract: In order to solve the problem that traditional ensemble algorithms are not suitable for imbalanced data classification, Over Sampling AdaBoost based on Margin theory (MOSBoost) was proposed. Firstly, the margins of original samples were obtained by pretraining. Then, the minority class samples were heuristic duplicated by margin sorting thus forming a new balanced sample set. Finally, the finall ensemble classifier was obtained by the trained AdaBoost with the balanced sample set as the input. In the experiment on UCI dataset, Fmeasure and Gmean were used to evaluate MOSBoost, AdaBoost, Random OverSampling AdaBoost (ROSBoost) and Random UnderSampling AdaBoost (RDSBoost). The experimental results show that MOSBoost is superior to other three algorithm. Compared with AdaBoost, MOSBoost improves 8.4% and 6.2% respctively under Fmeasure and Gmean criteria.

        英文關(guān)鍵詞Key words: imbalanced data; margin theory; over sampling method; ensemble classifier; machine learning

        0 引言

        近些年,不平衡數(shù)據(jù)分類問(wèn)題成為了機(jī)器學(xué)習(xí)的熱點(diǎn)問(wèn)題,它廣泛存在于現(xiàn)實(shí)生產(chǎn)生活中,例如郵件過(guò)濾[1]、圖像分類[2]、軟件缺陷預(yù)測(cè)[3]、醫(yī)療診斷[4]、基因數(shù)據(jù)分析[5]等。對(duì)于二分類問(wèn)題,不平衡數(shù)據(jù)中多類的樣本數(shù)量遠(yuǎn)大于少類。傳統(tǒng)的分類方法以總體分類精度為目標(biāo),忽視了類別不平衡性,從而導(dǎo)致少類樣本分類準(zhǔn)確率降低,然而少類樣本往往具有較高的價(jià)值,這使得錯(cuò)分代價(jià)較大。

        針對(duì)不平衡數(shù)據(jù)的處理方法大致分為算法層面和數(shù)據(jù)層面: 算法層面指構(gòu)造新的算法或?qū)υ兴惴ㄟM(jìn)行改造以偏向少類; 數(shù)據(jù)層面主要是利用重采樣方法獲得平衡樣本集,再結(jié)合現(xiàn)有分類器進(jìn)行分類。重采樣方法,包括欠采樣法和過(guò)采樣法,形式上比較簡(jiǎn)練,且不影響分類器設(shè)計(jì),因此得到了廣泛的研究。根據(jù)采取的策略,它又可分為隨機(jī)采樣和啟發(fā)式采樣: 隨機(jī)采樣不依據(jù)數(shù)據(jù)信息,只是簡(jiǎn)單地隨機(jī)刪除或添加樣本; 啟發(fā)式采樣則是在利用數(shù)據(jù)內(nèi)部特性的基礎(chǔ)上進(jìn)行采樣。典型的啟發(fā)式欠采樣方法如Tomek links[6]、One sided selection[7]、Neighborhood Cleaning Rule[8]等克服了隨機(jī)欠采樣中容易缺失有用信息的缺點(diǎn),一定程度上提高了算法性能。而啟發(fā)式過(guò)采樣中比較有代表性的是SMOTE(Synthetic Minority Oversampling TEchnique)[9]方法及其改進(jìn)算法[10-12]。SMOTE方法的基本假設(shè)是相同類別的鄰近數(shù)據(jù)點(diǎn)所生成的凸集也屬于同一類別。啟發(fā)式重采樣方法基本都是在某種準(zhǔn)則下對(duì)樣本進(jìn)行篩選,對(duì)數(shù)據(jù)集的依賴性較強(qiáng),然而不平衡數(shù)據(jù)集往往存在類內(nèi)不平衡、小析取項(xiàng)、高噪聲等特點(diǎn),使得其難以滿足準(zhǔn)則要求,進(jìn)而降低了算法性能。表面上看,這是數(shù)據(jù)集與準(zhǔn)則之間的適配性問(wèn)題,實(shí)際上是這些方法缺乏理論基礎(chǔ),泛化性較低。

        AdaBoost算法是一種經(jīng)典的集成分類算法,在機(jī)器學(xué)習(xí)中有廣泛的應(yīng)用[13-15]。AdaBoost以最小化總體分類誤差為目標(biāo),忽視了類別間的不平衡性,因而不適用于不平衡數(shù)據(jù)分類。間隔理論是AdaBoost算法的重要理論基礎(chǔ),成功解釋了AdaBoost算法不易過(guò)擬合等現(xiàn)象。本文從間隔理論出發(fā),定義了少類間隔和多類間隔,對(duì)少類間隔樣本依據(jù)符號(hào)正負(fù)進(jìn)行篩選,對(duì)正的少類間隔樣本進(jìn)行啟發(fā)式復(fù)制,形成新的平衡樣本集,在此樣本集上進(jìn)行AdaBoost訓(xùn)練,形成了MOSBoost算法,從而提高了不平衡數(shù)據(jù)分類性能。

        1 相關(guān)工作

        1.1 AdaBoost算法

        AdaBoost算法將訓(xùn)練樣本集{(x1,y1),(x2,y2),…,(xN,yN)}作為輸入,其中xi是樣本,yi為其類標(biāo),對(duì)于二分類問(wèn)題,yi∈{-1,1}。然后根據(jù)已知的基分類算法在t=1,2,…,T輪中不斷地運(yùn)算。Dt(i)表示第t輪中第i個(gè)訓(xùn)練樣本的權(quán)重?;诸愃惴ǖ娜蝿?wù)是在權(quán)重分布Dt的基礎(chǔ)上得到基分類器ht來(lái)最小化分類誤差。當(dāng)ht訓(xùn)練完成,AdaBoost選擇一個(gè)參數(shù)αt∈R來(lái)衡量ht的分類性能。然后更新權(quán)重分布Dt。最終的集成分類器F是T個(gè)基分類器的加權(quán)輸出。具體算法如算法1所示。

        參考文獻(xiàn) (References)

        [1] DAI H L. Class imbalance learning via a fuuzy total margin based support vector machine[J]. Applied Soft Computing, 2015, 31(C): 172-184.

        [2] 譚潔帆,朱焱,陳同孝,等.基于卷積神經(jīng)網(wǎng)絡(luò)和代價(jià)敏感的不平衡圖像分類方法[J].計(jì)算機(jī)應(yīng)用,2018,38(7):1862-1865,1871.(TAN J F, ZHU Y, CHEN T X, et al. Imbalanced image classification approach based on convolution network and costsensitivity[J]. Journal of Computer Applications,2018,38(7):1862-1865,1871.)

        [3] WANG S, YAO X. Using class imbalance learning for software defect prediction[J]. IEEE Transactions on Reliability, 2013, 62(2): 434-443.

        [4] OZCIFT A, GULTEN A. Classifer ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms[J]. Computer Methods and Programs in Biomedicine, 2011, 104(3):443-451.

        [5] YU H, NI J, ZHAO J. ACOSampling: an ant colony optimizationbased undersampling method for classifying imbalanced DNA microarray data[J]. Neurocomputing, 2013,101:309-318.

        [6] TOMEK I. Two modifications of CNN[J]. IEEE Transactions on Systems, Man and Cybernetics, 1976, SMC6(11): 769-772.

        [7] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: onesided selection[C]// Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997: 179-186.

        [8] LAURIKKALA J. Improving identification of difficult small classes by balancing class distribution[C]// Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. Berlin: Springer, 2001: 63-66.

        [9] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: synthetic minority oversampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.

        [10] RIVERA W A. Noise reduction a priori synthetic oversampling for class imbalanced data sets[J]. Information Sciences, 2017, 408(C): 146-161.

        [11] MA L, FAN S. CURESMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests [J]. BMC Bioinformatics, 2017,18(1): 169.

        [12] BOROWSKA, K, STEPANIUK J. Imbalanced data classification: a novel resampling approach combining versatile improved SMOTE and rough sets[C]// CISIM 2016: IFIP International Conference on Computer Information Systems and Industrial Management. Berlin: Springer, 2016: 31-42.

        [13] BAIG M M, AWAIS M M, ELALFY E S M. AdaBoostbased artificial neural network learning[J]. Neurocomputing, 2017, 248(C): 120-126.

        [14] MINZ A, MAHOBIYA C. MR image classification using Adaboost for brain tumor type[C]// Proceedings of the 2017 IEEE 7th International Advance Computing Conference. Washington, DC: IEEE Computer Society, 2017:701-705.

        [15] 王軍,費(fèi)凱,程勇.基于改進(jìn)的AdaboostBP模型在降水中的預(yù)測(cè)[J]. 計(jì)算機(jī)應(yīng)用, 2017, 37(9):2689-2693.(WANG J,F(xiàn)EI K,CHENG Y. Prediction of rainfall based on improved AdaboostBP model[J]. Journal of Computer Applications, 2017, 37(9):2689-2693.)

        [16] SCHAPIRE R E, FREUND Y, BARTLETT P, et al. Boosting the margin: a new explanation for the effectiveness of voting methods[J]. Annals of Statistics, 1998, 26(5): 1651-1686.

        [17] GAO W, ZHOU Z H. On the doubt about margin explanation of boosting[J]. Artificial Intelligence, 2013,203:1-18.

        [18] BACHE K, LICHMAN M. UCI repository of machine learning databases[DB/OL].[2018-06-20].http://www.ics.uci.edu/~mlearn/MLRepository.html.

        [19] van HULSE J, KHOSHGOFTAAR T M, NAPOLITANO A. Expertimental perspectives on learning from imbalanced data[C]// Proceedings of the 24th International Conference on Machine Learing. New York: ACM, 2007: 935-942.

        [20] LIU N, WEI L W, AUNG Z. Handling class imbalance in customer behavior prediction[C]// Proceedings of the 2014 International Conference on Collaboration Technologies and Systems. Piscataway, NJ: IEEE, 2014: 100-103.

        猜你喜歡
        機(jī)器學(xué)習(xí)
        基于詞典與機(jī)器學(xué)習(xí)的中文微博情感分析
        基于機(jī)器學(xué)習(xí)的圖像特征提取技術(shù)在圖像版權(quán)保護(hù)中的應(yīng)用
        基于網(wǎng)絡(luò)搜索數(shù)據(jù)的平遙旅游客流量預(yù)測(cè)分析
        前綴字母為特征在維吾爾語(yǔ)文本情感分類中的研究
        下一代廣播電視網(wǎng)中“人工智能”的應(yīng)用
        活力(2016年8期)2016-11-12 17:30:08
        基于支持向量機(jī)的金融數(shù)據(jù)分析研究
        基于Spark的大數(shù)據(jù)計(jì)算模型
        基于樸素貝葉斯算法的垃圾短信智能識(shí)別系統(tǒng)
        基于圖的半監(jiān)督學(xué)習(xí)方法綜述
        機(jī)器學(xué)習(xí)理論在高中自主學(xué)習(xí)中的應(yīng)用
        亚洲av中文无码乱人伦下载| 人妻少妇久久精品一区二区| 一区二区三区一片黄理论片| 亚洲综合精品中文字幕| 日韩国产成人无码av毛片蜜柚| 亚洲伊人久久大香线蕉综合图片| 日本第一区二区三区视频| 中文字幕乱码亚洲一区二区三区 | 久久老熟女乱色一区二区| 插上翅膀插上科学的翅膀飞| 俺去俺来也在线www色官网| 亚洲国产麻豆综合一区| 91麻豆精品激情在线观最新| 国产毛片黄片一区二区三区| 蜜臀av性久久久久蜜臀aⅴ| 中文字幕无码人妻丝袜| 亚洲国产线茬精品成av | 亚洲另类国产综合第一| 精品一区二区三区人妻久久| 国产在线观看午夜视频| 亚洲精品一品区二品区三品区| 久久ri精品高清一区二区三区| 国产免费人成视频在线观看播放 | 欧美老熟妇乱xxxxx| 小sao货水好多真紧h视频| 久久久久久免费播放一级毛片| 狼狼色丁香久久女婷婷综合| 亚洲精品~无码抽插| 色偷偷88888欧美精品久久久 | 国产高清在线精品一区二区三区| 国产果冻豆传媒麻婆精东| 亚洲深深色噜噜狠狠爱网站| 亚洲免费观看一区二区三区| 国产一区二区三区内射| 免费观看激色视频网站| 亚洲国产精品中文字幕日韩| 99蜜桃在线观看免费视频| 欧美日韩精品久久久久| 伊人久久大香线蕉免费视频| 亚洲黄片av在线免费观看| 日本a级片免费网站观看|