亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

基于GAN-AdaBoost-DT不平衡分類算法的信用卡欺詐分類

2019-08-01 01:57:38莫贊蓋彥蓉樊冠龍

計(jì)算機(jī)應(yīng)用 2019年2期

莫贊蓋彥蓉樊冠龍

摘要：針對傳統(tǒng)單個(gè)分類器在不平衡數(shù)據(jù)上分類效果有限的問題，基于對抗生成網(wǎng)絡(luò)（GAN）和集成學(xué)習(xí)方法，提出一種新的針對二類不平衡數(shù)據(jù)集的分類方法——對抗生成網(wǎng)絡(luò)自適應(yīng)增強(qiáng)決策樹（GAN-AdaBoost-DT）算法。首先，利用GAN訓(xùn)練得到生成模型，生成模型生成少數(shù)類樣本，降低數(shù)據(jù)的不平衡性;其次，將生成的少數(shù)類樣本代入自適應(yīng)增強(qiáng)（AdaBoost）模型框架，更改權(quán)重，改進(jìn)AdaBoost模型，提升以決策樹（DT）為基分類器的AdaBoost模型的分類性能。使用受測者工作特征曲線下面積（AUC）作為分類評價(jià)指標(biāo)，在信用卡詐騙數(shù)據(jù)集上的實(shí)驗(yàn)分析表明，該算法與合成少數(shù)類樣本集成學(xué)習(xí)相比，準(zhǔn)確率提高了4.5%，受測者工作特征曲線下面積提高了6.5%;對比改進(jìn)的合成少數(shù)類樣本集成學(xué)習(xí)，準(zhǔn)確率提高了4.9%，AUC值提高了5.9%;對比隨機(jī)欠采樣集成學(xué)習(xí)，準(zhǔn)確率提高了4.5%，受測者工作特征曲線下面積提高了5.4%。在UCI和KEEL的其他數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明，該算法在不平衡二分類問題上能提高總體的準(zhǔn)確率，優(yōu)化分類器性能。

關(guān)鍵詞：對抗生成網(wǎng)絡(luò); 集成學(xué)習(xí); 不平衡分類;? 二分類;自適應(yīng)增強(qiáng);決策樹;信用卡欺詐

中圖分類號： TP391

文獻(xiàn)標(biāo)志碼：A

Abstract： Concerning that traditional single classifiers have poor classification effect for imbalanced data classification， a new binary-class imbalanced data classification algorithm was proposed based on Generative Adversarial Nets （GAN） and ensemble learning， namely Generative Adversarial Nets-Adaptive Boosting-Decision Tree （GAN-AdaBoost-DT）. Firstly， GAN training was adopted to get a generative model which produced minority class samples to reduce imbalance ratio. Then， the minority class samples were brought into Adaptive Boosting （AdaBoost） learning framework and their weights were changed to improve AdaBoost model and classification performance of AdaBoost with Decision Tree （DT） as base classifier. Area Under the Carve （AUC） was used to evaluate the performance of classifier when dealing with imbalanced classification problems. The experimental results on credit card fraud data set illustrate that compared with synthetic minority over-sampling ensemble learning method， the accuracy of the proposed algorithm was increased by 4.5%， the AUC of it was improved by 6.5%; compared with modified synthetic minority over-sampling ensemble learning method， the accuracy was increased by 4.9%， the AUC was improved by 5.9%; compared with random under-sampling ensemble learning method， the accuracy was increased by 4.5%， the AUC was improved by 5.4%. The experimental results on other data sets of UCI and KEEL illustrate that the proposed algorithm can improve the accuracy of imbalanced classification and the overall classifier performance.

Key words： Generative Adversarial Nets （GAN）; ensemble learning; imbalanced classification; binary-class classification; Adaptive Boosting （AdaBoost）; Decision Tree （DT）; credit card fraud

0 引言

不平衡數(shù)據(jù)是指數(shù)據(jù)集中的某個(gè)或某些類的樣本量遠(yuǎn)遠(yuǎn)高于其他類，而某些類樣本量較少，通常把樣本量較多的類稱為多數(shù)類，樣本量較少的類稱為少數(shù)類[1]。在不平衡數(shù)據(jù)集中，對少數(shù)類的識別較為重要，例如故障診斷[2]中，機(jī)器故障屬于少數(shù)類，如果將故障診斷為正常，就會造成工程延誤，帶來不必要的損失。由于不平衡數(shù)據(jù)集的復(fù)雜特性，傳統(tǒng)的分類算法預(yù)測少數(shù)類的分類規(guī)則比多數(shù)類的分類規(guī)則少，而且效果差[3]，這就是不平衡分類問題。不平衡分類問題已經(jīng)成為數(shù)據(jù)挖掘領(lǐng)域的挑戰(zhàn)之一[4]，現(xiàn)在這種問題普遍存在于銀行信用評級[5]、異常檢測[6]、人臉識別[7]、醫(yī)學(xué)診斷[8]、電子郵件分類[9]等領(lǐng)域。

本文所研究的信用卡欺詐偵測問題也是不平衡分類問題。信用卡欺詐偵測就是銀行根據(jù)與客戶信用狀況相關(guān)的特征變量預(yù)測客戶的支付記錄是否是欺詐交易，欺詐交易雖然是少數(shù)類，但一個(gè)欺詐交易的分類錯(cuò)誤所造成的資金損失，是千百個(gè)正常交易分類正確也挽回不了的。為了避免信用風(fēng)險(xiǎn)造成的損失，對欺詐交易記錄的識別尤為重要。

目前處理不平衡問題的方法可以概括為兩類。一種比較普遍的方法是在數(shù)據(jù)層面通過采用欠采樣或過采樣的方法，重新分配類別分布，例如文獻(xiàn)[10]提出的合成小類過采樣技術(shù)（Synthetic Minority Over-sampling Technique，SMOTE），文獻(xiàn)[11]提出的自適應(yīng)樣本合成方法（Adaptive Synthetic Sampling Approach，ADASYN）。欠采樣方法可以提升模型對小類樣本的分類性能，但是這種方法會造成大類樣本數(shù)據(jù)的信息丟失而使模型無法充分利用已有的信息。傳統(tǒng)的過采樣方法可以生成少數(shù)類樣本的數(shù)據(jù)，但是根據(jù)少數(shù)類數(shù)據(jù)生成，只是基于當(dāng)前少數(shù)類蘊(yùn)含的信息，缺乏數(shù)據(jù)多樣性，一定程度上會造成過擬合。

另一種是在算法層面上，包括集成學(xué)習(xí)和代價(jià)敏感學(xué)習(xí)。集成學(xué)習(xí)通過集成多個(gè)分類器來避免單個(gè)分類器對不平衡數(shù)據(jù)分類預(yù)測造成的偏差[12]，如文獻(xiàn)[13]提出的在自適應(yīng)增強(qiáng)模型（Adaptive Boosting，AdaBoost）的每次迭代中引入SMOTE的SMOTEBoost算法，文獻(xiàn)[14]提出的在AdaBoost的每次迭代中引入隨機(jī)欠采樣（Random Under-Sampling method，RUS）的RUSBoosts算法。代價(jià)敏感學(xué)習(xí)是在算法迭代過程中設(shè)置少數(shù)類被錯(cuò)分時(shí)具有較高的代價(jià)損失[15]，通常與集成學(xué)習(xí)算法組合使用。代價(jià)敏感方法只是在算法層次進(jìn)行了修改，沒有增加算法的開銷，效率較高，能有效提高不平衡數(shù)據(jù)的分類效果;但是由于主觀引入代價(jià)敏感損失，損失函數(shù)的設(shè)計(jì)會影響算法的迭代效果，適用性普遍較弱[16]。

因此，本文擬從數(shù)據(jù)層面生成少數(shù)類樣本來使數(shù)據(jù)達(dá)到平衡，以此提高傳統(tǒng)分類算法的分類效果。生成式對抗網(wǎng)絡(luò)（Generative Adversarial Nets，GAN）[17]是2014年提出的生成模型，與傳統(tǒng)的生成模型對比，不需要基于真實(shí)數(shù)據(jù)就可以生成逼近真實(shí)數(shù)據(jù)的合成數(shù)據(jù)，可以擴(kuò)展數(shù)據(jù)多樣性，避免過擬合。

由于單一方法難以滿足不同不平衡數(shù)據(jù)集的要求，適用性普遍不強(qiáng)，同時(shí)組合預(yù)測模型能發(fā)揮各個(gè)單一預(yù)測模型的優(yōu)勢，進(jìn)而提高模型整體的預(yù)測效果，因此，本文提出一種針對不平衡二分類問題的對抗生成網(wǎng)絡(luò)自適應(yīng)增強(qiáng)決策樹（Generative Adversarial Nets-Adaptive Boosting-Decision Tree，GAN-AdaBoost-DT）算法。該算法首先使用GAN生成少數(shù)類樣本，使數(shù)據(jù)達(dá)到平衡，之后使用AdaBoost集成學(xué)習(xí)框架，使用以決策樹（Decision Tree，DT）作為基分類器的AdaBoost算法，利用集成的思想提高DT在不平衡數(shù)據(jù)集中的分類能力。采用受測者工作特征曲線下面積（Area Under the Carve，AUC）作為評價(jià)標(biāo)準(zhǔn)評價(jià)分類器的效果。

1 相關(guān)工作

1.1 GAN算法

GAN是2014年基于零和博弈理論提出的一種生成式模型，模型包括基于神經(jīng)網(wǎng)絡(luò)的生成模型（G）和判別模型（D），生成模型基于噪聲空間z生成數(shù)據(jù)，判別模型判斷數(shù)據(jù)是真實(shí)的還是生成模型生成的。這個(gè)過程相當(dāng)于一個(gè)二人博弈，G的訓(xùn)練目標(biāo)是使生成的數(shù)據(jù)接近于真實(shí)數(shù)據(jù)的分布，判別器訓(xùn)練目標(biāo)是區(qū)分出真實(shí)數(shù)據(jù)生成數(shù)據(jù)，兩者相互迭代優(yōu)化，使D和G的性能得到不斷增強(qiáng)，最終使兩個(gè)網(wǎng)絡(luò)達(dá)到一個(gè)動態(tài)均衡，判別模型判斷生成模型生成的數(shù)據(jù)為真的概率接近0.5，此時(shí)生成器生成的數(shù)據(jù)近似真實(shí)數(shù)據(jù)。計(jì)算流程如圖1所示。

4 結(jié)語

針對傳統(tǒng)分類算法在不平衡分類問題性能較差的問題，本文提出了一種用于解決不平衡二分類問題的算法——GAN-AdaBoost-DT算法。該算法基于對抗生成網(wǎng)絡(luò)改進(jìn)了AdaBoost算法，在AdaBoost每次迭代中使用GAN生成少數(shù)類數(shù)據(jù)，降低數(shù)據(jù)的不平衡率，從而提高AdaBoost-DT的分類性能。在信用卡詐騙數(shù)據(jù)集的實(shí)驗(yàn)結(jié)果表明，該方法對不平衡數(shù)據(jù)集的識別率有所提高，綜合提升了分類器的性能。在UCI、KEEL的5個(gè)數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明，該方法相比其他算法識別率更高，分類性能更優(yōu)。

參考文獻(xiàn)：

[1] SEARLE S R. Linear Models for Unbalanced Data [M]. New York： John Wiley & Sons， 1987： 145-153.

[2] YANG Z， TANG W H， SHINTEMIROV A， et al. Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers [J]. IEEE Transactions on Systems， Man & Cybernetics， Part C： Applications and Reviews， 2009， 39（6）： 597-610.

[3] SUN Y， KAMEL M S， WONG A K C， et al. Cost-sensitive boosting for classification of imbalanced data [J]. Pattern Recognition，2007，40（12）： 3358-3378.

[4] YANG Q， WU X. 10 challenging problems in data mining research [J]. International Journal of Information Technology & Decision Making， 2011， 5（4）： 597-604.

[5] BROWN I， MUES C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets [J]. Expert Systems with Applications， 2012， 39（3）： 3446-3453.

[6] TAVALLAEE M， STAKHANVA N， GHORBANI A A. Toward credible evaluation of anomaly-based intrusion-detection methods[J]. IEEE Transactions on Systems， Man & Cybernetics， Part C： Applications and Reviews， 2010， 40（5）： 516-524.

[7] LIU Y-H， CHEN Y-T. Total margin based adaptive fuzzy support vector machines for multiview face recognition [C]// Proceedings of the 2005 IEEE International Conference on Systems， Man and Cybernetics. Washington， DC： IEEE Computer Society， 2005， 2： 1704-1711.

[8] MAZUROWSKI M A， HABAS P A， ZURADE J M， et al. Training neural network classifiers for medical decision making： the effects of imbalanced datasets on classification performance [J]. Neural Networks， 2008， 21（2/3）： 427-436.

[9] BERMEJO P， GAMEZ J A， PUERTA J M. Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets [J]. Expert Systems with Applications， 2011， 38（3）： 2072-2080.

[10] CHAWLA N V， BOWYER K W， HALL L O， et al. SMOTE： Synthetic Minority Over-Sampling Technique [J]. Journal of Artificial Intelligence Research，2002， 16（1）： 321-357.

[11] HE H， BAI Y， GARCIA E A， et al. ADASYN： adaptive synthetic sampling approach for imbalanced learning [C]// Proceeding of the 2008 International Joint Conference on Neural Networks. Piscataway， NJ： IEEE， 2008： 1322-1328.

[12] FREUND Y， SCHAPIRE R E. Experiments with a new boosting algorithm [C]// Proceedings of the Thirteenth International Conference on Machine Learning. San Francisco， CA： Morgan Kaufmann， 1996： 148-156.

[13] CHAWLA N V， LAZAREVIC A， HALL L O， et al. SMOTEBoost： improving prediction of the minority class in boosting [C]// Proceedings of the 2003 European Conference on Knowledge Discovery in Databases， LNCS 2838. Berlin： Springer， 2003： 107-119.

[14] SEIFFERT C， KHOSHGOFTAAR T M， van HULSE J， et al. RUSBoost： a hybrid approach to alleviating class imbalance [J]. IEEE Transactions on Systems， Man and Cybernetics， Part A： Systems and Humans， 2010， 40（1）： 185-197.

[15] FAN W， STOLFO S J， ZHANG J， et al. AdaCost： misclassification cost-sensitive boosting [C]// Proceedings of the 16th International Conference on Machine Learning. San Francisco， CA： Morgan Kaufmann， 1999： 97-105.

[16] CATENI S， COLLA V， VANNUCCI M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems [J]. Neurocomputing， 2014， 135： 32-41.

[17] GOODFELLOW I J， POUGET-ABADIE J， MIRZA M， et al. Generative adversarial nets [C]// NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge， MA： MIT Press， 2014， 2： 2672-2680.

[18] GOODFELLOW I. NIPS 2016 tutorial： generative adversarial networks [EB/OL]. （2016-12-31） [2017-09-24]. https：//arxiv.org/pdf/1701.00160.pdf.

[19] LI J， MONROE W， SHI T， et al. Adversarial learning for neural dialogue generation [EB/OL].[2017-07-13]? [2018-05-02]. https：//arxiv.org/pdf/1701.06547v1.pdf.

[20] YU L， ZHANG W， WANG J， et al. SeqGAN： sequence generative adversarial nets with policy gradient [EB/OL].[2017-08-25] [2018-05-02]. https：//arxiv.org/pdf/1609.05473.pdf.

[21] HU WW， TAN Y. Generating adversarial malware examples for black-box attacks based on GAN [EB/OL]. [2017-02-20][2018-05-02]. https：//arxiv.org/pdf/1702.05983v1.pdf.

[22] CHIDAMBARAM M， QI Y. Style transfer generative adversarial networks： learning to play chess differently[EB/OL]. [2017-05-07] [2018-07-02]. https：//arxiv.org/pdf/1702.06762v1.pdf.

[23] FREUND Y， SCHAPIRE R E. A desicion-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer & System Sciences， 1997， 55（1）：119-139.

[24] HUNT E， KRIVANEK J. The effects of pentylenetetrazole and methylphenoxypropane on discrimination learning [J]. Psychopharmacology， 1966， 9（1）： 1-16.

[25] BOSE I， FARQUAD M A H. Preprocessing unbalanced data using support vector machine [J]. Decision Support Systems， 2012， 53（1）： 226-233.

[26] 張順，張化祥.用于多標(biāo)記學(xué)習(xí)的K近鄰改進(jìn)算法[J].計(jì)算機(jī)應(yīng)用研究，2011，28（12）：4445-4450. （ZHANG S， ZHANG H X. Modified KNN algorithm for multi-label learning [J]. Application Research of Computers， 2011， 28（12）： 4445-4450.）

[27] 李詒靖，郭海湘，李亞楠，等.一種基于Boosting的集成學(xué)習(xí)算法在不均衡數(shù)據(jù)中的分類 [J].系統(tǒng)工程理論與實(shí)踐，2016，36（1）：189-199. （LI Y J， GUO H X， LI Y N， et al. A boosting based on ensemble learning algorithm in imbalanced data classification [J]. Systems Engineering — Theory & Practice， 2016， 36（1）： 189-199.）

計(jì)算機(jī)應(yīng)用2019年2期

計(jì)算機(jī)應(yīng)用的其它文章: 基于滑動窗口和動態(tài)規(guī)劃的連續(xù)動作分割與識別; 基于深層長短期記憶網(wǎng)絡(luò)與批規(guī)范化的間歇過程故障檢測方法; 考慮潮汐影響的班輪多船型船舶調(diào)度; 考慮客戶聚類與產(chǎn)品回收的兩級閉環(huán)物流網(wǎng)絡(luò)選址路徑優(yōu)化; 考慮區(qū)域協(xié)調(diào)性的城際列車開行方案優(yōu)化; 物聯(lián)網(wǎng)智能物流系統(tǒng)容錯(cuò)服務(wù)組合建模與分析