李建春 李智 萬(wàn)里 李健
摘 要:數(shù)據(jù)缺失是臨床試驗(yàn)中常見(jiàn)但又不可避免的問(wèn)題之一。由于醫(yī)療設(shè)備欠缺或者病患忽略檢測(cè)白蛋白,可能造成白蛋白指標(biāo)缺失。隨著機(jī)器學(xué)習(xí)的廣泛應(yīng)用,很多研究者將機(jī)器學(xué)習(xí)應(yīng)用在缺失數(shù)據(jù)估計(jì)上。提出一種基于隨機(jī)森林與聚類(lèi)方法結(jié)合的算法——雙隨機(jī)森林回歸法,并將該算法應(yīng)用于估計(jì)白蛋白缺失值。在準(zhǔn)確率和魯棒性方面,雙隨機(jī)森林回歸法相比于最近鄰法、決策樹(shù)與隨機(jī)森林方法,均有不同程度提高。該算法為缺失值的有效處理提供了一種新思路,可以為其它的缺失值估計(jì)研究提供參考。
關(guān)鍵詞:血液透析;白蛋白;隨機(jī)森林;缺失值;數(shù)據(jù)缺失
DOI:10.11907/rjdk.173135
中圖分類(lèi)號(hào):TP319
文獻(xiàn)標(biāo)識(shí)碼:A 文章編號(hào):1672-7800(2018)005-0124-03
Abstract:Data missing is a common problem in clinical trials. The indicator of the albumin (ALB) is very important since it is associated with prognosis and mortality in patients with renal failure. And due to lack of medical equipment or patients ignorance of the detection of albumin, the value of albumin may be missed. With the widespread application of machine learning, many researchers have applied machine learning to the estimation of missing data in order to improve the quality of the dataset, and their work have got good results. In this paper, the method based on random forest and clustering and twice random forest, that is, Random forest regression-Kmeans-Random forest regression, RKR is proposed to apply this algorithm to estimate the albumin deletion value.The principle of the algorithm is to make use of the advantages of random forests in predicting nonlinear datasets. The process is divided into three parts. The first part is using the random forest regression method to impute the missing data of albumin. The second part is using the cluster method, Kmeans method, to cluster the dataset into six classes. Last but not the least, the third part is reusing the random forest regression method to impute the missing data of albumin. In terms of accuracy and robustness, the method performs better than the nearest neighbor regression method, decision regression tree and the random forest regression method. The algorithm provides a new approach for the efficient processing of missing values, which can be used as a reference for other researchers who study the estimation of missing values.
Key Words:hemodialysis; albumin; random forest; missing value; data missing
0 引言
數(shù)據(jù)缺失是臨床試驗(yàn)中常見(jiàn)但又不可避免的問(wèn)題之一。白蛋白(ALB)對(duì)于腎衰病人是一個(gè)非常重要的指標(biāo),與腎衰病人的預(yù)后和死亡率有一定關(guān)聯(lián)[1-4]。而由于醫(yī)療設(shè)備欠缺或者病患忽略檢測(cè)白蛋白,可能造成白蛋白指標(biāo)缺失。隨著機(jī)器學(xué)習(xí)的廣泛應(yīng)用,很多研究者將機(jī)器學(xué)習(xí)應(yīng)用在缺失數(shù)據(jù)估計(jì)上,如多元線性回歸、最近鄰法(K-Nearest Neighbor,KNN)、貝葉斯主成分分析法(Bayesian Principal Component Analysis,BPCA)[11]及決策樹(shù)(Decision Tree,DT)[5-8]等。但這些方法沒(méi)有充分利用患者檢查數(shù)據(jù)的特殊性,估計(jì)精度不高[10-12]。隨機(jī)森林(Random Forest,RF)基于DT算法,其優(yōu)勢(shì)在于克服了DT存在的過(guò)擬合問(wèn)題,為解決數(shù)據(jù)缺失提供了一種可行的手段。然而,它也存在以下兩個(gè)問(wèn)題:①隨機(jī)森林(Random Forest,RF)[9]回歸預(yù)測(cè)使用的最終預(yù)測(cè)值是取各個(gè)子樹(shù)的平均值,因而帶來(lái)一定誤差;②很多研究者在估計(jì)缺失值時(shí),未考慮缺失值特征帶來(lái)的影響,只對(duì)缺失值進(jìn)行預(yù)測(cè),因而又將一部分誤差引入[14-15]。
針對(duì)上述問(wèn)題,本文提出一種將隨機(jī)森林和K均值聚類(lèi)相結(jié)合的缺失值估計(jì)方法,即雙隨機(jī)森林回歸法(Random Forest Regression-Kmeans-Random Forest Regression,RKR),并使用歸一化均方誤差(Normalized Mean Square Error,NMSE)[13]、標(biāo)準(zhǔn)均方根誤差(Normalized Root Mean Square Deviation,NRMSD)[6]度量算法的準(zhǔn)確度與穩(wěn)定性。
1 基本原理與方法
1.1 雙隨機(jī)森林(RKR)方法
雙隨機(jī)森林(RKR)是將隨機(jī)森林與K均值聚類(lèi)方法融合的一種方法。首先使用隨機(jī)森林回歸(Random Forest Regression,RFR)對(duì)空缺值進(jìn)行第一次估計(jì),從而填補(bǔ)空缺值,進(jìn)行Kmeans均值聚類(lèi)。實(shí)驗(yàn)發(fā)現(xiàn),聚類(lèi)6個(gè)簇時(shí)效果最好。得到6個(gè)子樣本后,在含有空缺值的子樣本內(nèi),再次進(jìn)行隨機(jī)森林回歸(Random Forest Regression,RFR)估計(jì)缺失值。實(shí)驗(yàn)結(jié)果表明,該算法可以有效提升缺失值估計(jì)的準(zhǔn)確率。
具體分為以下步驟:①首先獲取完整的數(shù)據(jù)集DataSet0,隨機(jī)挑選指定比例的記錄,組成訓(xùn)練集DataSetTrain,將剩下部分預(yù)測(cè)指標(biāo)中的值清空,組成測(cè)試集DataSetTest;②使用隨機(jī)森林(Random Forest,RF)訓(xùn)練數(shù)據(jù)集DataSetTrain,對(duì)DataSetTest估計(jì)缺失值,得到新數(shù)據(jù)集DataSetTest1。將DataSetTest1與DataSetTrain合并成新的測(cè)試集DataSet1,使用K均值聚類(lèi)方法將DataSet1分為6個(gè)聚類(lèi),DataCluster0、DataCluster1、DataCluster2、DataCluster3、DataCluster4、DataCluster5;③將DataCluster0中也存在于DataSetTest1記錄預(yù)測(cè)指標(biāo)中的值清空,將DataCluster0中預(yù)測(cè)指標(biāo)不為空的記錄挑選出來(lái),組成DataClusterTrain0,剩下的記錄組成DataClusterTest0;④使用隨機(jī)森林(Random Forest,RF)訓(xùn)練數(shù)據(jù)集DataClusterTrain0,對(duì)DataClusterTest0預(yù)測(cè)指標(biāo)缺失值,將預(yù)測(cè)值放入數(shù)據(jù)集DataSetPredicted;⑤對(duì)DataCluster1-DataCluster5重復(fù)步驟③、④。
2 實(shí)驗(yàn)結(jié)果及分析
總共進(jìn)行了5次試驗(yàn),采用的對(duì)比算法有:K近鄰回歸(KNeighbors Regressor,KNR)、決策樹(shù)回歸(DecisionTree Regressor,DTR)、隨機(jī)森林回歸(Random Forest Regressor,RFR)與本文提出的雙隨機(jī)森林法回歸(Random Forest Regressor-Kmeans-Random Forest Regressor,RKR)。4種算法分別在測(cè)試集為1%、5%、10%、15%、20%進(jìn)行缺失值估計(jì),并使用歸一化均方誤差(NMSE)、標(biāo)準(zhǔn)均方根誤差(NRMSD)度量算法的準(zhǔn)確度與穩(wěn)定性。
2.1 實(shí)驗(yàn)數(shù)據(jù)
本研究實(shí)驗(yàn)數(shù)據(jù)來(lái)自成都軍區(qū)總醫(yī)院2013年1月~2015年11月期間的腎內(nèi)科數(shù)據(jù),對(duì)數(shù)據(jù)進(jìn)行預(yù)處理,最后選出511個(gè)透析病人的實(shí)驗(yàn)室檢查數(shù)據(jù),包括:白蛋白(ALB)、尿素氮(Bun)、性別(SEX)、年齡(AGE)、身高(HEIGHT)、體重(WEIGHT)、身體質(zhì)量指數(shù)(BMI)、舒張壓(DBP)、收縮壓(SBP)、鈣(CA)、磷(P)、鉀(K)、甲狀旁腺素(PTH)、堿性磷酸酶(AP)、鈉(NA)、血清肌酐(SCR)。將以上數(shù)據(jù)作為特征,這16個(gè)特征是透析患者應(yīng)著重關(guān)注的指標(biāo)。選擇需要估計(jì)的指標(biāo)(因變量)為白蛋白(ALB),其它指標(biāo)作為自變量。采用隨機(jī)抽取的方法將原始數(shù)據(jù)分成訓(xùn)練集和測(cè)試集,用訓(xùn)練集獲得各種回歸模型,再利用回歸模型加載測(cè)試集,得到估測(cè)值。
2.2 實(shí)驗(yàn)結(jié)果
在不同衡量指標(biāo)下,4種算法實(shí)驗(yàn)對(duì)比結(jié)果如圖1、圖2所示。
圖1表明,當(dāng)預(yù)測(cè)結(jié)果衡量指標(biāo)為NMSE時(shí),在各種測(cè)試集比例下,決策樹(shù)方法(DTR)預(yù)測(cè)結(jié)果最差,雙隨機(jī)森林(RKR)預(yù)測(cè)結(jié)果最好;測(cè)試集比例在10%以下時(shí),K近鄰回歸(KNR)、隨機(jī)森林(RFR)和雙隨機(jī)森林均表現(xiàn)優(yōu)異;測(cè)試集比例在10%以上時(shí),K近鄰回歸預(yù)測(cè)結(jié)果比隨機(jī)森林和雙隨機(jī)森林差。
圖2表明,當(dāng)預(yù)測(cè)結(jié)果衡量指標(biāo)為NRMSD,在各種測(cè)試集比例下,決策樹(shù)方法(DTR)預(yù)測(cè)結(jié)果最差,雙隨機(jī)森林(RKR)預(yù)測(cè)結(jié)果最好;測(cè)試集比例在5%以下時(shí),K近鄰回歸(KNR),隨機(jī)森林(RFR)和雙隨機(jī)森林均表現(xiàn)優(yōu)異;測(cè)試集比例在5%以上時(shí),K近鄰回歸預(yù)測(cè)結(jié)果比隨機(jī)森林和雙隨機(jī)森林差。
綜上述,通過(guò)與K近鄰、決策樹(shù)、隨機(jī)森林方法進(jìn)行實(shí)驗(yàn)對(duì)比,結(jié)果表明,雙隨機(jī)森林算法實(shí)現(xiàn)了對(duì)透析病人白蛋白(ALB)指標(biāo)缺失值較為準(zhǔn)確的填補(bǔ),同時(shí)具有較高的穩(wěn)定性。
3 結(jié)語(yǔ)
為解決臨床試驗(yàn)中的數(shù)據(jù)缺失問(wèn)題,本文提出一種基于隨機(jī)森林與聚類(lèi)方法結(jié)合的算法——雙隨機(jī)森林回歸法,并將此算法應(yīng)用于估計(jì)白蛋白缺失值。雙隨機(jī)森林回歸法相比于最近鄰法、決策樹(shù)與隨機(jī)森林方法,在準(zhǔn)確率和魯棒性方面均有不同程度提高。該算法為缺失值的有效處理提供了一種新思路,可以為其它的缺失值估計(jì)研究提供參考。
參考文獻(xiàn):
[1] 潘少康,劉東偉,劉章鎖.不同透析模式對(duì)急性腎損傷預(yù)后的影響[J].實(shí)用醫(yī)院臨床雜志,2017(2):16-19.
[2] MA L, ZHAO S. Risk factors for mortality in patients undergoing hemodialysis: a systematic review and meta-analysis[J]. International Journal of Cardiology,2017.
[3] ERIGUCHI R, OBI Y, STREJA E, et al. Longitudinal associations among renal urea clearance–corrected normalized protein catabolic rate, serum albumin, and mortality in patients on hemodialysis[J]. Clinical Journal of the American Society of Nephrology,2017.
[4] FAN H, YANG J, LIU L, et al. Effect of serum albumin on the prognosis of elderly patients with stage 3-4 chronic kidney disease[J]. International Urology & Nephrology,2017.
[5] LUO S, LAWSON A B, HE B, et al. Bayesian multiple imputation for missing multivariate longitudinal data from a Parkinson's disease clinical trial[J]. Statistical Methods in Medical Research,2012.
[6] WANG X, JIANG Z, FENG H. Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme[J]. BMC Bioinformatics,2006,7(1):1-10.
[7] SHAH A D, BARTLETT J W, CARPENTER J, et al. Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study[J]. American Journal of Epidemiology,2014,179(6):764.
[8] BABU G A, SUMANA G, RAJASEKHAR M. Computer-aided diagnosis of polycystic kidney disease using ANN[J]. World Academy of Science, Engineering and Technology, International Journal of Medical, Health, Biomedical, Bioengineering and Pharmaceutical Engineering,2013,7(12):933-937.
[9] ZHANG H, WU P, YIN A, et al. Prediction of soil organic carbon in an intensively managed reclamation zone of eastern China: a comparison of multiple linear regressions and the random forest model[J]. Science of the Total Environment,2017,592:704-713.
[10] TROYANSKAYA O, CANTOR M, SHERLOCK G, et al. Missing value estimation methods for DNA microarrays[J]. Bioinformatics,2001,17(6):520.
[11] OBA S, SATO M A, TAKEMASA I, et al. A Bayesian missing value estimation method for gene expression profile data[J]. Bioinformatics,2003,19(16):2088-2096.
[12] KIM H, GOLUB G H. Missing value estimation for DNA microarray gene expression data: local least squares imputation[J]. Bioinformatics,2005,21(2):187-198.
[13] 李瑞紅,李智,童玲.蟻群路徑優(yōu)化決策樹(shù)在慢性腎病分期診斷中的應(yīng)用[J].軟件導(dǎo)刊, 2017,16(2):135-138.
[14] ZHANG S, WU X, ZHU M. Efficient missing data imputation for supervised learning[M]. 2010.
[15] LI H, ZHAO C, SHAO F, et al. A hybrid imputation approach for microarray missing value estimation[J]. Bmc Genomics,2015,16(S9):S1.
(責(zé)任編輯:黃 ?。?/p>