郭均鵬,趙 茹,李汶華
?
一種具有約束的CRM區(qū)間回歸方法
郭均鵬,趙 茹,李汶華
(天津大學(xué) 管理與經(jīng)濟(jì)學(xué)部,天津 300072)
CRM(Center-Range Method,中點(diǎn)半徑法)是求解區(qū)間數(shù)據(jù)回歸模型的常用方法,其通過分別擬合區(qū)間中點(diǎn)和半徑進(jìn)行求解。研究了CRM方法的不足,當(dāng)區(qū)間的中點(diǎn)波動(dòng)比較大,而區(qū)間半徑相對(duì)較小時(shí),擬合中點(diǎn)和半徑可能會(huì)導(dǎo)致許多預(yù)測區(qū)間與樣本區(qū)間沒有任何重疊。針對(duì)該問題,在CRM方法基礎(chǔ)上,考慮在中點(diǎn)及半徑誤差平方和最小化的同時(shí),增加一些約束條件,提出一種改進(jìn)的區(qū)間回歸方法。對(duì)區(qū)間回歸分析的評(píng)價(jià)指標(biāo)進(jìn)行了研究,主要有均方根誤差、決定系數(shù)、比率三個(gè)方面,選取這些指標(biāo)用于本文所提出的約束方法的評(píng)價(jià)。通過蒙特卡洛模擬對(duì)約束方法進(jìn)行評(píng)價(jià),并選取2013年5月至2013年6月的滬深300指數(shù)和華夏上證50ETF構(gòu)建區(qū)間樣本數(shù)據(jù),進(jìn)行實(shí)證分析。模擬實(shí)驗(yàn)和實(shí)證分析都表明約束方法能夠有效的減少預(yù)測區(qū)間和樣本區(qū)間無任何重復(fù)的樣本數(shù)目,在平均準(zhǔn)確率和觀測區(qū)間包含的預(yù)測區(qū)間的平均比率方面也具有明顯的優(yōu)勢。
回歸分析;區(qū)間數(shù);約束;中點(diǎn)半徑法(CRM)
研究因變量與自變量之間的關(guān)系,是數(shù)據(jù)分析、模式識(shí)別、數(shù)據(jù)挖掘、機(jī)器學(xué)習(xí)等許多問題的一個(gè)重要任務(wù)[1]?;貧w分析是一種常用的確定因變量與自變量關(guān)系的分析方法。傳統(tǒng)的回歸分析研究,主要是針對(duì)傳統(tǒng)點(diǎn)數(shù)據(jù)進(jìn)行的。近年來,許多學(xué)者從模糊數(shù)學(xué)[2-12]、區(qū)間型符號(hào)數(shù)據(jù)分析[13-18]、計(jì)算機(jī)科學(xué)[19]等不同角度,對(duì)更為復(fù)雜的、涉及到模糊數(shù)的情況進(jìn)行研究。區(qū)間數(shù)也是一種模糊數(shù),目前區(qū)間回歸分析已有許多應(yīng)用[20-22]。
Tanaka et al.[2]最早提出了模糊線性回歸的模型及求解方法。模型中的自變量是點(diǎn)數(shù)據(jù),因變量和系數(shù)是三角模糊數(shù),通過最小化系數(shù)的寬度,同時(shí)約束預(yù)測區(qū)間按某種置信水平完全覆蓋樣本區(qū)間,求解一個(gè)線性規(guī)劃來估計(jì)模型系數(shù)。該方法主要有三個(gè)缺陷[3]:一是許多預(yù)測系數(shù)的半徑為零;二是模型的解具有參考點(diǎn)依賴性,當(dāng)用自變量減去其均值代替原自變量進(jìn)行回歸時(shí),估計(jì)系數(shù)會(huì)發(fā)生很大的變化;三是要求置信水平的預(yù)測區(qū)間完全覆蓋該置信水平的樣本區(qū)間,導(dǎo)致當(dāng)存在半徑較大的觀測區(qū)間或是存在異常值時(shí),估計(jì)系數(shù)的寬度也會(huì)較寬。針對(duì)這些問題,文獻(xiàn)[4-6]等提出了不同的改進(jìn)方法。Sakawa和Yano[7]、Hojati et al.[3]考慮了所有數(shù)據(jù)均是模糊數(shù)的情況。Chen 和 Hsueh[8]基于一種距離準(zhǔn)則[23]提出一種數(shù)學(xué)規(guī)劃方法,能有效的減少總估計(jì)誤差,但是當(dāng)數(shù)據(jù)量較多時(shí)運(yùn)算效率明顯下降。Chen和Hsueh[9]通過取模糊數(shù)的截集,將模糊數(shù)轉(zhuǎn)化成數(shù)值區(qū)間,進(jìn)而通過最小化預(yù)測區(qū)間與樣本區(qū)間左右端點(diǎn)差值的平方和求解。Hladíl和?erny[10]提出了一種基于誤差分析的回歸方法,首先通過傳統(tǒng)的針對(duì)點(diǎn)數(shù)據(jù)估計(jì)點(diǎn)系數(shù)的方法確定區(qū)間系數(shù)的中點(diǎn),進(jìn)而確定半徑,將系數(shù)擴(kuò)展成一個(gè)區(qū)間。該方法根據(jù)自變量、因變量的數(shù)據(jù)類型,進(jìn)行了三類探討:自變量和因變量都是點(diǎn)數(shù)據(jù);自變量為點(diǎn)數(shù)據(jù)、因變量為區(qū)間數(shù)據(jù);自變量和因變量都是區(qū)間數(shù)據(jù)。該方法適用于多種類型的數(shù)據(jù),但是需要給誤差率設(shè)定一個(gè)初始值,排除異常值時(shí)也是通過主觀分析進(jìn)行,這些導(dǎo)致方法具有不確定性,且難以確定最優(yōu)結(jié)果。李汶華等[22]從誤差傳遞的理論出發(fā),研究基于誤差傳遞的區(qū)間型符號(hào)數(shù)據(jù)的回歸分析方法。Boukezzoula et. al.[11]為觀測數(shù)據(jù)的中點(diǎn)和半徑分別建立模型定義域,作為規(guī)劃模型的一個(gè)約束,提出了一種中點(diǎn)-半徑回歸方法。
符號(hào)數(shù)據(jù)分析(symbolic data analysis,SDA)[24]通過“數(shù)據(jù)打包”,實(shí)現(xiàn)數(shù)據(jù)降維,從而實(shí)現(xiàn)了對(duì)海量數(shù)據(jù)的快速分析、處理。區(qū)間數(shù)據(jù)是一種常用的符號(hào)數(shù)據(jù),針對(duì)區(qū)間型符號(hào)數(shù)據(jù)的回歸研究也取得了許多成果。Billard 和 Diday[13]首先提出了線性擬合區(qū)間型符號(hào)數(shù)據(jù)的方法,稱為CM(Center Method)方法。該方法假設(shè)區(qū)間的上限、下限具有相同的系數(shù),區(qū)間上限、下限的誤差和作為區(qū)間數(shù)的誤差,通過最小化誤差平方和求解系數(shù)。Billard 和 Diday[14]將區(qū)間上限、下限分別回歸,提出了MinMax方法,從而將一個(gè)區(qū)間回歸問題轉(zhuǎn)化為兩個(gè)點(diǎn)回歸問題。文獻(xiàn)[13]中區(qū)間的中點(diǎn)和半徑是同時(shí)考慮的,Lima和Carvalho[1]提出了一種CRM(Centre and Range Method)方法,分別估計(jì)區(qū)間的中點(diǎn)和半徑。CRM方法不能保證區(qū)間右端點(diǎn)一定大于左端點(diǎn),因此,Lima和Carvalho[15]提出了CCRM(Constrained Center and Range Method),在對(duì)半徑誤差進(jìn)行最小化時(shí),保證半徑的系數(shù)非負(fù),由于半徑是非負(fù)的,因此預(yù)測的半徑也是非負(fù)的,從而保證了預(yù)測區(qū)間的合理性。González-Rodríguez et al.[16]在保證Hukuhara difference[25]存在的情況下,提出了針對(duì)區(qū)間數(shù)的簡單線性回歸方法。Blanco-Fernández et al.[17]提出一種新的回歸模型M,進(jìn)而給出了模型求解方法。
中點(diǎn)半徑法(Midpoint-Radius,MR)和端點(diǎn)法(Endpoint,EP)是區(qū)間數(shù)據(jù)的兩種表達(dá)方式[11]。相比于EP表達(dá)方式,MR方式有許多優(yōu)勢[11]:區(qū)間的不確定性(半徑)與區(qū)間的變化趨勢(中點(diǎn))相分離、許多實(shí)際情況中MR表達(dá)方式更為自然等。Blanco-Fernández et al.[17]提出的模型M也可看作是采用了MR表達(dá)方式,該模型用一個(gè)表達(dá)式將區(qū)間數(shù)據(jù)的不確定性和變化趨勢表示出來。該模型的不足在于,將模型M轉(zhuǎn)化為一般表示方法,即區(qū)間數(shù)據(jù)中點(diǎn)關(guān)系和半徑關(guān)系分別表達(dá)時(shí),可知半徑關(guān)系中的系數(shù)和常數(shù)項(xiàng)都必須是正值,而實(shí)際情況不一定都是如此。CRM、CCRM方法通過分別擬合區(qū)間中點(diǎn)和半徑進(jìn)行求解,但是當(dāng)區(qū)間的中點(diǎn)波動(dòng)比較大,而區(qū)間半徑相對(duì)較小時(shí),擬合中點(diǎn)和半徑,會(huì)有許多樣本區(qū)間分布于回歸曲線兩側(cè),且與回歸曲線有較大的偏離,這可能會(huì)導(dǎo)致預(yù)測區(qū)間與這些偏離回歸線較遠(yuǎn)的樣本區(qū)間沒有任何重疊。以表1中所示數(shù)據(jù)為例,其對(duì)應(yīng)的樣本區(qū)間及預(yù)測區(qū)間如圖1所示??梢钥闯觯?0個(gè)數(shù)據(jù)中有3個(gè)數(shù)據(jù)的預(yù)測區(qū)間與樣本區(qū)間沒有任何重合。
表1 示例數(shù)據(jù)
圖1 示例數(shù)據(jù)矩形圖
基于上述分析,本文選取中點(diǎn)關(guān)系、半徑關(guān)系分別表示的模型,在CRM方法的基礎(chǔ)上,考慮增加一些約束條件,提出一種具有約束的回歸方法。為此先對(duì)CRM方法進(jìn)行簡介。
(2)
通過最小化中點(diǎn)及半徑誤差的平方和求解,如式(3)所示。
式(1)、(2)也可以用式(4)、(5)表示,其中
,,,,,,
(4)
(5)
(7)
由引言部分分析可知,M方法不能有效的表達(dá)半徑線性關(guān)系中常數(shù)項(xiàng)為負(fù)的情況,因此本文選擇CRM方法中的模型,即分別構(gòu)建區(qū)間數(shù)據(jù)中點(diǎn)和半徑的線性關(guān)系,如式(1)、(2)所示。為使預(yù)測區(qū)間與樣本區(qū)間盡可能有重疊,考慮在約束中點(diǎn)及半徑誤差平方和最小化的同時(shí),增加一些約束條件。Tanaka et al.[2]提出的方法中,要求置信水平的預(yù)測區(qū)間完全覆蓋該置信水平的樣本區(qū)間,這往往會(huì)導(dǎo)致預(yù)測區(qū)間過大。此外,該方法適用于回歸系數(shù)、因變量為模糊數(shù)、自變量為點(diǎn)數(shù)據(jù)的情況。本文構(gòu)建的模型中,自變量和因變量均為區(qū)間數(shù)據(jù),系數(shù)為點(diǎn)數(shù)據(jù)。為避免預(yù)測區(qū)間過大,本文適當(dāng)放松約束條件,擬合時(shí)只要求預(yù)測區(qū)間與樣本區(qū)間有交叉,而不需要全部覆蓋。CCRM方法也是在CRM方法的基礎(chǔ)上增加了一些約束條件,但CCRM方法主要是解決預(yù)測區(qū)間不合理現(xiàn)象,即左端點(diǎn)大于右端點(diǎn)的情況,而本文中的約束方法主要解決的是預(yù)測區(qū)間與樣本區(qū)間無交叉的情況。
(8)
其中的約束條件保證了觀測樣本的預(yù)測區(qū)間和樣本區(qū)間有所交叉。和分別是一系列樣本的預(yù)測區(qū)間的右端點(diǎn)和相應(yīng)的左端點(diǎn),和分別是相應(yīng)的觀測區(qū)間的左端點(diǎn)和右端點(diǎn),約束條件表明,對(duì)于所有樣本,都有預(yù)測區(qū)間右端點(diǎn)值大于或等于相應(yīng)觀測區(qū)間左端點(diǎn)值,同時(shí)預(yù)測區(qū)間左端點(diǎn)值小于或等于相應(yīng)觀測區(qū)間右端點(diǎn)值,因此約束條件保證了所有樣本的觀測區(qū)間和預(yù)測區(qū)間有所交叉。
3.1 評(píng)價(jià)指標(biāo)
3.1.1 均方根誤差
均方根誤差(Root Mean-Square Error,RMSE)是評(píng)價(jià)區(qū)間線性回歸方法時(shí)常用到的一個(gè)指標(biāo), Lima和Carvalho[1]分別計(jì)算了區(qū)間上限和下限的均方根誤差,即、,李汶華等[22]基于Hausdorff距離[27]提出一種均方根誤差,Chuang[19]同樣基于Hausdorff距離[27],提出了一種區(qū)間數(shù)的均方根誤差計(jì)算方法。Bargiela[12]針對(duì)模糊數(shù),考慮模糊數(shù)的最小值、最大值和中值,提出了一種計(jì)算均方根的方法。
3.1.2 決定系數(shù)
Lima和Carvalho[1]分別計(jì)算了區(qū)間上限和下限的相關(guān)系數(shù)、,Lima和Carvalho[15]針對(duì)區(qū)間中點(diǎn)和半徑分別表達(dá)的模型,提出了三種可能的決定系數(shù)計(jì)算方法。
3.1.3 比率
Hu和He[18]定義了準(zhǔn)確率的概念,此外還統(tǒng)計(jì)了預(yù)測準(zhǔn)確率為零的樣本的數(shù)目。Mehran et al.[3]提出可以通過計(jì)算預(yù)測區(qū)間包含的觀測區(qū)間的比率,以及觀測區(qū)間包含的預(yù)測區(qū)間的比率,來衡量預(yù)測區(qū)間與觀測區(qū)間的覆蓋程度。
此外,本文還統(tǒng)計(jì)了每組數(shù)據(jù)預(yù)測準(zhǔn)確率為0的平均樣本數(shù)目,用表示。
3.2 蒙特卡洛模擬
3.2.1 數(shù)據(jù)形成
3.2.2 模擬實(shí)驗(yàn)
表2 蒙特卡洛模擬的幾種情況
實(shí)驗(yàn)時(shí),首先生成一組系數(shù),然后生成500組自變量的中點(diǎn),結(jié)合生成的系數(shù),再加上誤差項(xiàng),生成因變量的中點(diǎn),最后生成因變量及自變量的半徑,從而形成500組模擬數(shù)據(jù)。隨機(jī)選擇其中400組數(shù)據(jù)作為訓(xùn)練集估計(jì)系數(shù),100組作為測試集進(jìn)行預(yù)測,將預(yù)測區(qū)間與樣本區(qū)間進(jìn)行比較,計(jì)算三個(gè)評(píng)價(jià)指標(biāo)。為避免一次模擬的偶然性,重復(fù)生成500組模擬數(shù)據(jù)并計(jì)算評(píng)價(jià)指標(biāo)的過程100次,進(jìn)而可以求得100組評(píng)價(jià)指標(biāo),可以對(duì)這些指標(biāo)進(jìn)行T檢測,評(píng)價(jià)方法在不同方面的優(yōu)劣性。為避免系數(shù)導(dǎo)致的偶然性,再重新生成一組系數(shù),重復(fù)上述過程,這樣總共進(jìn)行50次。
3.2.3 實(shí)驗(yàn)結(jié)果
本文通過T檢測對(duì)方法進(jìn)行比較,顯著水平為1%。若以方法A不優(yōu)于方法B為零假設(shè),方法A優(yōu)于方法B為備擇假設(shè),則對(duì)于指標(biāo),值越大說明方法越差,因此,。而對(duì)于和兩個(gè)指標(biāo),值越大說明方法越好,因此,,以及,。若假設(shè)方法A和B相同,則零假設(shè)為方法A與方法B的指標(biāo)值相同,備擇假設(shè)為不相同。表3-表6分別是以約束方法優(yōu)于、劣于、等同于CRM方法作為原假設(shè)進(jìn)行T檢測的結(jié)果。
在C2情況下,從表3可以看出,三個(gè)指標(biāo)的拒絕率都在50%以上,時(shí)、的拒絕率高達(dá)100%,而時(shí)的拒絕率也達(dá)到了96%,的拒絕率為100%,從表4可以看出,三個(gè)指標(biāo)的拒絕率都是0,說明在各個(gè)方面,約束方法都優(yōu)于CRM方法。再比較表5可以看出,時(shí)兩方法在均方差誤差方面不相同的概率為52%,時(shí)為60%,而在、兩指標(biāo)方面大多都是100%。綜合三個(gè)表的結(jié)果可以看出,在、兩方面,約束方法都優(yōu)于CRM方法,在方面,50%以上的情況下,約束方法具有優(yōu)勢,其余情況下,兩者基本相同,綜合來看,約束方法優(yōu)于CRM方法。
在C3情況下,從表5可以看出,三個(gè)指標(biāo)的拒絕率都是100%,說明這種情況下,兩個(gè)方法在所有指標(biāo)上都不相同,從表3、4可知,、兩方面,約束方法表現(xiàn)更佳,而在方面,CRM方法更佳,兩者各有優(yōu)劣。
表6、7分別為C2、C3情況下,以固定參數(shù)隨機(jī)生成的10組實(shí)驗(yàn)數(shù)據(jù)統(tǒng)計(jì)的數(shù)目。當(dāng)時(shí),時(shí)。通過對(duì)比表6、表7可以發(fā)現(xiàn),在C3情況下,也就是誤差的變化范圍相對(duì)于中點(diǎn)的變化范圍較大時(shí),通過本文中的約束方法進(jìn)行預(yù)測,能夠預(yù)測出更多的和樣本區(qū)間有所交叉的區(qū)間,而用CRM方法進(jìn)行預(yù)測,則有約20%以上的預(yù)測區(qū)間與樣本區(qū)間沒有任何交叉,準(zhǔn)確率為0。
綜上所述,當(dāng)區(qū)間中點(diǎn)的波動(dòng)相對(duì)于區(qū)間寬度較小時(shí),約束方法與CRM方法基本一致;當(dāng)區(qū)間中點(diǎn)的波動(dòng)相對(duì)于區(qū)間寬度適中時(shí),約束方法在各方面都優(yōu)于CRM方法,在預(yù)測準(zhǔn)確率為0的樣本區(qū)間數(shù)目方面有輕微的優(yōu)勢;當(dāng)區(qū)間中點(diǎn)的波動(dòng)相對(duì)于區(qū)間寬度較大時(shí),約束方法在區(qū)間均方根誤差方面不及CRM方法,但在平均準(zhǔn)確率和樣本區(qū)間包含預(yù)測區(qū)間的平均比率方面,約束方法都具有明顯的優(yōu)勢,在預(yù)測準(zhǔn)確率為0的樣本區(qū)間數(shù)目方面更具是顯著的優(yōu)勢。
表3 為以約束方法優(yōu)于CRM方法作為原假設(shè)的實(shí)驗(yàn)結(jié)果
表4 為以CRM方法優(yōu)于約束方法作為原假設(shè)的實(shí)驗(yàn)結(jié)果
表5 為以CRM方法等同于約束方法作為原假設(shè)的實(shí)驗(yàn)結(jié)果
表6 C2情況下用CRM和約束方法對(duì)10組模擬數(shù)據(jù)進(jìn)行回歸預(yù)測的統(tǒng)計(jì)結(jié)果
表6 C2情況下用CRM和約束方法對(duì)10組模擬數(shù)據(jù)進(jìn)行回歸預(yù)測的統(tǒng)計(jì)結(jié)果
序號(hào)(p=1)(p=3) 約束方法CRM約束方法CRM 10000 20003 30022 41101 50010 60000 70001 80000 90000 100011
表7 C3情況下用CRM和約束方法對(duì)10組模擬數(shù)據(jù)進(jìn)行回歸預(yù)測的統(tǒng)計(jì)結(jié)果
表7 C3情況下用CRM和約束方法對(duì)10組模擬數(shù)據(jù)進(jìn)行回歸預(yù)測的統(tǒng)計(jì)結(jié)果
序號(hào)(p=1)(p=3) 約束方法CRM約束方法CRM 1027119 2124032 3221531 4018027 5123323 6234021 7025231 8131221 9024424 10125529
在進(jìn)行股指期貨套利時(shí),需要通過間接方法,構(gòu)建現(xiàn)貨頭寸進(jìn)行套利交易[28]。ETF基金能夠用于構(gòu)建現(xiàn)貨組合,文獻(xiàn)[28]的研究表明滬深300指數(shù)收益率與華夏上證50ETF、華安上證180ETF等幾個(gè)基金的收益率具有高度相關(guān)性,因此猜測滬深300指數(shù)與華夏上證50ETF之間也具有較高的相關(guān)性。本文選取2013年5月至2013年6月兩個(gè)月中每天的滬深300指數(shù)與華夏上證50ETF的最大值、最小值作為樣本數(shù)據(jù),構(gòu)建區(qū)間數(shù)據(jù),研究兩者之間的線性關(guān)系。圖3是用樣本數(shù)據(jù)繪制的矩形圖,兩者之間的線性關(guān)系比較明顯。
圖3 滬深300指數(shù)與華夏上證50ETF區(qū)間樣本矩形圖
以滬深300指數(shù)區(qū)間樣本為因變量,以華夏上證50ETF為自變量,分別用CRM方法和約束方法進(jìn)行區(qū)間回歸,求得的回歸方程如下:
圖4、圖5分別為用CRM方法和約束方法進(jìn)行預(yù)測的結(jié)果,為了避免數(shù)據(jù)過多導(dǎo)致難以通過圖像區(qū)分?jǐn)?shù)據(jù),只繪制了15組數(shù)據(jù)。從圖形上看,兩個(gè)方法預(yù)測結(jié)果差別不大,都能很好的擬合樣本數(shù)據(jù)。
圖4 CRM方法預(yù)測結(jié)果圖
圖5 約束方法預(yù)測結(jié)果圖
表8 CRM方法與約束方法回歸預(yù)測效果評(píng)價(jià)指標(biāo)結(jié)果
本文通過添加約束,對(duì)CRM方法進(jìn)行了改進(jìn),以解決區(qū)間中點(diǎn)波動(dòng)較大,而區(qū)間半徑又相對(duì)較窄時(shí),許多預(yù)測區(qū)間與樣本區(qū)間沒有任何重疊的問題。蒙特卡洛模擬實(shí)驗(yàn)以及基于股票市場的實(shí)證分析,都顯示出了約束方法所具有的優(yōu)越性。
本文的研究工作有如下可能的研究方向:
(1)當(dāng)區(qū)間樣本中存在異常值時(shí),會(huì)嚴(yán)重影響預(yù)測效果,因此研究如何排除異常值也很有研究意義。
(2)如何在保證預(yù)測區(qū)間與樣本區(qū)間盡可能重疊的同時(shí)兼顧均方根誤差,實(shí)現(xiàn)兩者之間的平衡。
[1] Lima Neto EA, de Carvalho FAT. Centre and Range method for fitting a linear regression model to symbolic interval data[J]. Computational Statistics & Data Analysis, 2008, 52(3): 1500-1515.
[2] Tanaka H, Uejima S, Asai K. Linear regression analysis with fuzzy model[J]. IEEE Trans. Systems Man Cybern, 1982, 12: 903-907.
[3] Hojati M, Bector CR, Smimou K. A simple method for computation of fuzzy linear regression[J]. European Journal of Operational Research, 2005, 166(1): 172-184.
[4] Savic DA, Pedrycz W. Evaluation of fuzzy linear regression models[J]. Fuzzy Sets and Systems, 1991, 39(1): 51-63.
[5] Tanaka H, Ishibuchi H. Identification of possibilistic linear systems by quadratic membership functions of fuzzy parameters[J]. Fuzzy sets and Systems, 1991, 41(2): 145-160.
[6] Tanaka H, Hayashi I, Watada J. Possibilistic linear regression analysis for fuzzy data[J]. European Journal of Operational Research, 1989, 40(3): 389-396.
[7] Sakawa M, Yano H. Multiobjective fuzzy linear regression analysis for fuzzy input-output data[J]. Fuzzy Sets and Systems, 1992, 47(2): 173-181.
[8] Chen LH, Hsueh CC. A mathematical programming method for formulating a fuzzy regression model based on distance criterion[J]. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 2007, 37(3): 705-712.
[9] Chen LH, Hsueh CC. Fuzzy regression models using the least-squares method based on the concept of distance[J]. Fuzzy Systems, IEEE Transactions on, 2009, 17(6): 1259-1272.
[10] Hladík M, ?erny M. Interval regression by tolerance analysis approach[J]. Fuzzy sets and systems, 2012, 193: 85-107.
[11] Boukezzoula R, Galichet S, Bisserier A. A Midpoint-Radius approach to regression with interval data[J]. International Journal of Approximate Reasoning, 2011, 52(9): 1257-1271.
[12] Bargiela A, Pedrycz W, Nakashima T. Multiple regression with fuzzy data[J]. Fuzzy Sets and Systems, 2007, 158(19): 2169-2188.
[13] Billard L, Diday E. Regression analysis for interval-valued data[M]. Springer Berlin Heidelberg, 2000: 369-374.
[14] Billard L, Diday E. Symbolic regression analysis[M]. Springer Berlin Heidelberg, 2002: 281-288.
[15] Lima Neto EA, de Carvalho FAT. Constrained linear regression models for symbolic interval-valued variables[J]. Computational Statistics & Data Analysis, 2010, 54(2): 333-347.
[16] González-Rodríguez G, Blanco á, Corral N, et al. Least squares estimation of linear regression models for convex compact random sets[J]. Advances in Data Analysis and Classification, 2007, 1(1): 67-81.
[17] Blanco-Fernández A, Corral N, González-Rodríguez G. Estimation of a flexible simple linear model for interval data based on set arithmetic[J]. Computational Statistics & Data Analysis, 2011, 55(9): 2568-2578.
[18] Hu C, He LT. An application of interval methods to stock market forecasting[J]. Reliable Computing, 2007, 13(5): 423-434.
[19] Chuang CC. Extended support vector interval regression networks for interval input-output data[J]. Information Sciences, 2008, 178(3): 871-891.
[20] 胡楓, 史宇鵬, 王其文. 中國的農(nóng)民工匯款是利他的嗎?——基于區(qū)間回歸模型的分析[J]. 金融研究, 2008 (1): 175-190.
[21] 胡楓, 王其文. 中國農(nóng)民工匯款的影響因素分析——一個(gè)區(qū)間回歸模型的應(yīng)用[J]. 統(tǒng)計(jì)研究, 2007, 24(10): 20-25.
[22] 李汶華, 郭均鵬. 區(qū)間型符號(hào)數(shù)據(jù)回歸分析及其應(yīng)用[J]. 管理科學(xué)學(xué)報(bào), 2010, 13(004): 38-43.
[23] Chen LH, Lu HW. An approximate approach for ranking fuzzy numbers based on left and right dominance[J]. Computers & Mathematics with Applications, 2001, 41(12): 1589-1602.
[24] Bock HH, Diday E. Analysis of symbolic data: exploratory methods for extracting statistical information from complex data[M]. Springer, 2000.
[25] Hukuhara M. Intégration des applications measurables dont la valeur est un compact convexe[J]. Funkcialaj Ekvacioj, 1967, 10: 205-223.
[26] 吳育華,杜剛. 管理科學(xué)基礎(chǔ)[M]. 天津:天津大學(xué)出版社, 2009.
[27] De Carvalho FDAT, De Souza RMCR, Chavent M, et al. Adaptive Hausdorff distances and dynamic clustering of symbolic interval data[J]. Pattern Recognition Letters, 2006, 27(3): 167-179.
[28] 方斌. 滬深300股指期貨套利問題的實(shí)證研究[J]. 西安電子科技大學(xué)學(xué)報(bào): 社會(huì)科學(xué)版, 2010,20(003): 78-85.
A Constrained CRM Regression Method for Interval Data
GUO Jun-peng, ZHAO Ru, LI Wen-hua
( College of Management and Economics, Tianjin University, Tianjin 300072, China)
Regression analysis is a statistical process for determining the relationships among variables. Traditional regression analysis takes point data as the research object. However, there are a large number of data which can’t be observed directly even though their variation intervals are available. For example, the stock index of a day is not a fixed data, because it always changes over time. A variety of methods to estimate the coefficients of interval regression models are studied in fuzzy theory, symbolic data analysis (SDA) as well as computer science. This paper also studies the estimation method for interval data.
Symbolic data analysis is a theory of extracting systematic knowledge from huge data sets. In the framework of SDA, many regression methods have been developed. The Centre method (CM) assumes that the lower and upper bounds of the interval have the same coefficients, and the coefficients can be obtained by minimizing the sum of the square of the lower and upper bound errors. The MinMax method (MinMax) assumes that the coefficients of the lower and upper bounds are different, and they can be estimated separately by applying the Least Square method. The Center and Range method (CRM) uses the mid-points and ranges of the intervals to represent the intervals. Two linear regression relationships are constructed with the center and radius series. The coefficients can be calculated using the Least Square method. CRM performs the best among all these methods. One of the shortcomings of CRM is that the predicted interval may be meaningless, because the predicted radius is less than zero sometimes. The Constrained Center and Ranger method (CCRM) is proposed to solve this problem. In the CCRM, all the coefficients of the radius relationship are non-negative, which can ensure the forecast radius being non-negative.
Another disadvantage of CRM is that it only fits the mid-points and the ranges of the intervals. In addition, it pays no attention to guarantee the prediction interval having overlaps with the sample interval. When the center series errors vary in a large range, there may be many predicted intervals which have no overlaps with the samples. The object of this paper is to solve this problem. A new constrained center and a range of methods are developed by adding some constrains to the CRM. The constraints ensure that the predicted intervals have some overlaps with the sample intervals. The constrained method can be expressed by a nonlinear programming. It is proved that the nonlinear programming is a convex programming. Thus, it can be solved through the K-T conditions. To evaluate the method, this paper studies the current main evaluating indicators. We summarize three kinds of measures, which are the Root Mean Square Error, the Coefficient of Determination and the Ratio. Some indicators are used in the paper, which are the root mean square error of the interval (), the average accuracy rate (), the average percentage of predicted intervals contained in the observed intervals (), as well as the average number of forecasts with 0% accuracy ().
Both Monte Carlo simulations and empirical analysis are used to evaluate our method. In the Monte Carlo simulation experiments, both simple regression (=1) and multivariate regression (=3) are considered. In each circumstance, three conditions are considered. The main difference of these conditions is the ranges of the errors. The results show that the larger the range of the errors, the better the new method performs in the measures,and. The results of simple regression are different from the results of multivariate regression. The linear relationship between CSI 300 and 50ETF is studied in the paper. The results indicate that the new method outperforms CRM in all the measures except the. In a word, The Monte Carlo simulations and the empirical analysis show that the constrained method performs better or the same in the aspects of,and. Sometimes the constrained method has better results than the CRM method in all the measures.
When there are some outliers in the samples, the new constrained method may not perform well. Thus, how to identify outliers and remove them are important research topics. Besides, how to achieve a balance between guaranteeing the overlaps and obtaining a loweris another topic worth studying.
regression analysis; interval; constraint; center and range method(CRM)
中文編輯:杜 ??;英文編輯:Charlie C. Chen
O212.4
A
1004-6062(2016)04-0196-07
10.13587/j.cnki.jieem.2016.04.025
2013-07-12
2014-05-25
國家自然科學(xué)基金資助項(xiàng)目(71271147,71003072);天津大學(xué)自主創(chuàng)新基金資助項(xiàng)目(2014XS-0024)
郭均鵬(1973—),男,山東濰坊人;天津大學(xué)管理與經(jīng)濟(jì)學(xué)部教授,博士生導(dǎo)師,研究方向:管理科學(xué),符號(hào)數(shù)據(jù)分析。