朱煜 趙江坤 王逸寧 鄭兵兵
?
基于深度學(xué)習(xí)的人體行為識(shí)別算法綜述
朱煜1趙江坤1王逸寧1鄭兵兵1
人體行為識(shí)別和深度學(xué)習(xí)理論是智能視頻分析領(lǐng)域的研究熱點(diǎn),近年來(lái)得到了學(xué)術(shù)界及工程界的廣泛重視,是智能視頻分析與理解、視頻監(jiān)控、人機(jī)交互等諸多領(lǐng)域的理論基礎(chǔ).近年來(lái),被廣泛關(guān)注的深度學(xué)習(xí)算法已經(jīng)被成功運(yùn)用于語(yǔ)音識(shí)別、圖形識(shí)別等各個(gè)領(lǐng)域.深度學(xué)習(xí)理論在靜態(tài)圖像特征提取上取得了卓著成就,并逐步推廣至具有時(shí)間序列的視頻行為識(shí)別研究中.本文在回顧了基于時(shí)空興趣點(diǎn)等傳統(tǒng)行為識(shí)別方法的基礎(chǔ)上,對(duì)近年來(lái)提出的基于不同深度學(xué)習(xí)框架的人體行為識(shí)別新進(jìn)展進(jìn)行了逐一介紹和總結(jié)分析;包括卷積神經(jīng)網(wǎng)絡(luò)(Convolution neural network,CNN)、獨(dú)立子空間分析(Independent subspace analysis,ISA)、限制玻爾茲曼機(jī)(Restricted Boltzmann machine,RBM)以及遞歸神經(jīng)網(wǎng)絡(luò)(Recurrent neural network,RNN)及其在行為識(shí)別中的模型建立,對(duì)模型性能、成果進(jìn)展及各類方法的優(yōu)缺點(diǎn)進(jìn)行了分析和總結(jié).
行為識(shí)別,深度學(xué)習(xí),卷積神經(jīng)網(wǎng)絡(luò),限制玻爾茲曼機(jī)
引用格式朱煜,趙江坤,王逸寧,鄭兵兵.基于深度學(xué)習(xí)的人體行為識(shí)別算法綜述.自動(dòng)化學(xué)報(bào),2016,42(6):848-857
基于機(jī)器視覺(jué)的人體行為識(shí)別是將包含人體動(dòng)作的視頻添加上動(dòng)作類型的標(biāo)簽.近年來(lái),隨著視頻采集傳感器及信息科學(xué)技術(shù)的不斷發(fā)展,這方面的研究在視頻監(jiān)控、人機(jī)接口、基于內(nèi)容的視頻檢索等方面逐漸成為一個(gè)具有廣泛應(yīng)用前景的研究課題.自動(dòng)化監(jiān)控對(duì)生產(chǎn)生活產(chǎn)生很大的影響,可以應(yīng)用在商場(chǎng)、廣場(chǎng)以及工業(yè)生產(chǎn)的監(jiān)控中;作為人機(jī)交互的關(guān)鍵技術(shù),可以將其作為智能家居的一部分應(yīng)用在家庭中,如監(jiān)護(hù)小孩或者老人的危險(xiǎn)行為等;傳統(tǒng)的視頻檢索方法都是人工對(duì)其進(jìn)行標(biāo)定,其中有很多主觀因素,如果能夠?qū)⑷梭w行為識(shí)別方法應(yīng)用到該領(lǐng)域,將大大提高建立索引的效率及搜索效果.
人體行為識(shí)別工作主要分為兩個(gè)過(guò)程:特征表征和動(dòng)作的識(shí)別及理解.圖1為動(dòng)作識(shí)別的原理框圖.特征表征是在視頻數(shù)據(jù)中提取能夠表征這段視頻關(guān)鍵信息的特征,這個(gè)過(guò)程在整個(gè)識(shí)別過(guò)程起了關(guān)鍵的作用,特征的好壞直接會(huì)影響到最終的識(shí)別效果.動(dòng)作識(shí)別及理解階段是將前一階段得到的特征向量作為輸入經(jīng)過(guò)機(jī)器學(xué)習(xí)算法進(jìn)行學(xué)習(xí),并將在測(cè)試過(guò)程或應(yīng)用場(chǎng)景中得到的特征向量輸入到上述過(guò)程得到的模型中進(jìn)行類型的識(shí)別.
圖1 動(dòng)作識(shí)別原理框圖Fig.1 The flowchart of action recognition
人體行為識(shí)別特征提取方法早期有基于人體幾何特征的計(jì)算方法[1]、運(yùn)動(dòng)信息的特征提取方法[2];隨著 HOG(Histogram of oriented gradient)[3]、SIFT(Scale-invariant feature transform)[4]等具有先驗(yàn)知識(shí)的多尺度特征提取算法的提出,結(jié)合視頻序列信息的HOG3D(Histogram of gradients 3D)等基于時(shí)空興趣點(diǎn)的特征提取方法得到了長(zhǎng)足發(fā)展[5-11].
以上方法在特征提取之后通常采用常見(jiàn)的模式識(shí)別算法如支持向量機(jī)(Support vector machine,SVM)等進(jìn)行分類識(shí)別.近年來(lái)隨著深度學(xué)習(xí)(Deep learning)理論的提出[12-14],為設(shè)計(jì)無(wú)監(jiān)督的自動(dòng)特征學(xué)習(xí)方法奠定了基礎(chǔ),其理論框架應(yīng)用于行為識(shí)別也得到了長(zhǎng)足發(fā)展.本文在介紹傳統(tǒng)算法的基礎(chǔ)上,重點(diǎn)分析深度學(xué)習(xí)算法在行為識(shí)別中的研究進(jìn)展.
在人體行為識(shí)別過(guò)程中主要遇到以下幾方面的挑戰(zhàn):
1)類內(nèi)和類間數(shù)據(jù)的差異
對(duì)于很多動(dòng)作,它們本身就具有很大的差異性,例如不同人物或者不同時(shí)刻的行走動(dòng)作在速度或者步長(zhǎng)都可能具有差異.不同動(dòng)作之間又可能具有很大的相似性.例如KTH數(shù)據(jù)庫(kù)中的慢跑和跑步.
2)場(chǎng)景和視頻采集的條件
背景復(fù)雜甚至是動(dòng)態(tài)變化的,或者在動(dòng)作過(guò)程中光照、天氣等發(fā)生變化都會(huì)對(duì)特征提取算法的選擇和算法的計(jì)算結(jié)果產(chǎn)生很大的影響.其次,視頻采集條件等其他因素也會(huì)對(duì)其產(chǎn)生影響,例如攝像頭晃動(dòng)等.
目前國(guó)內(nèi)外有多個(gè)人體行為數(shù)據(jù)庫(kù)供廣大科研人員下載使用,使用公共數(shù)據(jù)庫(kù)能夠方便地驗(yàn)證相關(guān)算法的可行性及對(duì)比不同算法的性能.
1)Weizman行為數(shù)據(jù)庫(kù)[15]
該人體行為數(shù)據(jù)庫(kù)是在以色列Weizman科學(xué)研究所錄制拍攝的,包含10種動(dòng)作(走路、快跑、向前跳、側(cè)身跳、彎腰、揮單手、揮雙手、原地跳、全身跳、單腿跳).每個(gè)動(dòng)作由10個(gè)人來(lái)演示.該數(shù)據(jù)庫(kù)背景固定并且前景輪廓已經(jīng)包含在數(shù)據(jù)庫(kù)中,視角固定.如圖2為Weizman數(shù)據(jù)庫(kù)部分動(dòng)作示例.
2)KTH數(shù)據(jù)庫(kù)[5]
該人體行為數(shù)據(jù)庫(kù)包括6種動(dòng)作(走、跳、跑、擊拳、揮手和拍手),是由25個(gè)不同的人執(zhí)行的,分別在四個(gè)場(chǎng)景下,一共有599段視頻.除了鏡頭的拉近拉遠(yuǎn)、攝像機(jī)的輕微運(yùn)動(dòng)外,背景相對(duì)靜止.如圖3為KTH數(shù)據(jù)庫(kù)部分動(dòng)作示例.
圖2 Weizman數(shù)據(jù)庫(kù)部分動(dòng)作示例Fig.2 Examples of Weizman database
圖3 KTH數(shù)據(jù)庫(kù)部分動(dòng)作示例Fig.3 Examples of KTH database
3)UCF Sports數(shù)據(jù)庫(kù)[16-17]
該人體行為數(shù)據(jù)庫(kù)包含150個(gè)視頻序列,這些都是從各種廣播體育頻道如BBC和ESPN上收集得到的,該數(shù)據(jù)庫(kù)涵蓋很廣的場(chǎng)景類型和視角區(qū)域.這個(gè)數(shù)據(jù)庫(kù)中由10類行為動(dòng)作組成:跳水、打高爾夫、踢腿、舉重、騎馬、跑步、滑板、搖擺、側(cè)擺和走路.人體圖像邊界框在數(shù)據(jù)庫(kù)中已給出.在視頻中有一定的人體外形、視角、光照和背景的變化及攝像頭的移動(dòng).如圖4為UCF Sports數(shù)據(jù)庫(kù)部分動(dòng)作示例.
4)Hollywood數(shù)據(jù)庫(kù)[18]
該人體行為數(shù)據(jù)庫(kù)是從32部好萊塢電影中采集得到的,包含8個(gè)類別的動(dòng)作:接電話、下車、握手、擁抱、接吻、坐下、坐著、站起來(lái),總共有1707個(gè)視頻.如圖5為Hollywood數(shù)據(jù)庫(kù)部分動(dòng)作示例. Hollywood 2將Hollywood數(shù)據(jù)庫(kù)的動(dòng)作類別擴(kuò)展到了12類.
本文各章節(jié)內(nèi)容安排如下,首先主要介紹了課題的研究背景及常用的數(shù)據(jù)集.第1節(jié)介紹傳統(tǒng)的基于人工設(shè)計(jì)特征提取方法的研究成果.第2節(jié)介紹了多個(gè)深度學(xué)習(xí)算法的理論基礎(chǔ)及在人體行為識(shí)別上的研究進(jìn)展.最后對(duì)論文做了總結(jié),分析了基于深度學(xué)習(xí)算法的優(yōu)缺點(diǎn).
傳統(tǒng)特征提取方法一般是通過(guò)人工觀察和設(shè)計(jì),手動(dòng)設(shè)計(jì)出能夠表征動(dòng)作特征的特征提取方法.人體行為識(shí)別特征提取方法主要分為兩部分:基于人體幾何或者運(yùn)動(dòng)信息的特征提取方法和基于時(shí)空興趣點(diǎn)的特征提取方法.
1.1基于人體幾何特征或運(yùn)動(dòng)信息的特征提取方法
根據(jù)人體的幾何形狀進(jìn)行行為識(shí)別是最直接的方法,F(xiàn)ujiyoshi等[1]使用四肢和頭部5個(gè)頂點(diǎn)表示的星狀圖來(lái)表示當(dāng)前幀的人體姿態(tài),并使用5個(gè)特征點(diǎn)與重心構(gòu)成的矢量作為動(dòng)作的特征向量;Yang等[19]從人體深度圖像中采集關(guān)節(jié)點(diǎn)的三維坐標(biāo),將這些關(guān)節(jié)點(diǎn)形成的人體輪廓作為特征進(jìn)行行為識(shí)別.使用人體幾何形狀的方法受限于人體幾何形狀的建模,而運(yùn)動(dòng)中的人體形狀具有一定的柔性,不能用簡(jiǎn)單的數(shù)學(xué)模型來(lái)描述運(yùn)動(dòng)過(guò)程中的人體形狀.在此基礎(chǔ)上有人提出了基于運(yùn)動(dòng)信息的人體行為的表征方法.
基于運(yùn)動(dòng)信息的人體行為的表征方法主要考慮了每幀圖像在時(shí)間維度上的變化.基于光流場(chǎng)的方法是基于運(yùn)動(dòng)信息表征方法中典型的方法. Chaudhry等[2]將兩個(gè)方向的光流場(chǎng)半波整流成上下左右四個(gè)方向的運(yùn)動(dòng)矢量,進(jìn)行歸一化并形成最終的運(yùn)動(dòng)描述符,如圖6所示.Bobick等的研究工作延續(xù)了這一思路,但抽取了不同的特征用于識(shí)別.他們采用運(yùn)動(dòng)能量圖像(Motion energy images,MEI)[20]和運(yùn)動(dòng)歷史圖像(Motion history images,MHI)[21]來(lái)解釋圖像序列中人的運(yùn)動(dòng)[22].基于人體幾何形狀或者運(yùn)動(dòng)信息的人體動(dòng)作表征方法都是在以人體為核心的感興趣區(qū)域內(nèi)進(jìn)行的.在Weizman數(shù)據(jù)庫(kù)中感興趣區(qū)域已經(jīng)給出,KTH數(shù)據(jù)庫(kù)雖然沒(méi)有給出但是背景相對(duì)變化不大,通過(guò)運(yùn)動(dòng)檢測(cè)方法容易得到感興趣區(qū)域.所以在普通的場(chǎng)景下,識(shí)別效果較好,但在復(fù)雜場(chǎng)景下,因不能得到人體的準(zhǔn)確位置,效果急劇下降.表1總結(jié)了這兩種方法在各個(gè)數(shù)據(jù)庫(kù)上的結(jié)果.
圖4 UCF Sports數(shù)據(jù)庫(kù)部分動(dòng)作示例Fig.4 Examples of UCF Sports database
圖5 Hollywood數(shù)據(jù)庫(kù)部分動(dòng)作示例Fig.5 Examples of Hollywood database
圖6 基于光流法的運(yùn)動(dòng)信息表征方法Fig.6 Movement information representation method based on optical flow method
表1 基于幾何形狀或基于運(yùn)動(dòng)信息的識(shí)別結(jié)果(%)Table 1 The results of recognition methods based on geometric shapes or motion information(%)
1.2基于時(shí)空興趣點(diǎn)的特征提取方法
在背景相對(duì)復(fù)雜的情況下基于時(shí)空興趣點(diǎn)的行為識(shí)別方法取得了比較好的效果.Schuldt等[5]將Harris的空域特征點(diǎn)擴(kuò)展到三維的時(shí)空興趣點(diǎn),通過(guò)在三維時(shí)空上進(jìn)行對(duì)應(yīng)的高斯模糊和局部角點(diǎn)提取,獲取時(shí)空興趣點(diǎn)并在時(shí)空興趣點(diǎn)周圍進(jìn)行像素直方圖的統(tǒng)計(jì)最終形成描述動(dòng)作的特征向量.但是Dollar等指出這種方法檢測(cè)出來(lái)穩(wěn)定興趣點(diǎn)的數(shù)量太少,因此Dollar等提出在時(shí)間維度和空間維度上采用Gabor濾波器進(jìn)行濾波[6],這樣檢測(cè)出來(lái)的興趣點(diǎn)數(shù)目就會(huì)隨著局部鄰域塊的尺寸大小的改變而改變.Rapantzikos等提出在3個(gè)維度上分別應(yīng)用離散小波變換[7],通過(guò)每一維低通和高通濾波響應(yīng)來(lái)選擇感興趣的時(shí)空點(diǎn).同時(shí)為了嵌入顏色和運(yùn)動(dòng)信息,Rapantzikos等又加入了彩色和運(yùn)動(dòng)信息來(lái)計(jì)算顯著時(shí)空點(diǎn).局部時(shí)空塊可以用網(wǎng)格來(lái)描述,一個(gè)網(wǎng)格包括了觀察到的局部鄰域像素,并將其看作是一個(gè)特征塊,由此減少了時(shí)空局部變化的影響.Knopp等[8]將二維SURF (Speeded up robust features)特征擴(kuò)展到三維,3D SURF特征的每個(gè)單元包含了全部Harr-wavelet特征;Kl′aser等[9]將局部梯度方向直方圖HOG特征擴(kuò)展到三維形成HOG3D,HOG3D的每個(gè)塊都是由規(guī)則多面體組成,并且HOG3D可以在多尺度下對(duì)時(shí)空塊進(jìn)行快速密度采樣,算法流程可參考圖7;Wang等[10]在文獻(xiàn)中比較了各種局部特征描述子(HOG3D、HOG/HOF[11]、Extended SURF),發(fā)現(xiàn)整合梯度與光流信息的描述子實(shí)驗(yàn)效果較好,在這幾個(gè)描述子中,HOG3D的效果最好,表2為各種方法在KTH、UCF Sports及Hollywood數(shù)據(jù)庫(kù)上的結(jié)果.
圖7 3D梯度方向直方圖獲得過(guò)程Fig.7 HOG3D descriptor
表2 基于時(shí)空興趣點(diǎn)的特征提取方法在KTH、UCF Sports及Hollywood數(shù)據(jù)庫(kù)上的結(jié)果(%)Table 2 The results of methods based on the interest of time and space on the KTH,UCF Sports and Hollywood databases(%)
由于深度網(wǎng)絡(luò)[12-14]可以無(wú)監(jiān)督地從數(shù)據(jù)中學(xué)習(xí)到特征,而這種學(xué)習(xí)方式也符合人類感知世界的機(jī)理,因此當(dāng)訓(xùn)練樣本足夠多的時(shí)候通過(guò)深度網(wǎng)絡(luò)學(xué)習(xí)到的特征往往具有一定的語(yǔ)義特征,并且更適合目標(biāo)和行為的識(shí)別.深度學(xué)習(xí)算法可以分為四個(gè)體系:有監(jiān)督的卷積神經(jīng)網(wǎng)絡(luò)、基于自編碼(AutoEncoder)的深度神經(jīng)網(wǎng)絡(luò)、基于限制玻爾茲曼機(jī)(Restricted Boltzmann machine,RBM)的深度置信網(wǎng)絡(luò)(Deep belief networks,DBN)[23-24]和基于遞歸神經(jīng)網(wǎng)絡(luò)(Recurrent neural network,RNN)的深度神經(jīng)網(wǎng)絡(luò).
圖83 DCNN結(jié)構(gòu)圖Fig.8 The structure of 3DCNN
2.1基于3D卷積神經(jīng)網(wǎng)絡(luò)的行為識(shí)別
卷積神經(jīng)網(wǎng)絡(luò) (Convolutionneuralnetwork,CNN)[25-28]是基于深度學(xué)習(xí)理論的一種人工神經(jīng)網(wǎng)絡(luò),它主要利用權(quán)值共享來(lái)減小普通神經(jīng)網(wǎng)絡(luò)中的參數(shù)膨脹問(wèn)題并在前向計(jì)算過(guò)程中使用卷積核對(duì)輸入數(shù)據(jù)進(jìn)行卷積操作,將得到的結(jié)果通過(guò)一個(gè)非線性函數(shù)作為該層的輸出,這樣的層稱為卷積層,卷積層和卷積層之間會(huì)出現(xiàn)下采樣層,下采樣層主要用于獲取局部特征的不變性,同時(shí)降低特征空間的尺度.一般在卷積層和下采樣層之后是一個(gè)全連接的神經(jīng)網(wǎng)絡(luò)用于最終的識(shí)別.
Ji等[29]將傳統(tǒng)CNN拓展到具有時(shí)間信息的3DCNN,在視頻數(shù)據(jù)的時(shí)間維度和空間維度上進(jìn)行特征計(jì)算.在卷積過(guò)程中的特征圖與多個(gè)連續(xù)幀中的數(shù)據(jù)進(jìn)行連接,Ch′eron等[30]使用單幀數(shù)據(jù)和光流數(shù)據(jù),從而捕獲運(yùn)動(dòng)信息.這個(gè)卷積神經(jīng)網(wǎng)絡(luò)的第一層是硬編碼的卷積核,包括灰度數(shù)據(jù),x、y方向的梯度,x、y向的光流,還包括3個(gè)卷積層,2個(gè)下采樣層和1個(gè)全連接層,其結(jié)構(gòu)圖如圖9所示. Varol等[31]在定長(zhǎng)時(shí)間的視頻塊內(nèi)使用3DCNN. Karpathy等[32]使用多分辨率的卷積神經(jīng)網(wǎng)絡(luò)對(duì)視頻特征進(jìn)行提取.輸入視頻被分作兩組獨(dú)立的數(shù)據(jù)流:低分辨率的數(shù)據(jù)流和原始分辨率的數(shù)據(jù)流.這兩個(gè)數(shù)據(jù)流都交替地包含卷積層、正則層和抽取層,同時(shí)這兩個(gè)數(shù)據(jù)流最后合并成兩個(gè)全連接層用于后續(xù)的特征識(shí)別,結(jié)構(gòu)圖如圖9所示.Simonyan等[33]同樣使用兩個(gè)數(shù)據(jù)流的卷積神經(jīng)網(wǎng)絡(luò)來(lái)進(jìn)行視頻行為識(shí)別.他們將視頻分成靜態(tài)幀數(shù)據(jù)流和幀間動(dòng)態(tài)數(shù)據(jù)流.靜態(tài)幀數(shù)據(jù)流可使用單幀數(shù)據(jù),幀間動(dòng)態(tài)的數(shù)據(jù)流使用光流數(shù)據(jù),每個(gè)數(shù)據(jù)里都使用深度卷積神經(jīng)網(wǎng)絡(luò)進(jìn)行特征提取.最后將得到的特征使用SVM 進(jìn)行動(dòng)作的識(shí)別.他們提出只使用人體姿勢(shì)的關(guān)節(jié)點(diǎn)部分的相關(guān)數(shù)據(jù)進(jìn)行深度卷積網(wǎng)絡(luò)進(jìn)行特征提取,最后使用統(tǒng)計(jì)的方法將整個(gè)視頻轉(zhuǎn)換為一個(gè)特征向量,使用SVM進(jìn)行最終分類模型的訓(xùn)練和識(shí)別.表3為各種方法在KTH、UCF101數(shù)據(jù)庫(kù)上的結(jié)果,其中,UCF101行為識(shí)別數(shù)據(jù)庫(kù)是從YouTube上的現(xiàn)實(shí)生活視頻中收集得到的,共101 類.
圖9 多分辨率卷積神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)圖Fig.9 The structure of multiresolution convolution neural network
表3 基于CNN的行為識(shí)別算法結(jié)果(%)Table 3 The results of action recognition based on CNN(%)
2.2基于自動(dòng)編碼機(jī)的無(wú)監(jiān)督行為識(shí)別
自動(dòng)編碼機(jī)(AutoEncoder)[26,34-35]是一種無(wú)監(jiān)督的學(xué)習(xí)算法,利用反向傳播算法,讓目標(biāo)值等于輸入值,如圖10所示.
AutoEncoder試圖學(xué)習(xí)一個(gè)函數(shù)hw,b(x),使得hw,b≈x,也就是說(shuō)它試圖學(xué)習(xí)得到一個(gè)等值函數(shù),使得該模型的輸出幾乎與輸入相等.Le等[36]將獨(dú)立子空間分析(Independent subspace analysis,ISA)[37]擴(kuò)展到三維的視頻數(shù)據(jù)上,使用無(wú)監(jiān)督的學(xué)習(xí)算法對(duì)視頻塊進(jìn)行建模.這個(gè)方法首先在小的輸入塊上使用ISA算法,然后將學(xué)習(xí)到的網(wǎng)絡(luò)和較大塊的輸入圖像進(jìn)行卷積,將卷積過(guò)程得到的響應(yīng)組合在一起作為下一層的輸入,如圖11所示.將得到的描述方法運(yùn)用到視頻數(shù)據(jù)上,這個(gè)方法同時(shí)在三個(gè)著名的行為識(shí)別庫(kù)上做了實(shí)驗(yàn),表4為其在KTH、UCF Sports、Hollyword 2數(shù)據(jù)庫(kù)上的結(jié)果.可以看出,ISA算法在具有復(fù)雜環(huán)境的Hollywood 2數(shù)據(jù)集上獲得了更優(yōu)異的性能,較時(shí)空興趣點(diǎn)算法高近10%.
圖10 AutoEncoder結(jié)構(gòu)圖Fig.10 The structure of AutoEncoder
表4 ISA在三個(gè)數(shù)據(jù)庫(kù)上的結(jié)果統(tǒng)計(jì)(%)Table 4 The results of ISA on three databases(%)
圖11 ISA-3D結(jié)構(gòu)圖Fig.11 The structure of ISA-3D
2.3限制玻爾茲曼機(jī)及其擴(kuò)展模型
限制玻爾茲曼機(jī)(RBM)[38-40]是一個(gè)關(guān)于輸入(可見(jiàn))神經(jīng)元v和輸出(隱藏)神經(jīng)元h之間的概率生成模型.可見(jiàn)層和隱藏層的神經(jīng)元之間通過(guò)一個(gè)權(quán)值矩陣w和兩個(gè)偏置向量c和b連接.在可見(jiàn)層神經(jīng)元之間或者隱藏層神經(jīng)元之間都沒(méi)有連接.給定一組v和h,可定義該模型的能量函數(shù)為:
對(duì)應(yīng)的聯(lián)合概率密度是
其中z是一個(gè)配分函數(shù),來(lái)保證概率分布P是歸一化的.
若可見(jiàn)層和隱藏層為二值(0或者1),在給定v的情況下h的概率分布和給定h的情況下v的概率分布分別是
其中σ(·)是激活函數(shù),可以選σ(x)=1/(1+e-x)或者σ(x)=tanh(x)等.使用對(duì)比散度算法求重構(gòu)誤差最小值,通過(guò)在數(shù)據(jù)上進(jìn)行訓(xùn)練可得到概率分布的三個(gè)參數(shù)W、b和c.
對(duì)于圖像或視頻來(lái)說(shuō),它們都是實(shí)值數(shù)據(jù).使用二值分布對(duì)其建模是不合適的.為使RBM能應(yīng)用到此類數(shù)據(jù)上,可將RBM的可見(jiàn)層替換成具有高斯噪聲的線性變量[39,41],隱藏層仍然使用二值分布.此時(shí)能量函數(shù)為:
其中σi是標(biāo)準(zhǔn)高斯分布的標(biāo)準(zhǔn)差.相應(yīng)的兩個(gè)條件分布公式為
其中N(μ,σ2)表示均值為μ、方差為σ2的高斯分布.
條件限制玻爾茲曼機(jī)(Conditional restricted Boltzman machines,CRBM)[40,42]是限制玻爾茲曼機(jī)在時(shí)間維度上的一個(gè)擴(kuò)展,它將過(guò)去時(shí)間點(diǎn)的可見(jiàn)層與當(dāng)前時(shí)刻的隱含層建立連接,因此對(duì)于二值數(shù)據(jù)來(lái)說(shuō)兩個(gè)條件分布公式分別為
參數(shù)θ={W,b,c,A,B}同樣可以通過(guò)對(duì)比散度算法進(jìn)行優(yōu)化.
Taylor等[42]提出將條件限制玻爾茲曼機(jī)用于人體行為識(shí)別的建模.Chen等[43]提出空間—時(shí)間深度信念網(wǎng)絡(luò)(Space-time deep belief network,ST-DBN),ST-DBN使用卷積RBM神經(jīng)網(wǎng)絡(luò)將空間抽取層和時(shí)間抽取層組合在一起在視頻數(shù)據(jù)上提取不變特征,并在KTH數(shù)據(jù)庫(kù)上獲得了91.13%的識(shí)別率.
2.4遞歸神經(jīng)網(wǎng)絡(luò)及其擴(kuò)展模型
在深度學(xué)習(xí)領(lǐng)域,傳統(tǒng)的前饋神經(jīng)網(wǎng)絡(luò)(Feedforward neural net,F(xiàn)NN)取得了顯著的成就.但近年來(lái)隨著研究的深入,F(xiàn)NN模型對(duì)聲音、文本、視頻等信息表征時(shí),無(wú)法學(xué)習(xí)到信息的邏輯順序.為解決這一問(wèn)題,能夠反映序列前后關(guān)聯(lián)信息的遞歸神經(jīng)網(wǎng)絡(luò)(Recurrent neural networks,RNN)[44-46]發(fā)展迅速.RNN將上幾個(gè)時(shí)刻的隱含層數(shù)據(jù)作為當(dāng)前時(shí)刻的輸入,從而允許時(shí)間維度上的信息得以保留.RNN的網(wǎng)絡(luò)結(jié)構(gòu)如圖12所示.隱含層的結(jié)果yj通過(guò)參數(shù)w作為系統(tǒng)輸出,同時(shí)上一時(shí)刻的yj(t-1)作為輸入,輸入到當(dāng)前時(shí)刻的系統(tǒng)中.
圖12 RNN結(jié)構(gòu)圖Fig.12 The structure of RNN
長(zhǎng)短時(shí)記憶 (Longshorttermmemory,LSTM)[47-49]型RNN模型是普通RNN模型的擴(kuò)展,主要用于解決RNN模型中的梯度消亡現(xiàn)象,如圖13所示.LSTM接受上一時(shí)刻的輸出結(jié)果,當(dāng)前時(shí)刻的系統(tǒng)狀態(tài)和當(dāng)前系統(tǒng)輸入,通過(guò)輸入門、遺忘門和輸出門更新系統(tǒng)狀態(tài)并將最終的結(jié)果進(jìn)行輸出.
如下公式所示,輸入門為it,遺忘門為ft,輸出門為ot,遺忘門來(lái)決定上一時(shí)刻的狀態(tài)信息中某部分?jǐn)?shù)據(jù)需要被遺忘,輸入門來(lái)決定當(dāng)前輸入中某部分?jǐn)?shù)據(jù)需要保留在狀態(tài)中,輸出門來(lái)決定由當(dāng)前時(shí)刻的系統(tǒng)輸入、前一時(shí)刻的輸入和狀態(tài)信息組合的信息某些部分可以作為最終的輸出.
圖13 LSTM單元Fig.13 The unit of LSTM
Ng等[50]使用LSTM對(duì)視頻進(jìn)行建模,LSTM將底層CNN的輸出連接起來(lái)作為下一時(shí)刻的輸入,在UCF101數(shù)據(jù)庫(kù)上獲得了82.6%的識(shí)別率.Donahue等[51]提出了長(zhǎng)時(shí)遞歸卷積神經(jīng)網(wǎng)絡(luò)(Longterm recurrent convolutional network,LRCN),這個(gè)網(wǎng)絡(luò)將CNN和LSTM結(jié)合在一起對(duì)視頻數(shù)據(jù)進(jìn)行特征提取,單幀的圖像信息通過(guò)CNN獲取特征,然后將CNN的輸出按時(shí)間順序通過(guò)LSTM,這樣最終將視頻數(shù)據(jù)在空間和時(shí)間維度上進(jìn)行特征表征,在UCF101數(shù)據(jù)庫(kù)上得到了82.92%的平均識(shí)別率.
本文對(duì)傳統(tǒng)的行為識(shí)別方法和基于深度學(xué)習(xí)的人體行為識(shí)別方法進(jìn)行了分析總結(jié).傳統(tǒng)的方法對(duì)視頻的環(huán)境或拍攝條件等有較高的要求,并且特征提取方法是人工先驗(yàn)設(shè)計(jì)出來(lái).而基于深度學(xué)習(xí)的行為識(shí)別方法不需要像傳統(tǒng)方法那樣對(duì)特征提取方法進(jìn)行人工設(shè)計(jì),可以在視頻數(shù)據(jù)上進(jìn)行訓(xùn)練和學(xué)習(xí),得到最有效的表征方法.這種思路對(duì)數(shù)據(jù)具有很強(qiáng)的適應(yīng)性,尤其在標(biāo)定數(shù)據(jù)較少的情況下能夠獲得更好的效果.卷積神經(jīng)網(wǎng)絡(luò)在圖像識(shí)別方面獲得了比較好的成果,因此基于卷積神經(jīng)網(wǎng)絡(luò)的方法一開(kāi)始就獲得了人們的注意,推廣至行為識(shí)別的3DCNN取得了不錯(cuò)的效果.但是該方法屬于有監(jiān)督學(xué)習(xí),在整個(gè)學(xué)習(xí)訓(xùn)練過(guò)程中需要大量有標(biāo)簽的樣本數(shù)據(jù).在行為識(shí)別領(lǐng)域無(wú)監(jiān)督學(xué)習(xí)的深度學(xué)習(xí)算法,如ISA,獲得了比較好的效果.基于AutoEncoder和RBM的方法可以在無(wú)標(biāo)簽的數(shù)據(jù)上進(jìn)行無(wú)監(jiān)督學(xué)習(xí),從而得到最佳的時(shí)空特征表示方法.由于視頻具有時(shí)間維度的信息,RNN能夠更好地適應(yīng)視頻的時(shí)間信息,但是RNN的梯度消亡現(xiàn)象使得其不能很好地處理長(zhǎng)時(shí)間的視頻,LSTM算法的提出解決了這個(gè)問(wèn)題.隨著研究的深入,相信將來(lái)會(huì)有更多更優(yōu)的基于深度學(xué)習(xí)的人體行為識(shí)別方法框架被提出.但是也應(yīng)注意到,基于深度學(xué)習(xí)的方法學(xué)習(xí)速度慢,需要的樣本數(shù)據(jù)量龐大,這些問(wèn)題的解決都期待算法的進(jìn)一步研究和發(fā)展.
References
1 Fujiyoshi H,Lipton A J,Kanade T.Real-time human motion analysis by image skeletonization.IEICE Transactions on Information and Systems,2004,87-D(1):113-120
2 Chaudhry R,Ravichandran A,Hager G,Vidal R.Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions.In:Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition.Miami,F(xiàn)L:IEEE,2009.1932-1939
3 Dalal N,Triggs B.Histograms of oriented gradients for human detection.In:Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition.San Diego,CA,USA:IEEE,2005.886-893
4 Lowe D G.Object recognition from local scale-invariant features.In:Proceedings of the 7th IEEE International Conference on Computer Vision.Kerkyra:IEEE,1999.1150-1157
5 Schuldt C,Laptev I,Caputo B.Recognizing human actions:a local SVM approach.In:Proceedings of the 17th International Conference on Pattern Recognition.Cambridge:IEEE,2004.32-36
6 Dollar P,Rabaud V,Cottrell G,Belongie S.Behavior recognition via sparse spatio-temporal features.In:Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.Beijing,China:IEEE,2005.65-72
7 Rapantzikos K,Avrithis Y,Kollias S.Dense saliency-based spatiotemporal feature points for action recognition.In:Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition.Miami,F(xiàn)L:IEEE,2009. 1454-1461
8 Knopp J,Prasad M,Willems G,Timofte R,Van Gool L. Hough transform and 3D SURF for robust three dimensional classification.In:Proceedings of the 11th European Conference on Computer Vision(ECCV 2010).Berlin Heidelberg:Springer.2010.589-602
9 Kl′aser A,Marsza′eek M,Schmid C.A spatio-temporal descriptor based on 3D-gradients.In:Proceedings of the 19th British Machine Vision Conference.Leeds:BMVA Press,2008.99.1-99.10
10 Wang H,Ullah M M,Klaser A,Laptev I,Schmid C.Evaluation of local spatio-temporal features for action recognition. In:Proceedings of the 2009 British Machine Vision Conference.London,UK:BMVA Press,2009.124.1-124.11
11 Wang H,Kl′aser A,Schmid C,Liu C L.Action recognition by dense trajectories.In:Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Providence,RI:IEEE,2011.3169-3176
12 Hinton G E.Learning multiple layers of representation. Trends in Cognitive Sciences,2007,11(10):428-434
13 Deng L,Yu D.Deep learning: methods and applications.Foundations and Trends?in Signal Processing,2014,7(3-4):197-387
14 Schmidhuber J.Deep learning in neural networks: an overview.Neural Networks,2015,61:85-117
15 Gorelick L,Blank M,Shechtman E,Irani M,Basri R.Actions as space-time shapes.In:Proceedings of the 10th IEEE International Conference on Computer Vision.Beijing,China:IEEE,2005.1395-1402
16 Soomro K,Zamir A R.Action recognition in realistic sports videos.Computer Vision in Sports.Switzerland:Springer. 2014.181-208
17 Rodriguez M D,Ahmed J,Shah M.Action mach a spatiotemporal maximum average correlation height filter for action recognition.In:Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition.Anchorage,AK:IEEE,2008.1-8
18 Marszalek M,Laptev I,Schmid C.Actions in context.In:Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition.Miami,F(xiàn)L:IEEE,2009. 2929-2936
19 Yang X D,Tian Y L.Effective 3D action recognition using EigenJoints.Journal of Visual Communication and Image Representation,2014,25(1):2-11
20 Bobick A,Davis J.An appearance-based representation of action.In:Proceedings of the 13th International Conference on Pattern Recognition.Vienna:IEEE,1996.307-312
21 Weinland D,Ronfard R,Boyer E.Free viewpoint action recognition using motion history volumes.Computer Vision and Image Understanding,2006,104(2-3):249-257
22 Bobick A F,Davis J W.The recognition of human movement using temporal templates.IEEE Transactions on Pattern Analysis and Machine Intelligence,2001,23(3):257-267
23 Sarikaya R,Hinton G E,Deoras A.Application of deep belief networks for natural language understanding.IEEE/ACM Transactions on Audio,Speech,and Language Processing,2014,22(4):778-784
24 Ren Y F,Wu Y.Convolutional deep belief networks for feature extraction of EEG signal.In:Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN).Beijing,China:IEEE,2014.2850-2853
25 Bengio Y.Learning deep architectures for AI.Foundations and Trends?in Machine Learning,2009,2(1):1-127
26 LeCun Y,Ranzato M.Deep learning tutorial.In:Tutorials in International Conference on Machine Learning(ICML13). Atlanta,USA:Citeseer,2013.
27 Krizhevsky A,Sutskever I,Hinton G E.Imagenet classification with deep convolutional neural networks.In:Proceedings of Advances in Neural Information Processing Systems. Lake Tahoe,Nevada,United States,2012.1097-1105
28 Bouvrie J.Notes on Convolutional Neural Networks.MIT CBCL Technical Report,2006,38-44
29 Ji S W,Xu W,Yang M,Yu K.3D convolutional neural networks for human action recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231
30 Ch′eron G,Laptev I,Schmid C.P-CNN:pose-based CNN features for action recognition.In:Proceedings of the 2015 IEEE International Conference on Computer Vision.Santiago:IEEE,2015.3218-3226
31 Varol G,Laptev I,Schmid C.Long-term temporal convolutions for action recognition.arXiV:1604.04494,2015.
32 Karpathy A,Toderici G,Shetty S,Leung T,Sukthankar R,Li F F.Large-scale video classification with convolutional neural networks.In:Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Columbus,OH:IEEE,2014.1725-1732
33 Simonyan K,Zisserman A.Two-stream convolutional networks for action recognition in videos.In:Proceedings of Advances in Neural Information Processing Systems.Red Hook,NY:Curran Associates,Inc.,2014.568-576
34 Poultney C,Chopra S,Cun Y L.Efficient learning of sparse representations with an energy-based model.In:Proceedings of Advances in Neural Information Processing Systems. Cambridge,MA:MIT Press,2006.1137-1144
35 Bengio Y,Lamblin P,Popovici D,Larochelle H.Greedy layer-wise training of deep networks.In:Proceedings of Advances in Neural Information Processing Systems.Cambridge,MA:MIT Press,2006.
36 Le Q V,Zou W Y,Yeung S Y,Ng A Y.Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis.In:Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Providence,RI:IEEE,2011. 3361-3368
37 Hyv′arinen A,Hurri J,Hoyer P O.Natural Image Statistics:A Probabilistic Approach to Early Computational Vision. London:Springer-Verlag,2009.
38 Hinton G.A practical guide to training restricted Boltzmann machines.Momentum,2010,9(1):926
39 Fischer A,Igel C.An introduction to restricted Boltzmann machines.In:Proceedings of the 17th Iberoamerican Congress on Progress in Pattern Recognition,Image Analysis,Computer Vision,and Applications.Buenos Aires,Argentina:Springer.2012.14-36
40 Larochelle H,Bengio Y.Classification using discriminative restricted Boltzmann machines.In:Proceedings of the 25th International Conference on Machine Learning.New York:ACM,2008.536-543
41 Chen H,Murray A F.Continuous restricted Boltzmann machine with an implementable training algorithm.IEE Proceedings-Vision,Image and Signal Processing,2003,150(3):153-158
42 Taylor G W,Hinton G E.Factored conditional restricted Boltzmann machines for modeling motion style.In:Proceedings of the 26th Annual International Conference on Machine Learning.New York:ACM,2009.1025-1032
43 Chen B,Ting J A,Marlin B,de Freitas N.Deep learning of invariant spatio-temporal features from video.In:Proceedings of Conferrence on Neural Information Processing Systems(NIPS)Workshop on Deep Learning and Unsupervised Feature Learning.Whistler BC Canada,2010.
44 Pineda F J.Generalization of back-propagation to recurrent neural networks.Physical Review Letters,1987,59(19):2229-2232
45 Chung J,Gulcehre C,Cho K,Bengio Y.Empirical evaluation of gated recurrent neural networks on sequence modeling.arXiv:1412.3555,2014.
46 Omlin C W,Giles C L.Training second-order recurrent neural networks using hints.In:Proceedings of the 9th International Workshop Machine Learning.San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,1992.361-366
47 Sak H,Senior A,Beaufays F.Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition.arXiv:1402.1128,2014.
48 Hochreiter S,Schmidhuber J.Long short-term memory. Neural Computation,1997,9(8):1735-1780
49 Sak H,Senior A,Beaufays F.Long short-term memory recurrent neural network architectures for large scale acoustic modeling.In:Proceedings of the 2014 Annual Conference of International Speech Communication Association(INTERSPEECH).Singapore:ISCA,2014.338-342
50 Ng J Y H,Hausknecht M,Vijayanarasimhan S,Vinyals O,Monga R,Toderici G.Beyond short snippets:deep networks for video classification.arXiv:1503.08909,2015.
51 Donahue J,Hendricks L A,Guadarrama S,Rohrbach M,Venugopalan S,Saenko K,Darrell T.Long-term recurrent convolutional networks for visual recognition and description.arXiv:1411.4389,2014.
朱煜華東理工大學(xué)信息科學(xué)與工程學(xué)院教授.1999年獲得南京理工大學(xué)博士學(xué)位.主要研究方向?yàn)橹悄芤曨l分析與理解,模式識(shí)別方法,數(shù)字圖像處理方法及應(yīng)用.本文通信作者.
E-mail:zhuyu@ecust.edu.cn
(ZHU YuProfessor in the School of Information Science and Engineering,East China University of Science and Technology.She received her Ph.D.degree from Nanjing University of Science and Technology,China in 1999.Her research interest covers intelligent video analysis and understanding,pattern recognition,digital image processing methods and applications.Corresponding author of this paper.)
趙江坤華東理工大學(xué)信息科學(xué)與工程學(xué)院碩士研究生.主要研究方向?yàn)橹悄芤曨l分析與模式識(shí)別.
E-mail:zhaojk90@gmail.com
(ZHAO Jiang-KunMaster student at the School of Information Science and Engineering,East China University of Science and Technology.His research interest covers intelligent video analysis and pattern recognition.)
王逸寧華東理工大學(xué)信息科學(xué)與工程學(xué)院碩士研究生.主要研究方向?yàn)橹悄芤曨l分析與模式識(shí)別.
E-mail:wyn885@126.com
(WANG Yi-NingMaster student at the School of Information Science and Engineering,East China University of Science and Technology.His research interest covers intelligent video analysis and pattern recognition.)
鄭兵兵華東理工大學(xué)信息科學(xué)與工程學(xué)院碩士研究生.主要研究方向?yàn)橹悄芤曨l分析與模式識(shí)別.
E-mail:13162233697@163.com
(ZHENG Bing-BingMaster student at the School of Information Science and Engineering,East China University of Science and Technology.His research interest covers intelligent video analysis and pattern recognition.)
A Review of Human Action Recognition Based on Deep Learning
ZHU Yu1ZHAO Jiang-Kun1WANG Yi-Ning1ZHENG Bing-Bing1
Human action recognition is an active research topic in intelligent video analysis and is gaining extensive attention in academic and engineering communities.This technology is an important basis of intelligent video analysis,video tagging,human computer interaction and many other fields.The deep learning theory has been made remarkable achievements on still image feature extraction and gradually extends to the time sequences of human action videos.This paper reviews the traditional design of action recognition methods,such as spatial-temporal interest point,introduces and analyzes different human action recognition framework based on deep learning,including convolution neural network (CNN),independent subspace analysis(ISA)model,restricted Boltzmann machine(RBM),and recurrent neural network (RNN).Finally,this paper summarizes the advantages and disadvantages of these methods.
Action recognition,deep learning,convolution neural network(CNN),restricted Boltzmann machine(RBM)
10.16383/j.aas.2016.c150710
Zhu Yu,Zhao Jiang-Kun,Wang Yi-Ning,Zheng Bing-Bing.A review of human action recognition based on deep learning.Acta Automatica Sinica,2016,42(6):848-857
2015-10-31錄用日期2016-04-18
Manuscript received October 31,2015;accepted April 18,2016
國(guó)家自然科學(xué)基金(61370174,61271349),中央高?;究蒲袠I(yè)務(wù)費(fèi)專項(xiàng)資金(WH1214015)資助
Supported by National Natural Science Foundation of China (61370174,61271349)and the Fundamental Research Funds for the Central Universities(WH1214015)
本文責(zé)任編委柯登峰
Recommended by Associate Editor KE Deng-Feng
1.華東理工大學(xué)信息科學(xué)與工程學(xué)院上海200237
1.School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237