楊世強(qiáng) 羅曉宇 喬丹 柳培蕾 李德信
摘 要:針對(duì)現(xiàn)有動(dòng)作識(shí)別中對(duì)連續(xù)動(dòng)作識(shí)別研究較少且單一算法對(duì)連續(xù)動(dòng)作識(shí)別效果較差的問題,提出在單個(gè)動(dòng)作建模的基礎(chǔ)上,采用滑動(dòng)窗口法和動(dòng)態(tài)規(guī)劃法結(jié)合,實(shí)現(xiàn)連續(xù)動(dòng)作的分割與識(shí)別。首先,采用深度置信網(wǎng)絡(luò)和隱馬爾可夫結(jié)合的模型DBN-HMM對(duì)單個(gè)動(dòng)作建模;其次,運(yùn)用所訓(xùn)練動(dòng)作模型的對(duì)數(shù)似然值和滑動(dòng)窗口法對(duì)連續(xù)動(dòng)作進(jìn)行評(píng)分估計(jì),實(shí)現(xiàn)初始分割點(diǎn)的檢測(cè);然后,采用動(dòng)態(tài)規(guī)劃對(duì)分割點(diǎn)位置進(jìn)行優(yōu)化并對(duì)單個(gè)動(dòng)作進(jìn)行識(shí)別。在公開動(dòng)作數(shù)據(jù)庫(kù)MSR Action3D上進(jìn)行連續(xù)動(dòng)作分割與識(shí)別測(cè)試,結(jié)果表明基于滑動(dòng)窗口的動(dòng)態(tài)規(guī)劃能夠優(yōu)化分割點(diǎn)的選取,進(jìn)而提高識(shí)別精度,能夠用于連續(xù)動(dòng)作識(shí)別。
關(guān)鍵詞:隱馬爾可夫模型;動(dòng)作分割;動(dòng)作識(shí)別;滑動(dòng)窗口;動(dòng)態(tài)規(guī)劃
中圖分類號(hào): TP391.4
文獻(xiàn)標(biāo)志碼:A
Abstract: Concerning the fact that there are few researches on continuous action recognition in the field of action recognition and single algorithms have poor effect on continuous action recognition, a segmentation and recognition method of continuous actions was proposed based on single motion modeling by combining sliding window method and dynamic programming method. Firstly, the single action model was constructed based on the Deep Belief Network and Hidden Markov Model (DBN-HMM). Secondly, the logarithmic likelihood value of the trained action model and the sliding window method were used to estimate the score of the continous action, detecting the initial segmentation points. Thirdly, the dynamic programming method was used to optimize the location of the segmentation points and identify the single action. Finally, the testing experiments of continuous action segmentation and recognition were conducted with an open action database MSR Action3D. The experimental results show that the dynamic programming based on sliding window can optimize the selection of segmentation points to improve the recognition accuracy, which can be used to recognize continuous action.
Key words: Hidden Markov Model (HMM); action segmentation; action recognition; sliding window; dynamic programming
0 引言
人體動(dòng)作識(shí)別是近年來(lái)諸多鄰域研究的熱點(diǎn)[1], 如視頻監(jiān)控[2]、人機(jī)交互[3]等領(lǐng)域。隨著人口老齡化,服務(wù)機(jī)器人將在未來(lái)的日常生活中發(fā)揮重要作用,觀察和反映人類行動(dòng)將成為服務(wù)機(jī)器人的基本技能[4]。動(dòng)作識(shí)別逐漸應(yīng)用到人們生活和工作的各個(gè)方面,具有深遠(yuǎn)的應(yīng)用價(jià)值。
動(dòng)作行為一般是以連續(xù)動(dòng)作的形式來(lái)體現(xiàn),包含多個(gè)單一動(dòng)作,行為識(shí)別時(shí)根據(jù)分割與識(shí)別的前后關(guān)系,可分為直接分割和間接分割。直接分割是先根據(jù)簡(jiǎn)單的參數(shù)大小變化確定分割邊界,然后識(shí)別分割好的片段,如白棟天等[5]根據(jù)關(guān)節(jié)速度、關(guān)節(jié)角度的變化對(duì)動(dòng)作序列進(jìn)行初始分割,該方法較為簡(jiǎn)單快速,但對(duì)于較復(fù)雜的連續(xù)動(dòng)作分割誤差較大。間接分割是分割與識(shí)別同時(shí)進(jìn)行,連續(xù)動(dòng)作的分割與識(shí)別在實(shí)際中相互耦合,動(dòng)作分割結(jié)果會(huì)影響動(dòng)作識(shí)別,且動(dòng)作分割一般需要?jiǎng)幼髯R(shí)別的支持。在連續(xù)動(dòng)作的識(shí)別中使用較多的算法有動(dòng)態(tài)時(shí)間規(guī)整(Dynamic Time Warping, DTW)[6]、連續(xù)動(dòng)態(tài)規(guī)劃(Continuous Dynamic Programming, CDP)[7]和隱馬爾可夫模型(Hidden Markov Model, HMM)。
Gong等[8]用動(dòng)態(tài)流形變化法(Dynamic Manifold Warping, DMW)計(jì)算兩個(gè)多變量時(shí)間序列之間的相似性,實(shí)現(xiàn)動(dòng)作分割與識(shí)別。Zhu等[9]利用基于特征位差的在線分割方法將特征序列劃分為姿態(tài)特征段和運(yùn)動(dòng)特征段,通過(guò)在線模型匹配計(jì)算每個(gè)特征片段可以被標(biāo)記為提取的關(guān)鍵姿態(tài)或原子運(yùn)動(dòng)的似然概率。Lei等[10]提出了一種結(jié)合卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Network, CNN)和HMM的分層框架(CNN-HMM),對(duì)連續(xù)動(dòng)作同時(shí)進(jìn)行分割與識(shí)別,能提取有效魯棒的動(dòng)作特征,對(duì)動(dòng)作視頻序列取得了較好的識(shí)別結(jié)果,而且HMM具有較強(qiáng)的可擴(kuò)展性。Kulkarni等[11] 設(shè)計(jì)了一種視覺對(duì)準(zhǔn)技術(shù)動(dòng)態(tài)幀規(guī)整(Dynamic Frame Warping, DFW),對(duì)每個(gè)動(dòng)作視頻訓(xùn)練超級(jí)模板,能實(shí)現(xiàn)多個(gè)動(dòng)作的分割與識(shí)別;但在測(cè)試中,測(cè)試動(dòng)作序列幀與模板的距離計(jì)算復(fù)雜度較高,且與概率統(tǒng)計(jì)的方法相比,模型訓(xùn)練方面學(xué)習(xí)能力較低。Evangelidis等[12]采用滑動(dòng)窗口來(lái)構(gòu)造框架式Fisher向量,由多類支持向量機(jī)(Support Vector Machine, SVM)進(jìn)行分類,由于滑動(dòng)窗口法對(duì)動(dòng)作序列長(zhǎng)度的固定,導(dǎo)致對(duì)相同類別且長(zhǎng)度有較大差異的動(dòng)作識(shí)別較差。
和HMM結(jié)合的復(fù)合模型DBN-HMM對(duì)單個(gè)動(dòng)作建模,該復(fù)合模型對(duì)時(shí)序數(shù)據(jù)具有較強(qiáng)的建模能力和模型學(xué)習(xí)能力,然后利用評(píng)分機(jī)制和滑動(dòng)窗口法對(duì)初始分割點(diǎn)進(jìn)行檢測(cè),最后用動(dòng)態(tài)規(guī)劃法進(jìn)行分割點(diǎn)優(yōu)化與識(shí)別。利用滑動(dòng)窗口可降低動(dòng)態(tài)規(guī)劃計(jì)算復(fù)雜度,而動(dòng)態(tài)規(guī)劃能彌補(bǔ)滑動(dòng)窗口固定長(zhǎng)度的缺陷,最終實(shí)現(xiàn)最優(yōu)分割點(diǎn)的檢測(cè)。
1 單個(gè)動(dòng)作建模
連續(xù)動(dòng)作識(shí)別中首先對(duì)連續(xù)動(dòng)作中的單個(gè)動(dòng)作分別建模,在此使用DBN與HMM相結(jié)合的模型DBN-HMM對(duì)動(dòng)作建模。
1.1 特征提取
人體動(dòng)作可以表示為三維空間中人體不同肢體的旋轉(zhuǎn)變化,結(jié)合由關(guān)節(jié)點(diǎn)組成的人體骨架模型,可由人體的20個(gè)關(guān)節(jié)點(diǎn)在空間中的三維坐標(biāo)表示人體姿態(tài),各關(guān)節(jié)點(diǎn)位置分別為:頭部、左/右肩關(guān)節(jié)、肩膀中心、左/右肘關(guān)節(jié)、脊柱中心、左/右手腕關(guān)節(jié)、左/右手、左/右髖關(guān)節(jié)、臀部中心、左/右膝蓋、左/右腳踝、左/右腳。在肢體角度模型中,一個(gè)肢體由人體20個(gè)關(guān)節(jié)點(diǎn)中兩個(gè)相鄰關(guān)節(jié)點(diǎn)在空間中的相對(duì)位置來(lái)表示。假設(shè)所有關(guān)節(jié)都是從脊柱關(guān)節(jié)點(diǎn)延伸出的,由相鄰兩個(gè)關(guān)節(jié)點(diǎn)組成的一個(gè)肢體中,靠近脊柱關(guān)節(jié)的關(guān)節(jié)點(diǎn)定義為父關(guān)節(jié)點(diǎn),另一個(gè)定義為子關(guān)節(jié)點(diǎn)。通過(guò)坐標(biāo)系轉(zhuǎn)換將世界坐標(biāo)系轉(zhuǎn)換為局部球坐標(biāo)系來(lái)表示每個(gè)肢體的相對(duì)位置信息,以每個(gè)肢體中的父關(guān)節(jié)點(diǎn)作為球坐標(biāo)系的原點(diǎn),子關(guān)節(jié)點(diǎn)與父關(guān)節(jié)點(diǎn)的連線長(zhǎng)度為r,其在球坐標(biāo)系中與Z軸的夾角為φ,投影到XOY平面上與X軸的夾角為θ,一個(gè)肢體角度模型可以表示為(r,θ,φ),如圖1所示。由于距離r包含有人體尺寸的影響,因此去掉距離r,由(θ,φ)表示肢體角度模型。
5 結(jié)語(yǔ)
針對(duì)現(xiàn)有動(dòng)作識(shí)別中對(duì)連續(xù)動(dòng)作識(shí)別研究較少,且單一算法對(duì)連續(xù)動(dòng)作識(shí)別效果較差的問題,本文給出了一種連續(xù)動(dòng)作的分割與識(shí)別方法——采用滑動(dòng)窗口法和動(dòng)態(tài)規(guī)劃法結(jié)合用于連續(xù)動(dòng)作的分割與識(shí)別。建立的DBN-HMM具有較強(qiáng)的建模能力,結(jié)合滑動(dòng)窗口和動(dòng)態(tài)規(guī)劃對(duì)連續(xù)動(dòng)作分割點(diǎn)進(jìn)行檢測(cè),使兩種方法互補(bǔ),既能降低計(jì)算復(fù)雜度又能彌補(bǔ)固定長(zhǎng)度的限制。實(shí)驗(yàn)結(jié)果表明,本文方法在復(fù)雜連續(xù)動(dòng)作的分割與識(shí)別中獲得了較好的識(shí)別結(jié)果。不過(guò)算法的識(shí)別率還有進(jìn)一步提高的空間,在后續(xù)研究中需考慮開展采集連續(xù)動(dòng)作視頻的動(dòng)作分割與識(shí)別。
參考文獻(xiàn)
[1] 胡瓊,秦磊,黃慶明.基于視覺的人體動(dòng)作識(shí)別綜述[J].計(jì)算機(jī)學(xué)報(bào),2013,36(12): 2512-2524. (HU Q, QIN L, HUANG Q M. A survey on visual human action recognition [J]. Chinese Journal of Computers, 2013, 36 (12):2512-2524.)
[2] AGGARWAL J K, RYOO M S. Human activity analysis:a review[J]. ACM Computing Surveys, 2011, 43(3): Article No. 16.
[3] KOPPULA H S, SAXENA A. Anticipating human activities using object affordances for reactive robotic response [J]. IEEE Transactions on Pattern analysis and Machine Intelligence, 2015, 38(1): 1-14.
[4] ZHANG C, TIAN Y. RGB-D camera-based daily living activity recognition [J]. Journal of Computer Vision and Image Processing, 2012, 2(4): 1-7.
[5] 白棟天,張磊,黃華.RGB-D視頻中連續(xù)動(dòng)作識(shí)別[J].中國(guó)科技論文,2016(2):168-172. (BAI D T, ZHANG L,HUANG H. Recognition continuous human actions from RGB-D videos[J]. China Science Paper, 2016(2): 168-172.)
[6] DARRELL T, PENTLAND A. Space-time gestures [C]// Proceedings of the 1993 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 1993: 335-340.
[7] OKA R. Spotting method for classification of real world data[J]. Computer Journal, 1998, 41(8): 559-565.
[8] GONG D, MEDIONI G, ZHAO X. Structured time series analysis for human action segmentation and recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1414-1427.
[9] ZHU G, ZHANG L, SHEN P, et al. An online continuous human action recognition algorithm based on the Kinect sensor [J]. Sensors, 2016, 16(2): 161-179.
[10] LEI J, LI G, ZHANG J, et al. Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model[J]. IET Computer Vision, 2016, 10(6): 537-544.
[11] KULKARNI K, EVANGELIDIS G, CECH J, et al. Continuous action recognition based on sequence alignment[J]. International Journal of Computer Vision, 2015, 112(1): 90-114.
[12] EVANGELIDIS G D, SINGH G, HORAUD R. Continuous gesture recognition from articulated poses [C]// Proceedings of the 2014 European Conference on Computer Vision. Cham: Springer, 2014: 595-607.
[13] SONG Y, GU Y, WANG P, et al. A Kinect based gesture recognition algorithm using GMM and HMM [C]// Proceedings of the 2013 6th International Conference on Biomedical Engineering and Informatics. Piscataway, NJ: IEEE, 2013: 750-754.
[14] VITERBI A J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm [J]. IEEE Transactions on Information Theory, 1967, 13(2): 260-269.
[15] TAYLOR G W, HINTON G E, ROWEIS S. Modeling human motion using binary latent variables [C]// Proceedings of the 19th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2007: 1345-1352.
[16] HINTON G E, SIMON O, TEH Y W, et al. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2014, 18(7): 1527-1554.
[17] LI W, ZHANG Z, LIU Z. Action recognition based on a bag of 3D points [C]// Proceedings of the 2010 IEEE Computer Vision and Pattern Recognition Workshops. Washington, DC: IEEE Computer Society, 2010: 9-14.