亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        基于多注意力多尺度特征融合的圖像描述生成算法

        2019-08-01 01:57:38陳龍杰張鈺張玉梅吳曉軍
        計(jì)算機(jī)應(yīng)用 2019年2期
        關(guān)鍵詞:長(zhǎng)短期記憶網(wǎng)絡(luò)深度神經(jīng)網(wǎng)絡(luò)

        陳龍杰 張鈺 張玉梅 吳曉軍

        摘 要:針對(duì)圖像描述生成中對(duì)圖像細(xì)節(jié)表述質(zhì)量不高、圖像特征利用不充分、循環(huán)神經(jīng)網(wǎng)絡(luò)層次單一等問題,提出基于多注意力、多尺度特征融合的圖像描述生成算法。該算法使用經(jīng)過預(yù)訓(xùn)練的目標(biāo)檢測(cè)網(wǎng)絡(luò)來提取圖像在卷積神經(jīng)網(wǎng)絡(luò)不同層上的特征,將圖像特征分層輸入多注意力結(jié)構(gòu)中,依次將多注意力結(jié)構(gòu)與多層循環(huán)神經(jīng)網(wǎng)絡(luò)相連,構(gòu)造出多層次的圖像描述生成網(wǎng)絡(luò)模型。在多層循環(huán)神經(jīng)網(wǎng)絡(luò)中加入殘差連接來提高網(wǎng)絡(luò)性能,并且可以有效避免因?yàn)榫W(wǎng)絡(luò)加深導(dǎo)致的網(wǎng)絡(luò)退化問題。在MSCOCO測(cè)試集中,所提算法的BLEU-1和CIDEr得分分別可以達(dá)到0.804及1167,明顯優(yōu)于基于單一注意力結(jié)構(gòu)的自上而下圖像描述生成算法;通過人工觀察對(duì)比可知,所提算法生成的圖像描述可以表現(xiàn)出更好的圖像細(xì)節(jié)。

        關(guān)鍵詞:長(zhǎng)短期記憶網(wǎng)絡(luò);圖像描述;多注意力機(jī)制;多尺度特征融合;深度神經(jīng)網(wǎng)絡(luò)

        中圖分類號(hào): TP391.41

        文獻(xiàn)標(biāo)志碼:A

        Abstract: Focusing on the issues of low quality of image caption, insufficient utilization of image features and single-level structure of recurrent neural network in image caption generation, an image caption generation algorithm based on multi-attention and multi-scale feature fusion was proposed. The pre-trained target detection network was used to extract the features of the image from the convolutional neural network, then the image features were layered and put into caption model with multi-attention mechanism.which were input into the multi-attention structures at different layers.?Each attention part with features of different levels was related to the multi-level recurrent neural networks sequentially, constructing a multi-level image caption generation network model. By introducing residual connections in the recurrent networks, the network complexity was reduced and the network degradation caused by deepening network was avoided. In MSCOCO datasets, the BLEU-1 and CIDEr scores of the proposed algorithm can achieve 0.804 and 1167, which is obviously superior to top-down image caption generation algorithm based on single attention structure. Both artificial observation and comparison results velidate that the image caption generated by the proposed algorithm can show better details.

        Key words: Long Short-Term Memory (LSTM) network; image caption; multi-attention mechanism; multi-scale feature fusion; deep neural network

        0 引言

        圖像是人類社會(huì)活動(dòng)中最常用的信息載體,其中蘊(yùn)含了豐富的信息。隨著互聯(lián)網(wǎng)技術(shù)的發(fā)展及數(shù)碼設(shè)備的普及,圖像數(shù)據(jù)增長(zhǎng)迅速,使用純?nèi)斯?duì)圖像內(nèi)容鑒別已成為一項(xiàng)艱難的工作。因此,如何通過計(jì)算機(jī)自動(dòng)提取圖像所表達(dá)的信息,已成為圖像理解領(lǐng)域的研究熱點(diǎn)。圖像描述生成是融合了自然語言處理和計(jì)算機(jī)視覺的一項(xiàng)較為綜合的任務(wù),目的是將視覺圖像和語言文字聯(lián)系起來,通過對(duì)所輸入圖像的特征進(jìn)行提取分析,自動(dòng)生成一段關(guān)于圖像內(nèi)容的文字描述。圖像描述生成能夠完成從圖像到文本信息的轉(zhuǎn)換,可以應(yīng)用到圖像檢索、機(jī)器人問答、輔助兒童教育及導(dǎo)盲等多個(gè)方面,對(duì)圖像描述生成的研究具有重要的現(xiàn)實(shí)意義。

        圖像描述生成是一種克服了人類主觀認(rèn)識(shí)的固有限制,借助計(jì)算機(jī)軟件從一幅或多幅圖像序列中生成與圖像相對(duì)應(yīng)文字描述的技術(shù)。圖像描述的質(zhì)量主要取決于以下兩個(gè)方面:一是對(duì)圖像中所包含物體及場(chǎng)景的識(shí)別能力;二是對(duì)物體間相互聯(lián)系等信息的認(rèn)知程度。按照?qǐng)D像描述模型的不同,圖像描述的方法可以分為三類:1)基于模板[1]的方法,該類方法生成的圖像描述依賴于模板類型,形式也較為單一;2)基于檢索的方法,依賴于數(shù)據(jù)集中現(xiàn)存的描述語句,無法生成較為新穎的圖像描述;3)基于神經(jīng)網(wǎng)絡(luò)的方法,通過將卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Network,CNN)[2]與循環(huán)神經(jīng)網(wǎng)絡(luò)(Recurrent Neural Network,RNN)[3]結(jié)合,使用端對(duì)端的方法訓(xùn)練模型,利用CNN提取特征的優(yōu)勢(shì)和RNN處理文字序列的優(yōu)勢(shì),共同指導(dǎo)圖像文字描述的生成。

        隨著深度學(xué)習(xí)的發(fā)展,以文獻(xiàn)[4]中提出的多模態(tài)循環(huán)神經(jīng)網(wǎng)絡(luò)(multimodal RNN,m-RNN)為代表的基于神經(jīng)網(wǎng)絡(luò)的方法嶄露頭角。m-RNN首次將圖像描述生成分割成兩個(gè)分支任務(wù),分別使用CNN提取圖像特征,利用RNN建立語言模型。m-RNN中的CNN采用AlexNet[5]結(jié)構(gòu),RNN使用兩層嵌入層將文字序列編碼為獨(dú)熱表示形式,之后輸入到循環(huán)層中,mRNN將CNN和RNN結(jié)合起來,使用深層CNN提取圖像特征并輸入到RNN循環(huán)層的多模態(tài)層中,最后經(jīng)過Softmax層得到輸出結(jié)果。雖然m-RNN成功地將CNN引入圖像描述任務(wù)中,但其RNN結(jié)構(gòu)較為單一,網(wǎng)絡(luò)學(xué)習(xí)能力較弱。文獻(xiàn)[6]則使用長(zhǎng)短期記憶(Long Short-Term Memory,LSTM)網(wǎng)絡(luò)代替普通的RNN,并使用了帶有批標(biāo)準(zhǔn)化層的CNN提取圖像特征,算法精度和速度均有提升。在視頻自然語言描述任務(wù)中,文獻(xiàn)[7]使用AlexNet模型及牛津大學(xué)視覺幾何組(Visual Geometry Group,VGG)提出的VGG16模型分別提取視頻空間特征,采用多種特征融合的方法,將空間特征與通過提取相鄰幀的光流得到的運(yùn)動(dòng)特征與視頻時(shí)間特征融合后輸入LSTM自然語言描述模型,提升了視頻自然語言描述的準(zhǔn)確性。

        3 結(jié)語

        本文設(shè)計(jì)了基于多注意力多尺度融合深度神經(jīng)網(wǎng)絡(luò)的圖像描述生成算法,該算法通過使用Faster R-CNN目標(biāo)檢測(cè)模型提取圖像不同尺度上的特征,然后將圖像不同尺度上的特征依次輸入多個(gè)注意力結(jié)構(gòu),并最終連入多層循環(huán)網(wǎng)絡(luò)語言模型中。本文通過在多層語言模型中加入殘差映射來增加網(wǎng)絡(luò)學(xué)習(xí)效率。在實(shí)驗(yàn)中采用ADAM優(yōu)化方法,并通過在訓(xùn)練過程逐步降低學(xué)習(xí)率,可以有效地促進(jìn)網(wǎng)絡(luò)收斂。本文算法的圖像描述效果和BLEU、ROUGE_L、METEOR、CIDEr等評(píng)價(jià)指標(biāo)皆取得較高得分,其中BLEU-1、CIDEr得分可以達(dá)到0.804和1.167。實(shí)驗(yàn)結(jié)果表明,基于多注意力多尺度融合的神經(jīng)網(wǎng)絡(luò)模型在圖像描述生成任務(wù)上能取得出色的表現(xiàn),通過加入多個(gè)注意力結(jié)構(gòu)可以有效提高對(duì)圖像細(xì)節(jié)及位置關(guān)系等的描述效果。下一步,將結(jié)合現(xiàn)有圖像描述的方法,在視頻描述及圖像問答等相關(guān)任務(wù)方面展開進(jìn)一步研究。

        參考文獻(xiàn):

        [1] FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back [C]// CVPR2015: Proceedings of the 2015 International Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2015: 1473-1482.

        [2] LeCUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.

        [3] HOPFIELD J J. Neural networks and physical systems with emergent collective computational abilities [J]. Proceedings of the National Academy of Sciences of the United States of America, 1982, 79(8): 2554-2558.

        [4] MAO J, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks[EB/OL]. [2018-06-10]. https://arxiv.org/pdf/1410.1090v1.pdf.

        [5] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks [C]//NIPS 2012: Proceedings of the 2012 International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc. 2012: 1097-1105.

        [6] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C]//CVPR 2015: Proceedings of the 2015 International Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2015: 3156-3164.

        [7] 梁銳,朱清新,廖淑嬌,等.基于多特征融合的深度視頻自然語言描述方法[J].計(jì)算機(jī)應(yīng)用,2017,37(4):1179-1184. (LIANG R, ZHU Q X, LIAO S J, et al. Deep natural language description method for video based on multi-feature fusion[J]. Journal of Computer Applications, 2017,37(4):1179-1184.)

        [8] XU K, BA J L, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[EB/OL]. [2018-06-08]. https://arxiv.org/pdf/1502.03044.pdf.Computer Science, 2015: 2048-2057.沒找到

        [9] BAHDANAU D, CHO K H, BENGIO Y. Neural machine translation by jointly learning to align and translate [EB/OL]. [2018-06-10]. https://arxiv.org/pdf/1409.0473.pdf.Published as a conference paper at ICLR 2015

        [10] LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]//CVPR2017: Proceedings of the 2017 International Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2017: 3242-3250.

        [11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2018-05-10]. https://arxiv.org/pdf/1706.03762.pdf.

        沒找到NIPS2017: Proceedings of the 2012 International Conference on Neural Information Processing Systems. Long Beach, USA. 2017: 6000-6010.

        [12] LI J, MEI X, PROKHOROV D, TAO D. Deep neural network for structural prediction and lane detection in traffic scene[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017,28(3): 690-703.

        [13] QU Y, LIN L, SHEN F, et al. Joint hierarchical category structure learning and large-scale image classification[J]. IEEE Transactions on Image Processing, 2017, 26(9): 4331-4346.

        [14] SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640-651.

        [15] GONG C, TAO D, LIU W, LIU L, YANG J. Label propagation via teaching-to-learn and learning-to-teach[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(6): 1452-1465.

        [16] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// CVPR2016: Proceedings of the 2016 International Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2016: 770-778.

        [17] WANG P, LIU L, SHEN C, et al. Multi-attention network for one shot learning [C]// CVPR2017: Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2017: 22-25.

        [18] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.https://arxiv.org/pdf/1506.01497v3.pdf

        [19] LIN T-Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common Objects in COntext [C]//ECCV2014: Proceedings of the 2014 European Conference on Computer Vision. Cham: Springer, 2014: 740-755.

        [20] KARPATHY A, LI F-F. Deep visual-semantic alignments for generating image descriptions [C]//CVPR2015: Proceedings of the 2015 International Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2015: 3128-3137.

        [21] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation [C]// ACL2002: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: ACL, 2002: 311-318.

        [22] LIN C-Y. Rouge: a package for automatic evaluation of summaries [C]//ACL2004: Proceedings of the ACL 2004 Workshop on Text Summarization. Stroudsburg, PA: ACL, 2004: 74-81.

        [23] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments [C]//ACL2005: Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: ACL, 2005: 65-72.https://en.wikipedia.org/wiki/METEOR

        [24] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation [C]//CVPR2015: Proceedings of the 2015 International Conference on Computer Vision and PatternRecognition. Washington, DC: IEEE Computer Society, 2015: 4566-4575.https://arxiv.org/pdf/1411.5726v2.pdf

        [25] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and VQA[EB/OL]. [2018-05-07]. https://arxiv.org/pdf/1707.07998.pdf.

        [26] KINGMA D P, BA J. ADAM: a method for stochastic optimization [EB/OL].[2018-04-22]. https://arxiv.org/pdf/1412.6980.pdf.Published as a conference paper at ICLR 2015

        [27] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]//CVPR2017: Proceedings of the 2017 International Conference on Computer Vision and PatternRecognition. Washington, DC: IEEE Computer Society, 2017: 1179-1195.https://arxiv.org/pdf/1612.00563.pdf

        [28] YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention [C]//CVPR2016: Proceedings of the 2016 International Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2016: 46514659.https://www.cvfoundation.org/openaccess/content_cvpr_2016/papers/You_Image_Captioning_With_CVPR_2016_paper.pdf

        [29] YANG Z, YUAN Y, WU Y, et al. Encode, review, and decode: Reviewer module for caption generation[EB/OL]. [2018-06-10]. https://arxiv.org/pdf/1605.07912v1.pdf.

        [30] YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes[EB/OL]. [2018-03-10]. https://arxiv.org/pdf/1611.01646.pdf.未查到OpenReview, 2016, 2(5): 8.,應(yīng)該是 ICLR 2017https://arxiv.org/pdf/1611.01646.pdf

        猜你喜歡
        長(zhǎng)短期記憶網(wǎng)絡(luò)深度神經(jīng)網(wǎng)絡(luò)
        從餐館評(píng)論中提取方面術(shù)語
        多種算法對(duì)不同中文文本分類效果比較研究
        LSTM—RBM—NMS模型下的視頻人臉檢測(cè)方法研究
        餐飲業(yè)客流預(yù)測(cè)的深度聯(lián)合模型
        商情(2018年47期)2018-11-26 09:12:38
        基于大數(shù)據(jù)網(wǎng)絡(luò)的運(yùn)動(dòng)損傷評(píng)估模型研究
        基于LSTM的媒體網(wǎng)站用戶流量預(yù)測(cè)與負(fù)載均衡方法
        基于LSTM自動(dòng)編碼機(jī)的短文本聚類方法
        試論基于深度神經(jīng)網(wǎng)絡(luò)的汽車車型識(shí)別問題
        深度神經(jīng)網(wǎng)絡(luò)的發(fā)展現(xiàn)狀
        基于深度神經(jīng)網(wǎng)絡(luò)的身份識(shí)別研究
        亚洲av无码一区二区三区乱子伦| 蜜桃在线观看视频在线观看| 最新69国产精品视频| 国产成人av一区二区三区在线观看| 日韩人妻无码免费视频一区二区三区| 中文字幕经典一区| 亚洲综合精品一区二区三区| 插入日本少妇一区二区三区 | 国产嫩草av一区二区三区| 国产精品久久久久乳精品爆| 亚洲V日韩V精品v无码专区小说| 在线观看极品裸体淫片av| 一区二区在线视频免费蜜桃| 久久青青草原亚洲av无码麻豆| 国产免费资源高清小视频在线观看| av天堂吧手机版在线观看| 国产伦一区二区三区色一情| 日本丰满熟妇videossex8k| 国产网站视频| 国产午夜在线观看视频| аⅴ天堂中文在线网| 国产女女精品视频久热视频| 欧美亚洲另类自拍偷在线拍| 男女视频一区二区三区在线观看| 医院人妻闷声隔着帘子被中出| 美丽人妻被按摩中出中文字幕 | 97se在线观看| 亚洲视频在线视频在线视频| 亚洲桃色视频在线观看一区| 精品无码久久久久成人漫画| 国产精品视频免费的| 久久国产精品国语对白| 亚洲中文字幕无码天然素人在线| 全免费a级毛片免费看视频| 在线免费观看视频播放| 在线免费观看黄色国产强暴av| 怡红院a∨人人爰人人爽| 亚洲午夜无码视频在线播放| 在线视频一区二区日韩国产| 国产精品第一区亚洲精品| 午夜免费电影|