萬(wàn)永菁 王博瑋 婁定風(fēng)
摘 要:進(jìn)口木材蛀蟲(chóng)檢疫是海關(guān)的一項(xiàng)重要工作,但其存在著蟲(chóng)聲檢測(cè)算法準(zhǔn)確率低、魯棒性差等問(wèn)題。針對(duì)這些問(wèn)題,提出了一種基于三維卷積神經(jīng)網(wǎng)絡(luò)(3D CNN)的蟲(chóng)音檢測(cè)方法以實(shí)現(xiàn)蟲(chóng)音特征的識(shí)別。首先,對(duì)原始蟲(chóng)音音頻進(jìn)行交疊分幀預(yù)處理,并使用短時(shí)傅里葉變換得到蟲(chóng)音音頻的語(yǔ)譜圖;然后,將語(yǔ)譜圖作為3D CNN的輸入,使其通過(guò)包含三層卷積層的3D CNN以判斷音頻中是否存在蟲(chóng)音特征。通過(guò)設(shè)置不同分幀長(zhǎng)度下的輸入進(jìn)行網(wǎng)絡(luò)訓(xùn)練及測(cè)試;最后以準(zhǔn)確率、F1分?jǐn)?shù)以及ROC曲線作為評(píng)估指標(biāo)進(jìn)行性能分析。結(jié)果表明,在交疊分幀長(zhǎng)度取5s時(shí),訓(xùn)練及測(cè)試效果最佳。此時(shí),3D CNN模型在測(cè)試集上的準(zhǔn)確率達(dá)到96.0%,F(xiàn)1分?jǐn)?shù)為0.96,且比二維卷積神經(jīng)網(wǎng)絡(luò)(2D CNN)模型準(zhǔn)確率提高近18%。說(shuō)明所提算法能準(zhǔn)確地從音頻信號(hào)中提取蟲(chóng)音特征并完成蛀蟲(chóng)識(shí)別任務(wù),為海關(guān)檢驗(yàn)檢疫提供有力保障。
關(guān)鍵詞:三維卷積神經(jīng)網(wǎng)絡(luò);短時(shí)傅里葉變換;語(yǔ)譜圖;蟲(chóng)音識(shí)別;聲學(xué)信號(hào)處理
中圖分類(lèi)號(hào):TP391.4
文獻(xiàn)標(biāo)志碼:A
Insect sound feature recognition method based on three-dimensional convolutional neural network
WAN Yongjing1, WANG Bowei1*, LOU Dingfeng2
1.School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China;
2.Shenzhen Customs, Shenzhen Guangdong 518045, China
Abstract:
The quarantine of imported wood is an important task for the customs, but there are problems such as low accuracy and poor robustness in the insect sound detection algorithm. To solve these problems, an insect sound detection method based on Three-Dimensional Convolutional Neural Network (3D CNN) was proposed to detect the presence of insect sound features. Firstly, the original insect audio was framed and pre-processed, and Short-Time Fourier Transform (STFT) was operated to obtain the spectrogram of the insect audio. Then, the spectrogram was used as the input of the 3D CNN consisting three convolutional layers. Network training and testing were conducted by setting inputs with different framing lengths. Finally, the analysis of performance was carried out using metrics like accuracy, F1 score and ROC curve. The experiments showed that the test results were best when the overlap framing length was 5 seconds. The best result of the 3D CNN model on the test set achieved an accuracy of 96.0% and an F1 score of 0.96. The accuracy was increased by nearly 18% compared with that of the two-dimensional convolutional neural network (2D CNN) model. It shows that the proposed model can extract the insect sound features from the audio signal more accurately and complete the insect identification task, which provides an engineering solution for customs inspection and quarantine.
Key words:
Three-Dimensional Convolutional Neural Network (3D CNN); Short-Time Fourier Transform (STFT); spectrogram; insect sound detection; acoustic signal processing
0 引言
我國(guó)進(jìn)口原木的數(shù)量逐年上升,其攜帶的有害生物造成的疫情也不斷加重,其中小蠹科、長(zhǎng)小蠹科、天???、吉丁蟲(chóng)科、長(zhǎng)蠹科等鉆蛀性昆蟲(chóng)為進(jìn)口原木中的主要害蟲(chóng)[1]。進(jìn)口原木中的這些害蟲(chóng)對(duì)我國(guó)的生態(tài)環(huán)境、林業(yè)生產(chǎn)和公眾生活構(gòu)成極大的威脅。在口岸檢疫時(shí),通常采用破壞性的檢查方法對(duì)原木進(jìn)行目視和剖檢,其檢出效率不高,費(fèi)工費(fèi)時(shí)[2]。近年來(lái),隨著聲學(xué)、信號(hào)處理、計(jì)算機(jī)技術(shù)的發(fā)展,出現(xiàn)了無(wú)損的檢測(cè)方法。如Sutin等[3]利用震動(dòng)傳感器獲得木材表面的振動(dòng)信號(hào),提出了一套基于時(shí)域的害蟲(chóng)檢測(cè)方法;害蟲(chóng)在進(jìn)食、活動(dòng)和交流時(shí)都會(huì)有聲音信號(hào)的產(chǎn)生,所以利用聲音信號(hào)來(lái)檢測(cè)也是一個(gè)不錯(cuò)的選擇:祁驍杰等[4]通過(guò)采集聲音信號(hào),采用時(shí)頻結(jié)合的方法對(duì)木材內(nèi)的害蟲(chóng)數(shù)量進(jìn)行識(shí)別。盡管以上研究取得了不錯(cuò)的效果,但使用的方法均基于時(shí)頻域手工提取的特征,帶來(lái)的設(shè)備成本和對(duì)環(huán)境的要求都嚴(yán)重限制了其在邊境檢疫中的推廣。隨著近些年來(lái)深度學(xué)習(xí)的快速發(fā)展,它在圖像處理、語(yǔ)音識(shí)別、自然語(yǔ)言處理等領(lǐng)域的成功應(yīng)用也為解決這個(gè)問(wèn)題帶來(lái)了新的研究方法和思路。
將深度學(xué)習(xí)應(yīng)用在動(dòng)物的聲學(xué)識(shí)別和分類(lèi)最早可以追溯到本世紀(jì)初期蝙蝠回聲定位的相關(guān)研究中[5]。隨后也不斷涌現(xiàn)出了許多基于傳統(tǒng)特征提取的識(shí)別方法。利用動(dòng)物的聲音信號(hào)來(lái)進(jìn)行識(shí)別在害蟲(chóng)控制[6]、物種多樣性檢測(cè)[7]等方面都有著重要的作用。傳統(tǒng)的聲音識(shí)別主要包括三個(gè)階段:首先是預(yù)處理,然后是特征提取,最后是模式識(shí)別。預(yù)處理通常包括標(biāo)準(zhǔn)化、降噪等,雖然目前降噪技術(shù)在某些數(shù)據(jù)集上能夠有效工作,但是在復(fù)雜環(huán)境下的處理仍然是一個(gè)挑戰(zhàn)。然后會(huì)進(jìn)入到特征提取環(huán)節(jié),將數(shù)據(jù)轉(zhuǎn)換為適合分類(lèi)器輸入的形式進(jìn)入到最后的判決階段。通常,特征提取指的是應(yīng)用信號(hào)處理的不同方法提取時(shí)頻域的各種特征。常見(jiàn)的提取方法包括梅爾頻率倒頻系數(shù)(Mel Frequency Cepstral Coefficients, MFCC)[8],線性頻率倒頻系數(shù)(Linear Frequency Cepstral Coefficients, LFCC)[9]。這些方法不僅在語(yǔ)音識(shí)別領(lǐng)域均取得了不錯(cuò)的效果,在其他音頻信號(hào)處理中也得到了廣泛的應(yīng)用。除此之外,利用短時(shí)傅里葉變換(Short-Time Fourier Transform, STFT)、小波變換(Wavelet Transform, WT)以及希爾伯特黃變換(Hilbert-Huang Transform, HHT)等方法得到語(yǔ)譜圖的特征提取也有著不錯(cuò)的效果[10]。常見(jiàn)的分類(lèi)方法包括支持向量機(jī)(Support Vector Machine, SVM)、隱馬爾可夫模型(Hidden Markov Model, HMM)、高斯混合模型(Gaussian Mixture Model, GMM),以及最近的深度神經(jīng)網(wǎng)絡(luò)(Deep Neural Network, DNN)。深度神經(jīng)網(wǎng)絡(luò)通過(guò)使用視覺(jué)上的某些信息進(jìn)行相關(guān)的訓(xùn)練。音頻通過(guò)二維的語(yǔ)譜圖作為輸入,卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Network, CNN)模型能夠捕獲時(shí)域和頻域的某些能量變換從而實(shí)現(xiàn)分類(lèi)判決任務(wù)。但是使用二維輸入將會(huì)使CNN模型更多地關(guān)注背景噪聲而不是聲音事件,且其性能受數(shù)據(jù)集大小影響較大。
為了解決傳統(tǒng)信號(hào)處理在高噪聲背景下難以提取微弱信號(hào)以及二維卷積神經(jīng)網(wǎng)絡(luò)(Two-Dimensional Convolutional Neural Network, 2D CNN)不能充分利用時(shí)頻信息的問(wèn)題,本文提出了基于聲學(xué)的三維卷積神經(jīng)網(wǎng)絡(luò)(Three-Dimensional Convolutional Neural Network, 3D CNN)以識(shí)別木材中的害蟲(chóng)。CNN在各種視覺(jué)識(shí)別任務(wù)中都有著出色的性能。最近,CNN在聲學(xué)信號(hào)處理方面有了一些新的嘗試:牛津大學(xué)的Kiskin等[11]提出基于小波變換的2D CNN以識(shí)別熱帶家蚊且達(dá)到了0.925的F1分?jǐn)?shù),該方法也同樣被證實(shí)在鳥(niǎo)類(lèi)的分類(lèi)任務(wù)中有著出色的表現(xiàn);Zhou等[12]通過(guò)快速傅里葉變換(Fast Fourier Transform, FFT)和n-gram語(yǔ)言模型提取特征,使用CNN混合模型在自動(dòng)語(yǔ)言識(shí)別(Automatic Speech Recognition, ASR)上也取得了不錯(cuò)的成果。考慮到木制品中害蟲(chóng)活動(dòng)規(guī)律的周期性,本文提出了基于3D CNN的蟲(chóng)聲檢測(cè)方法。該方法相比二維能夠更好地提取蟲(chóng)音特征,在F1分?jǐn)?shù)和準(zhǔn)確率兩個(gè)指標(biāo)上均優(yōu)于2D CNN模型,在低信噪比環(huán)境下也能夠?qū)崿F(xiàn)較高的準(zhǔn)確率,為邊境檢驗(yàn)檢疫部門(mén)提供檢測(cè)害蟲(chóng)的有效工具。
1 卷積神經(jīng)網(wǎng)絡(luò)
1.1 二維卷積神經(jīng)網(wǎng)絡(luò)
卷積神經(jīng)網(wǎng)絡(luò)是一種典型的深度學(xué)習(xí)方法,近年來(lái)的飛速發(fā)展使得其在圖像識(shí)別領(lǐng)域有了巨大的進(jìn)步。深度卷積神經(jīng)網(wǎng)絡(luò)通過(guò)模仿生物神經(jīng)網(wǎng)絡(luò),低層表示細(xì)節(jié),高層表示語(yǔ)義,能夠通過(guò)大量的數(shù)據(jù)自動(dòng)地訓(xùn)練出模型[13]。二維卷積神經(jīng)網(wǎng)絡(luò)的計(jì)算公式如式(1)所示:
yxy=f(∑i∑jwijv(x+i)(y+j)+b)(1)
其中:yxy是位置(x,y)的特征圖,wij是卷積核的權(quán)重,v(x+i)(y+j)是在(x+i,y+j)的輸入值,b是偏差[14]。
一個(gè)典型的神經(jīng)網(wǎng)絡(luò)通常包括卷積層、池化層和全連接層。CNN網(wǎng)絡(luò)的核心功能在卷積層完成,在每一層都會(huì)計(jì)算輸入與當(dāng)前層權(quán)重的點(diǎn)積。在圖像處理中,為了確保能從圖像信息中得到所有的顏色信息,接受層的深度必須和顏色通道個(gè)數(shù)一致。卷積層的輸出叫作特征圖,而特征圖的個(gè)數(shù)也由深度決定。卷積層后會(huì)有激活層,通常會(huì)使用非線性函數(shù)對(duì)上一層數(shù)據(jù)進(jìn)行激活。在卷積層后通常也會(huì)有池化層,池化層的作用是在保證輸入不變的情況下進(jìn)行降采樣以減少計(jì)算量。
1.2 三維卷積神經(jīng)網(wǎng)絡(luò)
三維卷積神經(jīng)網(wǎng)絡(luò)和二維卷積神經(jīng)網(wǎng)絡(luò)類(lèi)似,是二維卷積神經(jīng)網(wǎng)絡(luò)的一種拓展。在二維卷積神經(jīng)網(wǎng)絡(luò)中,卷積層的輸入是二維的空間信息;而在三維卷積神經(jīng)網(wǎng)絡(luò)中,時(shí)間和頻域信息能夠被更加充分地利用。三維神經(jīng)網(wǎng)絡(luò)的卷積核有三個(gè)維度,有時(shí)還考慮顏色通道為第四個(gè)維度。其計(jì)算公式如式(2)所示:
yxyt=f(∑i∑j∑kwijkv(x+i)(y+j)(t+k)+b)(2)
其中:yxyt是在(x,y,t)處的特征圖,f是激活函數(shù),wijk是核權(quán)重,v(x+i)(y+j)(t+k)是在(x+i,y+j,t+k)位置的輸入值,b是偏差[14]。
正如前面所描述的那樣,三維卷積神經(jīng)網(wǎng)絡(luò)是對(duì)二維卷積神經(jīng)網(wǎng)絡(luò)的一種拓展。模型的輸入也會(huì)根據(jù)需要修改為合適的三維形式。
2 基于3D CNN的檢測(cè)方法
2.1 方法概述
本文以短時(shí)傅里葉變換(STFT)得到的語(yǔ)譜圖作為3D CNN的輸入搭建了模型,應(yīng)用到害蟲(chóng)的聲學(xué)識(shí)別中。首先,對(duì)獲取到的音頻數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化的預(yù)處理;其次,需要提取音頻數(shù)據(jù)的特征,本文利用短時(shí)傅里葉變換(STFT)提取蟲(chóng)聲的時(shí)頻特征并進(jìn)行分幀處理;最后,將二維的頻譜特征數(shù)據(jù)變換到合適的三維形式作為3D CNN的輸入進(jìn)行訓(xùn)練。
2.2 數(shù)據(jù)來(lái)源和預(yù)處理
本文的音頻數(shù)據(jù)來(lái)源于深圳市出入境檢驗(yàn)檢疫局,覆蓋了在2017年到2018年期間不同環(huán)境下錄制的67段音頻。其中785個(gè)片段為31種害蟲(chóng)在木材中活動(dòng)和進(jìn)食的音頻,1534個(gè)片段為各種實(shí)際檢疫環(huán)境中的非蟲(chóng)音(分幀長(zhǎng)度n=5s時(shí))。存在害蟲(chóng)的音頻被標(biāo)記為1,包括環(huán)境噪聲在內(nèi)的非蟲(chóng)音被標(biāo)記為0。木材中的害蟲(chóng)已經(jīng)通過(guò)昆蟲(chóng)分類(lèi)專(zhuān)家鑒定,因此可以認(rèn)為數(shù)據(jù)標(biāo)定是準(zhǔn)確的。
同時(shí),特征的提取還受到錄音設(shè)備、發(fā)聲響度的影響,本文使用了式(3)對(duì)音頻進(jìn)行標(biāo)準(zhǔn)化處理,其中x[k]表示聲音序列,K表示音頻的長(zhǎng)度。
x[k]=x[k]∑kx2[k]/K(3)
2.3 特征提取
為了將訓(xùn)練數(shù)據(jù)轉(zhuǎn)換為適合神經(jīng)網(wǎng)絡(luò)訓(xùn)練的形式,本文采用短時(shí)傅里葉變換(STFT)提取時(shí)頻特征。離散短時(shí)傅里葉變換方法如式(4)所示:
STFT{x[k]}≡X(m,ω)=∑∞k=-∞x[k]w[k-m]e-jωk(4)
根據(jù)害蟲(chóng)的發(fā)聲規(guī)律,對(duì)音頻信號(hào)作短時(shí)傅里葉變換得到語(yǔ)譜圖,然后將語(yǔ)譜圖以50%的重疊分幀為n秒的片段,其中n取3、4、5、6、7、8,在不影響原始數(shù)據(jù)的基礎(chǔ)上擴(kuò)大了實(shí)驗(yàn)樣本。因?yàn)閿?shù)據(jù)集的采樣頻率fs=44100Hz,最大可辨識(shí)頻率被限制在22050Hz。本文選擇了20ms的Hanning窗,重疊率為50%,1024點(diǎn)的FFT來(lái)計(jì)算短時(shí)傅里葉變換。最后,原始語(yǔ)譜圖(80×100×n)作為2D CNN的輸入并將原始語(yǔ)譜圖規(guī)范化為80×100×n作為3D CNN的輸入(n=4時(shí))。
3 CNN結(jié)構(gòu)
3.1 3D CNN
本文使用的三維CNN是對(duì)二維CNN的一種擴(kuò)展。具體而言,3D CNN包括三個(gè)卷積層(Kernel):第一層卷積層包括8個(gè)3×3×3的卷積核;第二層卷積層有32個(gè)3×3×3的卷積核;第三層卷積層有40個(gè)3×3×3的卷積核。均采用SAME的補(bǔ)零方式,步長(zhǎng)為1。
另外在每層卷積層后還包括批規(guī)范層[15]、激活函數(shù)層、最大池化層(Max Pooling)和Dropout[16]層。其中批規(guī)范層將前一層的值重新規(guī)范化,使得其輸出數(shù)據(jù)的均值接近0,標(biāo)準(zhǔn)差接近1,在提高學(xué)習(xí)速率的同時(shí)防止其過(guò)擬合。批規(guī)范層的參數(shù)設(shè)置均相同,動(dòng)量參數(shù)為0.99,epsilon為0.001。而激活層的激活函數(shù)均選用線性整流函數(shù)(Rectified Linear Unit,ReLu)。最大池化層均采用SAME補(bǔ)零方式,池化層各層的窗口大小分別為1×2×2、2×2×2、3×2×2,步長(zhǎng)與窗口大小相同。為了防止過(guò)擬合,將Dropout層速率均設(shè)置為0.5。在進(jìn)入全連接層之前,F(xiàn)latten層將特征向量整合為一維。最后一層節(jié)點(diǎn)個(gè)數(shù)為2,用Softmax函數(shù)激活以預(yù)測(cè)有蟲(chóng)或無(wú)蟲(chóng)。詳細(xì)結(jié)構(gòu)如圖1所示(n=5時(shí))。
3.2 2D CNN
為了準(zhǔn)確評(píng)估3D CNN的性能,本文搭建了與3D CNN類(lèi)似的2D CNN模型。不同之處在于,2D CNN均采用二維的卷積核和池化窗口。卷積核大小均為3×3,步長(zhǎng)為1;池化層窗口大小均為2×2,步長(zhǎng)為2。其他參數(shù)設(shè)置與3D CNN相同。
4 實(shí)驗(yàn)結(jié)果與分析
4.1 實(shí)驗(yàn)平臺(tái)
所有的實(shí)驗(yàn)均是在Google Colab平臺(tái)上完成的。該平臺(tái)搭載有一顆Intel Xeon @2.20GHz的處理器,13GB RAM以及一塊20GB的Nvidia Tesla K80圖形處理器,操作系統(tǒng)為Ubuntu。深度學(xué)習(xí)框架是以Python和Tensorflow[17]為后端的Keras。
4.2 評(píng)估指標(biāo)
本文的性能通過(guò)準(zhǔn)確率、F1分?jǐn)?shù)以及受試者工作特征(Receiver Operating Characteristic, ROC)曲線[18]來(lái)進(jìn)行評(píng)估。F1分?jǐn)?shù)能夠綜合準(zhǔn)確率(Precision)和召回率(Recall)兩個(gè)性能指標(biāo)綜合評(píng)判,F(xiàn)1分?jǐn)?shù)越高,則模型的效果越好。準(zhǔn)確率、召回率和F1分?jǐn)?shù)的計(jì)算方法如式(5)(6)(7)所示。所有的指標(biāo)均在測(cè)試集上計(jì)算。
Precision=正確有蟲(chóng)標(biāo)簽數(shù)預(yù)測(cè)有蟲(chóng)標(biāo)簽數(shù)(5)
Recall=正確有蟲(chóng)標(biāo)簽數(shù)實(shí)際有蟲(chóng)標(biāo)簽數(shù)(6)
F1=2×Precision×RecallPrecision+Recall(7)
4.3 參數(shù)選擇
訓(xùn)練過(guò)程中,2D CNN和3D CNN模型均使用了AdaDelta優(yōu)化器[19],損失函數(shù)為分類(lèi)交叉熵(Categorical Crossentropy)函數(shù)。其中AdaDelta學(xué)習(xí)速率設(shè)置為1.0,epsilon參數(shù)設(shè)置為10-8。將整個(gè)數(shù)據(jù)集按90%、5%、5%的比例分為訓(xùn)練集、驗(yàn)證集和測(cè)試集。驗(yàn)證集用于超參數(shù)調(diào)節(jié)以及篩選最后的模型,測(cè)試集用于對(duì)模型最后的評(píng)估。為了避免訓(xùn)練過(guò)程中的過(guò)擬合現(xiàn)象,在訓(xùn)練66代(epoch)后停止訓(xùn)練。
4.4 不同分幀長(zhǎng)度的網(wǎng)絡(luò)性能
本文首先研究了不同分幀長(zhǎng)度的輸入對(duì)網(wǎng)絡(luò)性能的影響。選擇合適的分幀長(zhǎng)度在本文中至關(guān)重要:過(guò)長(zhǎng)會(huì)導(dǎo)致可訓(xùn)練的數(shù)據(jù)集減少,網(wǎng)絡(luò)容易出現(xiàn)過(guò)擬合現(xiàn)象;過(guò)短則不能保證幀內(nèi)有蟲(chóng),從而導(dǎo)致標(biāo)簽錯(cuò)誤。不同的分幀長(zhǎng)度的網(wǎng)絡(luò)性能以準(zhǔn)確率、F1分?jǐn)?shù)、真陽(yáng)率(True Positive Rate,TPR)和真陰率(True Negative Rate,TNR)來(lái)表示(表1);6種分幀長(zhǎng)度的ROC曲線如圖2所示。結(jié)果表明在n=4、5、6時(shí),網(wǎng)絡(luò)性能較好。本文將準(zhǔn)確率、F1分?jǐn)?shù)和ROC曲線折中考慮,在n=5時(shí)模型性能達(dá)到最佳。
4.5 與2D CNN的比較
本文將3D CNN模型與2D CNN模型從準(zhǔn)確率、F1分?jǐn)?shù)和ROC曲線三個(gè)指標(biāo)進(jìn)行了比較。兩種模型的詳細(xì)配置如表2所示(“—”表示該層沒(méi)有這個(gè)參數(shù))。從表3可以看出,在準(zhǔn)確率和F1分?jǐn)?shù)兩個(gè)指標(biāo)上,3D CNN模型均勝過(guò)2D CNN模型。圖3給出了兩種方法的ROC曲線,可以看到3D CNN模型效果較好。
[6]YAZGA B G, KIRCI M, KIVAN M. Detection of sunn pests using sound signal processing methods [C]// Proceedings of the 2016 5th International Conference on Agro-Geoinformatics (Agro-Geoinformatics). Piscataway, NJ: IEEE, 2016: 1-6.
[7]ZILLI D, PARSON O, MERRETT G, et al. A hidden Markov model-based acoustic cicada detector for crowdsourced smartphone biodiversity monitoring[C]// Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2013: 2945-2951.
[8]LIKITHA M S, GUPTA S R R, HASITHA K, et al. Speech based human emotion recognition using MFCC [C]// Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking. Piscataway, NJ: IEEE, 2017: 2257-2260.
[9]POTAMITIS I, GANCHEV T, FAKOTAKIS N. Automatic acoustic identification of crickets and cicadas[C]// Proceedings of the 9th International Symposium on Signal Processing and Its Applications. Piscataway, NJ: IEEE, 2007: 1-4.
[10]DONG X, YAN N, WEI Y. Insect sound recognition based on convolutional neural network [C]// Proceedings of the IEEE 3rd International Conference on Image, Vision and Computing. Piscataway, NJ: IEEE, 2018: 855-859.
[11]KISKIN I, ZILLI D, LI Y, et al. Bioacoustic detection with wavelet-conditioned convolutional neural networks [J]. Neural Computing and Applications, 2018, Online: 1-13.
KISKIN I, ZILLI D, LI Y, et al. Bioacoustic detection with wavelet-conditioned convolutional neural networks [EB/OL]. [2018-12-28]. https://link.springer.com/article/10.1007/s00521-018-3626-7.
[12]ZHOU X, LI J, ZHOU X. Cascaded CNN-resBiLSTM-CTC: an end-to-end acoustic model for speech recognition [EB/OL]. [2019-02-01]. https://arxiv.org/pdf/1810.12001.pdf.
[13]SCHMIDHUBER J. Deep learning in neural networks: an overview[J]. Neural Networks, 2015, 61: 85-117.
[14]HUANG J, ZHOU W, LI H, et al. Sign language recognition using 3D convolutional neural networks[C]// Proceedings of the 2015 IEEE International Conference on Multimedia and Expo. Piscataway, NJ: IEEE, 2015: 1-6.
[15]IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift [C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning. Brookline, MA: JMLR, 2015, 37: 448-456.
[16]SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: a simple way to prevent neural networks from overfitting [J]. Journal of Machine Learning Research, 2014, 15(1): 1929-1958.
[17]ABADI M, BARHAM P, CHEN J, et al. TensorFlow: a system for large-scale machine learning [C]// Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2016: 265-283.
[18]DOWNEY T J, MEYER D J, PRICE R K, et al. Using the receiver operating characteristic to asses the performance of neural classifiers [C]// Proceedings of the 1999 International Joint Conference on Neural Networks. Piscataway, NJ: IEEE, 1999: 3642-3646.
[19]ZEILER M D. ADADELTA: an adaptive learning rate method [EB/OL]. [2019-02-01].? https://arxiv.org/pdf/1212.5701.pdf.
[20]SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization [C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE, 2017, 1: 618-626.
This work is partially supported by the National Natural Science Foundation of China (61872143), the National Undergraduate Innovation and Entrepreneurship Training Program of China (201810251064).
WAN Yongjing, born in 1975, Ph. D., associate professor. Her research interests include intelligent information processing, image processing, pattern recognition, audio signal processing.
WANG Bowei, born in 1997, undergraduate student. His research interests include signal processing, data mining, machine learning.
LOU Dingfeng, born in 1960, professorresearch fellow. His research interests include entomological bioacoustics.