亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

A novel L-vector representation and improvedcosine distance kernel for Text-dependentSpeaker Verification

2016-05-27 01:42:44LIWeiYOUHanxuZHUJieCHENNing

上海師范大學(xué)學(xué)報(bào)·自然科學(xué)版 2016年2期

LI Wei, YOU Hanxu, ZHU Jie, CHEN Ning

(1.School of Electronic Information and Electical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China; 2.School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)

LI Wei1, YOU Hanxu1, ZHU Jie1, CHEN Ning2

Abstract:A text-dependent i-vector extraction scheme and a lexicon-based binary vector (L-vector) representation are proposed to improve the performance of text-dependent speaker verification.An utterance used for enrollment or test is represented by these two vectors.An improved cosine distance kernel combining i-vector and L-vector is constructed to discriminate both speaker identity and lexical (or text) diversity with back-end support vector machine(SVM).Experiments are conducted on RSR 2015 Corpus part 1 and part 2.The results indicate that at most 30% improvement can be obtained compared with traditional i-vector baseline.

Key words:text-dependent speaker verification; i-vector; L-vector; cosine distance kernel

1Introduction

In recent years,i-vector based framework has demonstrated state-of-the-art performance in text-independent speaker verification[1].Each utterance either for enrollment or test is projected onto a low rank total factor space,and is represented by a low dimensional identity vector termed i-vector.It is commonly thought that i-vector well captures speaker- and channel- dependent information in an utterance,also it represents a global adaptation in Gaussian Mixture Model (GMM) subspace.However its applicability has not been widely accepted in text-dependent speaker verification[2]mainly due to two reasons.Firstly,i-vector cannot explicitly represent the lexical information of an utterance.Secondly,since the duration of utterance is very short in text-dependent speaker verification,short-term speaker features,like Mel Frequency Cepstrum Coefficient (MFCC) or Perceptual Linear Predictive (PLP),can only activate a subset of total Gaussian components,hence it is not appropriate to globally adapt all the Gaussian components.

To cope with these two shortcomings,firstly,we propose a text-dependent i-vector extraction scheme,only those Gaussian components with sufficient speaker frames are retained based on this scheme,and i-vector adaptation is performed based on this subset.Secondly,a lexicon-based binary vector termed L-vector is constructed to model the distribution of zero order Baum-Welch statistics,which can capture lexical information in an utterance.Finally,an improved cosine distance kernel is constructed,which combines i-vector and L-vector,to measure the diversity of both speaker identity and lexical (or text) content.

2Text-dependent i-vector extraction

Given the speaker frame set of an utterance,we regard corresponding zero order Baum-Welch statisticsNcas a metric to measure how many frames are assigned to each Gaussian component,wherecindexes each Gaussian component.According to[3],extremely short utterance (less than 10 s) leads to an imbalanced distribution of zero order Baum-Welch statistics,we can use 50% of total Gaussian components with highestNcto capture more than 90% speaker frames.In text-dependent speaker verification,enrollment or test utterance is also very short,moreover,scarceNcmay lead to biased estimation of first order Baum-Welch statisticsFc[3],hence it is more appropriate to perform i-vector adaptation within a subset of total Gaussian components.

In order to select those Gaussian components with highestNc,a threshold function is defined as:

(1)

Whereεis an empirically tuned factor to adjust the number of Gaussian components to be retained.By this filter scheme,we can select a subset of Gaussian components with highestNc.In real application,we usually pay more attention to the number of Gaussians in the subset,which we denote byR.The text-dependent i-vector extraction can be written as:

(2)

where udenotestheutteranceinvolved,Iistheidentitymatrixasaprior,Tcisthesub-matrixofthec-thblockoftotalfactormatrixT,Tcandmcarethespeaker-andtext-independentcovariancematrixandmeanvectorforc-thGaussiancomponent,CisthenumberoftotalGaussiancomponents.Comparedtotraditionali-vectorextraction[4],theS(c)filteringmechanismensurethatonlythosecomponentsrepresentinglexicalinformationofutteranceinvolveinadaptation.

3Lexicon-basedL-vector

Althoughourimprovedi-vectorcanberegardedasatext-dependentlocalrepresentationinGMMspace,itaimstodiscriminatespeakeridentityandcannotwelldiscriminatelexicaldiversity.Alexicon-basedbinaryvectortermedL-vectorisconstructedforthispurpose.

UtilizingthesameS(c)in(1),L-vectorcanbewrittenas:

(3)

wherethesubscriptindexesGaussiancomponent,thenumberof1sinLisequaltoR,thedimensionalityofL-vectorisequaltothenumberoftotalGaussiancomponentsC.L-vectorrepresentswhichGaussiancomponentisactivatedgivenatrainingutterance,anditencodeslexicalinformationinutterance.

4Improvedcosinedistancekernel

Givenanenrollmentutteranceu1andatestutteranceu2,correspondingspeakermodelsλcanberepresentedas:

(4)

Tocalculatethesimilaritybetweenu1andu2,theimprovedcosinedistancekernelcanbewrittenas:

(5)

5Experiments and results

All experiments were carried out on part 1 and part 2 of the Robust Speaker Recognition 2015 (RSR 2015) corpus set[5-6],which is designed for text-dependent speaker recognition with scenario based on fixed pass-phrases (part 1) and fixed commands (part 2).It contains audio recordings from 300 people,which include 143 female and 157 male speakers that are between 17 to 42 years old,and the whole set is divided into background (bkg),development (dev) and evaluation (eval) subsets.Among the 300 people,50 male and 47 female speakers are in the background set,50/47 in the development set and 57/49 in the evaluation set.

Our experiments applied MFCC (19 order coefficients together with log energy) as short-term speaker feature,with speech/silence segmentation performed according to an energy-based voice activity detection (VAD).The length of Hamming window was 25ms with 10ms shift.The 20-dimensional feature vector was normalized by cepstral mean subtraction (CMS),20 first orderδand 10 second orderδwere appended,equal to a total dimension of 50.

512 order gender dependent universal background models (UBM) were trained with bkg corpus set.Gender dependent total factor matrixes with rank of 300 were trained with the mixture of bkg and dev corpus sets.In the back-end support vector machine (SVM) classification system,the speaker modelsλextracted from bkg corpus set were used as imposter models to train the SVM system.Linear discriminant analysis (LDA) was applied as channel compensation technique before SVM training.LDA was estimated with the mixture of bkg and dev corpus sets.In our experiments,the optimal LDA dimension is 260.The eval set was used to evaluate system performance.Evaluations on part 1 and part 2 were independent and corpus sets between part 1 and part 2 were not overlapped.Two types of trials,i.e.CLIENT-wrong (given that the test utterance is spoken by the target user with wrong pass-phrase) and IMP-true (given that the test utterance is spoken by an imposter with the correct pass-phrase) of the evaluations described in[6]were used in our experiment.As we have mentioned in the previous section,the only parameter has to be empirically tuned in our system isR,Rranges from 512～350.

Results were given in terms of equal error rate (EER) and decision cost function (DCF).Table 1 and 2 present the results of traditional text-independent i-vector baseline system and our lexicon-based text-dependent i-vector system.

The results in both Table 1 and Table 2 show that as the value ofRdecreases from 512～430 (for CLIENT-wrong) or 450 (for IMP-TRUE),the system gains a significant performance improvement.In the CLIENT-wrong trials,best improvement is obtained whenRis set to 430,our lexicon-based text-dependent i-vector system achieves a relative improvement of 26% in part 1 and 28% in part 2 on male trials as well as 30% in part 1 and 20% in part 2 on female trials.In the IMP-TRUE trials,as the lexical contents of target speaker and imposter speaker are identical,our lexicon-based text-dependent i-vector system gains less significant improvement,best improvement is obtained whenRis set to 450,which achieves a relative improvement of 9.7% in part 1 and 9.5% in part 2 on male trials as well as 15% in part 1 and 10.9% in part 2 on female trials.In real application,settingRto 430 can obtain a global optimal performance in our text-dependent i-vector system.

6Conclusion

We have proposed a lexicon-based local representation algorithm for text-dependenti-vector speaker verification system.A subset of total Gaussian components is selected,which is most relevant to lexicon information.Text-dependent i-vector for either enrollment utterance or test utterance is extracted based on this subset.Moreover,a lexicon-based L-vector is constructed to discriminate lexical diversity.An improved cosine kernel is designed to measure the similarity of both speaker identity and lexical content between two utterances.Experimental results show that at most 30% improvement in EER can be obtained compared to traditional text-independent i-vector system.Given that our system now still highly depend on the empirical valueR,our future work will focus on adaptive approach for tuningRautomatically from speaker data.

References:

[1]Dehak N,Kenny P,Dehak R,et al.Front-end factor analysis for speaker verification [J].Audio,Speech,and Language Processing,IEEE Transactions on,2011,19(4):788-798.

[2]Aronowitz H.Text dependent speaker verification using a small development set[C]//Odyssey.The Speaker and Language Recognition Workshop.ISCA:Singapore,2012.

[3]Li W,Fu T F,Zhu J,et al.Sparsity Analysis and Compensation fori-Vector Based Speaker Verification[M]//Ronzhin A,Potapova R,Fakotakis N.Speech and Computer.Berlin:Springer International Publishing,2015:381-388.

[4]Kenny P,Boulianne G,Dumouchel P.Eigenvoice modeling with sparse training data [J].Speech and Audio Processing,IEEE Transactions on,2005,13(3):345-354.

[5]Larcher A,Lee K A,Ma B,et al.Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances[C]//IEEE.Acoustics Speech and Signal Processing (ICASSP) 2013 IEEE International Conference on.IEEE,Vancouer,2013:7673-7677.

[6]Larcher A,Lee K A,Ma B,et al.RSR2015:Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases[C]//Institute for Information Research.Interspeech.IZR:Singapore,2012.

(責(zé)任編輯:包震宇)

一種應(yīng)用于文本相關(guān)說話人確認(rèn)的L-向量表示和改進(jìn)的余弦距離核函數(shù)

李為1, 游寒旭1, 朱杰1, 陳寧2

(1.上海交通大學(xué) 電子信息與電氣工程學(xué)院,上海 200240;2.華東理工大學(xué) 信息科學(xué)與工程學(xué)院,上海 200237)

關(guān)鍵詞:文本相關(guān)說話人識(shí)別; i-向量; L-向量; 余弦核函數(shù)

摘要:提出了一種用于文本相關(guān)說說話人確認(rèn)技術(shù)的i-向量提取方法和L-向量表示.一段用于注冊(cè)或識(shí)別的語音可以用i-向量和L-向量聯(lián)合表示.同時(shí)提出了一種改進(jìn)的用于支持向量機(jī)(SVM)后端分類的核函數(shù),改進(jìn)的核函數(shù)可以同時(shí)區(qū)分說話人身份的差異和文本內(nèi)容的差異.在RSR 2015語料集合1和集合2上驗(yàn)證系統(tǒng)的性能,實(shí)驗(yàn)結(jié)果顯示改進(jìn)的算法相對(duì)于傳統(tǒng)的i-向量系統(tǒng)的基線能提高至多30%的識(shí)別率.

CLC number:TP 912.3

Document code:AArticle ID: 1000-5137(2016)02-0243-05

Received date:2016-02-29

Foundation item:This work was supported by the National Natural Science Foundation of China (NSFC) under Grant (61271349,61371147,11433002),and Shanghai Jiao Tong University joint research fund for Biomedical Engineering under (YG2012ZD04).

Corresponding author:ZHU Jie,School of Electronic Information and Electical Engineering,Shanghai Jiao Tong University,No.800,Dongchuan Rd.,Shanghai 200240,China,E-mail:zhujie@sjtu.edu.cn

上海師范大學(xué)學(xué)報(bào)·自然科學(xué)版2016年2期

上海師范大學(xué)學(xué)報(bào)·自然科學(xué)版的其它文章: 稀疏線性預(yù)測(cè)字典在語音壓縮感知中的應(yīng)用; 基于測(cè)距的定位系統(tǒng)中位置解算研究; 移動(dòng)無線傳感器網(wǎng)絡(luò)中分布式重聚類算法研究; 基于大規(guī)模天線陣的多長(zhǎng)度導(dǎo)頻機(jī)制研究; 一種抗信道相關(guān)性的高效MIMO檢測(cè)算法; 基于干擾對(duì)齊的設(shè)備到設(shè)備功率控制算法