亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        A novel L-vector representation and improvedcosine distance kernel for Text-dependentSpeaker Verification

        2016-05-27 01:42:44LIWeiYOUHanxuZHUJieCHENNing
        關(guān)鍵詞:向量

        LI Wei, YOU Hanxu, ZHU Jie, CHEN Ning

        (1.School of Electronic Information and Electical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China; 2.School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)

        ?

        A novel L-vector representation and improvedcosine distance kernel for Text-dependentSpeaker Verification

        LI Wei1, YOU Hanxu1, ZHU Jie1, CHEN Ning2

        (1.School of Electronic Information and Electical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China; 2.School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)

        Abstract:A text-dependent i-vector extraction scheme and a lexicon-based binary vector (L-vector) representation are proposed to improve the performance of text-dependent speaker verification.An utterance used for enrollment or test is represented by these two vectors.An improved cosine distance kernel combining i-vector and L-vector is constructed to discriminate both speaker identity and lexical (or text) diversity with back-end support vector machine(SVM).Experiments are conducted on RSR 2015 Corpus part 1 and part 2.The results indicate that at most 30% improvement can be obtained compared with traditional i-vector baseline.

        Key words:text-dependent speaker verification; i-vector; L-vector; cosine distance kernel

        1Introduction

        In recent years,i-vector based framework has demonstrated state-of-the-art performance in text-independent speaker verification[1].Each utterance either for enrollment or test is projected onto a low rank total factor space,and is represented by a low dimensional identity vector termed i-vector.It is commonly thought that i-vector well captures speaker- and channel- dependent information in an utterance,also it represents a global adaptation in Gaussian Mixture Model (GMM) subspace.However its applicability has not been widely accepted in text-dependent speaker verification[2]mainly due to two reasons.Firstly,i-vector cannot explicitly represent the lexical information of an utterance.Secondly,since the duration of utterance is very short in text-dependent speaker verification,short-term speaker features,like Mel Frequency Cepstrum Coefficient (MFCC) or Perceptual Linear Predictive (PLP),can only activate a subset of total Gaussian components,hence it is not appropriate to globally adapt all the Gaussian components.

        To cope with these two shortcomings,firstly,we propose a text-dependent i-vector extraction scheme,only those Gaussian components with sufficient speaker frames are retained based on this scheme,and i-vector adaptation is performed based on this subset.Secondly,a lexicon-based binary vector termed L-vector is constructed to model the distribution of zero order Baum-Welch statistics,which can capture lexical information in an utterance.Finally,an improved cosine distance kernel is constructed,which combines i-vector and L-vector,to measure the diversity of both speaker identity and lexical (or text) content.

        2Text-dependent i-vector extraction

        Given the speaker frame set of an utterance,we regard corresponding zero order Baum-Welch statisticsNcas a metric to measure how many frames are assigned to each Gaussian component,wherecindexes each Gaussian component.According to[3],extremely short utterance (less than 10 s) leads to an imbalanced distribution of zero order Baum-Welch statistics,we can use 50% of total Gaussian components with highestNcto capture more than 90% speaker frames.In text-dependent speaker verification,enrollment or test utterance is also very short,moreover,scarceNcmay lead to biased estimation of first order Baum-Welch statisticsFc[3],hence it is more appropriate to perform i-vector adaptation within a subset of total Gaussian components.

        In order to select those Gaussian components with highestNc,a threshold function is defined as:

        (1)

        Whereεis an empirically tuned factor to adjust the number of Gaussian components to be retained.By this filter scheme,we can select a subset of Gaussian components with highestNc.In real application,we usually pay more attention to the number of Gaussians in the subset,which we denote byR.The text-dependent i-vector extraction can be written as:

        (2)

        where udenotestheutteranceinvolved,Iistheidentitymatrixasaprior,Tcisthesub-matrixofthec-thblockoftotalfactormatrixT,Tcandmcarethespeaker-andtext-independentcovariancematrixandmeanvectorforc-thGaussiancomponent,CisthenumberoftotalGaussiancomponents.Comparedtotraditionali-vectorextraction[4],theS(c)filteringmechanismensurethatonlythosecomponentsrepresentinglexicalinformationofutteranceinvolveinadaptation.

        3Lexicon-basedL-vector

        Althoughourimprovedi-vectorcanberegardedasatext-dependentlocalrepresentationinGMMspace,itaimstodiscriminatespeakeridentityandcannotwelldiscriminatelexicaldiversity.Alexicon-basedbinaryvectortermedL-vectorisconstructedforthispurpose.

        UtilizingthesameS(c)in(1),L-vectorcanbewrittenas:

        (3)

        wherethesubscriptindexesGaussiancomponent,thenumberof1sinLisequaltoR,thedimensionalityofL-vectorisequaltothenumberoftotalGaussiancomponentsC.L-vectorrepresentswhichGaussiancomponentisactivatedgivenatrainingutterance,anditencodeslexicalinformationinutterance.

        4Improvedcosinedistancekernel

        Givenanenrollmentutteranceu1andatestutteranceu2,correspondingspeakermodelsλcanberepresentedas:

        (4)

        Tocalculatethesimilaritybetweenu1andu2,theimprovedcosinedistancekernelcanbewrittenas:

        (5)

        5Experiments and results

        All experiments were carried out on part 1 and part 2 of the Robust Speaker Recognition 2015 (RSR 2015) corpus set[5-6],which is designed for text-dependent speaker recognition with scenario based on fixed pass-phrases (part 1) and fixed commands (part 2).It contains audio recordings from 300 people,which include 143 female and 157 male speakers that are between 17 to 42 years old,and the whole set is divided into background (bkg),development (dev) and evaluation (eval) subsets.Among the 300 people,50 male and 47 female speakers are in the background set,50/47 in the development set and 57/49 in the evaluation set.

        Our experiments applied MFCC (19 order coefficients together with log energy) as short-term speaker feature,with speech/silence segmentation performed according to an energy-based voice activity detection (VAD).The length of Hamming window was 25ms with 10ms shift.The 20-dimensional feature vector was normalized by cepstral mean subtraction (CMS),20 first orderδand 10 second orderδwere appended,equal to a total dimension of 50.

        512 order gender dependent universal background models (UBM) were trained with bkg corpus set.Gender dependent total factor matrixes with rank of 300 were trained with the mixture of bkg and dev corpus sets.In the back-end support vector machine (SVM) classification system,the speaker modelsλextracted from bkg corpus set were used as imposter models to train the SVM system.Linear discriminant analysis (LDA) was applied as channel compensation technique before SVM training.LDA was estimated with the mixture of bkg and dev corpus sets.In our experiments,the optimal LDA dimension is 260.The eval set was used to evaluate system performance.Evaluations on part 1 and part 2 were independent and corpus sets between part 1 and part 2 were not overlapped.Two types of trials,i.e.CLIENT-wrong (given that the test utterance is spoken by the target user with wrong pass-phrase) and IMP-true (given that the test utterance is spoken by an imposter with the correct pass-phrase) of the evaluations described in[6]were used in our experiment.As we have mentioned in the previous section,the only parameter has to be empirically tuned in our system isR,Rranges from 512~350.

        Results were given in terms of equal error rate (EER) and decision cost function (DCF).Table 1 and 2 present the results of traditional text-independent i-vector baseline system and our lexicon-based text-dependent i-vector system.

        The results in both Table 1 and Table 2 show that as the value ofRdecreases from 512~430 (for CLIENT-wrong) or 450 (for IMP-TRUE),the system gains a significant performance improvement.In the CLIENT-wrong trials,best improvement is obtained whenRis set to 430,our lexicon-based text-dependent i-vector system achieves a relative improvement of 26% in part 1 and 28% in part 2 on male trials as well as 30% in part 1 and 20% in part 2 on female trials.In the IMP-TRUE trials,as the lexical contents of target speaker and imposter speaker are identical,our lexicon-based text-dependent i-vector system gains less significant improvement,best improvement is obtained whenRis set to 450,which achieves a relative improvement of 9.7% in part 1 and 9.5% in part 2 on male trials as well as 15% in part 1 and 10.9% in part 2 on female trials.In real application,settingRto 430 can obtain a global optimal performance in our text-dependent i-vector system.

        6Conclusion

        We have proposed a lexicon-based local representation algorithm for text-dependenti-vector speaker verification system.A subset of total Gaussian components is selected,which is most relevant to lexicon information.Text-dependent i-vector for either enrollment utterance or test utterance is extracted based on this subset.Moreover,a lexicon-based L-vector is constructed to discriminate lexical diversity.An improved cosine kernel is designed to measure the similarity of both speaker identity and lexical content between two utterances.Experimental results show that at most 30% improvement in EER can be obtained compared to traditional text-independent i-vector system.Given that our system now still highly depend on the empirical valueR,our future work will focus on adaptive approach for tuningRautomatically from speaker data.

        References:

        [1]Dehak N,Kenny P,Dehak R,et al.Front-end factor analysis for speaker verification [J].Audio,Speech,and Language Processing,IEEE Transactions on,2011,19(4):788-798.

        [2]Aronowitz H.Text dependent speaker verification using a small development set[C]//Odyssey.The Speaker and Language Recognition Workshop.ISCA:Singapore,2012.

        [3]Li W,Fu T F,Zhu J,et al.Sparsity Analysis and Compensation fori-Vector Based Speaker Verification[M]//Ronzhin A,Potapova R,Fakotakis N.Speech and Computer.Berlin:Springer International Publishing,2015:381-388.

        [4]Kenny P,Boulianne G,Dumouchel P.Eigenvoice modeling with sparse training data [J].Speech and Audio Processing,IEEE Transactions on,2005,13(3):345-354.

        [5]Larcher A,Lee K A,Ma B,et al.Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances[C]//IEEE.Acoustics Speech and Signal Processing (ICASSP) 2013 IEEE International Conference on.IEEE,Vancouer,2013:7673-7677.

        [6]Larcher A,Lee K A,Ma B,et al.RSR2015:Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases[C]//Institute for Information Research.Interspeech.IZR:Singapore,2012.

        (責(zé)任編輯:包震宇)

        一種應(yīng)用于文本相關(guān)說(shuō)話人確認(rèn)的L-向量表示和改進(jìn)的余弦距離核函數(shù)

        李為1, 游寒旭1, 朱杰1, 陳寧2

        (1.上海交通大學(xué) 電子信息與電氣工程學(xué)院,上海 200240;2.華東理工大學(xué) 信息科學(xué)與工程學(xué)院,上海 200237)

        關(guān)鍵詞:文本相關(guān)說(shuō)話人識(shí)別; i-向量; L-向量; 余弦核函數(shù)

        摘要:提出了一種用于文本相關(guān)說(shuō)說(shuō)話人確認(rèn)技術(shù)的i-向量提取方法和L-向量表示.一段用于注冊(cè)或識(shí)別的語(yǔ)音可以用i-向量和L-向量聯(lián)合表示.同時(shí)提出了一種改進(jìn)的用于支持向量機(jī)(SVM)后端分類的核函數(shù),改進(jìn)的核函數(shù)可以同時(shí)區(qū)分說(shuō)話人身份的差異和文本內(nèi)容的差異.在RSR 2015語(yǔ)料集合1和集合2上驗(yàn)證系統(tǒng)的性能,實(shí)驗(yàn)結(jié)果顯示改進(jìn)的算法相對(duì)于傳統(tǒng)的i-向量系統(tǒng)的基線能提高至多30%的識(shí)別率.

        CLC number:TP 912.3

        Document code:AArticle ID: 1000-5137(2016)02-0243-05

        Received date:2016-02-29

        Foundation item:This work was supported by the National Natural Science Foundation of China (NSFC) under Grant (61271349,61371147,11433002),and Shanghai Jiao Tong University joint research fund for Biomedical Engineering under (YG2012ZD04).

        Corresponding author:ZHU Jie,School of Electronic Information and Electical Engineering,Shanghai Jiao Tong University,No.800,Dongchuan Rd.,Shanghai 200240,China,E-mail:zhujie@sjtu.edu.cn

        猜你喜歡
        向量
        空間向量的應(yīng)用A卷
        空間向量的應(yīng)用B卷
        向量應(yīng)用及小結(jié)復(fù)習(xí)A卷
        向量的分解
        向量的共線
        向量的平行與垂直
        聚焦“向量與三角”創(chuàng)新題
        一道向量題的多解與多變
        向量垂直在解析幾何中的應(yīng)用
        向量五種“變身” 玩轉(zhuǎn)圓錐曲線
        亚洲天堂av一区二区三区不卡| 国产欧美精品区一区二区三区| 国产精品综合日韩精品第一页| 欧美人与动牲交片免费| 中文字幕人乱码中文字幕乱码在线| 少妇无码av无码专线区大牛影院| 男女爱爱好爽视频免费看| 久久这里只有精品9| 日韩国产有码精品一区二在线| 亚洲中文字幕久久在线| 妺妺窝人体色www看美女| 玖玖资源站无码专区| 久久久婷婷综合五月天| 在线免费观看蜜桃视频| 四虎影视成人永久免费观看视频| 国产午夜无码视频免费网站| 日本一区二区亚洲三区| 精品一区二区三区婷婷| 国产高跟黑色丝袜在线| 97se亚洲国产综合自在线图片| 男人的av天堂狠狠操| 亚洲av无一区二区三区| 色综合色狠狠天天综合色| 亚洲日本天堂| 亚洲天堂av大片暖暖| 人妻少妇中文字幕在线观看| 中国国语毛片免费观看视频| 日本久久久免费高清| 欧美与黑人午夜性猛交久久久| 亚洲av无码专区在线亚| 久久青青草原一区网站| 亚洲精品一区二区国产精华液| 亚洲熟伦熟女新五十路熟妇| 丁香九月综合激情| 国产亚洲熟妇在线视频| 国产精品久久久福利| 国产99久久久久久免费看| 欧美精品久久久久久三级| 在线观看免费不卡网站| 摸进她的内裤里疯狂揉她动图视频| 亚洲成人免费网址|