姚茂建 李晗靜 呂會(huì)華 姚登峰
關(guān)鍵詞: 自然語(yǔ)言處理; 中文分詞; 神經(jīng)網(wǎng)絡(luò); 雙向長(zhǎng)短時(shí)記憶條件隨機(jī)場(chǎng); 字嵌入; 序列標(biāo)注
中圖分類號(hào): TN711?34; TP391.1 ? ? ? ? ? ? ? ? ? 文獻(xiàn)標(biāo)識(shí)碼: A ? ? ? ? ? ? ? ? ? ?文章編號(hào): 1004?373X(2019)01?0095?05
Abstract: The mainstream Chinese word segmentation method based on supervised learning algorithm requires a lot of corpora labeled manually, and the extracted local feature has sparse problem. Therefore, a bidirectional long short?term memory conditional random field (BI_LSTM_CRF) model is proposed, which can automatically learn the text features, and model the text context dependent information. The tag information before and after sentence character is considered in CRF layer, and the text information is deduced. The word segmentation model has achieved perfect word segmentation results on datasets of MSRA, PKU and CTB 6.0, and the experiment for the model is carried out with news data, MicroBlog data, automobile forum data and restaurant review data. The experimental results show that the BI_LSTM_CRF model has high word segmentation performance in testing set, and strong generalization ability in cross?domain data testing.
Keywords: natural language processing; Chinese word segmentation; neural network; bidirectional long short?term memory random field; word embedding; sequence labeling
中文分詞是中文自然語(yǔ)言處理必需的過(guò)程,是進(jìn)一步進(jìn)行詞性標(biāo)注、機(jī)器翻譯、信息檢索的基礎(chǔ)。分詞效果直接影響著中文自然語(yǔ)言任務(wù)結(jié)果的好壞,所以中文分詞具有重要意義。然而中文是一種復(fù)雜的語(yǔ)言,存在一詞多意、未登錄詞、語(yǔ)句歧義現(xiàn)象,只有結(jié)合上下文信息才能有效地進(jìn)行分詞。近些年,中文分詞研究取得了持續(xù)發(fā)展。中文分詞常用的方法可以分為以下幾大類:基于規(guī)則和字典的方法、基于統(tǒng)計(jì)的方法、基于神經(jīng)網(wǎng)絡(luò)的方法。
基于規(guī)則和字典的方法主要思想是建立一個(gè)充分大的詞典,按照一定的算法策略將待分詞的字符序列與詞典中收錄的詞條進(jìn)行匹配,若在詞典中存在,則匹配成功,完成分詞[1]。但其對(duì)詞典依賴性很強(qiáng),對(duì)歧義和未登錄詞識(shí)別效果不佳等問(wèn)題?;诮y(tǒng)計(jì)的方法是基于訓(xùn)練語(yǔ)料庫(kù)來(lái)學(xué)習(xí)任意字符相鄰出現(xiàn)的概率,得到分詞模型,通過(guò)計(jì)算字符序列切分最大概率作為分詞結(jié)果[2]。該方法需要人工定義和提取特征,其性能也受到訓(xùn)練語(yǔ)料、特征設(shè)定的影響,存在特征過(guò)多、模型復(fù)雜、容易過(guò)擬合的問(wèn)題。隨著深度學(xué)習(xí)的快速發(fā)展,近年來(lái)神經(jīng)網(wǎng)絡(luò)算法被廣泛用于自然語(yǔ)言處理任務(wù)中。由于神經(jīng)網(wǎng)絡(luò)可以從原始數(shù)據(jù)中自主學(xué)習(xí)特征,不僅替代了人工提取特征的工作量,同時(shí)也避免了人為特征設(shè)定的局限性。
為了提高中文分詞的性能,應(yīng)用BI_LSTM_CRF神經(jīng)網(wǎng)絡(luò)處理中文分詞任務(wù),使用BI_LSTM_CRF網(wǎng)絡(luò)構(gòu)造更具表征的字符信息,本文系統(tǒng)性地比較了4字詞位標(biāo)注與6字詞位標(biāo)注方法在測(cè)試集上的測(cè)試結(jié)果,實(shí)驗(yàn)結(jié)果表明采用6字詞位標(biāo)注的方法能更好地表征詞語(yǔ)中的詞位信息,并且性能更加優(yōu)越。使用6字詞位標(biāo)注方法的神經(jīng)網(wǎng)絡(luò)分詞模型分別在新聞數(shù)據(jù)、微博數(shù)據(jù)、汽車論壇數(shù)據(jù)、餐飲點(diǎn)評(píng)數(shù)據(jù)進(jìn)行了測(cè)試,實(shí)驗(yàn)結(jié)果顯示,BI_LSTM_CRF神經(jīng)網(wǎng)絡(luò)分詞模型在跨領(lǐng)域數(shù)據(jù)測(cè)試上也有很好的泛化能力。
長(zhǎng)短時(shí)記憶(Long Short?term Memory,LSTM)網(wǎng)絡(luò)是遞歸神經(jīng)網(wǎng)絡(luò)(Recurrent Neural Network,RNN)的一種變種,在很多任務(wù)上表現(xiàn)的比RNN更好,可以學(xué)習(xí)長(zhǎng)期依賴信息。1997年,Schuster等人在LSTM網(wǎng)絡(luò)模型基礎(chǔ)上提出了雙向長(zhǎng)短時(shí)記憶(Bidirectional Recurrent Neural Networks,BI_RNN)模型,由于是雙向輸入,在記憶長(zhǎng)時(shí)信息方面比LSTM更具有優(yōu)勢(shì)。以上述神經(jīng)網(wǎng)絡(luò)為基礎(chǔ)的模型在處理與時(shí)間相關(guān)的序列任務(wù)中取得了很大的成功,通常模型都能對(duì)長(zhǎng)短時(shí)依賴信息進(jìn)行表達(dá)。
文獻(xiàn)[3]對(duì)神經(jīng)網(wǎng)絡(luò)建立概率語(yǔ)言模型,該方法對(duì)n?gram模型有顯著的改進(jìn),并且利用了較長(zhǎng)的上下文信息。文獻(xiàn)[4]使用神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)處理中文自然語(yǔ)言任務(wù),描述了一種感知器訓(xùn)練神經(jīng)網(wǎng)絡(luò)的替代算法,以加速整個(gè)訓(xùn)練過(guò)程。文獻(xiàn)[5]將LSTM網(wǎng)絡(luò)模型應(yīng)用于中文分詞中,以解決上下文長(zhǎng)距離依賴關(guān)系,并取得了不錯(cuò)的分詞效果。2016年,Yao等人提出采用BI_LSTM網(wǎng)絡(luò)模型處理中文分詞,該模型將過(guò)去和未來(lái)上下文中文信息都考慮進(jìn)去,中文分詞效果得到了提高。2017年,李雪蓮等針對(duì)LSTM神經(jīng)網(wǎng)絡(luò)模型復(fù)雜、訓(xùn)練時(shí)間長(zhǎng)等問(wèn)題,提出基于GRU(Gate Recurrent Unit)模型,使得模型訓(xùn)練更加簡(jiǎn)化并且取得了與LSTM模型相當(dāng)?shù)姆衷~效果。
本文主要研究了BI_LSTM_CRF神經(jīng)網(wǎng)絡(luò)來(lái)實(shí)現(xiàn)中文分詞,實(shí)驗(yàn)中不僅使用MSRA,PKU,CTB 6.0數(shù)據(jù)集做了測(cè)試,比較了4詞位標(biāo)注與6詞位標(biāo)注模型的表現(xiàn)性能,實(shí)驗(yàn)結(jié)果顯示6詞位標(biāo)注模型表現(xiàn)出了更好的分詞性能。同時(shí),采用6詞位標(biāo)注的模型對(duì)新聞數(shù)據(jù)、微博數(shù)據(jù)、汽車論壇數(shù)據(jù)、餐飲點(diǎn)評(píng)數(shù)據(jù)不同領(lǐng)域進(jìn)行了測(cè)試,結(jié)果表明6詞位標(biāo)注的模型在跨領(lǐng)域中文分詞也具有良好的性能,說(shuō)明模型具有很好的泛化能力。
注:本文通訊作者為李晗靜。
參考文獻(xiàn)
[1] WU A. Word segmentation in sentence analysis [C]// Procee?dings of 1998 International Conference on Chinese Information Processing. Beijing: Chinese Information Society, 1998: 1?10.
[2] LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data [C]// Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc., 2001: 282?289.
[3] BENGIO Y, VINCENT P, JANVIN C. A neural probabilistic language model [J]. Journal of machine learning research, 2003, 3(6): 1137?1155.
[4] ZHENG X, CHEN H, XU T. Deep learning for Chinese word segmentation and POS tagging [C]// 2013 Conference on Empirical Methods in Natural Language Processing. Seattle: Association for Computational Linguistics, 2013: 647?657.
[5] CHEN X, QIU X, ZHU C, et al. Long short?term memory neural networks for Chinese word segmentation [C]// 2015 Confe?rence on Empirical Methods in Natural Language Processing. [S.l.: s.n.], 2015: 1197?1206.
[6] GRAVES A. Long short?term memory [M]// Anon. Supervised sequence labelling with recurrent neural networks. Berlin: Springer, 2012: 37?45.
[7] ZHAO H, HUANG C N, LI M, et al. An improved Chinese word segmentation system with conditional random field [C]// Proceedings of the Fifth Sighan Workshop on Chinese Language Processing. [S.l.: s.n.], 2006: 162?165.
[8] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL]. [2013?09?07]. http://www.surdeanu.info/mihai/teaching/ista555?spring15/readings/mikolov2013.pdf.
[9] LAI S, LIU K, HE S, et al. How to generate a good word embedding [J]. IEEE intelligent systems, 2016, 31(6): 5?14.
[10] YAO Y, HUANG Z. Bi?directional LSTM recurrent neural network for Chinese word segmentation [C]// 2016 International Conference on Neural Information Processing. Berlin: Springer, 2016: 345?353.
[11] STRUBELL E, VERGA P, BELANGER D, et al. Fast and accurate entity recognition with iterated dilated convolutions [C]// Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. [S.l.: s.n.], 2017: 2664?2669.