亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

基于深度神經網絡的法語命名實體識別模型

2019-08-01 01:48:57嚴紅陳興蜀王文賢王海舟殷明勇

計算機應用 2019年5期

嚴紅陳興蜀王文賢王海舟殷明勇

摘要：現有法語命名實體識別（NER）研究中，機器學習模型多使用詞的字符形態(tài)特征，多語言通用命名實體模型使用字詞嵌入代表的語義特征，都沒有綜合考慮語義、字符形態(tài)和語法特征。針對上述不足，設計了一種基于深度神經網絡的法語命名實體識別模型CGCfr。首先從文本中提取單詞的詞嵌入、字符嵌入和語法特征向量; 然后由卷積神經網絡（CNN）從單詞的字符嵌入序列中提取單詞的字符特征; 最后通過雙向門控循環(huán)神經網絡（BiGRU）和條件隨機場（CRF）分類器根據詞嵌入、字符特征和語法特征向量識別出法語文本中的命名實體。實驗中，CGCfr在測試集的F1值能夠達到82.16%，相對于機器學習模型NERCfr、多語言通用的神經網絡模型LSTMCRF和Char attention模型，分別提升了5.67、1.79和1.06個百分點。實驗結果表明，融合三種特征的CGCfr模型比其他模型更具有優(yōu)勢。

關鍵詞：命名實體識別;法語;深度神經網絡;自然語言處理;序列標注

中圖分類號：TP391.1

文獻標志碼：A

Abstract： In the existing French Named Entity Recognition （NER） research， the machine learning models mostly use the character morphological features of words， and the multilingual generic named entity models use the semantic features represented by word embedding， both without taking into account the semantic， character morphological and grammatical features comprehensively. Aiming at this shortcoming， a deep neural network based model CGCfr was designed to recognize French named entity. Firstly， word embedding， character embedding and grammar feature vector were extracted from the text. Then， character feature was extracted from the character embedding sequence of words by using Convolution Neural Network （CNN）. Finally， Bidirectional Gated Recurrent Unit Network （BiGRU） and Conditional Random Field （CRF） were used to label named entities in French text according to word embedding， character feature and grammar feature vector. In the experiments， F1 value of CGCfr model can reach 82.16% in the test set， which is 5.67 percentage points， 1.79 percentage points and 1.06 percentage points higher than that of NERCfr， LSTM（Long ShortTerm Memory network）CRF and Char attention models respectively. The experimental results show that CGCfr model with three features is more advantageous than the others.

英文關鍵詞Key words： Named Entity Recognition （NER）; French; neural network; Natural Language Processing （NLP）; sequence labeling

0 引言

命名實體識別（Named Entity Recognition， NER）是指從文本中識別出特定類型事務名稱或者符號的過程[1]。它提取出更具有意義的人名、組織名、地名等，使得后續(xù)的自然語言處理任務能根據命名實體進一步獲取需要的信息。隨著全球化發(fā)展，各國之間信息交換日益頻繁。相對于中文，外語信息更能影響其他國家對中國的看法，多語言輿情分析應運而生。法語在非英語的語種中影響力相對較大，其文本是多語種輿情分析中重要目標之一。法語NER作為法語文本分析的基礎任務，重要性不可忽視。

專門針對法語NER進行的研究較少，早期研究主要是基于規(guī)則和詞典的方法[2]，后來，通常將人工選擇的特征輸入到機器學習模型來識別出文本中存在的命名實體[3-7]。Azpeitia等[3]提出了NERCfr模型，模型采用最大熵方法來識別法語命名實體，用到的特征包括詞后綴、字符窗口、鄰近詞、詞前綴、單詞長度和首字母是否大寫等。該方法取得了不錯的結果，但可以看出用到的特征多為單詞的形態(tài)結構特征而非語義特征，缺乏語義特征可能限制了模型的識別準確率。

近幾年深度神經網絡在自然語言處理領域取得了很好的效果： Hammerton[8]將長短時記憶網絡（Long ShortTerm Memory network， LSTM）用于英語NER; Rei等[9]提出了多語言通用的Char attention模型，利用Attention機制融合詞嵌入和字符嵌入，將其作為特征輸入到雙向長短時記憶網絡（Bidirectional Long ShortTerm Memory network， BiLSTM）中，得到序列標注產生的命名實體; Lample等[10]提出BiLSTM后接條件隨機場（Conditional Random Field， CRF）的LSTMCRF模型，它也是多語言通用的，使用了字詞嵌入作為特征來識別英語的命名實體，但LSTMCRF模型應用在法語上，和英語差距較大，這個問題可能是因為沒有用到該語言的語法特征，畢竟法語語法的復雜程度大幅超過英語。

為了在抽取過程中兼顧語義、字符形態(tài)和語法特征，更為準確地抽取法語的命名實體，本文設計了模型CGCfr。該模型使用詞嵌入表示文本中單詞的語義特征，使用卷積神經網絡（Convolutional Neural Network， CNN）提取字符嵌入蘊含的單詞字符形態(tài)特征以及預先提取的法語語法特征，拼接后輸入到雙向門控循環(huán)網絡（Gated Recurrent Unit Neural Network， GRU）和條件隨機場結合的復合網絡中，來識別出法語命名實體。CGCfr充分利用了這些特征，通過實驗證明了每種特征的貢獻度，并與其他模型進行比較證明了融合三種特征的CGCfr模型更具有優(yōu)勢。除此之外，本文貢獻了一個法語的數據集，包含1005篇文章，29016個實體，增加了法語命名實體識別的數據集，使得后續(xù)可以有更多的研究不被數據集的問題困擾。

4 結語

本文設計了用于法語命名實體識別的深度神經網絡CGCfr模型，并構建了一個法語命名實體識別數據集。CGCfr模型將法語文本中單詞的詞嵌入作為語義特征，從單詞對應的字符嵌入序列提取單詞的形態(tài)結構特征，結合語法特征完成對命名實體的識別。這增加了傳統(tǒng)統(tǒng)計機器學習方法中特征的多樣性，豐富了特征的內涵，也避免了多語言通用方法對法語語法的忽視。實驗對比模型中各個特征的貢獻度，驗證了它們的有效性;還將CGCfr模型與最大熵模型NERCfr、多語言通用模型Char attention和LSTMCRF對比。實驗結果表明，CGCfr模型相對三者的F1值都有提高，驗證了融合三種特征的本文模型在法語命名實體識別上的有效性，進一步提高了法語命名實體的識別率。

然而，本文模型也存在著不足，在法語文本中組織名的識別率相比其余兩種命名實體類型差距較大，模型對形式存在較大變化的命名實體類型的識別效果不是很好;其次，相對于英語較高的命名實體識別準確率，法語命名實體識別還有較大的提升空間。

參考文獻（References）

[1] NADEAU D， SEKINE S. A survey of named entity recognition and classification[J]. Lingvisticae Investigationes， 2007， 30（1）： 3-26.

[2] WOLINSKI F， VICHOT F， DILLET B. Automatic processing of proper names in texts[C]// Proceedings of the 7th Conference on European Chapter of the Association for Computational Linguistics. San Francisco， CA： Morgan Kaufmann Publishers， 1995： 23-30.

[3] AZPEITIA A， CUDADROS M， GAINES S， et al. NERCfr： supervised named entity recognition for French[C]// TSD 2014： Proceedings of the 2014 International Conference on Text， Speech and Dialogue. Berlin： Springer， 2014： 158-165.

[4] POIBEAU T. The multilingual named entity recognition framework[C]// Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2003： 155-158.

[5] PETASIS G， VICHOT F， WOLINSKI F， et al. Using machine learning to maintain rulebased namedentity recognition and classification systems[C]// Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2001： 426-433.

[6] WU D， NGAI G， CARPUAT M. A stacked， voted， stacked model for named entity recognition[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT. Stroudsburg， PA： Association for Computational Linguistics， 2003： 200-203.

[7] NOTHMAN J， RINGLAND N， RADFORD W， et al. Learning multilingual named entity recognition from Wikipedia[J]. Artificial Intelligence， 2013， 194：151-175.

[8] HAMMERTON J. Named entity recognition with long shortterm memory[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT. Stroudsburg， PA： Association for Computational Linguistics， 2003： 172-175.

[9] REI M， CRICHTON G， PYYSALO S. Attending to characters in neural sequence labeling models[J/OL]. arXiv Preprint， 2016， 2016： arXiv：1611.04361[2016-11-14]. https：//arxiv.org/abs/1611.04361.

[10] LAMPLE G， BALLESTEROS M， SUBRAMANIAN S， et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2016： 260-270.

[11] LE Q， MIKOLOV T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. New York： JMLR.org， 2014： 1188-1196.

[12] PENNINGTON J， SOCHER R， MANNING C. Glove： global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1532-1543.

[13] SANTOS C D， ZADROZNY B. Learning characterlevel representations for partofspeech tagging[C]// Proceedings of the 31st International Conference on Machine Learning. New York： JMLR.org， 2014： 1818-1826.

[14] CHO K， van MERRIENBOER B， GULCEHRE C， et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1724-1734.

[15] SANG E F， VEENSTRA J. Representing text chunks[C]// Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 1999： 173-179.

計算機應用2019年5期

計算機應用的其它文章: 基于最大間隔準則的魯棒多流形判別局部圖嵌入算法; 基于無線信道狀態(tài)信息的跌倒無源監(jiān)測方法; 毒品濫用流行病模型的穩(wěn)定性分析; 基于改進漸進最優(yōu)的雙向快速擴展隨機樹的移動機器人路徑規(guī)劃算法; 基于多因素線索長短期記憶模型的血壓分析預測; 基于棧式自編碼網絡的風機葉片結冰預測