亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

短文本流突發(fā)性話題發(fā)現(xiàn)：BBTM改進(jìn)算法

2017-03-24 13:16:27林特

電腦知識(shí)與技術(shù) 2017年1期

林特

摘要：BBTM模型克服了數(shù)據(jù)稀疏性和冗余性，是短文本流突發(fā)性話題發(fā)現(xiàn)的有效方法。然而，BBTM模型量化詞對(duì)突發(fā)概率方法比較簡(jiǎn)陋且存在不合理性，對(duì)周期性話題關(guān)聯(lián)詞對(duì)的突發(fā)概率估計(jì)有偏差，故提出了一種結(jié)合基于自動(dòng)狀態(tài)機(jī)的枚舉突發(fā)詞對(duì)和正態(tài)分布的改進(jìn)方法。實(shí)驗(yàn)證明，該方法能夠?yàn)槟Ｐ徒Ｌ峁└鼫?zhǔn)確的先驗(yàn)知識(shí)，從而提高模型對(duì)突發(fā)話題的敏感度和話題抽取的準(zhǔn)確度。

關(guān)鍵詞：短文本；突發(fā)性； BBTM；枚舉突發(fā)；正態(tài)分布

中圖分類號(hào)：TP181 文獻(xiàn)標(biāo)識(shí)碼：A 文章編號(hào)：1009-3044（2017）01-0248-03

Abstract：BBTM is an effective model for bursty topic discovery in short texts well solve data sparsity and redundancy. However， the method used to quantify the burstiness of biterms proposed by BBTM is pool and irrational， making the wrong kind of assumptions about biterm bursty probability related to periodic topics， then an improved algorithm based on enumerating bursts biterms used state automation and normal distribution is presented. Experiments show the improved algorithm gives more precise prior knowledge for modeling， then raises the sensitivity and accuracy of bursty topics discovered.

Key words：short texts； bursty； BBTM； enumerating bursts； normal distribution

1 概述

近年來(lái)，短文本形式數(shù)據(jù)充斥社交網(wǎng)絡(luò)平臺(tái)，大量突發(fā)性話題隱含其間，而這類話題往往與社會(huì)網(wǎng)絡(luò)熱點(diǎn)事件息息相關(guān)。短文本流的突發(fā)性話題發(fā)現(xiàn)工作是對(duì)海量網(wǎng)絡(luò)文本的精餾，為輿情分析、商務(wù)智能、新聞故事線跟蹤提供了必不可少的研究基礎(chǔ)。然而，大量冗余信息增加了突發(fā)性話題發(fā)現(xiàn)的難度，同時(shí)短文本的文本稀疏性特征對(duì)話題抽取的精度的影響顯著。

在過去的研究工作中，主要通過兩類方法提取文本流突發(fā)性話題。一類經(jīng)典的方法是，先檢測(cè)文本突發(fā)性特征后聚類[1][2][3]。然而，突發(fā)性特征存在二義性對(duì)于聚類效果影響顯著，從而復(fù)雜的啟發(fā)式調(diào)節(jié)和后處理方法不可或缺，另外，僅僅以突發(fā)性特征表征話題會(huì)丟失文本基本信息，造成話題的理解和解讀困難。另一類方法，通過主題模型對(duì)突發(fā)性話題進(jìn)行提取[4]，但傳統(tǒng)意義上的主題模型的初衷是揭示文本集合的主話題，并不能夠直接用于突發(fā)性話題的提取，后處理方法仍舊不可或缺[5][6]，由于大部分主話題并不具突發(fā)性，啟發(fā)式后處理方法也不能夠彌補(bǔ)模型本身的缺陷。Yan等人提出了一種針對(duì)突發(fā)性話題發(fā)現(xiàn)的主題模型，即BBTM模型[7]。模型的核心思想是量化詞對(duì)的突發(fā)概率，作為BTM模型建模的先驗(yàn)知識(shí)。

BBTM模型對(duì)突發(fā)概率的量化算法存在不合理性，任一詞對(duì)的突發(fā)概率恒小于非突發(fā)概率，先驗(yàn)知識(shí)的誤差導(dǎo)致模型更傾向于將詞對(duì)歸類為非突發(fā)性話題而非非突發(fā)性話題。本文引入Kleinberg的枚舉突發(fā)算法[8]用于詞對(duì)的突發(fā)狀態(tài)評(píng)估，并定義了一種突發(fā)概率量化方法，改進(jìn)BBTM模型中的突發(fā)概率量化方法。

OBTM模型的Novelty指標(biāo)在各個(gè)時(shí)間片上均小于0.2，遠(yuǎn)小于另外三種模型對(duì)突發(fā)話題的敏感度?？梢姡珺BTM模型相較傳統(tǒng)意義上的主題模型更適用于突發(fā)性話題的發(fā)現(xiàn)，而本文提出的改進(jìn)方法在各個(gè)時(shí)間片上的Novelty指標(biāo)較為平穩(wěn)，相較原來(lái)BBTM模型有更優(yōu)的敏感度表現(xiàn)。

5 結(jié)論

本文給出了整合使用突發(fā)特征提取方法和主題模型方法的一種有效途徑。對(duì)BBTM模型的改進(jìn)方法中先通過引入枚舉突發(fā)詞對(duì)方法估計(jì)詞對(duì)的突發(fā)狀態(tài)，再采用正態(tài)分布的累積分布函數(shù)擬合詞對(duì)的突發(fā)概率，為模型建模提供了相較于BBTM模型更準(zhǔn)確的先驗(yàn)知識(shí)，從而提高了模型對(duì)突發(fā)性話題的敏感度和話題抽取的準(zhǔn)確度。

參考文獻(xiàn)：

[1] Mathioudakis M， Koudas N. Twittermonitor： trend detection over the twitter stream[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM，2010：1155-1158.

[2] Cataldi M， Di Caro L， Schifanella C. Emerging topic detection on twitter based on temporal and social terms evaluation[C]//Proceedings of the Tenth International Workshop on Multimedia Data Mining. ACM， 2010： 4.

[3] Li C， Sun A， Datta A. Twevent： segment-based event detection from tweets[C]//Proceedings of the 21st ACM international conference on Information and knowledge management. ACM，2012：155-164.

[4] Blei D M. Probabilistic topic models[J]. Communications of the ACM， 2012， 55（4）： 77-84.

[5] Diao Q， Jiang J， Zhu F， et al. Finding bursty topics from microblogs[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics： Long Papers-Volume 1. Association for Computational Linguistics， 2012： 536-544.

[6] Lau J H， Collier N， Baldwin T. On-line Trend Analysis with Topic Models：＼# twitter Trends Detection Topic Model Online[C]//COLING. 2012： 1519-1534.

[7] Yan X， Guo J， Lan Y， et al. A Probabilistic Model for Bursty Topic Discovery in Microblogs[C]//AAAI. 2015： 353-359.

[8] Kleinberg J. Bursty and hierarchical structure in streams[J]. Data Mining and Knowledge Discovery，2003，7（4）： 373-397.

[9] Mimno D， Wallach H M， Talley E， et al. Optimizing semantic coherence in topic models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics，2011：262-272.