亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

基于時(shí)滯特征的時(shí)序依賴情節(jié)發(fā)現(xiàn)

2019-08-01 01:57:38顧佩月劉崢李云李濤

計(jì)算機(jī)應(yīng)用 2019年2期

關(guān)鍵詞：時(shí)滯

顧佩月劉崢李云李濤

摘要：對(duì)于事件序列中的時(shí)序依賴發(fā)現(xiàn)，傳統(tǒng)的頻繁情節(jié)發(fā)現(xiàn)方法一方面使用時(shí)間窗口機(jī)制挖掘事件之間簡(jiǎn)單的關(guān)聯(lián)依賴，另一方面無(wú)法有效處理事件的交叉時(shí)序關(guān)聯(lián)。針對(duì)以上問(wèn)題，提出了時(shí)滯情節(jié)發(fā)現(xiàn)的概念，在頻繁情節(jié)發(fā)現(xiàn)的基礎(chǔ)上，設(shè)計(jì)了一種基于相鄰事件匹配集（AEM）的時(shí)滯情節(jié)發(fā)現(xiàn)算法。首先，引入時(shí)滯的概率統(tǒng)計(jì)模型進(jìn)行事件序列匹配，避免預(yù)先設(shè)定時(shí)間窗口，處理可能存在的交叉關(guān)聯(lián);然后，將時(shí)滯挖掘轉(zhuǎn)化為最優(yōu)化問(wèn)題，使用迭代的方式得到時(shí)滯情節(jié)之間的時(shí)間間隔分布;最后，利用假設(shè)檢驗(yàn)區(qū)分串行時(shí)滯情節(jié)和并行時(shí)滯情節(jié)。理論分析與實(shí)驗(yàn)結(jié)果表明，與目前最新的時(shí)滯挖掘方法迭代最近事件（ICE）算法相比，基于AEM的時(shí)滯情節(jié)發(fā)現(xiàn)算法模擬的時(shí)滯分布與真實(shí)時(shí)滯分布的平均KL距離為0.056，縮短了20.68%。基于AEM的時(shí)滯情節(jié)發(fā)現(xiàn)算法通過(guò)時(shí)滯的概率統(tǒng)計(jì)模型衡量事件多種匹配情況的可能性，獲得一對(duì)多的相鄰事件匹配集，比ICE算法中的一對(duì)一匹配更加有效地模擬了實(shí)際情況。

關(guān)鍵詞：時(shí)序依賴;事件序列;頻繁情節(jié);時(shí)滯;概率統(tǒng)計(jì)模型

中圖分類號(hào)： TP391

文獻(xiàn)標(biāo)志碼：A

Abstract： Concerning the problem that a predefined time window is usually used to mine simple association dependencies between events in the traditional frequent episode discovery， which cannot effectively handle interleaved temporal correlations between events， a concept of time-lag episode discovery was proposed. And on the basis of frequent episode discovery， Adjacent Event Matching set （AEM） based time-lag episode discovery algorithm was proposed. Firstly， a probability statistical model introduced with time-lag was introduced to realize event sequence matching and handle optional interleaved associations without a predefined time window. Then the discovery of time lag was formulated as an optimization problem which can be solved iteratively to obtain time interval distribution between time-lag episodes. Finally， the hypothesis test was used to distinguish serial and parallel time-lag episodes. The experimental results show that compared with Iterative Closest Event （ICE） algorithm which is the latest method of time-lag mining， the Kullback-Leibler （KL） divergence between true and experimental distributions discovered by AEM is 0.056 on average， which is decreased by 20.68%. AEM algorithm measures the possibility of multiple matches of events through a probability statistical model of time lag and obtains a one-to-many adjacent event matching set， which is more effective than the one-to-one matching set in ICE for simulating the actual situation.

Key words： temporal dependency; event sequence; frequent episode; time lag; probability statistical model

0 引言

近年來(lái)，隨著數(shù)據(jù)采集和存儲(chǔ)技術(shù)的不斷發(fā)展，各領(lǐng)域海量時(shí)序數(shù)據(jù)的研究逐漸成為數(shù)據(jù)挖掘和知識(shí)發(fā)現(xiàn)的熱點(diǎn)問(wèn)題[1]。時(shí)序數(shù)據(jù)通常記錄了事物隨著時(shí)間變化發(fā)展的不同狀態(tài)。

在各類領(lǐng)域中，時(shí)序數(shù)據(jù)通常會(huì)有兩種形式：第一種是時(shí)間序列，記錄連續(xù)時(shí)間內(nèi)事物變化值，例如股票數(shù)據(jù)、系統(tǒng)內(nèi)存占用率等;第二種是事件數(shù)據(jù)，記錄離散時(shí)間內(nèi)事物變化情況，例如購(gòu)物籃數(shù)據(jù)[2]、Web數(shù)據(jù)[3]、社交網(wǎng)絡(luò)事件數(shù)據(jù)[4]等。同一事物的發(fā)展往往蘊(yùn)含著一些不為人知的規(guī)律或固有模式，從事件數(shù)據(jù)中發(fā)現(xiàn)學(xué)習(xí)這些隱藏的有趣時(shí)序依賴模式，有助于對(duì)未來(lái)事物的發(fā)展進(jìn)行預(yù)測(cè)或是對(duì)引起事件的根源進(jìn)行追查。不同形式的時(shí)序數(shù)據(jù)依賴發(fā)現(xiàn)方式也有很大不同，本文主要研究離散時(shí)間上的序列集合，即事件數(shù)據(jù)中的時(shí)序依賴發(fā)現(xiàn)。

關(guān)聯(lián)規(guī)則[5]是現(xiàn)有工作中針對(duì)事件序列最簡(jiǎn)單的時(shí)序依賴發(fā)現(xiàn)，也是后續(xù)許多時(shí)序依賴發(fā)現(xiàn)方法的基礎(chǔ)。關(guān)聯(lián)規(guī)則旨在發(fā)現(xiàn)時(shí)序數(shù)據(jù)中頻繁出現(xiàn)的項(xiàng)集，比較經(jīng)典的應(yīng)用就是購(gòu)物籃數(shù)據(jù)。在此數(shù)據(jù)集中，每一條記錄代表一個(gè)客戶所有的購(gòu)買(mǎi)記錄，通過(guò)關(guān)聯(lián)分析得到客戶可能同時(shí)購(gòu)買(mǎi)的商品組合，從而幫助商場(chǎng)進(jìn)行營(yíng)銷(xiāo)。但是關(guān)聯(lián)規(guī)則的依賴事件是不考慮時(shí)間先后關(guān)系的，隨之又衍生了頻繁情節(jié)挖掘[6]，即在關(guān)聯(lián)規(guī)則的基礎(chǔ)上添加簡(jiǎn)單的時(shí)間約束，通過(guò)設(shè)定時(shí)間窗口從發(fā)現(xiàn)的依賴中得知事件發(fā)生的先后順序。隨著生產(chǎn)應(yīng)用的需求變化，時(shí)序依賴發(fā)現(xiàn)中的時(shí)間參數(shù)從先后關(guān)系再到模糊的時(shí)間區(qū)間進(jìn)而轉(zhuǎn)向具體的滯后時(shí)間挖掘。

在考慮兩類事件A、B之間的依賴時(shí)，通常是基于依賴存在于相鄰事件的假設(shè)上進(jìn)行挖掘的，即若事件A、B是依賴的，那么事件B的發(fā)生只能與它之前的第一個(gè)事件A有關(guān)。但在實(shí)際復(fù)雜的生產(chǎn)環(huán)境中，事件之間的依賴不僅僅存在于相鄰發(fā)生，也有可能出現(xiàn)交叉依賴[7]的情況，導(dǎo)致無(wú)法判斷事件之間的對(duì)應(yīng)關(guān)系。例如圖1，在關(guān)聯(lián)規(guī)則、序列模式等時(shí)序依賴發(fā)現(xiàn)中，通常只考慮彼此相鄰的兩類事件依賴，即實(shí)線箭頭表示的依賴關(guān)系，但真實(shí)的時(shí)序依賴是相互交叉的，事件B可能與它之前的第一個(gè)或第二個(gè)甚至更前面的A有關(guān)，即b3不僅可能與a3有關(guān)，也可能與a2或a1有關(guān)。以上這些需求的變化都為事件時(shí)序依賴發(fā)現(xiàn)帶來(lái)了更多的問(wèn)題和困難。

1 相關(guān)工作

來(lái)自不同領(lǐng)域的多樣化需求衍生了多種類型的事件依賴，一般根據(jù)處理的數(shù)據(jù)類型分為兩類。第一類是事務(wù)數(shù)據(jù)庫(kù)，其中每個(gè)事務(wù)是一系列事件的集合。每個(gè)事務(wù)通常記錄的是同一個(gè)實(shí)體的事件，例如銀行賬戶流水、顧客購(gòu)物記錄。這種類型的時(shí)序依賴主要用于發(fā)現(xiàn)在某些事務(wù)中頻繁出現(xiàn)的子序列，關(guān)注不同個(gè)體之間相同的依賴模式。序列模式[8-9]、全依賴模式[10]、互依賴模式[11]、部分周期模式[12]是該類事件依賴中比較典型的模式依賴發(fā)現(xiàn)。

在實(shí)際應(yīng)用中因?yàn)槟承l件限制，用戶無(wú)法獲得足夠的信息將事件數(shù)據(jù)劃分為事務(wù)，此時(shí)用戶面對(duì)的就是一系列事件。這就是第二類的時(shí)序依賴，也是本文重點(diǎn)關(guān)注的類型。其中，最基本的時(shí)序依賴是頻繁情節(jié)發(fā)現(xiàn)[6]。同一個(gè)情節(jié)中的事件發(fā)生在相近的時(shí)間區(qū)間內(nèi)，所以情節(jié)發(fā)現(xiàn)的一般方式都是采用用戶預(yù)定義時(shí)間窗口機(jī)制。通過(guò)滑動(dòng)時(shí)間窗口將發(fā)生在同一個(gè)窗口中的事件視作一個(gè)事務(wù)，然后將從事務(wù)數(shù)據(jù)庫(kù)中發(fā)現(xiàn)頻繁子序列的思想應(yīng)用到頻繁情節(jié)發(fā)現(xiàn)中。傳統(tǒng)的情節(jié)發(fā)現(xiàn)只能揭示在一個(gè)預(yù)定義窗口內(nèi)頻繁發(fā)生的有序事件，情節(jié)內(nèi)事件之間發(fā)生的具體時(shí)間間隔是未知的。例如在大小為5的時(shí)間窗口中發(fā)現(xiàn)的情節(jié)〈A，B〉表示事件B會(huì)在事件A發(fā)生之后的4個(gè)單位時(shí)間內(nèi)發(fā)生，但不能準(zhǔn)確提供具體的發(fā)生時(shí)刻。文獻(xiàn)[13]提出了固定間隔情節(jié)的概念，構(gòu)建了一種基于前綴樹(shù)的精確位置情節(jié)規(guī)則（Precise-Position Episode Rule trie， PER-trie）挖掘框架，用于獲得事件之間準(zhǔn)確發(fā)生的時(shí)間間隔區(qū)間。

以上這些挖掘算法通常都需要用戶提前選定一個(gè)合適的時(shí)間窗口大小，這在實(shí)際應(yīng)用中是一件非常困難的事情，因?yàn)閷?shí)際應(yīng)用中時(shí)序依賴發(fā)現(xiàn)的時(shí)間區(qū)間可能會(huì)從1秒到1天甚至更長(zhǎng)不等，而不合適的時(shí)間窗口可能會(huì)導(dǎo)致一些超出窗口以外的依賴被忽視。文獻(xiàn)[14]算法不需要預(yù)先定義窗口，將空間統(tǒng)計(jì)學(xué)中的思想應(yīng)用到事件的時(shí)序依賴發(fā)現(xiàn)中，用空間點(diǎn)過(guò)程的距離度量計(jì)算事件發(fā)生的無(wú)條件分布和條件分布，將成對(duì)事件之間的依賴關(guān)系問(wèn)題轉(zhuǎn)化為這兩個(gè)概率分布的比較，而事件之間的等待時(shí)間用一個(gè)卡方檢驗(yàn)來(lái)計(jì)算。Tang等[7]考慮到時(shí)間間隔的隨機(jī)性，將交錯(cuò)事件存在關(guān)聯(lián)的可能性納入依賴發(fā)現(xiàn)過(guò)程中，構(gòu)建了一個(gè)基于有序表的算法STScan （Sorted Table Scan），將成對(duì)事件之間可能存在的所有時(shí)間間隔都保存在鏈表中，通過(guò)掃描鏈表的子段得到所有的合格滯后區(qū)間。文獻(xiàn)[15]將時(shí)序依賴發(fā)現(xiàn)問(wèn)題轉(zhuǎn)化為成對(duì)事件的時(shí)間間隔分布的學(xué)習(xí)，提出了一種基于期望最大化（Expectation Maximization， EM）的方法用于發(fā)現(xiàn)時(shí)序依賴中時(shí)滯的最大似然模型。Zller[16]基于迭代最近點(diǎn)（Iterative Closest Point， ICP）算法[17]思想構(gòu)建了迭代最近事件（Iterative Closest Event， ICE）算法尋找最合適的事件匹配集合滿足時(shí)滯分布波動(dòng)性最小。

6 結(jié)語(yǔ)

本文針對(duì)事件數(shù)據(jù)中的時(shí)序依賴問(wèn)題進(jìn)行研究，提出了時(shí)滯情節(jié)發(fā)現(xiàn)這一概念，采用基于相鄰事件匹配集（AEM）的時(shí)滯情節(jié)發(fā)現(xiàn)算法，不需要預(yù)先定義時(shí)間窗口，通過(guò)時(shí)滯的概率統(tǒng)計(jì)模型發(fā)現(xiàn)成對(duì)并行和串行時(shí)滯情節(jié)。與算法ICE相比，基于AEM的時(shí)滯情節(jié)發(fā)現(xiàn)算法有效地提高了時(shí)滯模擬效果。但目前本文研究只關(guān)注了成對(duì)的時(shí)滯情節(jié)，三個(gè)及三個(gè)以上事件類型之間的時(shí)滯情節(jié)發(fā)現(xiàn)需要在事件序列匹配中構(gòu)建更加復(fù)雜的概率模型，這也是我們未來(lái)工作的重點(diǎn)研究方向，以便更好地將時(shí)滯情節(jié)發(fā)現(xiàn)應(yīng)用于實(shí)際生產(chǎn)中。

參考文獻(xiàn)：

[1] 李濤，劉崢，周綺鳳.應(yīng)用驅(qū)動(dòng)的大數(shù)據(jù)挖掘[J].中興通訊技術(shù)，2016，22（2）：49-52. （LI T， LIU Z， ZHOU Q F. Application-driven big data mining[J]. ZTE Communications， 2016，22（2）：49-52.）

[2] WU C-W， LIN Y-F， YU P S， et al. Mining high utility episodes in complex event sequences [C]// Proceedings of the 2013 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2013： 536-544.

[3] 武健.時(shí)序Web數(shù)據(jù)挖掘方法[J].計(jì)算機(jī)應(yīng)用，2014，34（S2）：120-122. （WU J. Data mining method from time series Web data [J]. Journal of Computer Applications， 2014， 34（S2）： 120-122.）

[4] 郭跇秀，呂學(xué)強(qiáng)，李卓.基于突發(fā)詞聚類的微博突發(fā)事件檢測(cè)方法[J].計(jì)算機(jī)應(yīng)用，2014，34（2）：486-490. （GUO Y X， LYU X Q， LI Z. Bursty topics detection approach on Chinese microblog based on burst words clustering [J]. Journal of Computer Applications， 2014， 34（2）： 486-490.）

[5] AGRAWAL R， IMIELINSKI T， SWAMI A N. Mining association rules between sets of items in large databases [C]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York： ACM， 1993： 207-216.

[6] MANNILA H， TOIVONEN H， VERKAMO A I. Discovery of frequent episodes in event sequences [J]. Data Mining & Knowledge Discovery， 1997， 1（3）：259-289.

[7] TANG L， LI T， SHWARTZ L. Discovering lag intervals for temporal dependencies [C]// Proceedings of the 2012 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2012： 633-641.

[8] AGRAWAL R， SRIKANT R. Mining sequential patterns [C]// Proceedings of the 1995 Eleventh International Conference on Data Engineering. Washington， DC： IEEE Computer Society， 1995： 3-14.

[9] 李艷輝，劉浩，袁野，等.基于差分隱私的頻繁序列模式挖掘算法[J].計(jì)算機(jī)應(yīng)用，2017，37（2）：316-321. （LI Y H， LIU H， YUAN Y. Frequent sequence pattern mining with differential privacy[J]. Journal of Computer Applications， 2017， 37（2）： 316-321.）

[10] LIANG F， MA S， HELLERSTEIN J L. Discovering fully dependent patterns [C]// Proceedings of the 2002 SIAM International Conference on Data Mining. Philadelphia， PA： Society for Industrial and Applied Mathematics （SIAM）， 2002： 511-527.

[11] MA S， HELLERSTEIN J L. Mining mutually dependent patterns [C]// Proceedings of the 2001 IEEE International Conference on Data Mining. Washington， DC： IEEE Computer Society， 2001： 409-416.

[12] MA S， HELLERSTEIN J L. Mining partially periodic event patterns with unknown periods [C]//Proceedings of the 2001 17th International Conference on Data Engineering. Piscataway， NJ： IEEE， 2001：205-214.

[13] AO X， LUO P， WANG J， et al. Mining precise-positioning episode rules from event sequences [J]. IEEE Transactions on Knowledge and Data Engineering， 2018， 30（3）： 530-543.

[14] LI T， MA S. Mining temporal patterns without predefined time windows [C]// Proceedings of the 4th IEEE International Conference on Data Mining. Washington， DC： IEEE Computer Society， 2004： 451-454.

[15] ZENG C， TANG L， LI T， et al. Mining temporal lag from fluctuating events for correlation and root cause analysis [C]// Proceedings of the 2014 10th International Conference on Network and Service Management. Washington， DC： IEEE Computer Society， 2014： 19-27.

[16] ZLLER M-A， BAUM M， HUBER M F. Framework for mining event correlations and time lags in large event sequences [C]// Proceedings of the IEEE 2017 15th International Conference of Industrial Informatics. Piscataway， NJ： IEEE， 2017： 3-4.

[17] BESL P J， McKAY N D. A method for registration of 3-D shapes [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2002， 14（2）：239-256.

[18] GUO S， LIU Z， CHEN W， et al. Event extration from streaming system logs [C]// ICISA 2018Proceedings of the 9th International Conference on Information Science and Applications， LNEE 514. Singapore： Springer， 2018： 465-474.

[19] LI T， JIANG Y， ZENG C， et al. FLAP： an end-to-end event log analysis platform for system management [C]// Proceedings of the 2017 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2017： 1547-1556.

[20] FISCHLER M A， BOLLES R C. Random sample consensus： a paradigm for model fitting with applications to image analysis and automated cartography [J]. Readings in Computer Vision， 1987： 726-740.

[21] LI T， LIANG F， MA S， et al. An integrated framework on mining logs files for computing system management[C]// Proceedings of the 2005 Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2005： 776-781.