亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

        ?

        The Research on How to Detect Plagiarism in the Theses Based on Automatic Abstraction

        2010-04-16 09:15:20ZhaoJunjieWangLiWangPingshui
        電腦與電信 2010年2期
        關(guān)鍵詞:情報(bào)學(xué)碩士學(xué)位俊杰

        Zhao JunjieWang Li Wang Pingshui

        (Anhui University of Finance&Economics,Bengbu 233061,Anhui)

        Automatic abstraction can automatically extract the brief and coherent essays reflecting the main contents of the text completely and accurately from the text or text collection,using the computer to meet the general or particular users’requirements.First,this paper refers to the definition,function and classification of automatic abstraction,and then gives a kind of automatic abstraction technology based on keywords retrieval.It also puts forward a method of detecting plagiarism in the theses based on automatic abstraction and analyzes the results of the experiment.Finally,the author introduces the further work in brief.

        automatic abstraction;keywords;extraction;retrieval;plagiarism detection

        Author introduction:Zhao Junjie,male,Suzhou Anhui,master degree,lecturer,research direction:data mining and the information retrieval.

        Fund of social sciences research project:the project(07JC870006)youth fund,Anhui University of Finance and Economics research projects(ACJYZD200914).

        趙俊杰,男,安徽宿州人,碩士,講師,研究方向:數(shù)據(jù)挖掘與情報(bào)檢索。

        教育部社科研究基金青年項(xiàng)目,項(xiàng)目編號(hào):07JC870006,安徽財(cái)經(jīng)大學(xué)教研重點(diǎn)項(xiàng)目,項(xiàng)目編號(hào):ACJYZD200914。

        1.Introduction

        So-called automatic abstraction is to automatically extract abstracts from the original literature using the computer[1].Automatic abstraction quickly condenses and extracts a large of electronic texts,which is an accurate and efficient way to accelerate the reading and obtain information resources.So-called abstract is a brief and coherent passage to reflect the central content of a document accurately,mainly including the following three types:instruction,information and comment[2].This paper mainly studies on information abstract,a kind of concentrated expression for the details of the content.It can help users to grasp the core content of the original paper only through reading the abstract,and greatly save the time and improve the efficiency of reading.The main purpose of this study is to design a kind of automatic abstraction techniques based on keywords retrieval and apply it to the rapid detection of paper copy.

        2.Overview of Automatic Abstraction Technology

        Automatic abstraction consists of three steps:text analysis,information selection and generalization,and generating abstracts.Text analysis finds the most representative components of original contents.Conversion process compresses text through summary.The last step is to recombine the original content and generate abstracts[3].

        Automatic abstraction includes four main methods:automatic extraction,automatic abstraction based on understanding,information extraction and automatic abstraction based on structure[4].

        2.1 Automatic Extraction

        Automatic extraction regards text as a linear sequence sentences and the sentence as a linear sequence of words.It usually works by four steps:(1)calculating the right value of words;(2)calculating the right value of sentences;(3)descending the order of all the original sentences from the highest value to the lowest,and the highest one is selected as abstract words;(4)outputting all abstract words according to their order in the original text.In automatic extraction,the calculation of word value and sentence value and the selection of abstract words are all on the basis of the six kinds of text form:word-frequency,title,position,syntax structure,clue words and demonstrative expressions.These six features are the basis of automatic extraction and they indicate the theme of the text from different angles.

        2.2 Automatic Abstraction Based on Understanding

        The obvious difference between this mat hod and automatic abstraction lies in the use of knowledge.It not only obtains language structure by using the knowledge of linguistics,but also gets the significance of abstract by using the knowledge of this field.Finally it produces the abstract from the significance.

        2.3 Information Extraction

        Information extraction means to automatically identify the information such as referring to an entity,relationship,and event from a given set of texts and store or manage all the information.The method of using information extraction to carry out automatic summarization should firstly identify the themes of text,then choose the framework of abstracts,analyze the useful fragments of information extraction deeply and use relevant phrases or sentences to fill the abstract framework.Lastly,we will make use of the abstract model to convert the content in the framework into the abstract and output it.

        2.4 Automatic Abstraction Based on Structure

        The abstract words are usually regarded as top sentences which are related to many sentences in a network composed of sentences.The relationship between sentences can be judged by that of words or conjunctions.To a long article,it also can be regarded as a network of paragraphs.We can give each paragraph a feature vector,and take the inner product of these two paragraphs eigenvector as the connection strength of them.If the connection strength is beyond the given threshold,the two paragraphs have semantic links.Lastly,the central groups with the link to many segments are extracted to form an abstract of an article.

        3.A Technique of Automatic Abstraction Based on Keyword Retrieval

        3.1 Keyword Extraction

        The algorithm model of keyword extraction puts the following into a full framework,such as word segmentation and part-of-speech tagging,text pretreatment,linear weighting algorithm,the formation and filtration of combined words,merging keywords,etc.And the two important data structures are the word information table the compound word information table.The generated combined words are not regarded as exceptions,but to give them value with the scientific method and take part in the competition with other words(the words made by the algorithm of linear weighting).Then we merge the two tables and get the ultimate keywords[5].

        We first deal with text pretreatment,and the system of word segmentation and part-of-speech tagging,then use the algorithm of linear weighting.Through analyzing the frequency of the Chinese text,part of speech and the position of phrases,we quantize the weighting factor and calculate the value of each word.Then the candidate keys are extracted according to the size of value,and take them as the basis of final keywords.Based on the method of getting combined words by using linear weighted algorithm,we can get the second candidate keywords list.Finally the repeated items in these two tables are taken away,and the keywords are produced according to the right order of the size of value.Meanwhile,the number of keywords can be specified by users.

        3.2 Algorithm of Automatic Abstraction

        The algorithm of automatic abstraction first does the text segmentation by using segmentation tools[6];then it extracts the keywords,on the one hand,it stores the keywords according to the unit of paragraphs;each keyword is given different weights by the order of extraction(1.0,0.9,0.8),and the weight of each statement in every paragraph is calculated according to the value of keywords.Title,position and the length of sentences are also taken as the important factors of choosing abstracts besides word frequency.According to statistics,the chance of abstract words appearing on the title is around 95%,85%in the beginning of the paragraph,and 7%in the end of the paragraph.Therefore,the titles with keywords are directly seen as abstract words.The other statements are sorted by the order of weight,and the 5 sentences with the maximum weight in each paragraph are picked up as candidate key sentences.Then we select the abstract words considering the position and length of statements.Eventually the abstract of the whole thesis is formed.Specific processes are shown below:

        4.Detection of Thesis Copying Based on Automatic Abstraction

        4.1 Basic Thought

        Because most papers take up the large space,it is time-consuming to compare them.Therefore,we first compare their abstracts and again compare whole text if they have a high similarity to find the contents suspected of plagiarism.But some authors offer too simple abstracts,no more than 200words;or the abstract is not too accurate.And to sum the full content of text is a good way to stress the key point.So here this paper deals with the themes with automatic abstraction and compare the abstracts so that the accuracy is improved.

        4.2 Concrete Steps

        Step 1:to segment the paper to be detected and the original one;

        Step 2:to extract the keywords of the paper to be detected and the original one respectively and store them;

        Step 3:to calculate and sort the weight of sentences in the paper to be detected and the original one respectively,and gen-erate automatic abstracts;

        Figure1 Automatic Abstraction Based on Keyword Retrieval

        Step 4:to compare the abstracts of the paper to be detected and the original one,calculate the similarity;to calculate the similarity of the abstract provided by the author and the automatic abstract;

        Step 5:to suspect that it is a copy if the similarity is beyond 10%,make a further comparison between the whole text of the paper to be detected and the original one,output the copied contents.Otherwise,it is not thought as a copy.

        4.3 Experimental Result

        This paper designs the three copying files D1,D2,and D3 to act as the test samples.The proportions of plagiarism are about 20%,30%and 50%respectively.And the main purpose is to test that different proportions of plagiarism have an influence on the result of the comparison.The paper calculates the similarity by using word-frequency statistics,that is,to get the proportion of similar words out of the total words[7].Figure 2 is an interface of automatic abstraction system.Table 1 contains not only three copying files D1,D2 and their corresponding abstracts,but also the result of similarity between them and the original text,abstract and automatic abstracts.

        Figure 2 Automatic Abstract for a Certain Document

        Table 1 Experimental Result

        4.4 Basic Summary

        From the experiment result we can see that the similarity of the whole text and automatic abstract is very close to the proportion of copying.But the abstract provided by the writer sometimes makes some errors due to the accuracy and the words of the abstract.The abstract generated by the automatic abstraction based on keyword retrieval can roughly summarize the text,replace text to be detected.Of course it's only a preliminary inspection;detailed text detection still needs to be done.

        In addition,the keywords given by some authors are less and not very accurate.This system usually extracts 5-8 keywords,and they can reflect the theme of the text,so that the automatic abstract which is based on keywords retrieval is more accurate.

        5.Conclusion

        The a bstract with good quality can replace the retrieval position of the original text to a certain extent and act as an alternative to the retrieval,so that it can reduce the time spent on the information retrieval.The experts at home and abroad are always exploring an accurate and efficient algorithm of automatic abstraction.There is still something to be improved in this paper.Generally,the abstract is about 700 words in a paper with 7000 words.The more the words or paragraphs of the text are,the more the words of abstract are.Therefore,it is necessary to reduce the number of words,that is,within 500 words.We can combine a few paragraphs in the practice or pick up the key sentences for the unit of subtitle,not for the unit of paragraph.

        [1]柴曉麗,自動(dòng)文摘技術(shù)的研究與應(yīng)用[D].碩士學(xué)位論文.長(zhǎng)春理工大學(xué),2006.

        [2]黃麗瓊,中文自動(dòng)文摘及評(píng)價(jià)方法的研究[D].碩士學(xué)位論文.重慶大學(xué),2007.

        [3]郭燕慧,鐘義信等,自動(dòng)文摘綜述,情報(bào)學(xué)報(bào)[J].2002,21(5):582~591.

        [4]劉挺,王開(kāi)鑄,自動(dòng)文摘的四種主要方法,情報(bào)學(xué)報(bào)[J].1999,18(1):10~19.

        [5]張紅鷹,基于模糊處理的中文文本關(guān)鍵詞提取算法[J].現(xiàn)代圖書(shū)情報(bào)技術(shù),2009,(5):39~43.

        [6]李榮陸,文本分類及其相關(guān)技術(shù)研究[D].博士學(xué)位論文.復(fù)旦大學(xué),2005.

        [7]趙俊杰,一種基于段落詞頻統(tǒng)計(jì)的論文抄襲判定算法[J].計(jì)算機(jī)技術(shù)與發(fā)展,2009,19(4):231~233,238.

        猜你喜歡
        情報(bào)學(xué)碩士學(xué)位俊杰
        開(kāi)放與融合:公安情報(bào)學(xué)進(jìn)入情報(bào)學(xué)方式研究*
        構(gòu)建中國(guó)特色的情報(bào)學(xué)
        俊杰印象
        海峽姐妹(2019年11期)2019-12-23 08:42:18
        表演大師
        我的同桌
        我校成功獲批碩士學(xué)位授予單位及3個(gè)碩士學(xué)位授權(quán)點(diǎn)
        在美國(guó)對(duì)于就業(yè)來(lái)說(shuō)最好和最差的碩士學(xué)位
        海外星云(2016年17期)2016-12-01 04:18:38
        數(shù)據(jù)挖掘技術(shù)在情報(bào)學(xué)領(lǐng)域的應(yīng)用
        河南科技(2014年11期)2014-02-27 14:16:48
        知識(shí)管理視域下的圖書(shū)情報(bào)學(xué)研究
        河南科技(2014年4期)2014-02-27 14:07:36
        我給桌子“洗臉”
        好日子在线观看视频大全免费动漫| 中文字幕亚洲入口久久| 天堂女人av一区二区| 国产一区二区三区四区在线视频 | 亚洲国产av一区二区四季| 国产精品成人免费视频一区| 日韩精品无码一区二区三区视频 | 亚洲男人第一无码av网站| 50岁熟妇大白屁股真爽| 久久久久亚洲AV无码专| 日本高清一区二区三区色| 蜜桃视频在线观看免费亚洲| 精东天美麻豆果冻传媒mv| 最新亚洲人成网站在线| 中文字幕高清一区二区| 丝袜美腿在线观看一区| 久久久日韩精品一区二区三区 | 婷婷亚洲国产成人精品性色| 久久本道久久综合一人| 在线视频国产91自拍| 99精品欧美一区二区三区| 亚洲午夜成人片| 日本一区二区在线播放观看| 青青草成人免费在线观看视频| 狼人青草久久网伊人| 成人免费xxxxx在线视频| 人妻风韵犹存av中文字幕| 日韩中文字幕不卡在线| 无码一区二区三区| 女厕厕露p撒尿八个少妇| 试看男女炮交视频一区二区三区| 国产成人精品一区二区日出白浆| 少妇性俱乐部纵欲狂欢少妇| 777国产偷窥盗摄精品品在线| 香蕉久久夜色精品国产2020| 91国产自拍视频在线| 国产精品亚洲av无人区一区香蕉| 337p日本欧洲亚洲大胆精品| 国产精品一区二区电影| 亚洲中文字幕女同一区二区三区| 日本91一区二区不卡|