Jingfa LIU,F(xiàn)an LI,Ruoyao DING,Zi’ang LIU
1Guangzhou Key Laboratory of Multilingual Inte lligent Processing,Guan gdong University of Foreign Studies,Guangzhou 510006,China
2School of Information Scienceand Technology,Guangdong University of Foreign Studies,Guangzhou 510006,China
3School of Computer and Software,Nanjing University of Information Science&Technology,Nanjing 210044,China
4 Faculty of Science,University of Alberta,Edmonton T6 G2H6,Canada
?E-mail:jfliu@nuist.edu.cn;bj2014_lifan@163.com
Abstract:At present,focused crawler is a crucial method for obtaining effective domain knowledge from massive heterogeneous networks.For most current focused crawling technologies,there are some difficulties in obtaining high-quality crawling results.The main difficulties are the establishment of topic benchmark models,the assessment of topic relevance of hyperlinks,and the design of crawling strategies.In this paper,we use domain ontology to build a topic benchmark model for a specific topic,and propose a novel multiple-filtering strategy based on local ontology and global ontology(MFSLG).A comprehensive priority evaluation method(CPEM)based on the web text and link structure is introduced to improve the computation precision of topic relevance for unvisited hyperlinks,and a simulated annealing(SA)method is used to avoid the focused crawler falling into local optima of the search.By incorporating SA into the focused crawler with MFSLG and CPEM for the first time,two novel focused crawler strategies based on ontology and SA(FCOSA),including FCOSA with only global ontology(FCOSA_G)and FCOSA with both local ontology and global ontology(FCOSA_LG),are proposed to obtain topic-relevant webpages about rainstorm disasters from the network.Experimental results show that the proposed crawlers outperform the other focused crawling strategies on different performance metric indices.
Key words:Focused crawler;Ontology;Priority evaluation;Simulated annealing;Rainstorm disaster
Crawler,which is an important part of search en‐gines for information retrieval(IR),is a technology for automatically obtaining webpages.To acquire domainspecific knowledge,traditional crawlers have difficulties in implementing semantic analysis.Therefore,focused crawler technologies with topic preference charac‐teristics have received great attention in recent years(Bajpai and Arora,2018).A focused crawler(Tsikrika et al.,2016)aims to retrieve large-quantity and highquality topic-relevant webpages in a short time.Fo‐cused crawler has many applications in the fields of business,transmission,biomedicine,and meteorology(Boukadi et al.,2018;Liu B et al.,2020).This paper addresses focused crawling on the topic of rainstorm disaster,which is one of the most frequent meteoro‐logical disasters.It is extremely important to obtain early warning to ensure preventive measures and emergency response avoidance or reduce the loss caused by rainstorm disasters,and important to ensure the safety of people and property.
However,the scale of webpages on the Internet is massive and continuously growing.The content of webpages is highly dynamic and complex.The infor‐mation about webpages related to rainstorm disasters is sparse,showing the characteristics of big data.In the field of IR,traditional focused crawlers face great challenges in improving their accuracy.The main difficulties are the establishment of topic bench‐mark models,the assessment of topic relevance(in‐cluding hyperlinks and texts),and the design of crawler strategies.
Semantic description methods on a given topic are the most popular topic modeling strategies,and include mainly the conceptual graph(CG)(Du et al.,2013,2017;Guan and Luo,2016)and domain ontol‐ogy(Du et al.,2014;Zhu et al.,2017;Capuano et al.,2020;Khadir et al.,2021;Lakzaei and Shmasfard,2021).Currently,the methods of determining the vir‐tual concept are not uniform in CGs,and some meth‐ods require user interaction,which may cause topic deviations because of insufficient user knowledge or inaccurate understanding.Therefore,most crawl‐ing methods use domain ontology to specify the do‐main knowledge hierarchy,and compute conceptual semantic weights of the topic words occurring in the domain ontology.These are ultimately used to com‐pute the topic relevance score to judge whether one text or hyperlink is relevant to a specific topic.There‐after,during the crawling process,most scholars per‐form hyperlink filtering based on the priority(topic relevance)and a preset threshold.However,existing filtering methods on hyperlinks are all single-filtering processes,and a hyperlink filtering method with multi‐ple criteria has not been considered.
In the focused crawler,the primary methods to predict the priority of an unvisited hyperlink(web‐page)include two categories:hyperlink structure based method(Du et al.,2017)and web text analysis based method(Liu WJ and Du,2014;Prakash and Kumar,2015;Cheng et al.,2018).However,most re‐search ignores the impact of the combination of these two methods,and the indicators considered are not comprehensive.
In the design of crawler strategies,breadth first search(BFS)(Vidal et al.,2006)and optimal priority search(OPS)(Rawat and Patil,2013)are frequently applied.BFS ignores unvisited hyperlinks’priori‐ties,so the performance of BFS is generally inferior to that of OPS.Most scholars now use the OPS crawler strategy,but the OPS strategy is a greedy algorithm,where it is easy for the search to be trapped in local optima.To avoid inherent flaws of the greedy algo‐rithm,researchers have recently proposed some heu‐ristic crawler methods based on meta-heuristic strate‐gies,such as particle swarm optimization(PSO)(Tong,2008),the genetic algorithm(GA)(Jing et al.,2016),the ant colony optimization(ACO)algorithm(Chen et al.,2011),and the tabu search(TS)algorithm(Liu JF et al.,2020).However,some improvements and developments are still required to enhance their effectiveness.
The simulated annealing(SA)algorithm(He et al.,2009)has a strong global search capability and can accept the sub-optimal links based on Me‐tropolis sampling and avoid the focused crawling falling into local search.Thus,in this paper we apply the SA algorithm and ontology technology(which combines a multiple-filtering strategy and a compre‐hensive priority evaluation method(CPEM))to exe‐cute focused crawling.Experimental results of the fo‐cused crawlers on the rainstorm disaster show the ef‐fectiveness of the proposed method.The main contri‐butions of this paper are as follows:
1.A novel multiple-filtering strategy based on local ontology and global ontology(MFSLG)is pro‐posed to find more topic-relevant hyperlinks.
2.A CPEM considering four indicators(topic relevance of the webpage containing the unvisited hyperlink,topic relevance of anchor text,the Page-Rank(PR)value,and topic relevance of the webpage to which the unvisited hyperlink points)is used to evaluate the unvisited hyperlinks.
3.An annealing strategy based on Metropolis sampling is applied to avoid the focused crawler fall‐ing into a local optimal search.
4.Anew focused crawler combining domain on‐tology and the SA algorithm has been used to obtain the effective domain knowledge of the rainstorm di‐saster for the first time.
Formal concept analysis(FCA)is a semiautomated method of constructing ontology.The“concept lattice”is the core data structure of knowl‐edge representation and is the core mathematical the‐ory of FCA(Yang et al.,2008).Each node of the con‐cept lattice is a concept that consists of an extension and an intension of the concept.We analyze the con‐ceptual hierarchy relation and conceptual intrinsic link to construct the concept lattice.The construction of the concept lattice includes three main steps:
1.Data extraction
First,we obtain topic-relevant academic papers from the database CNKI(i.e.,China National Knowl‐edge Infrastructure)as data sources,and extract titles,abstracts,and keywords from the academic papers as a candidate set of domain terminologies.Then,we use the word segmentation technology to find the core vocabulary of the field and count the number of occurrences of each word.
2.Formal context creation
A formal context is also called a“Document–Terms”matrix,and can be defined as a triple:F=(Documents,Terms,Relation).Documents,Terms,and Relation represent the document collection,ter‐minology,and relationship between documents and terms,respectively.
3.Concept lattice construction
The concept lattice is a Hasse graph(Zhu et al.,2017).Each node in the concept lattice is a conceptC(Denotation,Connotation),where Denotation repre‐sents the extension of conceptCand Denotation∈Documents,and Connotation represents the intension of conceptCand Connotation∈Terms.Generally,one can use the tool ConExp(https://sourceforge.net/projects/conexp/)to construct the concept lattice semiautomatically,that is,to generate the Hasse graph.
Example 1(Construction of the concept lattice)First,we obtain five academic papers related to the topic of“rainstorm disaster”from CNKI.The selected field terms are Terms 1–8(Term 1,rainstorm disaster;Term 2,disaster management;Term 3,emergency alert level;Term 4,weather monitoring;Term 5,hy‐draulic engineering;Term 6,city waterlogging;Term 7,floodwater;Term 8,landslide).According to the above method,we build binary relations of the formal con‐text(Table 1)and the Hasse graph of the concept lattice(Fig.1).
Table 1 Binary relations of the formal context
In Fig.1,Terms 1–8 represent the attribute sets and Docs 1–5 represent the object sets.The upper half circle represents the attribute,and the lower half represents the object.If the attribute part of a node is blue,it means that there is a new attribute linked to the node.If the object part of a node is black,it means that a new object is linked to the node.The attribute set of each concept node is the sum of all the attributes at the upper level of the node(inheriting the parent concept attributes),and the object set is the sum of all the objects at the lower level of the node(covering the sub-concept objects).For example,in the“Doc 1,Term 2”node in Fig.1,its attribute set is{Term 1,Term 2},and the object set is{Doc 1,Doc 2,Doc 3}.As reported by Rios-Alvarado et al.(2013),a hyponym is defined as a word of more specific meaning than a general or superordi‐nate term,and a hypernym is a word with a broad meaning constituting a category under which more specific words fall.Thus,Term 1 is the hypernym of Term 2,and Term 2 is the hyponymy of Term 1.The hyponymy of other terms in the figure can be ob‐tained similarly.
Fig.1 Hasse graph of the concept lattice that corresponds to the relations in Table 1(References to color refer to the online version of this figure)
After the concept lattice is built,we use the on‐tology web language(OWL)(http://www.w3.org/TR/owl-features/)to formalize the concept hierarchy.Each term is defined as a class,and the relationship between terms is defined as the hyponymy relation of the class.Generally,the development tool Protégé(https://protege.stanford.edu/)can be used to write and implement visualizations of the OWL.Ontology is a formal and explicit specification of a shared conceptualization(Gruber,1995).This means that ontology explicitly defines the rich rela‐tions between concepts.The key component in the ontology is the hierarchy of concepts.Domain ontol‐ogy is a formal description of the background knowl‐edge in a specific field.In a semantic sense,the iden‐tification of hypernymy or hyponymy relations be‐tween words is mandatory for building a hierarchy of concepts.In this study we build a domain ontology based on the topic of rainstorm disaster,where the hi‐erarchy of concepts is built based on hyponymy in the concept lattice and the expert experience.Also,for the ontology construction,readers can refer to WordNet and Chinese Classified thesaurus(http://cct.nlc.cn/login.aspx).In the following,we give two definitions based on the domain ontology:
Definition 1(Global ontology) Global ontology provides a relatively complete semantic model,which includes all related entities(concepts)in a specified domain and basic knowledge hierarchy among entities for sharing and reusing characteristics(Fig.2).
Definition 2(Local ontology) Local ontology is a special domain ontology whose topic is a reuse of a concept separated from the global ontology,and con‐sists of all sub-concepts and the hierarchy relations under this concept of the global ontology(Fig.2).A local ontology can be reused by adding,deleting,modi?fying,and other operations.
Fig.2 Schematic of the global ontology and local ontology
Example 2We construct a global ontology(shown in Fig.S1 in the supplementary materials)on the topic of rainstorm disaster according to the above men‐tioned method.The constructed global ontology in‐cludes 50 concepts and a six-level hierarchical struc‐ture.We use the concepts of disaster management,secondary disaster,and rainstorm level,separated from the rainstorm disaster global ontology,to build three local ontologies.
The above approach for building an ontology of the topic rainstorm disaster unifies and structures the description of the domain background knowledge,including prediction,early warning,disaster level,and related emergency management knowledge.In this paper we propose the idea of constructing a global ontology and generating some local ontologies sepa‐rated from the global ontology for the first time,which makes full use of the ontologies to describe the topics.In this study,the main aim of constructing ontology is to calculate the concept semantic similarity in crawl‐ing,while the comprehensive applications of the global ontology and local ontology in the crawling process are to make the webpages be fully analyzed to reduce the omission of topic-relevant webpages and effec‐tively prevent the topic drifting problem.
In the ontology structure,the five attribute rela‐tionships between two conceptsC1andC2can effec‐tively quantify the similarity of concepts,including semantic distance(IFDis),concept density(IFDen),concept depth(IFDep),concept coincidence degree(IFCoi),and concept semantic relationship(IFRel).The definitions of these five attribute relationships can be found in Dong et al.(2020).
Based on the above five factors,the semantic similarity Sem(C1,C2)between conceptsC1andC2is calculated as follows:
where the adjustment factors satisfya+b+c+d+e=1.In this study,they are set asa=0.7,b=0.04,c=0.11,d=0.03,ande=0.12,according to the results of many experiments.Suppose that GTK=(gtk1,gtk2,...,gtkr)indicates the vector of the topic word sets,whererrepresents the number of topic words in the global ontology.WGTKis the semantic weight vector of the topic words corresponding to the global ontology,andwgtkiindicates the weight of theithtopic word gtki.Thus,if the topic of a global ontology is GC,the method of calculatingWGTKis as follows:)
Suppose that the topics ofklocal ontologies are LC1,LC2,...,LCk.WLTKindicates the semantic weight vector of the topic word sets corresponding toklocal ontologies.WLTKiindicates the semantic weight vector of the topic words corresponding to theithlocal ontology.indicates the weight of thejthtopic word in theithlocal ontology.LTK=(LTK1,LTK2,...,LTKk)indicates a vector of topic word sets correspond‐ing toklocal ontologies.indicates the topic word set of theithlocal ontology,andNiindicates the number of topic words in theithlocal ontology.Thus,the method of calculatingWLTKis as follows:
This section introduces the topic-relevance calcu‐lation methods.We use a vector space model(VSM)to calculate the topic relevance of webpage text(Farag et al.,2018;Jia et al.,2021),and propose a CPEM for predicting the priority(topic relevance)of the un‐visited hyperlink.
In a search engine,the regular expression(RE)method(Colazzo et al.,2013)uses some metacharacters to compose information that meets certain rules.The process of using RE to match words directly is very slow.Therefore,in this study we use a word segmentation tool with open source,IK Analzyer(IK)(Wang and Meng,2014),to process word segmentation by loading custom extending dictionaries and deacti‐vating dictionaries.
A webpage is an HTML file composed of many elements and tags(Patel and Schmidt,2011).The topic word appears in different tags with different influences.We assign different weightswjto different labels(Liu JF et al.,2019a).We choose main tags from HTML and divide them into five groups,as shown in Table 2.
Table 2 Division of labels and their weights
We now describe the vectorization of webpage text.First,remove all noise in the webpage text.Then,after matching or segmenting the extracted content,count the term frequency(TF)of each topic word.The webpage text may be represented as a TF vectorDTF=(TF1,TF2,...,TFn),wherenis the number of topic words.Considering the weight of different tags extracted from HTML,the webpage text can be rep‐resented as a TF vectorDTF=((TF1,1,TF1,2,...,TF1,J),(TF2,1,TF2,2,...,TF2,J),...,(TFn,1,TFn,2,...,TFn,J)),where TFi,jrepresents the TF of theithtopic word in thejthposition(group)of the webpage text,andJrepresents the size of tag groups(here,J=5).Generally,if a feature word appears frequently in a text(i.e.,its TF is large),this feature word should be important and has good classification ability.To take full ac‐count of the importance of tag information,we use the following model to calculate the weightwdkjof theithtopic word in the webpage feature set DK={dk1,dk2,...,dkn}:
where tfi,jrepresents the normalized TF of theithtopic word at thejthposition(group)of the webpage text,max TFi,jrepresents the maximum TF of theithtopic word occurring at all positions,andwjrepresents the weight of thejthtag group.
Some previous focused crawler algorithms ignore the impact of semantics on crawlers,and consider only TF to calculate the topic relevanceR(P)of web‐pageP:
Here,we use the VSM method to calculate the topic relevanceR(P)of webpageP:
whereWTK=(wtk1,wtk2,...,wtkn) represents the seman‐tic weight vector of topic words,withwtkiindicating the weight of theithtopic word in the topic word set TK={tk1,tk2,...,tkn},andWDK=(wdk1,wdk2,...,wdkn)represents the feature weight vector of a webpage,withwdkiindicating the weight of theithtopic word in the webpage feature set DK={dk1,dk2,...,dkn}.
VSM is a well-known measure of cosine and transforms a language problem into a mathematical problem.The cosine similarity between two vectors is considered as the similarity of the text related to the given topic.When the angle between two vectors is equal to 0o,the relevance between them is maxi‐mum and equals 1,indicating that they are the most relevant.When the angle is equal to 90o,the relevance is minimum and equals 0,indicating that they are irrelevant.We set thresholdσto determine whether the webpage is related to the topic.IfR(P)is greater thanσ,webpagePis considered to be related to the topic;otherwise,webpagePis regarded as irrelevant to the topic.
For webpageP,the traditional method of calcu‐lating the PR value is as follows:
whered=0.85,mrepresents the number of in-links ofPin the crawled webpage set,PR(Pi)represents the PR value of theithin-link of webpageP,andC(Pi)represents the number of out-links of theithin-link of webpageP.
In Eq.(7),the larger the number of in-links ofPand the higher the average importance of the out-link webpage,the higher the importance ofP(i.e.,the higher the PR value).This importance has nothing to do with the topic and easily leads to topic drifting.To overcome this topic drifting problem,the topic rele‐vance of anchor text is introduced in the calculation of the PR value.In Eq.(8),except for the number of
i
n-links and the average importance of the out-link webpages,the higher the topic relevance of the in-link anchor text,the higher the importance ofP(i.e.,the larger the PR value).
whereωis the adjustment factor and is set to 0.6,andR(Ai)represents the topic relevance of anchor textAiof theithin-link ofP(Section 4.3).Assume all crawled webpages as the entire Internet.Therefore,the PR value of each webpage is constantly updated.
The anchor text usually has only a few words or phrases,but can clearly describe the main idea of the webpage to which the hyperlink points.If a topic word frequently appears in a certain anchor text and rarely appears in the other anchor texts,this topic word should be important and has a strong distin‐guishing ability.According to TF×IDF(IDF is short for inverse document frequency),the calculation for‐mula of the weightwakiof theithtopic word in an an‐chor text is as follows:
wherefirepresents the TF of theithtopic word in the anchor text,nis the number of topic words,Nrepre‐sents the number of crawled webpages,Nirepresents the number of crawled webpages containing theithtopic word of this anchor text,ands>1.By consider‐ing the cosine betweenWTKandWAK,the topic rele‐vanceR(Al)of anchor textAlis computed by
whereWAK=(wak1,wak2,...,wakn) represents the feature weight vector of the anchor text,andwakirepresents the weight of theithtopic word in the anchor text fea‐ture set AK={ak1,ak2,...,akn}.
In addition,the relevance of the next webpagePlto which hyperlinklpoints affects the priority of hyperlinkl.LetWUK=(wuk1,wuk2,...,wukn) denote the feature weight vector of the next webpagePlto which hyperlinklpoints,wherewukirepresents the weight of theithtopic word in webpagePl,and UK={uk1,uk2,...,ukn}represents the webpage feature set ofPl.According to Eq.(6),topic relevance ofPlis as follows:
To evaluate the topic relevance(or priority)of the unvisited hyperlinkl,we propose a CPEM which involves the topic relevanceR(Al)of the anchor textAlof linkl,the sum of topic relevanceR(Pi)of the webpages at which linklis located,the PR(Pl)value of webpagePlto which hyperlinklpoints,and the topic relevanceR(Pl)of webpagePl.The compre‐hensive priority Priority(l)of the unvisited hyperlinklis
whereα+β+γ+θ=1.
We predict the priority of each hyperlink by Eq.(12)and set the threshold toηfor filtering unvis‐ited hyperlinks.If Priority(l)>η,we add linklinto an ordered link-waiting queueQwaccording to the prior‐ity;otherwise,we discard it.Generally,an OPS strat‐egy is applied to select next hyperlink to crawl fromQw.However,it is easy for this strategy to cause the search to fall into local optima.To avoid this,we intro‐duce an SA algorithm to select the next hyperlink by optimizing maximum Priority(l)to enhance the global search performance of focused crawlers.
In this section,we first introduce an SA algo‐rithm and give its improved version for selecting the next link in the focused crawler.Then,the framework of the focused crawler strategy based on the SA(FCSA)algorithm is proposed.Finally,by incorporat‐ing MFSLG and CPEMinto FCSA,the focused crawler strategy based on the ontology and SA(FCOSA)is presented.
The SA algorithm has been widely used in com‐binatorial optimization problems(Liu JF et al.,2010).The idea of SA originates from the annealing process of solid matter in physics.It is a random optimiza‐tion algorithm and generally starts from a high initial temperature.It iterates from an initial solution and updates the solution by an effective neighborhood strategy.For the newly generated solution,it uses the Metropolis sampling criterion to determine whether it is accepted.With the stability of sampling,the sys‐tem gradually begins to cool down.The above pro‐cess is repeated until the algorithm obtains the opti‐mal solution to the problem or reaches the preset min‐imum temperature.
We introduce the SA procedure for selecting the next link in the focused crawler.This gives the suboptimal links the opportunity to be selected and pre‐vents the crawler from falling into local traps.At the same time,it helps the crawler extend the search range and find better retrieval paths by traversing the tun‐nel.The detailed steps of the SA process for selecting a link are outlined in Algorithm 1.The algorithm begins from the header-link from the link-waiting queueQw,and selects the next link fromQwby the roulette method.For the generation of seed URLs ofQw,there are three ways(Liu JF et al.,2019a):(1)manual method—collecting the seed URLs from experts in the field;(2)auto-generated method—entering the specified topic words,for example,rainstorm disaster and disaster management,into regular search engines(e.g.,Baidu and Google)in sequence,and selecting the URLs listed in the preceding pages as seed URLs;(3)mixed mode—combining the manual method and auto-generated method.We use the mixed mode to generate seed URLs ofQw.In the executing pro‐cess of the SA algorithm,there are some important parameters which need to be set,including the initial annealing temperatureT,the controlled annealing speed parameterC,and the number of inner cyclesMduring the annealing process.The parametersT,C,andMare all empirical values.The parameterCcon‐trols the rate of temperature dropping.When the tem‐peratureTtends to 0,the possibility of accepting suboptimal links also tends to 0.
To shorten the runtime of the algorithm,we intro‐duce the target completion rate(com-rate)to adjust the temperature.Suppose that the target of the focused crawler is to download 15 000 webpages from the In‐ternet,and that the number of current downloaded webpages is DP.Thus,the target completion rate is de‐fined as com-rate=DP/15 000.With the increase of the downloaded DP,the initial temperature of the SAalgo‐rithm is reduced to accelerate the convergence.The improved SA(ISA)for selecting a link is obtained by modifying step 1 in the above SA algorithm by the following steps:Compute the target completion rate com-rate.If com-rate=0,setT=1;otherwise,setT=T×com-rate.SetM=10 andC=0.9.
Algorithm 1 SA(Q w)Input:Q w Output:a link 1:Set T=1,M=10,and C=0.9 2:Choose the header-link from Q w and mark it as current 3:Set q=1 4:Select randomly a link from Q w by the roulette method,and mark it as next 5:If random[0,1]<exp[(Priority(next)-Priority(current))/T]then Accept next for the next link,and let current=next Else Do not accept next for the next link,and keep current unchanged End if 6:If q>M then go to step 7 Else let q=q+1,and go to step 4 End if 7:Let T=C×T 8:If T≤0.01 then Output current Else go to step 3 End if
This subsection introduces the specific process of FCSA.FCSA starts with the first link ordered inQw.For each obtained webpage,we pre-treat it and extract the webpage text features and subsequently calculate the relevance of the webpage by Eq.(6)to judge whether the obtained webpage is related to the topic.Then,all child-links included in the webpage are extracted and their comprehensive priorities are calculated by Eq.(12).Once the comprehensive priority of a child-link exceeds the threshold valueη(η>0),the child-link is inserted intoQw.To overcome the defect of the greedy algorithms such as the OPS strategy,we use the SA algorithm(Section 5.1)to implement the link selection operation.The iterative steps of FC‐SA are outlined in Algorithm 2.
In FCSA,the fact that the computation of topic relevance relies simply on TK andWTKhas limita‐tions in controlling the crawler search(TK andWTKconsist of several simple topic words with semantic weights).To prevent the focused crawler from access‐ing irrelevant links and to lead it to access as many topic-relevant links as possible,we make full use of ontology features by constructing a global ontology and three local ontologies.This method will make topic search more extensive and can retrieve more topicrelevant hyperlinks.
FCOSA is divided into two groups.One is a fo‐cused crawler strategy based on only global ontology(FCOSA_G),and the other is a focused crawler strat‐egy based on both the global ontology and local on‐tology(FCOSA_LG)(Algorithm 3).In FCOSA_LG,we use the local ontology for the first filtering,where LTK is used to compute the topic relevance of the page(to which each child-link points),and the global ontology for the second filtering,where GTK is used to compute the comprehensive priority of each saved child-link.The FCOSA_G algorithm is obtained by deleting step 9 in the FCOSA_LG algorithm.In both FCOSA_G and FCOSA_LG algorithms,we use the ISAstrategy to select a link.
Algorithm 2 Focused crawler strategy based on the simulated annealing(FCSA)Input:seed URLs Output:downloaded webpages 1:Add the seed URLs to Q w.Setσandη.Let DP=0 and LP=0 2:Select the first link ordered in Q w,and mark it as Headerlink.The webpage to which Header-link points is marked as Current-page 3:Remove Header-link from Q w and download the Currentpage 4:Let DP=DP+1 5:Remove the noise and extract tag information(Table 2)from the Current-page,and gain the feature vector DK of the Current-page 6:Calculate the topic relevance R(Current-page)of the Current-page text according to Eq.(6)7:If R(Current-page)>σthen Download the Current-page,and let LP=LP+1 End if 8:Extract all the child-links and the corresponding anchor texts from Current-page,and remove repeated links//For the irrelevant webpage,the purpose of extracting//all the child-links in the page is to provide a chance for//the crawler to pass through the tunnel 9:For i=1 to k do//k is the size of the child-links Calculate the comprehensive priority of child-link i according to Eq.(12)If Priority(child-link i)>ηthen Insert child-link i into Q w Else give up child-link i End if End For 10:Recalculate PR values of all downloaded webpages and update the comprehensive priority values of all links in Q w 11:If Q w is not empty then Let l=SA(Q w)Insert link l into the head of Q w Else the algorithm ends End if 12:If DP<15 000 then Go to step 2 Else the algorithm ends End if
To evaluate the performance of the proposed focused crawler algorithms FCSA,F(xiàn)COSA_G,and FCOSA_LG,we implement BFS(Vidal et al.,2006),OPS(Rawat and Patil,2013),the focused crawler based on the web space evolutionary(WSE)algo‐rithm(Liu JF et al.,2019b),the focused crawler based on the improved tabu search(ITS)algorithm(Liu JF et al.,2020),the focused crawler based on the ontology and ITS algorithm(On-ITS)(Liu JF et al.,2020),and three algorithms proposed in this study.All algorithms are compiled in the Java language and run on a personal computer with Intel Pentium G3260,3.30 GHz processor,and 4.0 GB RAM.We execute a series of experimental tests and analyze the computa‐tional results.
Algorithm 3 FCOSA_LG Input:seed URLs Output:downloaded webpages 1:Add seed URLs to Q w.Setσ,φ,andη.Let DP=0 and LP=0 2:Select the first link ordered in Q w,and mark it as Headerlink.The webpage to which Header-link points is marked as the Current-page 3:Remove Header-link from Q w and download the Currentpage 4:Let DP=DP+1 5:Remove the noise and extract tag information(Table 2)from the Current-page.Use IK for word segmentation and gain the feature vector DK of the Current-page 6:Calculate the topic relevance R(Current-page)of the Current-page text according to Eq.(6)7:If R(Current-page)>σthen Download the Current-page and let LP=LP+1 End if 8:Extract all the child-links and the corresponding anchor texts from the Current-page,and remove repeated links//Local ontology is used to implement the first filtering//of child-links 9:For i=1 to k1 do//k1 is the size of the child-links For j=1 to k2 do//k2 is the number of local ontologies,and k2=3 in this//study Calculate the topic relevance Ri,j of child-link i based on the j th local ontology according to Eq.(11):Ri,j=Sim(LTK j,UK)If Ri,j≥φthen//φis a positive parameter Save child-link i break Else if Rij<φand j=k2 Discard child-link i End if End for End for//Global ontology is used to implement the second fil?//tering of the saved child-links 10:For j=1 to k3 do//k3 is the number of the saved child-links Calculate the comprehensive priority of the child-link j according to Eq.(12),where TK is replaced by GTK=(gtk1,gtk2,...,gtk r)If Priority(child-link j)>ηthen//ηis a positive parameter Insert child-link j into Q w Else give up child-link j End if End for
11:Recalculate PR values of all the downloaded webpages and update the comprehensive priority values of all links in Q w 12:If Q w is not empty then Let l=ISA(Q w)//Return link l Insert link l into the head of Q w Else the algorithm ends End if 13:If DP<15 000 then Go to step 2 Else the algorithm ends End if
The performance metric indices of crawler algo‐rithms generally include recall rate(Recall)and accu‐racy(Accuracy):
where TP represents the total number of topic-relevant webpages in the whole Internet,LP represents the number of downloaded topic-relevant webpages,and DP represents the total number of downloaded web‐pages.Since it is difficult to count the total number of topic-relevant webpages in the whole Internet,we choose Accuracy as the standard for comparison.
In addition,we use the average relevance(AR)and standard deviation(SD)of the downloaded web‐pages and the downloaded topic-relevant webpages to analyze the results of the algorithms.AR and SD of the downloaded topic-relevant webpages are cal‐culated using Eqs.(15)and(16),respectively.AR and SD of the downloaded webpages are calculated using Eqs.(17)and(18),respectively.
where ARLPdenotes the average relevance of all down‐loaded topic-relevant webpages LP,and SDLPis the standard deviation of all downloaded topic-relevant webpages compared to ARLPand is used to measure the spread of the topic relevance of all downloaded topic-relevant webpages LP.ARDPand SDDPhave similar meanings.
As mentioned,we use a mixed mode to generate seed URLs(Section 5.1).The total number of seed URLs(shown in Table S1 in the supplementary mate‐rials)is 30.The settings of experimental parameters or strategies for different algorithms are listed in Table 3.In particular,we use the same TK to evalu‐ate the relevance of webpages,and the pre-defined thresholdσis set to 0.7(Liu WJ and Du,2014)in all experiments.The thresholdσis used to measure whether the webpage is a topic-relevant page.In addi‐tion,the experimental environments and initializa‐tion conditions of different algorithms are the same.The number of retrieved webpages starts with 100,and then 500,and progressively is increased by 500.All algorithms end when DP reaches 15 000 orQwis empty.
Table 4 displays the results of Accuracy,LP,AR,and SD when DP reaches 1000,5000,10 000,and 15 000.Table 4 shows that when DP reaches 1000,OPS finds the optimal results of Accuracy,LP,ARLP,and SDLP,ITS achieves the optimal value of ARDP,and FCOSA_LG finds the lowest SDDP.When DP reaches 5000,OPS finds the optimal values of Accu‐racy,LP,and SDLP,while FCOSA_LG obtains the op‐timal results of ARLP,ARDP,and SDDP.When DP reaches 10 000,F(xiàn)COSA_LG obtains the highest Ac‐curacy,LP,ARLP,ARDP,and the least SDDP.For the FCOSA_LG algorithm,when DP reaches 15 000,its ARLPis greater than 0.8 and ARDPexceeds 0.75.In addition,F(xiàn)COSA_LG always has the lowest SDDP,while the OPS algorithm has the smallest SDLP.This can be explained by the fact that FCOSA_LG always extracts child-links whose comprehensive priority exceeds the preset thresholdη,and that OPS always downloads the most relevant webpages in the pro‐cess of crawling.
Table 3 Experimental strategies and parameters of different algorithms
The results of Accuracy and LP reflect the ability of the crawlers to retrieve absolute quantities of topicrelevant pages.When DP reaches 15 000,Accuracy of the FCOSA_LG algorithm is about 0.03 higher than those of WSE and On-ITS,about 0.13 higher than that of ITS,about 0.18 higher than that of FCSA,about 0.1 higher than that of FCOSA_G,and far higher than those of BFS and OPS.As a whole,the FCOSA_LG crawling algorithm stabilizes in a higher Accuracy range,and is superior to the other algo‐rithms.The results of ARLP,SDLP,ARDP,and SDDPreflect the stability of crawlers.When DP reaches 15 000,compared with FCSA,F(xiàn)COSA_G,and the other algorithms,the FCOSA_LG algorithm obtains the best results of ARLP,ARDP,and SDDP.This indicates that the proposed FCOSA_LG algorithm is more conducive to the global retrieval of topic-relevant webpages.This also proves that the combination of the SA algorithm and the MFSLG strategy to guide crawlers to filter hyperlinks can improve the stability of focused crawlers.Table 4 lists the runtime of dif‐ferent algorithms when DP=15 000.As can be seen,BFS and OPS have a slightly short retrieval time,but their results are far worse than those of the proposed crawlers.
Table 4 Experimental results of different algorithms about evaluation indices of Accuracy,LP,ARLP,SDLP,ARDP,SDDP,and retrieval time and when DP reaches 1000,5000,10 000,and 15 000
In addition,the Friedman test(Derrac et al.,2011)is a non-parametric statistical test and is commonly used to compare the performances of two or more algorithms.Here,when DP=15 000,the results obtained by eight different algorithms for the three representa‐tive indices Accuracy,ARDP,and SDDPare converted to ranks.The best performing algorithm for each in‐dex should have the rank of 1,the second best ranks 2,and so on.Table 5 depicts the ranks of eight algo‐rithms according to these three evaluation indices by the Friedman test.As can be seen,F(xiàn)COSA_LG algo‐rithm with the minimum average for three indices is the best performing algorithm of the eight algo‐rithms,On-ITS ranks second,WSE ranks third,and BFS is the worst.
From Tables 4 and 5,it is not hard to see that on the whole,F(xiàn)COSA_LG overmatches FCOSA_G and FCOSA_G is superior to FCSA.This further verifies the effectiveness of the improved strategies in FCOSA_LG and FCOSA_G.For selecting topic-relevant webpag‐es,the multiple-filtering strategy based on the global ontology and local ontology in FCOSA_LG outper‐forms the simple filtering strategy based on the global ontology in FCOSA_G.For topic description,the do‐main ontology topic model used in FCOSA_LG and FCOSA_G outperforms the topic model in FCSA.
Table 5 Friedman test ranks of eight algorithms for the three representative evaluation indices of Accuracy,ARDP,and SDDP when DP reaches 15 000
The parameter values in the focused crawlers are important.We run FCOSA_LG algorithm with different parameters to analyze the experimental re‐sults.FCOSA_LG algorithm has three important pa‐rameters,σ,φ,andη,which denote the threshold of the topic relevance of the webpage,the threshold of the topic relevance of the child-link based on the local ontology,and the threshold of the comprehen‐sive priority of the hyperlink,respectively.To test the effects of the thresholdsσ,φ,andηon the experi‐ments,we select some representative values to exe‐cute FCOSA_LG algorithm.The algorithm is run 20 times under each parameter independently.
The thresholdφis set to 0.05,0.10,0.15,0.20,and 0.25,ηis set to 0.10,0.15,and 0.20,andσis set to 0.5,0.6,and 0.7.Table 6 records the Accuracy and LP of FCOSA_LG algorithm based on different thresh‐olds.The larger the values ofφandη,the better the Accuracy achieved by the algorithm.In these three cases,when bothφandηare set to 0.15,the algo‐rithm can download 15 000 pages and the Accuracy and LP are the best.If the values of thresholds are too large,their filtering capacities are excessive and the algorithm ends prematurely.In this study,φandηare both set to 0.15.From Table 6,it can be seen that when the thresholdσis smaller,F(xiàn)COSA_LG algorithm can obtain better Accuracy and higher LP.However,some actual irrelevant webpages are counted in LP.Therefore,in this study,σis fixed to 0.7 according the threshold value in Liu WJ and Du(2014).
In addition,five independent experimental re‐sults of FCOSA_LG algorithm whenσ=0.7,φ=0.15,η=0.15,and DP=15 000 are shown in Table 7.The aver‐age Accuracy obtained from five independent crawl‐ers is 0.7565.ARLPand ARDPobtained from the five independent runs have an average value of 0.8056 and 0.7235,respectively.The average values of SDLPand SDDPare low,i.e.,0.0374 and 0.1492,respectively.Through the analysis of the above experiments,whenφ=0.15 andη=0.15,F(xiàn)COSA_LG algorithm performs the best and has good stability.
Table 6 Computational results obtained by FCOSA_LG algorithm with different threshold sizes ofσ,φ,andηwhen DP=15 000
Table 7 Experimental results of FCOSA_LG algorithm over five independent times whenσ=0.7,φ=0.15,η=0.15,and DP=15 000
Traditional approaches of crawlers generally pursue the maximum number of retrieved webpages,regardless of the content of the webpages.However,in most real-world searches,crawlers have an explicit target theme.It is also noticeable that most existing methods guide the search process using the domain conceptual hierarchy to describe the topic benchmark model and the greedy strategy to control the search direction in the Web.In contrast,we proposed a new strategy based on a focused crawler which is related to the topic of rainstorm disasters and gives more weight to the judgment of webpage topic relevance.We constructed the ontology as the topic benchmark model,and converted the problem of evaluating hy‐perlinks into a single-objective optimization problem.We proposed three algorithms,F(xiàn)CSA,F(xiàn)COSA_G,and FCOSA_LG,to direct the search of the focused crawlers.To improve the accuracy of the algorithms and prevent the phenomenon of topic drifting,we proposed a novel CPEM for unvisited hyperlinks and a novel multiple-filtering strategy based on MFSLG to find more topic-relevant webpages.
Experimental results on rainstorm disaster showed that the proposed FCSA,F(xiàn)COSA_G,and FCOSA_LG algorithms are effective in implementing the focused crawler.The FCOSA_LG algorithm achieved state-ofthe-art performance and was capable of finding more topic-relevant webpages.The crawler algorithms pro‐posed here can effectively obtain relevant knowledge about rainstorm disasters from the network,and pro‐vide a reference plan for disaster warning and pre‐ventive measures.In addition,crawlers can promote the construction of ontology knowledge in the domain of rainstorm disasters.
The main challenges of the proposed focused crawlers based on ontology and the annealing strat‐egy include automated ontology construction and the design of an annealing strategy.Although a semiautomated method of constructing ontology based on FCAwas proposed in this paper,the automated ontol‐ogy construction which involves automatic extraction of topic words and their relationships still needs fur‐ther research.In addition,because of the complexity of the network structure,crawlers easily fall into a loop trap,and it wastes computation resources.Al‐though the SAalgorithm can partially solve this prob‐lem,it is affected by the temperature cooling rate.If the cooling rate is too low,the crawler will spend more time.If the cooling rate is too high,the crawler may skip the process of obtaining the optimal crawl‐ing direction.Therefore,further investigation is needed to ensure the global nature of the focused crawler.In addition,the topic relevance evaluation of the hyper‐links in CPEM is weighted by multiple evaluating indi‐cators,and this is regarded as a single-objective op‐timization problem.However,the reasonable weight factors are hard to determine.In the future work,we plan to continue research on automated ontology construction,design of the annealing strategy,and the applications of multi-objective intelligent optimi‐zation algorithms in focused crawlers.We believe this work can achieve good results.
Contributors
Jingfa LIU designed the research.Fan LI drafted the paper,implemented the software,and performed the experi‐ments.Ruoyao DING and Zi’ang LIU revised and finalized the paper.
Compliance with ethics guidelines
Jingfa LIU,F(xiàn)an LI,Ruoyao DING,and Zi’ang LIU declare that they have no conflict of interest.
List of supplementary materials
Fig.S1 A global ontology structure about the topic of rainstorm disaster
Table S1 Seed URLs
Frontiers of Information Technology & Electronic Engineering2022年8期