A,new,focused,crawler,using,an,improved,tabu,search,algorithm,incorporating,ontology,and,host,information*#

时间:2023-10-29 16:36:02 来源:网友投稿

Jingfa LIU ,Zhen WANG†‡,2 ,Guo ZHONG ,Zhihe YANG

1School of Information Science and Technology,Guangdong University of Foreign Studies,Guangzhou 510006,China

2China Unicom Central South Research Institute,Changsha 410000,China

Abstract: To solve the problems of incomplete topic description and repetitive crawling of visited hyperlinks in traditional focused crawling methods,in this paper,we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information (FCITS_OH),where a domain ontology is constructed by formal concept analysis to describe topics at the semantic and knowledge levels.To avoid crawling visited hyperlinks and expand the search range,we present an improved tabu search (ITS) algorithm and the strategy of host information memory.In addition,a comprehensive priority evaluation method based on Web text and link structure is designed to improve the assessment of topic relevance for unvisited hyperlinks.Experimental results on both tourism and rainstorm disaster domains show that the proposed focused crawlers overmatch the traditional focused crawlers for different performance metrics.

Key words: Focused crawler;Tabu search algorithm;Ontology;Host information;Priority evaluation

Currently,Internet resources are growing explo‐sively.The data update speed is increasing,and users’ needs for Web information are becoming more per‐sonalized.Traditional search engines can no longer sat‐isfy the needs of customized information,so a focused crawler (Chakrabarti et al.,1999;Deng,2020) is pre‐sented to collect topical information.Compared with traditional crawlers,a focused crawler can retrieve larger quantities and higher-quality topic-relevant web‐pages.Therefore,in recent years,focused crawlers have attracted the attention of many scholars (Yu and Liu,2015;Hosseinkhani et al.,2021).

At present,focused crawlers face three main is‐sues: topic description,evaluation of the topic rele‐vance of unvisited hyperlinks,and design of crawl‐ing strategies.The methods of topic description include mainly topic words (Fei and Liu,2018),context graphs (CGs) (Du et al.,2014;Guan and Luo,2016),and domain ontology (Khan and Sharma,2016;Rani et al.,2017).The topic words are collected through the experience of domain experts,but there is a prob‐lem of semantic ambiguity.The construction of CGs relies on the user’s historical crawling information and may deviate from the topic if the user lacks topicrelevant knowledge.Because ontology can describe the specific domain at the semantic and knowledge levels,most semantic-based crawlers (Khan and Sharma,2016;Lakzaei and Shmasfard,2021) use ontology to describe topics.

The methods of evaluating unvisited hyperlinks include hyperlink structure based methods and web‐page content based methods.Hyperlink structure based methods,such as the PageRank algorithm (Yuan et al.,2017) and the hyperlink induced topic search (HITS) algorithm (Asano et al.,2007),focus on the structure itself and ignore the relevance of the topic,which may cause crawlers “topic drifting.” Webpage content based methods evaluate mainly priorities of unvisited hyperlinks by calculating and analyzing the relevance of the webpage text and anchor text,such as the fishsearch algorithm (de Bra et al.,1994) and sharksearch algorithm (Prakash and Kumar,2015).These algorithms ignore the characteristics of the global hy‐perlink structure and perform well only when search‐ing nearby webpages.Most researchers ignore the impact of combining these two methods,and the con‐sidered metrics are not sufficiently comprehensive.

The crawling strategy determines the order in which hyperlinks with different priorities are visited.The traditional algorithms include mainly breadth-first search (BFS) (Li et al.,2015) and optimal priority search (OPS) (Rawat and Patil,2013).BFS neglects the accessing order of hyperlinks during crawling so that it has the worst performance.OPS takes only the best value of priority into account,and the greedy strategy leads to a greater possibility of falling into a choice of a hyperlink with no prospects.To avoid the inherent flaws of greedy algorithms,many scholars have proposed intelligent focused crawling methods based on metaheuristic strategies.For instance,He et al.(2009) proposed a focused crawling strategy based on the simulated annealing (SA) algorithm,al‐lowing crawlers to obtain suboptimal hyperlinks for expanding the search range.Yan and Pan (2018) con‐sidered users’ browsing behavior to optimize genetic operations and proposed a heuristic focused crawling strategy based on an improved genetic algorithm (GA).Tong (2008) considered the distribution characteris‐tics of website resources and proposed a heuristic fo‐cused crawling strategy based on an adaptive dynamic evolutionary particle swarm optimization (PSO) algo‐rithm.Xiao and Chen (2018) analyzed the priority of crawlers in global crawling and proposed a focused crawling strategy based on the gray wolf optimization (GWO) algorithm.Recently,Liu JF et al.(2022a) proposed a heuristic focused crawling strategy combin‐ing ontology learning and the multiobjective ant colony algorithm (OLMOACO).In OLMOACO,a method of the nearest farthest candidate solution (NFCS) combined with fast nondominated sorting is used to select a set of Pareto-optimal hyperlinks and guide the crawlers’ search directions.Liu JF et al.(2022b) built a multiobjective optimization model for evaluat‐ing unvisited hyperlinks based on Web text and link structure,and proposed a focused crawling strategy combining the Web space evolution algorithm and do‐main ontology (FCWSEO).Both the OLMOACO and FCWSEO algorithms guide the crawling direction by building a multiobjective optimization model to select the next hyperlink to visit.However,the OLMOACO and FCWSEO algorithms suffer from the tendency to crawl the visited hyperlinks under a few hosts,which causes the crawler to converge prematurely.

To overcome the above issues,in this paper,we propose a novel focused crawler using an improved tabu search algorithm with domain ontology and host information (FCITS_OH).The main contributions of this paper are as follows:

1.Two domain ontologies of tourism and rain‐storm disaster based on formal concept analysis (FCA) are constructed to describe topics at the semantic and knowledge levels.

2.In the crawling process,an improved tabu search (ITS) strategy with host information is pre‐sented to select the next hyperlink,where the modi‐fied tabu object and acceptance principles are used to avoid crawling the visited hyperlinks in the focused crawler,and the host information memory of hyper‐links is proposed to prevent the crawler from cycling under a few hosts,which controls the convergence speed of the algorithm.

In this study,we use domain ontology to describe the topic.This section first introduces the construction process of domain ontology based on the FCA method and then computes the topic semantic weighted vec‐tor based on domain ontology semantics.

2.1 Ontology construction

FCA (Zhu et al.,2017) is a semiautomatic method of constructing ontology,whose main data structure is the concept lattice.The process of generating a concept lattice is concept clustering,which form‑alizes the hierarchical relationship between concepts.The detailed steps of ontology construction in this study are as follows: (1) select five keywords for the determined domain and search for keywords through search engines such as Baidu and Google to obtain the top 50 webpages of each search engine;(2) use the tool IK-Analyzer (Wang and Meng,2014) to per‐form word segmentation;(3) extract document sets and term sets that describe the topic;(4) build a document–term matrix,which is input into the tool ConExp (https://sourceforge.net/projects/conexp/) to generate a concept lattice and obtain a Hasse diagram;(5) describe the hierarchical relations among concepts by ontology Web language (OWL) (https://www.w3.org/TR/owl-features/);(6) visualize the ontology by Protégé (https://protege.stanford.edu/).

Applying the above method,we construct a tour‐ism ontology and a rainstorm disaster ontology.The tourism ontology includes seven branches: tourist attractions,tourism purpose,accommodation,service agencies,tourism routes,means of transportation,and tourist.The whole ontology includes 61 concepts and a seven-level hierarchical structure.The rainstorm di‐saster ontology includes three branches: disaster man‐agement,secondary disaster,and disaster grade.The whole ontology contains 50 concepts and a six-level hierarchical structure.

2.2 Topic semantic weighted vector computation

Referring to Liu JF et al.(2022a),we consider five impact factors,including semantic distance (IFDis),concept density (IFDen),concept depth (IFDep),con‐cept coincidence degree (IFCoi),and concept semantic relationship (IFRel),to measure the topic semantic similarity between concepts based on the constructed domain ontology.The calculation formula of the se‐mantic similarity between conceptC1and conceptC2,Sem(C1,C2),is shown as follows:

Here,the adjustment factorsk1,k2,k3,k4,andk5are non-negative and satisfyk1+k2+k3+k4+k5=1.To ob‐tain the topic semantic weighted vector,we first de‐termine a topic conceptC,which is tourism or rain‐storm disaster in this study.Suppose the topic word vectorT=(t1,t2,...,tn).Calculate the semantic simi‐larity between each topic wordti(i=1,2,...,n) and topic conceptCbased on Eq.(1) to obtain the corre‐sponding topic semantic weighted vectorWT=(wt1,wt2,...,wtn),wherewti(i=1,2,...,n) is the weight of theithtopic wordtiinT.Thus,the topic semantic weighted vector between topic conceptCand topic word vectorTis shown as follows:

We use the vector space model (VSM) (Farag et al.,2018) to calculate the topic relevance of a webpage and propose a comprehensive priority evaluation method for predicting the topic relevance of unvisited hyperlinks.

3.1 Topic relevance of webpages

Most webpages are represented as HTML files,and the content of the webpage is presented in the form of tags.Different positions of tags display dif‐ferent importance degrees in the entire webpage.We choose main tags from HTML files and divide them into five groups.Each tag group is assigned a specific weightWk(k=1,2,...,5),as shown in Table 1.

Table 1 Division of labels and their weights

We map the webpage text into a webpage fea‐ture vectorD=(d1,d2,...,dn) and obtain the corre‐sponding webpage feature weighted vectorWD=(wd1,wd2,...,wdn),wherewdi(i=1,2,...,n) represents the weight of theithfeature word and is computed by the improved term frequency-inverse document fre‐quency (TF-IDF) (Wu YL et al.,2017).Its expression is shown as follows:

Here,K=5,tfi,krepresents the normalized TF of theithtopic word at thekthposition (group) of thewebpage text,maxfi,krepresents the maximum TF of theithtopic word in all label groups,andWkrepresents the weight of thekthgroup of labels.We adopt VSM (Farag et al.,2018) to calculate the topic relevanceR(p) of webpagep.Its expression is shown in Eq.(4):

VSM is a well-known measure of cosine simi‐larity and transforms a language problem into a math‐ematical problem.The cosine similarity between two vectors is considered the similarity of the text related to the given topic.When the angle between two vec‐tors is equal to 0°,the relevance between them is the maximum and equals 1,indicating that they are the most relevant.When the angle is equal to 90°,the rel‐evance is the minimum and equals 0,indicating that they are irrelevant.Assume that the threshold of the webpage topic relevance isα.IfR(p)>α,then web‐pagepis considered to be topic-relevant.

3.2 Topic relevance of anchor text

The anchor text usually has only a few words or phrases,but it is an important resource to predict the relevance of the webpage to which the hyperlink points.Generally,the TF-IDF model (Wu YL et al.,2017) is used to evaluate the importance of key‐words.However,it is not comprehensive to use TF to measure the importance of a word in the whole an‐chor text.Therefore,we use the improved BM25 model (Wu TY,2018) to evaluate the importance of keywords in the anchor text.It retains the impor‐tant indicator of IDF in the TF-IDF model and im‐proves the computation of TF.The BM25 algorithm is generally used to evaluate the relevance of words and documents.In this study,we use the BM25 algo‐rithm to obtain the weights of words in the anchor text.The weightwaiof theithtopic word in the anchor text is calculated as follows:

Here,Nis the number of crawled webpages,Nidenotes the number of webpages containing theithtopic word,a>1,mrepresents the number of webpages containing the anchor text of the considered hyper‐link,k=2 andb=0.75represent adjustment factors,dljrepresents the length of thejthwebpage (i.e.,the num‐ber of words) containing the anchor text,avgdl is the average length of all crawled webpages,andfi,jde‐notes the frequency of theithtopic word in the anchor text located in thejthwebpage.After obtaining the an‐chor text feature weighted vectorWA=(wa1,wa2,...,wan),we calculate the cosine similarity between the topic semantic weighted vectorWTand the anchor text feature weighted vectorWAto obtain the topic relevanceR(Al) of anchor textAl.The topic relevance of the anchor textAlis computed as follows:

whereA=(a1,a2,...,an) denotes the anchor text fea‐ture vector.

3.3 Improved PageRank value computation

The PageRank algorithm is an essential algo‐rithm for evaluating unvisited hyperlinks.For webpagep,the traditional calculation formula of the PageRank (PR) value is

Here,dis the damping factor and is set to 0.85,hrepresents the total number of all in-links of web‐pagep,piis theithin-link webpage of webpagep,PR(pi) denotes the PR value of webpagepi,andC(pi) represents the total number of out-links of webpagepi.To avoid topic drifting of traditional PR calcula‐tion,by referring to Ma et al.(2016),we integrate the anchor text topic relevance into PR value calcula‐tion and propose an improved PR value calculation method for webpagep,which is shown as follows:

Here,ωrepresents an adjustment factor and is set to 0.6 in this study,andR(Ai) represents the topic relevance of anchor textAiof theithin-link of web‐pagep(Section 3.2).

3.4 Topic relevance evaluation of hyperlinks

A comprehensive priority evaluation method is given to evaluate the topic relevance of unvisited hy‐perlinkl.Its expression is shown as follows:

Here,r1,r2,andr3represent weighted factors and satisfyr1+r2+r3=1,P(l) represents the compre‐hensive priority value of the unvisited hyperlinkl,R(Al) represents the topic relevance of anchor textAlof hyperlinkl,R(pi) represents the topic relevance of webpagepithat contains hyperlinkl,mis the number of webpages containing hyperlinkl,and PR(pl) is the PR value of webpageplcontaining hyperlinkl.To filter irrelevant hyperlinks,we set a comprehensive priority thresholdβ.IfP(l)≥β,the unvisited hyper‐linklis considered topic-relevant and is added to the waiting queue (Qwait).

In this section,we first introduce the tabu search (TS) algorithm and subsequently propose the im‐proved tabu search (ITS) algorithm by modifying the tabu object and acceptance principles.Finally,by incorporating the domain ontology and host informa‐tion memory into the focused crawling strategy based on ITS,a new focused crawler using ITS with the ontology and host information (FCITS_OH) algo‐rithm is proposed.

4.1 Tabu search algorithm

The TS algorithm was first proposed by Fred Glover.The TS algorithm (Liu JF et al.,2021) is a ran‐dom heuristic algorithm based on local search in es‐sence.It generates some new candidate solutions in the neighborhood of the current solution.The basic flow of TS is as follows: (1) Given an initial solution,select some candidate solutions from the neighbor‐hood of the current solution.(2) If the objective func‐tion value of the optimal candidate solution is better than the objective function value of the current opti‐mal solution,ignore its tabu property,displace the current solution and the current optimal solution with the optimal candidate solution,and add it into the tabu list and simultaneously update the term of each object in the tabu list;otherwise,select the nontabu optimal solution from the candidate solutions as the new current solution,add it into the tabu list,and up‐date the term of each object in the tabu list.(3) Re‐peat the above process until the algorithm meets the ending condition.The TS algorithm involves some re‐lated elements,such as the objective function,neigh‐borhood,tabu list,and aspiration criterion,which will directly affect the optimization performance of the algorithm.

4.2 Objective function

The objective function is also called the fitness function,which is used to compute the objective value of the solution.In the focused crawler,the objective function is expressed by the comprehensive priority of hyperlinkl(Eq.(9)),andP(l) represents the objec‐tive function value.

4.3 Neighborhood set and extended neighborhood set

Definition 1(Neighborhood set) The set of all hy‐perlinks in the webpage to which the current hyper‐link Plink points is called the neighborhood set of Plink,denoted asN(Plink).

Definition 2(Candidate neighborhood set) The set of hyperlinks with a comprehensive priority higher than the thresholdβ,located in the webpage to which the current hyperlink Plink points,is called the candi‐date neighborhood set of Plink,denoted asC(Plink).Obviously,C(Plink) ⊆N(Plink).

Definition 3(Extended neighborhood set) The set of hyperlinks whose comprehensive priority is higher than the thresholdβin the webpage where the cur‐rent hyperlink Plink is located is called the extended neighborhood set of Plink,denoted asE(Plink).

In the entire crawling process,the traditional neighborhood search range considers only hyperlinks in the webpage to which the current hyperlink Plink points,i.e.,neighborhood set or candidate neighbor‐hood set.To expand the search range of the crawler,our ITS algorithm extends the neighborhood set to the extended neighborhood set.After access to the candidate neighborhood set of the current hyperlink Plink for a specified number of times and each time if there is no suitable hyperlink to be found,the next hyperlink will be selected from the extended neighborhood set.

4.4 Tabu list

The tabu list contains the tabu object and tabu length.The tabu object is the object in the tabu list.Whenupdating the crawler queue based on the neigh‐borhood setN(Plink),it is possible for the crawler to repeatedly select a certain hyperlink Plinkwith the highest comprehensive priority.To avoid this,in the tra‐ditional TS algorithm,if the comprehensive priority of Plinkis higher than the priority of the current optimal hyperlink,the algorithm will ignore its tabu property and replace the current optimal hyperlink and the cur‐rent hyperlink by Plink,and at the same time set it as the tabu object and put it into the tabu list;otherwise,the nontabu hyperlink with the highest comprehen‐sive priority fromN(Plink)will be selected as the current hyperlink and regarded as a new tabu object.However,in the ITS algorithm,we do not consider whether the current hyperlink Plink is a tabu object or not.As long as each of the comprehensive priorities of five randomly selected hyperlinks fromC(Plink) is lower than Plink’s comprehensive priority,we will set Plink as a tabu object,put Plink into the tabu list,and then select a nontabu hyperlink with the highest com‐prehensive priority fromE(Plink) as the current hyper‐link.Obviously,when the hyperlink is selected fromE(Plink),Plink is not selected again.This improved tabu object strategy not only gives the current hyperlink more opportunities to select the next hyperlink with bet‐ter comprehensive priority from the candidate neigh‐borhood set,but also effectively extends the search range of the crawler by the extended neighborhood set.

Tabu length denotes the maximum number of times by which tabu objects are not picked out from the tabu list without considering the aspiration cri‐terion.In this study,the tabu length is set to five.

4.5 Aspiration criterion and improved acceptance principles

The aspiration criterion means that when a ta‐booed hyperlink has higher comprehensive priority than the current optimal hyperlink,the tabu property of this tabooed hyperlink will be ignored,and it will be accepted as the current hyperlink.In the traditional TS algorithm,when the tabooed hyperlink does not satisfy the aspiration criterion,the nontabu hyperlink with the highest comprehensive priority will be se‐lected from the neighborhood set as the current hy‐perlink (ignoring its comparison with the current hy‐perlink).This method easily accepts hyperlinks with a low comprehensive priority.The ITS algorithm re‐fines the acceptance principles by the following steps while retaining the aspiration criterion:

1.If hyperlink Glink selected fromC(Plink) is a tabu object and satisfies the aspiration criterion,Glink will be released and accepted as the current hyperlink Plink.

2.If hyperlinkGlink is a tabu object and does not satisfy the aspiration criterion,Glink will not be accepted as the current hyperlink.Thereafter,a new hyperlink is randomly selected fromC(Plink).If its comprehensive priority is higher than that of the cur‐rent hyperlink,it will be accepted as the new current hyperlink;otherwise,another hyperlink will be se‐lected fromC(Plink) and judged whether it is accepted.This process is repeated five times until a selected hyperlink is accepted.If each of the five cannot be accepted,we set the hyperlinkPlink as a tabu object and put it into the tabu list.Then,select a nontabu hy‐perlink with the highest comprehensive priority fromE(Plink) as the current hyperlinkPlink.Update the tabu listand release the object whose term is 0.

3.If hyperlink Glink is not a tabu object and its comprehensive priority is higher than that of the cur‐rent hyperlink Plink,Glink will be accepted as the current hyperlink Plink.

4.If hyperlink Glink is not a tabu object and its comprehensive priority is not higher than that of the current hyperlink Plink,Glink will not be accepted as the current hyperlink.Thereafter,the five different hyperlinks are selected fromC(Plink),similar to the above step 2.If the comprehensive priority of a se‐lected hyperlink is higher than that of the current hy‐perlink,this hyperlink will be accepted as a new cur‐rent hyperlink.If each of them cannot be accepted as the current hyperlink,we set the hyperlinkPlink as a tabu object and put it into the tabu list.Then,select a nontabu hyperlink with the highest comprehensive pri‐ority fromE(Plink) as the current hyperlink Plink.Up‐date the tabu listand release the object whose term is 0.

4.6 Focused crawler based on the improved tabu search algorithm

The ITS algorithm is obtained by improving the tabu object and acceptance principles of the traditional TS algorithm.The ITS algorithm is applied to deter‐mine the next hyperlink to be visited from the wait‐ing queueQwait.

First,initialize the tabu listH1.Suppose that Hlink is the current optimal hyperlink and that Plink is the current hyperlink selected randomly fromQwait.Construct a candidate neighborhood setC(Plink) and an extended neighborhood setE(Plink) based on the current hyperlink Plink.Randomly select a hyperlinkGlink fromC(Plink) as a candidate hyperlink.Then,judge whether Glink is accepted according to the im‐proved acceptance principles.If it is accepted,re‐place the current hyperlink Plink by Glink,and out‐put the hyperlink Plink.If it is not accepted,select another candidate hyperlink Glink fromC(Plink) and continue the judgment process.If five different candi‐date hyperlinks are selected,and each time the selected candidate hyperlink is not accepted,then we set Plink as a tabu object and put it into tabu listH1.Reselect a nontabu hyperlink with the highest comprehensive priority from the extended neighborhood setE(Plink) as the current hyperlink.Update the tabu listH1by subtracting 1 for the term of each tabu object in the tabu list,and release the object whose term is 0.The above iteration process is repeated until a hy‐perlink is accepted.The detailed process of the ITS(Qwait) algorithm is presented in Algorithm A1 of Appendix.

4.7 Focused crawler combining ontology and the improved tabu search algorithm

By introducing the ITS algorithm into the focused crawler and using the domain ontology to describe the topic,we design a focused crawling strategy combining ontology and the ITS algorithm (FCOITS),which is used to fetch topic-relevant webpages from the Internet.

First,determine the topic and build the domain ontology about this topic,and add the seed uniform resource locators (URLs) toQwait.Suppose thatαis the threshold of topic-relevant webpages and thatβis the threshold of the hyperlink’s comprehensive prior‐ity.Then,the ITS(Qwait) algorithm is used to select the next hyperlink phead to visit and download the webpage phead-page to which the hyperlink phead points.IfR(phead-page)>α,it is considered a topicrelevant webpage;otherwise,it is considered an irrel‐evant webpage.Subsequently,all hyperlinks in the webpage phead-page are extracted and added to the set of child-links.Calculate the comprehensive priority of every hyperlink child-linkiin child-links based on Eq.(9).IfP(child-linki)>β,add child-linkitoQwait;oth‐erwise,discard it.The above iteration process is re‐peated until the end conditions are met.Fig.1 shows the flowchart of the proposed FCOITS algorithm.The detailed process of the FCOITS algorithm is presented in Algorithm A2 of Appendix.

Fig.1 Flowchart of the proposed FCOITS algorithm (DP: number of downloaded webpages;LP: number of downloaded topic-relevant webpages;Qwait: waiting queue)

4.8 Focused crawler combining the FCOITS algorithm and host information

It is possible that the crawler recursively crawls under a few hosts,resulting in premature conver‐gence of the crawler and limitation to retrieve more topic-relevant webpages.Hostgraph (Jiang and Zhang,2007) reveals the connection between hyperlinks and common hosts.For example,“klme.nuist.edu.cn” in the hyperlink “https://klme.nuist.edu.cn/index.htm” is the host,and any hyperlink that contains it can be under the same host.In this study,we analyze the hyperlink’s syntactic structure,leverage the host of the hyperlink,and propose a new focused crawler that integrates host information into FCOITS,called FCITS_OH.

At the beginning of FCITS_OH,the hyperlinks that are located under different hosts and have higher comprehensive priorities are selected as seed hyper‐links to avoid premature convergence of the algo‐rithm.Then,put the selected hyperlinks intoQwait.Apply the ITS(Qwait) algorithm to obtain the head hy‐perlink phead ofQwait,whose host is marked byphead_host.Suppose that the number of hosts inQwaitis QN_host.The number of downloaded webpages is DP.The algorithm ends when DP reaches 15 000.The algorithm completion rate is defined by com_rate=DP/15 000.During crawling,if some hyperlinks are visited many times under the same host,the crawler will select another hyperlink located at differ‐ent hosts inQwaitfrom the current host.To avoid the crawler circularly crawling under a few hosts,a tabu listH2for hosts is defined.Continue the following four steps: (1) If phead_host is a tabooed host,call the ITS(Qwait) algorithm to obtain another head hyper‐link phead ofQwait.(2) If com_rate<0.3 and QN_host<10,select three hyperlinks according to descending order of comprehensive priorities from the discarded hyperlinks whose hosts do not belong to the set of hosts inQwaitand add them intoQwait.This is condu‐cive to expanding the number of hosts of hyperlinks inQwait.(3) If the number of visited hyperlinks under the current host phead_host is smaller than 50,continue to visit other hyperlinks under the current hostphead_host;otherwise,compute the percentage ph_ratio (the number of topic-relevant webpages to the num‐ber of all visited webpages under the current host phead_host).(4) If the number of visited hyperlinks under the current host phead_host is smaller than 100 and ph_ratio>0.8,continue to visit other hyperlinks under the current hostphead_host;otherwise,set phead_host as a tabooed host and put it intoH2.After the head hyperlink phead is obtained,continue the re‐maining steps of Algorithm A2 until the end condi‐tions are met.The detailed process of the FCITS_OH algorithm is presented in Algorithm A3 of Appendix.

In this study,the initial seed hyperlinks are ac‐quired from Baidu,which is the most authoritative and widely used search engine in China.We obtain some webpages by searching the keywords “tourism” and “rainstorm disaster,” separately.We choose 30 top-ranked webpages as the initial seed hyperlinks in the tourism domain and rainstorm disaster domain (Tables S1 and S2 in the supplementary materials).

In addition,some important parameters (αandβ) have a great impact on the experimental results.For example,if the topic relevance thresholdαis too high,the crawled topic-relevant webpages will be re‐duced because some topic-relevant webpages are fil‐tered.If the thresholdαis too low,some irrelevant web‐pages will be wrongly considered as topic-relevant webpages.We have conducted parameter experiments on different values ofαin the range of 0.5–0.8 based on the lattice search method under different domains by referring to Liu WJ and Du (2014).The results show that whenα=0.7 in the tourism domain andα=0.62 in the rainstorm disaster domain,the crawler could correctly capture topic-relevant webpages and achieve the best performance.The other parameters are set similarly.Here,we setβ=0.19 for the tourism domain andβ=0.15 for the rainstorm disaster domain.In addition,r1=0.55,r2=0.25,andr3=0.20.

5.1 Performance metrics

The effectiveness of the focused crawlers can be generally evaluated by accuracy (AC) and recall (RC).AC equals the ratio of the number of downloaded topic-relevant webpages to the total number of down‐loaded webpages.RC equals the ratio of the number of downloaded topic-relevant webpages to the total number of all topic-relevant webpages on the Internet.Because it is difficult to count the total number of topic-relevant webpages on the Internet,in this study,we do not use RC as the evaluation metric.In addi‐tion,we use the average topic relevance (AR) and the standard deviation (SD) of downloaded webpages as evaluation metrics.These three metrics are as follows:

Here,SD is the standard deviation of all down‐loaded webpages compared to AR,used to measure the spread of the topic relevance of all downloaded webpages.The value of SD is in [0,1].

5.2 Experimental results of different crawlers

In this study,we first test six focused crawling algorithms in the tourism domain and then test seven focused crawling algorithms in the rainstorm disaster domain under the same experimental environment,including BFS (Li et al.,2015),OPS (Rawat and Patil,2013),focused crawler based on the simulated anneal‐ing algorithm (FCSA) (Liu JF et al.,2019),FCWSEO (Liu JF et al.,2022b),OLMOACO (Liu JF et al.,2022a),FCOITS,and FCITS_OH.The last two algo‐rithms are proposed in this study.We implement all crawling algorithms in Java language and run them on an Intel Core i7-7700 PC with 3.6 GHz CPU and 8.0 GB RAM.When the number of downloaded web‐pages reaches 15 000,all algorithms tend to be stable and terminate.The same evaluation metrics are used to test different crawling algorithms on the two topics of tourism and rainstorm disaster.This is conducive to investigating the validity,superiority,and adaptability of each algorithm.

5.2.1 Experimental results in the tourism domain

Experimental results of LP,AC,AR,and SD by six different crawling algorithms including BFS,OPS,FCSA,FCWSEO,FCOITS,and FCITS_OH in the tourism domain are shown in Figs.2–5 for compari‐son.Fig.2 shows the results of LP obtained by six crawling algorithms in the tourism domain.With the increase in the number of downloaded webpages,the LP of five crawling algorithms increases rapidly except for BFS.Obviously,the LP obtained by FCITS_OH is significantly greater than that of the five other crawling algorithms.The LP obtained by FCITS_OH is 13 082 when DP reaches 15 000.Fig.3 shows the re‐sults of AC obtained by six crawling algorithms in the tourism domain.It is not hard to see from the figure that the AC of FCITS_OH becomes higher than that of the five other crawling algorithms after the DP exceeds 8000.The AC of the BFS,OPS,FCSA,FCWSEO,FCOITS,and FCITS_OH crawling algorithms is 0.3740,0.6820,0.7503,0.8086,0.8453,and 0.8721,re‐spectively,when DP reaches 15 000.

Fig.2 Results of LP obtained by six crawling algorithms in the tourism domain (LP: number of downloaded topicrelevant webpages;DP: number of downloaded webpages)

Fig.3 Results of AC obtained by six crawling algorithms in the tourism domain (AC: accuracy;DP: number of downloaded webpages)

Fig.4 shows the results of AR obtained by six crawling algorithms in the tourism domain.According to Fig.4,the AR of FCITS_OH is obviously higher than that of the five other crawling algorithms after the DP exceeds 8000.The AR of the BFS,OPS,FCSA,FCWSEO,FCOITS,and FCITS_OH crawling algorithms is 0.4247,0.6966,0.7292,0.7553,0.7806,and 0.7912,respectively,when DP reaches 15 000.Fig.5 shows the results of SD obtained by six crawl‐ing algorithms in the tourism domain.From Fig.5,FCITS_OH maintains a low SD throughout the cra‑wling process.The SD of the BFS,OPS,FCSA,FCWSEO,FCOITS,and FCITS_OH crawling algo‐rithms is stable at 0.2848,0.2317,0.1769,0.1293,0.1413,and 0.1340,respectively,when DP reaches 15 000.The SD reflects the stability of the topic rele‐vance of webpages captured by the algorithm.The lower the SD,the more stable the algorithm.Al‐though the SD of FCITS_OH is slightly higher than that of FCWSEO,FCITS_OH outperforms FCWSEO in the other evaluation metrics.

Fig.4 Results of AR obtained by six crawling algorithms in the tourism domain (AR: average topic relevance;DP: number of downloaded webpages)

Fig.5 Results of SD obtained by six crawling algorithms in the tourism domain (SD: standard deviation;DP: number of downloaded webpages)

5.2.2 Experimental results in the rainstorm disaster domain

Experimental results by seven different crawling algorithms including BFS,OPS,FCSA,FCWSEO,OLMOACO,FCOITS,and FCITS_OH in the rain‐storm disaster domain for four evaluation metrics (LP,AC,AR,and SD) are shown in Figs.6–9.Fig.6 shows the results of LP obtained by seven crawling algorithms in the rainstorm disaster domain.From Fig.6,we find that when DP reaches 15 000,FCITS_OH obtains 12 393 topic-relevant webpages,indicat‐ing that FCITS_OH can collect more topic-relevant webpages than the six other crawling algorithms.Fig.7 shows the results of AC obtained by seven crawling algorithms in the rainstorm disaster domain.From Fig.7,we find that the AC of FCITS_OH tends to stabi‐lize gradually after the DP exceeds 10 000.Finally,the AC of the BFS,OPS,FCSA,FCWSEO,OLMOACO,FCOITS,and FCITS_OH crawling algorithms is 0.2366,0.6542,0.7004,0.8103,0.7417,0.7969,and 0.8262,respectively.Compared with the six other crawling algorithms,FCITS_OH has a higher AC.

Fig.6 Results of LP obtained by seven crawling algorithms in the rainstorm disaster domain (LP: number of down‐loaded topic-relevant webpages;DP: number of downloaded webpages)

Fig.7 Results of AC obtained by seven crawling algorithms in the rainstorm disaster domain (AC: accuracy;DP: number of downloaded webpages)

Fig.8 shows the results of AR obtained by seven crawling algorithms in the rainstorm disaster domain.Throughout the crawling process,the AR of FCITS_OH is relatively high and stable among the seven crawling algorithms.Finally,the AR of the BFS,OPS,FCSA,FCWSEO,OLMOACO,FCOITS,and FCITS_OH cra‑wling algorithms is stable at 0.2947,0.6376,0.6627,0.8200,0.7781,0.7306,and 0.7421,respectively.Fig.8 shows that although the AR of FCWSEO and OLMOACO is slightly higher than that of FCITS_OH,the effect of FCITS_OH in grabbing topic-relevant web‐pages is relatively stable.

Fig.8 Results of AR obtained by seven crawling algorithms in the rainstorm disaster domain (AR: average topic relevance;DP: number of downloaded webpages)

Fig.9 shows the results of SD obtained by seven crawling algorithms in the rainstorm disaster domain.The SD of FCITS_OH maintains a downwards trend in the whole crawling process.Finally,the SD of the BFS,OPS,FCSA,FCWSEO,OLMOACO,FCOITS,and FCITS_OH crawling algorithms is 0.3096,0.2599,0.1953,0.1570,0.1375,0.1495,and 0.1444,respectively.Although the SD of FCITS_OH is slightly higher than that of OLMOACO,they are comparable.

Fig.9 Results of SD obtained by seven crawling algorithms in the rainstorm disaster domain (SD: standard deviation;DP: number of downloaded webpages)

5.3 Analysis and discussion

From Figs.2,3,4,6,7,and 8,it is not hard to find that the OPS algorithm has a better performance in the early crawling stage on the tourism and rain‐storm disaster domains,but the performance degrades in the later crawling stage,resulting from its greedy strategy.The OPS algorithm always selects the highest priority hyperlink from the waiting queue to crawl the webpage.When it falls into a choice of a hyper‐link with no prospects,the webpage to which it points may contain few valuable hyperlinks,which is not con‐ducive to the expansion of the search range.The FCSA algorithm is also a kind of greedy strategy but changes the optimal search by adopting a certain probability to receive hyperlinks with relatively low priority.How‐ever,the performance of the FCSA algorithm highly depends on its parameters,such as the initial temper‐ature and annealing speed,which are difficult to de‐termine.Therefore,the ability of the FCSA algorithm to grab topic-relevant webpages is only slightly higher than that of the BFS and OPS algorithms.

Fig.3 shows that the FCITS_OH algorithm over‐matches the FCWSEO algorithm on AC in the tour‐ism domain,and it can also be seen from Fig.7 that the FCITS_OH algorithm outperforms the FCWSEO and OLMOACO algorithms on AC in the rainstorm disaster domain.The FCWSEO and OLMOACO algo‐rithms grow fast in the early stage and tend to stabi‐lize without improvement later in the crawling pro‐cess.This is because the FCWSEO algorithm is a kind of multiobjective optimization algorithm that produces nondominant hyperlinks within circular re‐gions.However,as the circular regions expand,it will easily catch some hyperlinks with no prospects by contrasting with the adjacent hyperlinks and affect the crawling performance.For the OLMOACO algo‐rithm,it is easier to accumulate pheromones to find an optimal search path at the beginning,and as the crawler proceeds,it is affected by the feedback mech‐anism that makes it difficult to improve the phero‐mone of the optimal path again.As a result,it is chal‐lenging to continue to enhance the ability to fetch more topic-relevant webpages.The FCITS_OH algorithm uses the tabu object and aspiration criterion to avoid crawling visited hyperlinks and introduces host infor‐mation to expand the search range,so it is easier to find the optimal crawling path and fetch more topicrelevant hyperlinks in the entire crawling process.

The specific values of all evaluation metrics and running time of the abovementioned seven crawling algorithms in the tourism domain and rainstorm disas‐ter domain are displayed in Table 2 when DP reaches 15 000.From Table 2,we can find that the running time of BFS is the shortest,while FCWSEO and OLMOACO require longer running time than the other crawling algorithms in the tourism domain and rain‐storm disaster domain,respectively.This is because FCWSEO and OLMOACO are multiobjective opti‐mization crawling algorithms,where the optimiza‐tion process of hyperlink selection based on a multi‐objective optimization model increases the time con‐sumption.The running time of FCITS_OH is slightly longer than that of the other crawling algorithms ex‐cept FCWSEO and OLMOACO.This is because it takes more running time to construct the ontology and extract host information.

To further investigate the effectiveness of the improved strategies of tabu object and acceptance principles in the ITS algorithm,we design the fo‐cused crawler based on the improved tabu search al‐gorithm (FCITS) and the focused crawler based on the traditional tabu search algorithm (FCTS).For con‐venience of presentation,Table 2 also shows the LP,AC,AR,SD,and running time of FCITS and FCTS in the tourism and rainstorm disaster domains when DP reaches 15 000.We find that the experimental re‐sults of FCITS for all evaluation metrics except the run‐ning time are better than those of FCTS.This further confirms the effectiveness of the improved strategies in the ITS algorithm.With regard to the running time of the ITS algorithm,we analyze the time complexityof the ITS algorithm and the TS algorithm in the crawling process as follows.

Suppose that there aremhyperlinks in the wait‐ing queueQwait.The time complexity of selecting a hyperlink Plink fromQwaitisO(m).The time con‐sumption of selecting a hyperlink Glink from the neighborhoodC(Plink) is assumed to bek1~O(m).The time consumption for determining the tabu ob‐ject is constant,and the time complexity of comput‐ing the topic relevance of the hyperlink isk2×O(DP×n).Here,k2is the time consumption of word segmen‐tation,word frequency statistics,and link extraction from webpages;O(DP×n) represents the time com‐plexity of calculatingR(pi),PR(pl),andR(Al);DP andnare the number of downloaded webpages and the number of topic words,respectively.The time con‐sumption of selecting a hyperlink from the extended neighborhood set is assumed to bek3~O(m).There‐fore,the time complexity of the ITS algorithm can be expressed asO(m)×[k1×k2×O(DP×n)×k3].Becausek1~O(m),k2~O(n),andk3~O(m),the time complexity of the ITS algorithm isO(m3×DP×n2).

Different from the ITS algorithm,the TS algo‐rithm selects a nontabu link with the best comprehen‐sive priority from neighborhoodC(Plink) when the tabooed hyperlink does not satisfy the aspiration cri‐terion,so its time complexity isO(m2×DP×n2).By analyzing the time complexities of ITS and TS,it can be seen that the time complexity of the ITS algorithm is higher than that of the TS algorithm.This results in a longer running time of the FCITS algorithm than the FCTS algorithm.

It can be seen from Table 2 that not all evalua‐tion metrics of FCITS_OH have the optimal results.To better evaluate the effectiveness and superiority of FCITS_OH,the Friedman test (Derrac et al.,2011),which is a nonparametric statistical test,is used to comprehensively evaluate the performance of these algorithms.In this study,when DP=15 000,the re‐sults obtained by nine crawling algorithms for the four representative metrics (LP,AC,AR,and SD) are converted to average ranks.The best performing al‐gorithm for each metric should have the rank of 1,the second best ranks 2,and so on.The smaller the aver‐age rank is,the better the performance.Table 3 dis‐plays the experimental results of nine crawling algo‐rithms based on four evaluation metrics by the Fried‐man test when DP reaches 15 000.From Table 3,we can find that the FCITS_OH algorithm is the best performing algorithm among the nine algorithms in the two domains in terms of the four metrics.In sum‐mary,the experimental results show that FCITS_OH achieves impressive and satisfactory results in most performance evaluation metrics,particularly prevail‐ing over the other eight crawlers in LP and AC.There‐fore,we can conclude that the proposed FCITS_OH crawler is an effective semantic retrieval method.

Table 3 Friedman ranks of nine crawling algorithms for the four representative evaluation metrics in the tourism and rainstorm disaster domains when DP reaches 15 000

The drawback of traditional crawlers is that they cannot provide enough topic-relevant information for a specific domain.To overcome the shortcomings oftraditional crawlers,this paper focuses on focused crawlers.We propose a novel focused crawling algo‐rithm,namely,FCITS_OH.Specifically,we construct a domain ontology based on the FCA method for topic description at the semantic and knowledge levels.The ITS strategy and host information are used to select the next hyperlink in the focused crawler.In addition,we design a comprehensive priority evaluation method for evaluating unvisited hyperlinks and preventing the problem of topic drifting.To demonstrate the effec‐tiveness and superiority of the FCITS_OH algorithm,we compare the experimental results of FCITS_OH and FCOITS with those of BFS,OPS,FCSA,and FCWSEO in the literature in the tourism domain and BFS,OPS,FCSA,OLMOACO,and FCWSEO in the rainstorm disaster domain under the same experimen‐tal environment.The experimental results show that FCITS_OH outperforms other focused crawling algo‐rithms and has the ability to collect more quantities and higher-quality webpages.Furthermore,we compare the experimental results of FCTS based on the origi‐nal TS and FCITS based on ITS.The experimental results confirm the effectiveness of the proposed ITS.

The proposed FCITS_OH has some disadvan‐tages,however,such as no consideration of the tunnel crossing technique.It is possible for a hyperlink to cross an irrelevant webpage to a relevant webpage.In addition,in the topic-relevance evaluation of un‐visited hyperlinks in the focused crawlers,the tradi‐tional single-objective optimization method based on the weighted sum is adopted,which is difficult to determine the optimal weight coefficients reasonably.In future work,we intend to study focused crawlers based on the tunnel crossing technique and multiob‐jective intelligent optimization algorithms to improve our evaluation metrics.

Contributors

Jingfa LIU designed the research.Zhen WANG drafted the paper,implemented the software,and performed the experi‐ments.Guo ZHONG and Zhihe YANG revised and finalized the paper.

Compliance with ethics guidelines

Jingfa LIU,Zhen WANG,Guo ZHONG,and Zhihe YANG declare that they have no conflict of interest.

Data availability

Data are available in a public repository.

List of supplementary materials

Table S1 Seed uniform resource locators (URLs) in the tour‐ism domain

Table S2 Seed uniform resource locators (URLs) in the rain‐storm disaster domain

Appendix: Proposed focused crawling algorithms

推荐访问:tabu search improved

版权所有:天海范文网 2010-2024 未经授权禁止复制或建立镜像[天海范文网]所有资源完全免费共享

Powered by 天海范文网 © All Rights Reserved.。鲁ICP备10209932号