Meilian Lu, Zhihe Qu, Mengxing Wang, Zhen Qin
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
Abstract: Citation network is often used for academic recommendation. However, it is difficult to achieve high recommendation accuracy and low time complexity because it is often very large and sparse and different citations have different purposes. What’s more, some citations include unreasonable information, such as in case of intentional self-citation. To improve the accuracy of citation network-based academic recommendation and reduce the time complexity, we propose an academic recommendation method for recommending authors and papers. In which,an author-paper bilayer citation network is built, then an enhanced topic model, Author Community Topic Time Model (ACTTM) is proposed to detect high quality author communities in the author layer, and a set of attributes are proposed to comprehensively depict the author/paper nodes in the bilayer citation network. Experimental results prove that the proposed ACTTM can detect high quality author communities and facilitate low time complexity, and the proposed academic recommendation method can effectively improve the recommendation accuracy.
Keywords: academic recommendation; topic model; community detection; bilayer citation network
Along with the rapid development of computer technology, network technology and digital information technology, Internet becomes the main way for researchers to obtain academic resources. However, obtaining academic resources from Internet encounters some drawbacks, for instance, the amount of academic resources is enormous, and some of which are of low quality. It may take a lot of time for researchers to find the academic resources they need. What’s worse, they cannot easily find the required resources even if a lot of time is wasted. Academic recommendation is a promising technology which can provide researchers with recommended academic resources according to their personalized requirements.
Authors and papers are two typical and important academic resources. Researchers often need to read a lot of papers or know what other researchers in certain areas are doing or find new research topics when conducting research activities. Speci fically, for senior researchers,influential authors and their papers can help them quickly learn the state of the research areas they interested and become familiar with the areas rapidly. For senior researchers, since they may be involved in multiple research areas, in fluential authors and their papers can help broaden their research visions.
The authors proposed a new academic recommendation method for recommending authors and papers.In which, ACTTM is proposed to model author communities based on paper content, and a bilayer citation network is constructed according to paper citations and author citations.
In some existing studies, citation networks[1, 2] were built to recommend authors or papers in academic recommender and further detect research fronts by analyzing network structure and nodes attributes. The research results [3] show that the recommender system based on citation network can recommend not only influential authors in relative research areas but also the authors who can expand researchers’ perspective. However, citation networks are usually large-scale and the citation relationships within the network are very sparse, so recommending academic resources based on entire citation network has low computation efficciency [4]. Moreover, the quality of citation network is excessively dependent on the citation information, low quality citation information may affect the quality of citation network, and further reduce the accuracy of academic recommender. Therefore, the fundamental issues of academic recommender based on citation network include both aspects: one is to increase the computational efficciency; the other is to improve the quality of citation information.
Aimed at the two issues above, some studies considered to divide the entire citation network into multiple communities and recommend academic resources within the high correlation communities. In which, textual analysis [5, 6] and topic based methods [7,8, 9] were often used to detect author or paper communities. The results of these studies showed that topic-based algorithms could overcome the defect that the results of community detection were affected in case of less citation information, and could classify the authors with high similarities into same community to achieve high quality communities.However, these algorithms are not able to analyze the citation information using graph theory in building communities, and can’t extract the specific authors or papers which have higher authority or can lead the research fronts.
Some researches combined citation network and topic model to remedy the defects. Wang X. et al. [10] proposed CitationLDA, which combined LDA (Latent Dirichlet Allocation)[11] and citation information, to discovery research topics and their evolution. However,the scalability of the above methods needs to be improved, and their computational time complexity will increase with the increase of the scale of citation network. Enriching the information of citation networks can improve the quality of citation networks from another point of view. Yujing Wang et al. [12] utilized journal information, author information and time information to reinforce the citation network and rank the scienti fic articles. A Livne et al. [13] enriched the node attributes in citation network with citation number. While these methods did not consider the community structure of citation network, the computational time complexity and recommendation accuracy is to be future improved.
In this paper, we proposed an enhanced academic recommendation method to recommend influential authors and high quality papers,effectively improving the accuracy and time complexity in citation network-based academic recommender. Firstly, we built an author-paper bilayer citation network; then proposed an enhanced topic model, namely Author Community Topic Time Model (ACTTM), to detect high quality author communities within the author citation layer, thus effectively reducing the time complexity and increasing the recommendation accuracy; finally proposed a set of attributes to comprehensively depict the author nodes and the paper nodes of the bilayer citation network to enrich the citation information and further increasing the recommendation accuracy.
The subsequent content is organized as follows: Part 2 introduces the related research works from the aspects of citation network based academic recommendation, academic topic modeling and community detection. Part 3, 4 and 5 respectively introduce the proposed ACTTM, bilayer citation network and academic recommendation method. Part 6 conducts the experiments and analyses the perfor-mance of the proposed models and algorithms.Part 7 concludes our research work and gives prospects of the future study.
In academic field, authors/papers and their relations compose a complex social network.In which, there are two types of special social relationships, called as citation relations. The citation relationships between papers or authors form paper citation networks and author citation networks. The citation information between papers indicates the content correlation among papers and the knowledge transfer. The citation information between authors implies the research groups composed of scholars with similar interested research areas, which further represents the status of speci fic research area and re flects the future research trends.
Wu et al. [14] thought, depending on different citation relations, three different types of citation networks can be constructed: directly connected network, coupling network and co-citation network. Wu et al. [14] evaluated the effects of the above three types of citation networks in terms of discovering research fronts and concluded that the directly connected network can be used to discern emerging fields with the least time, while the co-citation network performs worst. So in this paper, we build bilayer citation network according to the direct citation relations among papers/authors.
Amancio et al. [1] regarded the citation relations in citation networks denote one paper selects its references through some factors,and content similarity, citation counts and publication year were taken into account to build citation network. They proved that content similarity was the most relevant factor for reproducing the topology of citation networks.Menczer [2] proposed a model to generate document networks by using popularity and textual content, which yielded remarkably accurate predictions of both degree and similarity distributions in the networks. However,their textual similarity was measured based on simple term frequencies, which was quite crude.
As for recommending authors and papers,many features about authors and papers are involved, such as citation relationships, co-author relationships, and research areas. F Battiston et al. [15] investigated the multiplex nature of communities in collaboration networks and proposed a model using triadic closure to explain the appearance, coexistence and co-evolution of communities at the different layers of multiplex. MP Viana et al. [16] integrated time factor to consider the cohesive relationship.However, these methods only considered the topology relationships between authors and the semantic information was ignored, which is not suitable for discovering valuable papers.Although co-author network better show the similarities between the research areas of authors, and co-authorships tend to emerge in an organic way when researchers share scienti fic interests, or are involved in joint research projects. While, in co-author network, the nodes with high degree may be just active authors who published more papers rather than influential authors in specific research area.So co-author network can help to recommend authors, but may not be appropriate for recommending influential authors. As for research areas, which are not conducive to distinguishing authors due to coarse-grained partitions. S Wang et al. [17] improved the predictive accuracy of network service quality by considering various attributes such as time and location comprehensively. So in this paper, from the perspective of semantic relevance, we first build a basic citation network then enrich the citation relations by extracting information from the users’ operations on papers. However, S Wang et al. [18] mentioned that the existence of subjective users and malicious users will reduce the accuracy of recommendations.Therefore, in addition to users’ operations, we also use the topology information and semantic information of the network to enrich the node attributes and thus to improve recommendation accuracy.
Text analysis can provide semantic information, while topic model provides an effective mean for text analyzing and were therefore used for academic network related studies. By using web-based methods and text analysis methods, FN Silva et al. [5] built a scienti fic map with a given scienti fic area automatically, which facilitated the conduct of further investigation of this designated field. A Lancichinetti et al. [6] proposed a topic mapping method to detect communities.
Michal Rosen-Zvi et al. [19] proposed AT(Author-Topic) model to generate the topic distribution of authors, which is a topic model by extending LDA with author. In which,text-topic distribution was replaced by the author-topic distribution. Wang et al. [20]thought that topic is not only influenced by co-occurrence frequency of words, but also by the changes of time. So Wang et al. added a variablet, taken as an observable parameter into the LDA model, and proposed TOT (Topic Over Time) model. TOT model can be used to obtain paper-topic distribution and topic-time distribution. Jie Tang et al. [21] proposed ATT(Author-Time-Topic) model by combining AT model and TOT model and adding a new variablet. By which, the author-topic information and the changes of topics over time can be extracted. However, the above topic models did not detect academic community and also did not reflect continues changes of topics over time. Aimed at the problem, W Elshamy [22]proposed ciDTM (continuous-time infinite Dynamic Topic Model). But, the implementation of this method is very complex, and the setting of model parameters is very complex,which may affect the final accuracy.
Existing community detection algorithms are mainly based on network topology and topic relevance. The community detection algorithms based on network topology include graph splitting methods and graph aggregation methods. The representative splitting method is GN algorithm [23] proposed by Girvan and Newman. The representative aggregation method is BGLL algorithm [24] proposed by Blondel and Guillaume et al. However, the performance of network-based community detection algorithms is depend on the density of network, whereas citation network is high sparse, and topology-based community detection algorithms generally didn’t consider the topic relations between papers, which may lead to the problem that authors in different research fields be classi fied into same community in the condition of inappropriate citation,so the communities of citation network constructed using this type of algorithm are not very appropriate.
The community detection algorithms based on topic relevance use topic model to extract the topic relations among papers or authors, and detect community according to the extracted information. Daud Ali et al.[7] proposed VAT (Venue Author Topic) to model authors and conferences, then used the author-topic distribution obtained by VAT to detect author community. The main idea of the algorithm is utilizing topic models to extract the similarities between individuals. However,it didn’t take the advantages of other abundant network information formed by individuals.Hwang S Y et al. [25] constructed a co-authorship network and used LDA to analyze the topic distribution of a paper, then an author topic distribution could be obtained based on the topic distribution of all the papers written by the author and the author with the most similar topic distribution in the network was recommended. However, when the network is large in scale, finding similar authors across the entire network may take a great deal of time. Daifeng Li et al. [8] added community information to AT model, proposed ACTM(Author-Community-Topic Model) to detect author communities, and further added time information to ACTM and put forward DCTM (Dynamic Community Topic Model)to model topic communities during a given period. However, DCTM did not reflect the continuous change of topic community, and the effect of communities is limited. T Liu et.al [9] transformed citation network to paper similarity network, and proposed a citation similarity based community detection method to detect the communities in citation networks.However this paper only considered the paper co-citation relations and may result many isolated nodes.
The innovation of this paper includes three aspects: 1) build an author-paper bilayer citation network with citation information and propose a set of attributes to enrich the features of network nodes, which are conducive to improve the accuracy of academic recommender; 2) propose ACTTM model to detect author communities in the bilayer network in order to recommend academic resources within communities, which can effectively reduce the computational time complexity and further increase the accuracy of academic recommender, avoid the problem caused by sparse citation information, and re flect the continuous evolution of topics over time.
In order to avoid the high computational time complexity of academic recommendation based on entire citation network and improve recommendation accuracy, we consider the following steps for academic recommendation. First building an author-paper bilayer citation network; then selecting author groups who have similar research interests with users by detecting author communities in the author citation layer, thus effectively reduce the time complexity and increase recommendation accuracy; finally recommending authors/papers based on the selected author communities.
However, compared with the large number of authors and papers, the citation relations between authors/papers are very few, so the author-paper citation network is highly sparse.Under this condition, it is difficult to detect reasonable communities using network-topology based community detection algorithms.So in this paper, we put forward a topic based model, called Author Community Topic Time Model (ACTTM), to detect author communities in the highly sparse author citation network.
ACTTM improves AT model [19] and TOT model [20] by introducing community parameter and continuous time parameter, which enhances the semantic relations between authors in the author citation network by extracting the author relations based on paper content,and also reflects the continuous evolution of communities with time by introducing time factor into topic model. So, by mapping the author communities detected through ACTTM into the author citation network, we can detect more valuable author communities in the author citation network, furthermore, recommend academic resource based on user’s interested author communities instead of the entire author citation network, the computational time complexity of recommender can be greatly reduced and the recommendation accuracy is expected to be improved.
There are two kinds of representation methods for topic model: graph model and generation model, respectively give the model structure and the process of generating topic. ACTTM proposed in this paper is an enhanced topic model, so we also use graph model and generation model to describe.
3.1.1 ACTTM Graph Model
The graph model of ACTTM is shown in figure 1, and the meanings of the parameters in figure 1 are shown in table 1.
The distributions of authors over communities, communities over topics, and topics over words are multinomial distributions. The distribution of topics over time is beta distribution.ad,w,tare observable parameters.
3.1.2 ACTTM Generation Model
The generation model of ACTTM can be expressed as following learning process:
1) For each paperdin paper setD, the co-author set of paperdisad, and the word set of the abstract of paperdisNd.
2) For each authorxinad, computes the author-community eigenvectorχof authorx,which is a dirichlet distribution with a given positive parameterλ. Then selects a commu-nitycfrom the multinomial distribution with parameterχ.
3) Computes the community-topic eigenvectorΘof communityc, which is a dirichlet distribution with a given positive parameterα. Then selects a topickfrom the multinomial distribution with parameterΘ.
4) Computes the topic-words eigenvectorΦof topick, which is a dirichlet distribution with a given positive parameterβ. Then selects a wordwfrom the multinomial distribution with parameterΦ.
5) Computes the topic-time eigenvectorΨof topickover time, which is a binomial distribution. Then selects a timestampt, namely the publishing time of the paper, from the beta distribution with parameterΨ.
Fig. 1. Graph model of ACTTM.
Table. I. Meanings of the parameters in ACTTM graph model.
Gibbs sampling is used to estimate the distributionsχ,Θ,Φ,Ψin ACTTM. Where, the initial values ofχ,Θ,Φare respectively determined byλ,α,β, andψ1ψ2, are the initial values of the two parameters ofΨ.In our scenario, their empirical values are set to[21].
At each iteration, every wordw∈Ndis concurrently assigned to a communityc∈C,a topick∈Kand an authorx∈ad. For every possible assignment, the assignment probability is calculated as (1):
Vis the total number of words,denotes wordwin paperdis assigned to authorx, communityc, and topickrespectively.twis the timestamp of wordw, namely the publishing time of the paper which the word belongs to. ?wdenotes the assignments of wordwduring the previous (w?1) iteration.denotes the assignment number that authorxis assigned to communitycduring the current iteration. The similar denotation applies toare the parameters of the topic-time distribution of topick, whose initial iteration value are given byψ1andψ2.
At the end of each iteration, updates the topic-time distribution of all topics according to (2) and conducts the next iteration based on updatedΨ.
After multiple iterations, the difference between the assigned results of adjacent iterations for all the words of the same paper will be less than a given thresholdτ, and the updated distribution of topics over time obtained at the last iteration is regarded as the finalΨ. Thenχ,Θ,Φare estimated according to (4), (5) and (6), and we can get the final author-community distribution matrixχ,community-topic distribution matrixΘand topic-word distribution matrixΦ.
The above Gibbs sampling procedure is described as algorithm 1.
In order to improve the information quantity of citation network and increase the accuracy of academic recommender, we first build a basic bilayer citation network, which includes an author layer citation network, a paper layer citation network, and the writing relations between authors and papers. Furthermore, we utilize ACTTM proposed in section 3 to detect author communities in the author layer citation network and propose a set of metrics to more accurately depict the attributes of authors and papers. So, our academic recommendation can take the advantages of citation information,topic community information, and the node attributes, effectively reduce the time complexity of recommender system and improve the efficiency and accuracy of recommender system.
The proposed bilayer citation network is defined asGprespectively denote the author layer and the paper layer ofG.Edenotes the edge set between the nodes of author layer and paper layer.Cdenotes the community set of the author layer.
Define author layer citation net-Where,denotes all the author nodes in the author layer, which includesDaauthors, i.e., all authors in the dataset. Each author node in the author layer is characterized with an eigenvectorWhere,iis the identifier of authorai.is the authority vector of authorai, andis the authority of authoraiin communitycj.are respectively the diversity and popularity of authorai.denotes the edge set of the author layer. If the papers of authoraicites the papers of authoraq, then a directed edgeaitoaqis established, and the edge weight equals to the citation numbernumiq.
Define paper layer citation net-Where,denotes all the paper nodes in the paper layer, which includesDppapers, i.e., all papers in the dataset. Each paper node in the paper layer is characterized with an eigenvectorWhere,dis the identifier of paperrespectively denote the authority, the diversity and the popularity of paperdenotes the edge set of the paper layer. If paperpdcites paperpn, then a directed edge (d,n) frompdtopnis established, and the weight of the edge is 1.
The bilayer citation network is constructed through three steps: constructing basic bilayer citation network, detecting author communities, and computing the attributes of author nodes and paper nodes.
The basic bilayer citation network is composed of all author nodesall paper nodesand all edgesEa,Ep,E. However, in this step, the author and paper nodes are not characterized with additional attributes.
The citation information among papers is extracted according to the references of papers.According to the extracted citation information of all papers and the terms de fined in section 4.1, if paperpdcited paperpn, then there is a directed edge (d,n) frompdtopn, and the edge weight is 1. Considering the citation relations among all the papers inVpin turn, we can get edge set
The citation information among authors is extracted according to the references of the papers written by authors and co-authors.According to the extracted author citation information and the terms defined in section 4.1, if authoraicites the papers of authoraq, then there is a directed edgewherenumiqequals to the number of citations from authoraito authoraq. Considering all the authors inVain turn, we can get edge set
According to the author information of papers, if authoraiwrote paperpd, then there is an undirected sidesidering all the authors and papers in the bilayer network, we can get edge set
A simpli fied example of basic bilayer citation network constructed as above is shown in figure 2(a).
In order to reduce the computational time complexity and improve the accuracy of academic recommendation, we consider using ACTTM proposed in section 3 to detect author communities in the author layer of the basic bilayer network. Thus, the authors are aggregated into some communities based on semantic similarity, and academic recommendation only need to be done within several communities more relevant to users instead of the entire citation network.
According to the method described in section 3.2, author communities can be detected by using Gibbs sampling to train ACTTM, and the eigenvectorthoraiover all communities can be generated,whereχijindicates the distribution weight of authoraiover communitycj.
Compareχijwith a given thresholdwhereκis a community subjection threshold proposed in this paper and will be determined by experiments, which decides the number of communities the author belongs to and the number of authors in the communities. Ifthen authoraisubjects to communitycj. As a result,the subjection community list of authoraican be expressed asis the total number of communities where authoraibelongs. After obtaining the subjection community lists of all authors, the author list of each community can be generated. In other words, the final author communities in the author layer of the bilayer citation network are detected.
Through the procedure above, the authors with smaller distribution weight over a specific community are excluded, the academic recommendation based on author community described in section 5 will be more efficient,and the recommendation accuracy will be improved.
Figure 2(b) is an example of author communities detected in the simplified bilayer citation network in figure 2(a).
S Wang et al. [17] improved the predictive accuracy of network service quality by considering various attributes such as time and location comprehensively. In this paper, we also consider node attributes from multiple dimensions to improve the recommendation accuracy.
Specifically, in order to remedy the defect that the quality of citation network is excessively dependent on the quality of citation information and the citation network is highly sparse, the author/paper nodes in our proposed bilayer citation network are characterized with multiple attributes to more accurately depict the authors/papers and improve the effect of academic recommender. As our main object is to recommend influential authors and influential papers for researchers, so that they can quickly become familiar with the relevant research areas or broaden their research horizons, the following attributes for paper/author nodes, including authority, diversity and popularity, are respectively proposed.for paper nodes, andfor author nodes. The metrics of these attributes are described in following sections.
4.4.1 Paper Authority
Paper authorityrepresents the in fluence of papers on content in a specific research field. In general, the higher the citation number, the greater the contribution of a paper to relative research area, and the higher the paper authority. So, we consider that paper authority is decided by citation number, and the in-degree of paper node in the paper layer of basic bilayer citation network constructed in Fig 2(a) is used to measure the paper authority,that is:
Where,degree(d) is the in-degree of paperd, which is the number of directed edges pointing to paperpdin the paper layer of the bilayer citation network.is a linear nor-malization parameter.
Fig. 2. A simpli fied example of bilayer citation network.
4.4.2 Paper Diversity
Paper diversityrepresents the degree that a paper covers multiple research areas. Papers with higher diversity can broaden researchers’visions. We think that the more topics covered by a paper, the greater its diversity, and the smaller the variance of the topic distribution,the higher the diversity of the paper. So, the paper diversity is measured by the topic number that a paper covers and the variance of topic distribution.
The topic distribution of papers can be obtained by using Gibbs sampling to train LDA model over all papers in the paper layer,then the eigenvector of paperpdover topicscan be generated,in which,vdkdenotes the distribution weight of paperpdover topick.
The number of topics that paper pd covers is noted as
Where, ? is a given threshold inζd, which will be determined by subsequent experiments in section 6.2.5.
The topic distribution variance of paperpdis denoted asis the average topic distribution.
Finally, the diversity of paperpdcan be measured as (8):
Where,δp∈(0,1) is a coefficient for coordinatingwhich is set asin subsequent experiments.
4.4.3 Paper Popularity
Paper popularityrepresents how much the paper is popular to users. If a paper is operated by many users, the paper is considered to accord with the recent research trend and has higher popularity.
In this paper, we de fine the operation record of a paper aswhereuidis the identifier of user,pidis the identifier of the paper operated by the user,operatedenotes the operation type andtimerepresents the operation time.Four types of operations are defined here,operate=1 denotes a user published a paper,operate=2 denotes a user commended on a paper,operate=3 denotes a user read a paper,operate=4 denotes a user collected a paper.
According to the operation records, the popularity of paperpdis calculated as (9):
Where,sum{} is the total number of operation records.
4.4.4 Author Authority
Firstly, the distribution weightχijof authoraiover communitycjis extracted from the eigenvector of authors over communitiesThen the in-degree of authoraiis presented asWherenumqiis the weight of directed edge from nodeaqto nodeai,denotes all the edges pointing to nodeai, anda linear normalization parameter. Finally, the authority of authoraiover communitycjcan be measured as (10):
The authority eigenvector of authoraioverSicommunities can be expressed as (11):
Where,Siis the total number of the communities authoraicovers, which is determined in section 4.3.
4.4.5 Author Diversity
Author diversityrepresents the degree that an author covers multiple research areas.If an author’s papers involve multiple research areas, the author’s research area is considered to be extensible and the author is considered to be helpful in broadening other researchers’ visions. Similar with paper diversity, the number of communities the author belongs to and the community distribution variance are integrated into account to measure the author diversity. The more the author’s communities,the greater the author’s diversity. The smaller the author’s community distribution variance,the higher the diversity of the author.
Based on section 4.3, the number of communities authoraibelongs to can be expressed as
Finally, the diversity of authoraican be measured by (12):
Where,δa∈(0,1) is a coefficient for coordinating, which is set asδa=0.6 in subsequent experiments.
4.4.6 Author Popularity
So author popularity can be measured based on paper popularity, as shown in (13):
Where,PSidenotes all the papers published by authorai, andis the popularity of paperpd.
Based on the bilayer citation network constructed in section 4, we proposed an academic recommendation scheme, which includes three parts: Constructing author-paper bilayer citation network, modeling user interest community, and academic recommendation.
To recommend personalized academic resources to users, we modeled user interest community based on user’s operation records.Firstly, the user interest community list is generated by extracting users’ operation records and ACTTM-based topic prediction. Then the user attributes, including user authority and user diversity, are measured using the similar methods in section 4.4. Finally, based on user interest community list and user attributes, the user interest community model can be established.
The user interest community model is denoted asWhere,mis the identi fier of userum.denotes the authority vector of userumover all the author communities detected by ACTTM proposed in section 3.denotes the diversity of userum, which indicates the diversity of the papers operated by userum.Cmis the interest community list of userum. The measure methodsare described below.
As described in section 4.4.3, user’s operation record is denoted asSo according touid, the operated papers by userumcan be extracted and regarded as the papers written by userum, then ACTTM can also be utilized to generate user-community distributionis the probability of userumbelongs to communitycj. The communities that meetare selected to constitute user interest community listwhereis the maximum distribution of userumover all communities andNmis the community number of the list.
User authority vector is denoted as eigenvectorwhereis the authority of userumover communitycj, which is measured using the similar method with author authority.
Similar to author diversity, user diversity is measured by following steps. First, the community numberof each user interest community list is obtained. Then, user’s eigenvector varianceover the interested communities is computed,whereis an average distribution. Finally, the diversity of userumis expressed as
Where,δu∈(0,1) is a linear normalization parameter, which is set asδu=0.6 in subsequent experiments.
5.2.1 Recommending authors
Based on the bilayer citation network and user interest communities, we take into account the attributes proposed in section 4.4 to compute the similarity between target user and the authors within the user’s interest communities,and generate the author recommendation list for the user. The details are as follows:
(1) Obtain authors from communitycmjwhere userumsubject.
(2) Compute the weight coefficcients of userumto the authors in communitycmjaccording to authority and diversity.
(3) Compute the preference of userumfor authoraiin communitycmj.
(4) Sort the authors in communitybySelectauthors based on the authority of the target userumto construct initial author recommendation list, wheretopNis a constant value.
(5) Generate the initial author recommendation lists of all interested communities of userum, and merge all the initial author recommendation lists. After eliminating duplicating authors, the final author recommendation listcan be obtained.
5.2.2 Recommending papers
Based on the final author recommendation list above and the writing relations between authors and papers, the paper recommendation list for userumcan also be generated as described below.
For each author in the author recommendation list
(1) Compute the preferences of userumfor the paper written by the author.
(3) Generate the initial paper recommendation list of all authors in the final author recommendation listand merge all the initial paper recommendation lists. After eliminating duplicating papers, the final paper recommendation listcan be obtained.
In order to evaluate the proposed ACTTM model and the academic recommender method based on our proposed bilayer citation network, we built an academic recommendation platform. In which, several academic recommendation algorithms proposed by our research team, including the recommendation method proposed in this paper, were implemented and employed. The dataset of the platform includes paper information, author information, and citation information, which were crawled and stored locally by ourselves from Microsoft Academic Search (http://academic.research.microsoft.com) for academic research, including 250,000 papers published from 1997 to 2012 and the corresponding 234,000 authors in computer science area.Through this platform, registered users can search, download, read, collect and share papers and authors, and these user operations were recorded in the system log of the platform. Based on users’ history operations, the platform can recommend authors and papers to them. Unfortunately, currently, there are fewer registered users in the platform, so only the operation records of 20 registered users were extracted for scheme validation from the system log of academic recommender platform. The definition of operation record is given in previous section 4.4.3. As mentioned in section 4.3, community number will affect the training results of ACTTM,and further affect the community detection results in the bilayer citation network. In this experiment, we utilize modularity to measure the influence of community number on ACTTM based community detection results.We set topic numberK=100,λ=50/C,α=50/K,β=0.01,ψ1ψ2= =1, and the community numberCranges from 10 to 100.According to section 4.3 and 4.4, community subjection thresholdκand topic cover threshold ? may also in fluence the recommendation accuracy, we rangeκand ? from 0 to 1.According to recommendation accuracy we finally setC=30,κ=0.8 and ? =0.9.
Following experimental schemes are designed to prove the superiority of the proposed academic recommendation method.
Experiment 1: train ACTTM and analyze the results, including author-community distribution, community-topic distribution, and topic-time distribution.
Experiment 2: compare the proposed ACTTM based community detection algorithm with two baseline community detection methods, and verify the superiority of ACTTM.
Experiment 3: verify and analyze the influence of different types of node attributes on recommendation results and compare the effects of our proposed academic recommendation method based on ACTTM community and bilayer citation network with other four recommendation algorithms.
6.2.1 Results of ACTTM training
With the parameters determined in section 6.2.1, we extracted all the paper information and author information from the dataset in our recommender platform to train ACTTM, in which, the paper information includes paper title, abstract and keywords. By this process, we obtained the distribution probability of authors over communities, the distribution probability of communities over topics and the variation of topics over time.
The results are presented and analyzed below.
(1) Author-community distribution
In order to show the results clearly, 30 authors’ community distribution are randomly selected from the results of 234, 000 authors in the dataset. Figure 3 shows the distribution probability of the 30 authors over 30 communities, in which each line represents the distribution probability of an author.
As shown in figure 3, the author-community distribution presents both trends. One kind of authors have higher distribution probability only over a certain community, which indicates that some authors only focus on one area with unitary research content, and are called as single area authors in this paper. The other kind of authors are called as multi-disciplinary authors who have high distribution probability on several communities. The single area authors can further be divided into two kinds of authors, one kind of them are junior researchers who only published few papers. The others are senior researchers who published a lot of papers and the contents of these papers cover the same area. We consider that senior researchers have more in-depth research in their research area. The multi-disciplinary authors prefer to span multiple areas. For users who want to expand their research area, the papers of multi-disciplinary authors can provide good direction and inspiration.
Fig. 3. Author-community distribution.
(2) Community-topic distribution
Figure 4(a) shows the distribution probability of 30 communities over 60 topics, in which, each line represents the distribution of one community.
Select the17thcommunity (information retrieval) and the28thcommunity (network security) in figure4(a) to analyze the community-topic distribution, as shown in figure4(b).The higher the distribution probability of community over topic, the greater the probability of a research field belongs to the topic.
Figure 4(b) shows that the distribution probability of the17thcommunity over the5thtopic (textual retrieval), the37thtopic (ontology and metadata) and the58th(semantic analysis) are very big. Thus, it can be inferred that the author’s research content most likely belongs to the5th,37th,and38thtopics. It’s the similar situation for8thtopic (attack analysis) and30thtopic (IPSec) of28thcommunity.
The results above indicate that, in information retrieval field, there are three research directions: text retrieval, metadata ontology,and semantic analysis. While network security field includes both research direction, attack analysis and protocol security.
(3) Topic-time distribution
Figure 5 shows the evolution of 60 topics from 1997 to 2012. It can be seen from figure5, with time migration and technology development, computer science is continuously developing, so all of the curves in figure5 appear an increasing trend. For a speci fic topic, if its distribution probability grows rapidly,it means that this topic may be the current research hotspot, more and more researchers have been interested in this topic, whereas when a curve declines from a certain time, it means that the relevant researchers have begun to reduce, the reason may be that the relevant technology is mature or people gradually lose interest in it.
6.2.2 ACTTM-based Author Community Detection
This experiment compares the ACTTM based community detection algorithm with the following two community discovery algorithms from the aspect of modularity.
(1) Community detection based on NF[26]:NF is used to detect author communities in the author layer based on author citation information.
(2) Community detection based on VAT[7]and NF: firstly extract author-topic distribution based on VAT and calculate KL distance between authors based on the distribution,then compute the similarity between authors.If the author similarity is greater than a given threshold, then the authors are classi fied into the same community.
According to the proposed ACTTM,234,000 authors in the dataset are classified into 30 different communities, in which the community number is set to 30 according to the previous experiment in section 6.2.1. In order to clearly present the detected communities in graph, we randomly select 8,000 authors belonging to 12 communities, and each selected author only appears in the community with maximum author-community distribution, as shown in figure 6(a). Figure 6(b)shows 690 authors randomly selected from the 12 communities in figure 6(a).
For the 690 authors in figure 6(b), we generate the author citation network according to citation relations, and use NF algorithm to detect community in the network, the community detection results are generated as in figure 6(c). Because the citation relations are few and dispersive, the authors are classi fied into many communities, and each community only includes few authors. In particular, as shown in the central area officgure 6(c), many isolated nodes are considered to be a community with only one node. Obviously, compared with the results in figure 6(b), the community detection results based on citation information is very poor.
Fig. 4. Community-topic distribution based on ACTTM.
Fig. 5. The development and migration of topics during 1997-2012.
Reconsider the 690 authors above, a topic related network is constructed using VAT model. Using NF community detection algorithm for the topic related network, we can get the community detection results in figure 6(d).It shows that community detection based on pure citation network excessively relies on the citation information, while community detection based on topic relevance can well solve this problem and get better community detection results. However, comparing the results officgure 6(d) with figure 6(b), we can see, the community detection based on ACTTM considers the content relevance among authors,so the community results are more compact,which do not affected by the sparsity of network citation information.
We use community modularity to evaluate the effect of the community detection algorithms above. The modularity [27] of overlapping community is measured aswhereOvis the number of communities containing nodev,Ais the network adjacency matrix,mis the edge number in the network,kvis the degree of nodev. The maximum value ofQovis 1.Qovis closer to 1, the community structure is more obvious.
Fig. 6. Results of different community detection methods.
Through experiments, the modularity of three methods (including the citation network based NF method, the VAT based NF method, and the ACTTM method) is respectively 0.175, 0.406 and 0.532. It is obvious that, the modularity of citation network based NF algorithm is the lowest. This is because that the citation relations are sparse in citation network,so the number of communities detected using citation network based NF algorithm is the largest, and its modularity is the lowest. The VAT based NF community detection method extracts the correlation information among authors through VAT topic model, which enrich the relevance among authors, so the community results detected using VAT based NF algorithm is better than the citation network based NF algorithm, and the modularity is obviously improved. The modularity of ACTTM based community detection algorithm is higher than the other both algorithms, it’s more suitable for personalized academic recommendation.
In summary, the proposed ACTTM can model author-community information, community-topic information, and topic-time information. According to the results of ACTTM training, the author-community distribution can be used to discover authors’ research areas; the community-topic distribution can be used to analyze the topics that communities belong to, which makes it possible to get the detailed research topics from a large research area; the topic-time distribution can be used to analyze the topic and community evolution over continuous time and obtain eligible communities during a certain period of time. Particularly, in case of sparse citation information,the community results detected using ACTTM are more reasonable, which is more bene ficial for personalized academic recommendation.6.2.3 Evaluating the eあect of academic recommendation method based on
ACTTM community and bilayer citation network
In this experiment, we evaluate the in fluence of different node attributes on the results of our academic recommendation method, and compare the effects of our method with other four baseline methods, listed as following:
● NF based recommendation method (NF-based): using NF algorithm[26] instead of ACTTM to detect author community, then use the method described in section 5 to recommend authors and papers.
● VAT based recommendation method (VAT-based): using the community discovery algorithm based on VAT topic model[7]to detect author community, then use the method described in section 5 to recommend authors and papers.
● Traditional collaborative filtering algorithm based on item (CF-Item): using the traditional item-based collaborative algorithm to recommend authors and papers.
● Collaborative filtering algorithm based on community (SAC1) [28]: improving the NF algorithm, and detecting community according to the edges among nodes and the attributes of nodes, then using collaborative filtering calculation within community to recommendation authors and papers.
Our study includes two objectives: 1) recommend influential authors and papers to junior researchers in their research field, and make them become familiar with the field rapidly; 2) recommend authors and papers to senior researchers which can broaden their research horizon. Therefore, the authors, papers and users in our recommendation method are respectively characterized with three attributes, including authority, diversity and popularity. In order to analyze the influence of different attributes on the recommendation results, different attribute combinations are experimented based on the methods described in section 5.2, and recommendation accuracy is used to evaluate the recommendation effect.Specifically, based on the system log of our recommendation platform, we extracted 70%authors and papers users operated to build user interest model and obtain the recommendation results. The remaining 30% operated authors and papers were used to evaluate recommendation accuracy.
Table 2 shows the recommendation accuracy of our proposed recommendation method(ACTTM-based) under different attribute combinations, whereTopN=50,κ=0.8. As contrasts, the experimental results of the other four methods are also provided in table2.
The experimental results in table2 show that, when comprehensively considering all the three attributes, the recommendation effect of ACTTM-based method is the best, and therecommender can provide users with better experience according to different users’ needs.Furthermore, ACTTM-based method can improve the recommendation accuracy by considering both the authority attribute and the diversity attribute, while the popularity attribute only have a little in fluence on the recommendation results. It means that considering node attributes can improve recommendation accuracy. However, at present, the user operations are limited in our platform, so the contribution of popularity in improving the recommendation accuracy is limited. While, with the user number in our platform increases, the number of user operations will also increase, and the impact of popularity attribute on recommendation results will be more obvious. In this case,how to prevent malicious user feedback[18] is also something we should consider.
Table II. Recommendation accuracy based on different attribute combinations
However, compared with the results of NF-based method and VAT-based method,the recommendation effect of ACTTM-based method has some differences because of the different community structures. The NF-based community detection algorithms only take the citation relations between authors into consideration while not considering the content similarity of the cited papers. As a result, in case of both authors without citation relations in similar field, the both authors may be classified into different communities, and then the recommender can’t provide users with accurate results. The VAT-based community detection method only considers the similarity of research content between authors, while ignores the citation information, leading to the decrease of the recommendation diversity, and can’t meet the different requirements of different users. ACTTM-based method proposed in this paper considers both the citation relations and research content similarity between authors, so improves the recommendation accuracy obviously.
Table 2 also shows that, the recommendation accuracy of collaborative filtering algorithm based on item (CF-item) is the lowest.This is because the CF-item method is mainly in fluenced by user behaviors, while the number of users and their behaviors in our experimental platform are very limited. Compared with the traditional collaborative filtering algorithm, the SAC1 method slightly improves the recommendation effect, however the results of which is still relatively poor.
By analyzing the recommendation accuracy of all the methods above, we can draw the following conclusion: compared with the collaborative filtering based on items, the accuracy of our recommendation method is increased by 376%; compared with the NF based recommendation method, the accuracy is increased by 150%; compared with the VAT-based method, the accuracy is increased by 49%.
In view of the de ficiencies of citation network based academic recommendation method,such as information sparse, high computational time complexity and low recommendation accuracy, we proposed a new academic recommendation method for recommending authors and papers. In which, ACTTM is proposed to model author communities based on paper content, and a bilayer citation network is constructed according to paper citations and author citations. Then the ACTTM community information is mapped to the author layer of the bilayer citation network to enrich the relevance between authors and reduce the time complexity of the recommender. Moreover, we propose a set of attributes, named as authority,popularity and diversity, to characterize the features of authors, papers and users in multiple dimensions, improve the recommendation accuracy and better meet users’ personalized requirements.
In order to evaluate the effects of our proposed method, we built an academic recommender platform and carry out comparative and detailed experiments on real dataset. As seen from the experimental results, compared with the comparison community detection methods, ACTTM can be used to construct high-quality and reasonable author communities based on paper contents, which accurately reflect the actual relationship among authors and effectively reduce the time complexity of recommender. The proposed attributes can be used to extract user interest more accurately and improve the recommendation effects obviously. Compared with the comparison recommendation algorithms, our ACTTM community and bilayer citation network based recommendation method can recommend papers and authors from the perspective of depth and breadth, and improve the recommendation accuracy obviously.
However, the work is limited and some defects need to be studied further. First of all, when calculating the diversity, we only consider the variance on the basis of the author-community distribution and paper-topic distribution, which may not be accurate to reflect the characteristics of papers and authors.Secondly, the users’ operation records in the current platform are relatively small, which limited the evaluation on the recommendation results. Therefore, in the future work, we would expand the application scope of the recommender platform and use large amounts of user operation records to evaluate the characteristics and superiorities of our method.
ACKNOWLEDGEMENTS
The authors would like to thank the reviewers for their detailed reviews and constructive comments, which have helped improve the quality of this paper. This work is supported by the grants from Natural Science Foundation of China (Project No. 61471060).