Yichao SHAO,Zhiqiu HUANG,Weiwei LI,Yaoshen YU
1School of Computer Science and Technology,Nanjing University of Aeronautics and Astronauti cs,Nanjin g 211100,China
2 Key Laboratory of Safety-Critical Software,Ministry of Industry and Information Technology,Nan jing 211100,China
3Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210016,China
?E-mail:shaoyichao@nuaa.edu.cn;zqhuang@nuaa.edu.cn
Abstract:Software developers often write code that has similar functionality to existing code segments.A code recommendation tool that helps developers reuse these code fragments can significantly improve their efficiency.Several methods have been proposed in recent years.Some use sequence matching algorithms to find the related recommendations.Most of these methods are time-consuming and can leverage only low-level textual information from code.Others extract features from code and obtain similarity using numerical feature vectors.However,the similarity of feature vectors is often not equivalent to the original code’s similarity.Structural information is lost during the process of transforming abstract syntax trees into vectors.We propose an approximate sub-tree matching based method to solve this problem.Unlike existing tree-based approaches that match feature vectors,it retains the tree structure of the query code in the matching process to find code fragments that best match the current query.It uses a fast approximation sub-tree matching algorithm by transforming the sub-tree matching problem into the match between the tree and the list.In this way,the structural information can be used for code recommendation tasks that have high time requirements.We have constructed several real-world code databases covering different languages and granularities to evaluate the effectiveness of our method.The results show that our method outperforms two compared methods,SENSORY and Aroma,in terms of the recall value on all the datasets,and can be applied to large datasets.
Key words:Code reuse;Code recommendation;Tree similarity;Structure information
In software development,developers tend to re‐use existing code which can achieve the desired func‐tionality or behavior to assist their development pro‐cedure.Research shows that,on average,software developers spend about 19%of their development time on web searches,looking mainly for code exam‐ples for their tasks(Rahman et al.,2016).There‐fore,an automatic code snippet recommendation tool could help developers greatly improve development efficiency.
However,searching instructive code based on a user’s programming context is not always easy.Most of the current code recommendation tools are based on textual matching.They represent the context code fragment as tokens or lines of code,calculating the code similarity through sequence matching or a bagof-words model,and return the most relevant code snippets(Ye and Fischer,2002;Holmes and Murphy,2005;Antunes et al.,2014;Rahman and Roy,2014).However,these methods fail to capture high-level structural information,which means that it would be hard to obtain the best result when there is no highly similar match in the code repository.
Abstract syntax tree(AST)carries the structural information of code,but a traditional tree matching process consumes too much time for code recom‐mendation tasks.To solve this problem,some re‐searchers proposed to extract features from the AST of the code and used these,rather than trees,for match‐ing(Jiang LX et al.,2007;Luan et al.,2018;?ura?ík et al.,2020).However,because the ideal candidate code length should be larger than the query code,these candidate codes often contain more sub-tree structures and feature patterns.That is to say,al‐though two code snippets may obtain a high simi‐larity score,they may not be related code because the distribution of these features in candidate code can be different from that in query code.Therefore,methods using such techniques may bring more false-positive results.
In this study,we propose a novel code recom‐mendation method based on AST sub-tree matching.Compared with methods that transform all ASTs into multiple features,we retain the tree structure of the query code in the matching process.A candidate code with a complete query code structure will gain a higher similarity score than those that share the same feature sets but have different overall struc‐tures,and thus will have a higher chance of being recommended.We use hash values to record every sub-tree in the AST and speed up the matching pro‐cess by comparing the hash value only of sub-trees with the same number of nodes.Our method in‐cludes three stages:data processing,coarse-grained search,and fine-grained re-ranking.First,we con‐struct the code database using AST.In this stage,we calculate the hash value and the number of nodes for each sub-tree in AST,and store them using a list structure.This helps reduce the costs of time and space in the matching process.Second,we traverse the AST of a given query code snippet to calculate the similarity between the query code and the code‐base’s candidate code.In this stage,the similarity measurement of two code snippets is the number of nodes in all the most similar sub-trees of their ASTs.Based on the similarity score,a top-Kcandidate set will be generated.Third,we obtain the AST preor‐der traversal sequence for both the query code and each top-Kcandidate code.We calculate their simi‐larity using the Smith–Waterman(SW)algorithm to fully mine the sequence and continuity information in the code and re-rank the top-Klist.In this way,we can eventually obtain the recommending result list.
We have carried out an extensive empirical eval‐uation of our method applied to several large data‐sets,covering different languages and granularities.We compared our method with two strong baseline studies,and the experimental results showed that our method outperforms the two compared methods on all the datasets.
This study has the following main contributions:
1.We introduce a new code recommendation al‐gorithm based on sub-tree hashing and the SW algo‐rithm.It takes programming context as input and rec‐ommends relevant code snippets to assist developers in software development.
2.We implement a code recommendation tool for Java and C.Experimental results show that it has good performance in terms of both time consumption and accuracy for different recommending tasks.
In this section,we describe some related stud‐ies,split into two categories:(1)tree similarity detec‐tion;(2)code recommendation.
1.Tree similarity detection.Since tree-like data structure has gained more popularity in recent years,many studies have proposed to focus on similarity detection on trees.However,since most existing methods cannot be extended to large-scale datasets,how to achieve fast tree similarity detection is still an open question.Some researchers use tree edit dis‐tances to measure the similarity(Zhang and Shasha,1989;Shasha et al.,1994;Chen and Zhang,2014).Such edit-distance-based methods are widely used,but they obtain the similarity through the difference between two trees.This is inappropriate for tasks in which one tree contains another,which happens fre‐quently in code recommendation.Other researchers have proposed to extract features from trees and transform the tree similarity detection problem into feature matching on numerical vectors,which is fast and easy(Jiang LX et al.,2007;Luan et al.,2018;?ura?ík et al.,2020).However,the similarity of feature sets is not equivalent to the similarity of trees.These methods abandon the original structure of the trees and the continuity information of nodes,and may re‐turn false high similarity results.
2.Code recommendation.Code snippet recom‐mendation is a hot issue that attracts many research‐ers.Most researchers focus on improving the preci‐sion and speed of recommending to improve the users’experience.Techniques like information retrieval(Sahavechaphan and Claypool,2006;Jiang H et al.,2019)and pattern matching(Jiang LX et al.,2007;?ura?ík et al.,2020)are widely used in their work.Most existing methods are based on textual informa‐tion.They treat code snippets as a set of tokens or code lines(Ye and Fischer,2002;Holmes and Mur‐phy,2005;Antunes et al.,2014;Rahman and Roy,2014;Ai et al.,2019).Some researchers parse the code snippets into AST and use tree-based similarity detection techniques to obtain recommendation re‐sults(Luan et al.,2018;?ura?ík et al.,2020).The ad‐vantage of such methods is that they exploit the syn‐tax information of the code,but they often consume more time.
Fig.1 illustrates the overall architecture of our framework.To use the codebase efficiently,we first calculate and store the hash value and node number of each sub-tree in their ASTs.Then,we traverse the AST of the given query code to calculate the similarity between the query code and all the candi‐date code in the codebase.Multiple candidate code segments with the highest similarity will be selected.Finally,we calculate the AST preorder sequence simi‐larity between the query code and the top-Kcandi‐date code using the SW algorithm.We re-rank the candidate code list based on the combination of two kinds of similarity and return the final recommenda‐tion result.
Fig.1 The framework of code recommendation(Cand.:Candidate)
Next,we describe the details of each step using the code fragment in Listing 1 as a running example.
Listing 1 A simple piece of code used as the run‐ning example through Section 3 for(int i=0;i<10;i++)a+=0.5;end for
In this part,we show the procedure for data pro‐cessing;i.e.,how we transform candidate source code files into the form needed for the recommendation.
3.1.1 Tree parsing
Since our method is based on AST,the first thing we need to do is to parse the code.We parse Java code with Javalang(https://github.com/c2nes/javalang),and parse C code using Clang(https://clang.llvm.org/).Our method can work at either the file level or the method level.For individual Java methods,we add a foo class so that the parser can process them,while for file-level code,we exclude package declarations and import statements,since they are often generated by the integrated development environment(IDE)and not relevant for recommending purposes.
3.1.2 Featurization
We first map each node in the AST to a hash value.Fig.2 visualizes the simplified parse tree and the corresponding hash value for each node of the code snippet in Listing 1.In principle,different AST nodes should have different values.However,to im‐prove the detection effect,we map similar nodes to the same hash value to avoid the reduction of simi‐larity caused by minor differences among similar nodes(Yang et al.,2018).For example,in Fig.2,both FLOATING_LITERAL and INTEGER_LITERAL belong to numerical types,so they are given the same hash value of 13.We classify nodes according to their functions.The specific mapping rules may change according to different languages or parsers.In our experiment,we use Javalang as the parser of Java,and record the classification details in Appen‐dix A.
Fig.2 AST and the hash value of each node for code in Listing 1
Nodes under the same classification will be mapped to the same hash value.In a specific imple‐mentation,it can be supplemented or adjusted ac‐cording to the required recommending effect.
Then we calculate the hash value of each subtree in the AST.As Fig.3 shows,we use a<node number,hash value>tuple to represent the informa‐tion of the sub-tree rooted at the current node.We re‐cord the node number of each sub-tree and simply de‐fine the hash value of a tree as the sum of the hash values of each node in this tree.For the snippeti<10 in Listing 1,its hash value would be Hash(BINA‐RY_OPERATOR)+Hash(DECL_REF)+Hash(INTE‐GER_LITERAL)=2+17+13=32,and obviously its node number is 3.
Fig.3 Hash tree for code in Listing 1,where each node re‐cords the sub-tree information rooted at the current node
To speed up the subsequent matching and save memory space,we discard the tree structure and store sub-tree information in the form of lists.The intermediate representation of the code fragment in Listing 1 is shown in Fig.4.The first dimension of the list represents the number of nodes from 1 toN,and the hash values of sub-trees with corresponding node numbers are joined into one list.Each hash value list is kept to speed up the search process.We also re‐cord the index of code and the total number of nodes in the tree at the start of the list.
Fig.4 Intermediate representation of candidate code in Listing 1
In this part,we calculate the similarity scores between the query code and each candidate code,and obtain a candidate code set containingKcode snippets with the highest similarities.
Like candidate code processing,we map the AST of a given query code to a hash tree,similar to Fig.3.We first filter out the candidate codes that cannot reach the size threshold according to the num‐ber of nodes.We traverse the query hash tree in pre‐order and search for the hash value of the current node in the candidate hash value list.Since sub-trees with different node numbers represent different struc‐tures,only pairs of sub-trees with the same node number need to be compared.For each sub-tree node in the query hash tree,we obtain the hash value list with the same node number and search for the target hash value using a binary search.If found,the subtree represented by the current node is regarded as a match,and all its child nodes will be skipped in the subsequent traversal.We take the total number of nodes under all common sub-trees as the similarity score and recordKcode fragments with the highest score.Algorithm 1 shows the process of search,which we will introduce in three parts:preliminary filtering,sub-tree validation,and approximate matching.
3.2.1 Preliminary filter based on the node number
When applying code recommendations to real tasks,the vast majority of code snippets in code data‐bases are irrelevant to the query code.However,we need to process all possible pairs of methods to find out potential results.This can be extremely costly,es‐pecially on very large datasets.For candidate code fragments whose number of nodes is smaller than that of the query code,their maximum similarity score is the ratio of the number of common nodes to the total number of query nodes.We use size-based heuristics to aggressively eliminate unlikely candi‐date code snippets upfront.The intuition is that two methods with considerably different sizes are very unlikely to implement the same,or even similar,functionality(Saini et al.,2018).In addition,useful code recommendations are usually larger than query code so that developers can obtain a reference from the extra part.So,for each candidate code,we first judge whether it satisfies Eq.(1):
where|candidate_n|represents the node number of the current candidate tree,|query|represents the node number of the query tree,andλis an adjustment factor.
If Eq.(1)is satisfied,subsequent matching will be carried out;otherwise,the candidate code will be omitted.This heuristic can lead to some false nega‐tives,especially for code fragments with the same function but differing in texture or structure(Saini et al.,2018).However,in our experiments described below in Section 4.2.3,we observe little or no impact of this on the recall value of the final result under ap‐propriate parameter settings.
Algorithm 1 The search algorithm Input:query tree,Q;lists of candidate code,L;size of target candidate set,K Output:top-K result list,T 1 V←?2 for each i in L 3 if sizeMatched(Q,i)//Filter based on the node number 4 V←V∪{<i,traverseTree(Q,i)>}5 end if 6 end for 7 T←Rank candidate set V and obtain the top-K result//Traverse the query tree recursively 8 function traverseTree(N:sub-tree node,C:list)9 totalNum←node number of the query tree 10 sub-treeNum←sub-node number of current tree N 11 matchedNum←0 12 if isSearched(N,C[sub-treeNum])//Validate the sub-tree 13 if validated(totalNum,sub-treeNum,N,C)14 return N 15 end if 16 end if//If not found,traverse all the sub-trees 17 for each j in sub-tree of N 18 matchedNum←matchedNum+traverseTree(j,C)19 end for//Approximate matching for the current node 20 if exceedThreshold(matchedNum,sub-treeNum)21 matchedNum←matchedNum+1 22 end if 23 return matchedNum 24 end
3.2.2 Sub-tree validation
Since we use only one number to represent the hash value,sub-trees containing different nodes may be mapped to the same hash value,which we call a hash collision.Generally,with the increase of the node number of sub-trees,hash collision is more likely to occur,and when the proportion of sub-trees in the query tree increases,their influence on the final result is more significant.For these sub-trees,addi‐tional verification is performed.We adopt a simple but effective method,that is,to verify whether the subsequent nodes of this sub-tree root can also be found in the candidate sub-trees.If they all can be found,the two sub-trees are considered to be matched.We use the following formula to decide the number of nodes to be verified:
where[t]denotes the largest integer no more thant,cdenotes the node number of the current sub-tree,andqdenotes the node number of the whole query tree.
We find that almost all false-positive recom‐mending results caused by a hash collision can be eliminated at a low extra cost using this verification algorithm.The corresponding experimental results are given in Appendix B.
3.2.3 Approximate sub-tree matching
We can obtain the total number of sub-tree nodes repeated between the query code and each candidate code through the matching process.However,exact sub-tree matching cannot fully reflect the structural similarity.Here,we use the query tree in Fig.5 as an example.For illustration,we mark the node ID and omit the node number and hash value in this figure.
Fig.5 A query tree example in a match without(a)and with(b)approximate sub-tree matching(The solid circle represents a matched node and the dashed circle repre‐sents an unmatched node)
Consider a candidate code fragment whose AST is almost the same as that of the query code,the only difference being that node 10 is either missed or re‐placed by a node with a different hash value.In this instance,all sub-trees containing node 10 obtain dif‐ferent hash values,which means that all sub-tree nodes on the path from node 10 to the root of the query tree cannot be successfully matched(Fig.5a).Hence,the similarity for this tree is only 7/12=58.3%,though there is only a different node between the query code and the candidate code.Such minor differences in AST are common in software development(Baxter et al.,1998),and it is reasonable to think that it will be more evident for unfinished code.
So,rather than searching for identical sub-trees,we prefer to find similar ones.We set a similarity thresholdTto judge whether a sub-tree is matched.For each none-leaf node in the query tree,if no exact match is found in the candidate hash list,additional judgments will be made based on the sub-tree rooted at the current node.That is,if the proportion of matched nodes in the whole sub-tree exceedsT,the current node will be deemed to be matched.For ex‐ample,as Fig.5b shows,if we setTas 80%,among all nodes on the path from root node 1 to node 10,nodes 1 and 4 are matched while nodes 5,8,and 10 are unmatched,and the similarity score should be 9/12=75%,which is a better reflection of code similarity.
According to our above algorithm,when the number of common leaf nodes is fixed,code frag‐ments with a similar structure to query code can ob‐tain a higher similarity score.In some cases,although some code fragments are quite different from the structure of the query code,they may score higher for similarity than related code because they contain more common sub-trees.Such a situation can easily occur when there is no near-exact matching for the current query in the codebase.In addition,several candidate codes may share the same similarity score,in which case we cannot determine the order of these methods in the recommendation list.To solve these problems and reduce false positive results,we intro‐duce a re-ranking method based on the SWalgorithm.
The SW algorithm is used for sequence align‐ment,which means finding a similar region between two sequences.The purpose of the algorithm is not to compare the whole sequence,but to find similar fragments between the two sequences and determine their similarity.
For query code and each candidate code,we tra‐verse their hash tree in a depth-first order to generate node sequences.Then we use the SW algorithm to obtain the similarity scores between them.The de‐tails of SW are presented as follows(Smith and Wa‐terman,1981):
Assume thatA=a1a2…anandB=b1b2…bmare node sequences to be aligned.A similarity is given between sequence elementsaiandbj:
We set up a matrixHwhose size is(n+1)×(m+1).Its first row and first column are initialized with 0.Based on the initial value of scoring matrixH,the scores of other elements are calculated using the fol‐lowing formula:
The highest value ofHij’s is returned as the simi‐larity score between two sequences.
We combine the SW sequence similarity with the tree-list similarity obtained in the filtering phase as the basis for the re-ranking process,rather than us‐ing a single similarity score.This is because the SW algorithm cannot deal with different code statement orders,while tree-list similarity covers such cases and can supplement the final result.The final similar‐ity score is calculated using Eq.(5):
where TLSim denotes the tree-list similarity from the search phase,SWSim denotes the SWsequence simi‐larity from the re-ranking phase,andδis an adjust‐ment factor.
In this section,we describe our evaluation of the effectiveness of the method proposed in this study.Our experiments were conducted on a 2.60 GHz CPU(Intel i5)PC running Windows 10 OS with 8 GB memory.
4.1.1 Datasets
We used three real-world datasets to evaluate the performance of our method on code recommenda‐tion:a Java code repository IJaDataset(Svajlenko and Roy,2021)and two C code datasets,OJSystem(Mou et al.,2016)and OJNUAA,from pedagogical programming open judge(OJ)systems.
IJaDataset 2.0 is a large Java repository from the SECold Project(https://sites.google.com/site/aseg‐secold/projects/seclone),containing 25 000 open-source Java projects.We extracted method-level code frag‐ments and kept those with the number of executable lines between 20 and 30.Then we randomly sampled 300 code fragments and manually created partial code snippets by deleting 30%–50%of code lines from the original code snippets,aiming to check whether our method could produce appropriate re?commendations when an exacted match existed in the code database.Three developers with three years of Java project development experience were asked to judge the returned results.If two or more developers thought that the result corresponded to the query code,then we took and recorded it.
OJSystem contains 104 programming problems together with different source codes submitted by stu‐dents for each problem.There are 500 files submitted under each problem.Two different source codes un‐der programming problems can be regarded as similar if they realize the same functionality.We randomly se‐lected 20 code files as queries under each program‐ming problem and removed them from the candidate set to see if our method could recommend other files under the same problem according to these queries.
We constructed OJNUAA similar to OJSystem.It contains 99 programming problems,each with more than 200 files submitted from a school pro‐gramming online judgment system.OJNUAA covers a wide range of programming problems from simple to complex,making the average length of code files smaller than that in OJSystem,and the submitted codes for fundamental problems are mostly similar to each other.We also randomly collected 20 code files as queries for each problem on this dataset.
The statistical information of the above datasets(https://github.com/melond/RecommendationSamples)is shown in Table 1.
Table 1 Dataset size information
4.1.2 Metrics
We chose Recall@Kand the mean reciprocal rank(MRR)as the measurement metrics to evaluate the performance of our proposed method.
Recall@Kis defined as the percentage of query code snippets for which the original method body is found in the top-Kmethods in the re-ranked search result list.Recall@Kis calculated as
where|Relevance@K|denotes the number of query code snippets for which the original method body is found in the top-Kmethods in the result list,and|Queries|denotes the number of all queries.
MRR is defined as the average of the reciprocal ranks for all queries,where the reciprocal rank of a query is the inverse of the rank of the first relevant result.A larger MRR value means a higher ranking for the first relevant methods.MRR is calculated as
whereQdenotes the total number of query codes,and FRankqdenotes the rank of the first relevant re‐sult for queryq.
We investigated two research questions(RQs)to validate the effectiveness of our method.
RQ1:How effective is our method in recom‐mending code snippets?
In this study,we attempted to propose a new code recommendation tool.We introduced an approxi‐mate sub-tree matching algorithm to search for candi‐date code sets and used the SW algorithm to re-rank the list.In this RQ,to verify our method’s effective‐ness,we compared our experimental results with those of two strong baseline works:SENSORY and Aroma.
SENSORY(Ai et al.,2019)is a practical code recommendation tool.It uses the granularity of code statements rather than tokens as input,using the Burrows–Wheeler transform(BWT)algorithm to per‐form an ordered subsequence search.Since SENSORY is a Java method level tool while OJ databases are constructed by C files,we used Clang to parse the AST,so SENSORY could process the C code.
Aroma(Luan et al.,2018)is a state-of-the-art code recommendation tool.It vectorizes the AST features and uses a bag-of-words-like strategy to cal‐culate the similarity between the query code and code in the database.It then re-ranks the result list and uses an intersection phase to merge different code snip‐pets in the result list.Since the intersection phase will change the original code snippets from datasets,a manual judgement may be required to assess the correctness of the final result.It is hard to apply to large-scale automated validation and may introduce manual biases.Therefore,we did not adopt the final intersection stage in the experiment,but obtained the experimental results according to the previous part.
Table 2 shows the experimental results of these three methods applied to different datasets.The re‐sults show that our method outperformed the com‐pared methods in most situations.
Table 2 The experimental results from different methods over all datasets
In the experiment on IJaDataset,we treated the results corresponding to the query as a hit,and all three methods achieved good performance.This shows that these methods can make relevant recom‐mendations when there is an exact match in the code dataset in most cases.On OJSystem and OJNUAA,our method performed the best,and SENSORY performed poorly.This is because SENSORY uses a strict code line matching strategy.Its matching pro‐cess is based on textual information,while the OJ da‐tasets are classified according to functionality.There‐fore,it is difficult for SENSORY to recommend the correct result when no exact textually similar candidate code is in the codebase.Aroma chooses to convert the AST into the form of feature vectors and uses feature vectors to calculate the similarity.How‐ever,this bag-of-words-like method ignores the distribution of features in the original AST.Many ir‐relevant codes will appear with high similarity be‐cause they contain similar features to the query code,even though they may not be in the same dis‐tribution.Such a phenomenon is more obvious for large code fragments in the codebase and may cause the actual more relevant code to be excluded,thus af‐fecting the final result.In contrast,our method con‐siders the code structure information and stores the candidate codes in the form of sub-tree hash values,so that the candidate codes with a more complete subtree structure will have higher similarity.We believe that these are the reasons why our method performed the best on these datasets.
Recommending speed is also important for code recommendation tools.The average time complexity of our method for data processing and matching areO(n)andO(mf)respectively,wherendenotes the total AST node number of all candidate codes,mde‐notes the node number of the query tree,andfrepre‐sents the number of candidate code snippets after size-based filtering.Table 3 shows the recommend‐ing time cost of three methods applied to IJaDataset,which contains 718 525 files and 18 237 218 code lines.SENSORY is based on statement sequence matching,so it is time-consuming and not scalable to large datasets.Since Aroma computes similarity based on numerical vectors,it had the best time per‐formance.The time cost of our method was higher than that of Aroma,but still acceptable on such a large dataset.
Table A1
Table A1 Categories of node types for Javalang
Table 3 Time costs of three methods applied to IJaDataset
Our method requires additional storage space to store the hash representation of the AST nodes for code snippets in the code database.Similarly,taking IJaDa‐taset as an example,its source file occupies about 0.63 GB,while the intermediate representation in our method occupies 2.47 GB,which is about four times that of the code source file.The growth of additional space is linearly related to the size of the code data‐base,and it is acceptable for today’s data centers.
In summary,our code recommendation method worked well at the method level and file level,and outperformed the baseline methods over all datasets.Also,our method had a good time performance and can be applied to large datasets.Based on the experi‐mental results,we believe that our method can help developers quickly find the code they need.
RQ2:How do the parameters affect the experi‐mental results?
To explore the effects of various parameters on the experimental method,we performed a parameter sensitivity experiment.We tested the two parameters that have the greatest impact on the method:λin Eq.(1)andδin Eq.(3).The parameters match,mis‐match,and gap,required when using the SW algo‐rithm,do not have a great influence on the experi‐mental results(Yang et al.,2018).Therefore,we set them to the empirical values 2,-2,and-1,respec‐tively(Kamalpriya and Singh,2018).
To answer RQ2,because recommendations should be made repeatedly under each parameter setting in our experiment,and the experiment using IJaDataset needs much manual verification,we chose only OJSystem and OJNUAA for automated validation.
Our method uses Eq.(1)for the size similarity filter.A larger adjustment factorλcan filter out more candidate codes and speed up the overall recommen‐dation,but may also lead to more false negatives.We predefined the range ofλfrom 0 to 1.4 and used Recall@1 as the metric to estimate the recommend‐ing effect.
The experimental results are shown in Fig.6.The recall rate was close whenλwas between 0 and 0.8,and a significant decrease occurred whenλwas greater than 0.8.We conclude that the size-based fil‐tering had little or no impact on the final recommen‐dation results whenλwas lower than 0.8.At the same time,to reduce unnecessary matching as much as possible and speed up the recommendation,we de‐cided to setλto 0.8.
Fig.6 Results of our method with differentλvalues over two datasets
Our method uses Eq.(3)to combine tree-list similarity and sequence similarity into one final score.We balanced the influence of two kinds of sim‐ilarity by parameterδ;e.g.,the lower the value ofδ,the greater the influence of sequence similarity on the final result.We re-ranked the first 50 code segments after the coarse-grained search due to the high time consumption of the SW algorithm.We predefined the range ofδfrom 0.1 to 100,and used Recall@1 as the metric to estimate the recommending effect.
The experimental results over two datasets using Recall@1 are shown in Fig.7.The recall value using OJNUAA was better than that using OJSystem.This is because OJNUAA covers a broader range and includes many fundamental programming problems.The answers to these basic problems are similar,so the recommendation will be much easier.In addition,the boundaries of some programming problems in OJSystem are not clear,and there are several prob‐lems with similar or almost identical functionality,which will also have a specific impact on the results.
Using the OJSystem dataset,our method achieved the best recall results whenδwas 0.6,and asδcon‐tinued to grow,the recall rate finally dropped to around 0.7(Fig.7).Using OJNUAA,the recall reached the maximum value whenδwas set to 0.8.The reason for this difference is that,since the average number of lines of code for each file in OJSystem is larger than that in OJNUAA,sequence similarity plays a more critical role in the final result.The best selec‐tion ofδwas different for different datasets,but when the value ofδwas between 0.6 and 0.8,our method achieved good results over both datasets.Also,we conclude that the final recommendation re‐sults can be significantly improved after the re-rank‐ing phase.
Fig.7 Results of our method with differentδvalues over two datasets
In summary,the best selection ofδmay change according to the dataset,but our method can achieve good results for both datasets whenδis between 0.6 and 0.8.
In this section,we will discuss some limitations associated with our work.
1.The query set.Our method is not being directly implemented by developers in the actual development process because such an experimental setting needs a lot of manual validation and may in‐troduce errors caused by individual biases.Instead,we performed large-scale automated evaluations to test the accuracy of our method.We constructed three real-world code databases and carried out two different types of experiments.On IJaDataset,we tested whether the methods could recommend matching code according to partial code,while for OJSystem and OJNUAA,we evaluated the effectiveness of recom‐mending systems when there may not be a near-ex‐act match in the codebase.Although these two kinds of experimental settings cannot fully cover real de‐velopment scenarios,they can reflect the performance of the code recommendation system in typical cases.
2.Parameter setting.During the experiment,we set several parameters based on experience or exist‐ing experimental results,such as the thresholdTin approximate sub-tree matching and score and penalty parameters in the SW algorithm.We did not conduct complete experiments to analyze the effect of each parameter on the final result,which may have affected the performance of our method to some extent.How‐ever,since the experimental results under the current parameter settings were satisfactory,the parameters in the experimental process should be reasonable.We will fully investigate the effects of each parameter and their combinations in our future work.
In this study,we have presented a search-based code recommendation technique that combines an ap‐proximate sub-tree matching algorithm and a sequence matching algorithm.Specifically,for each candidate code,we first calculate and store the hash value and node number of each sub-tree in their AST.Then,we traverse the AST of the given query code to calculate the similarity between the query code and all the can‐didate codes in the codebase.Top-Kcandidate code segments are selected according to a similarity score.We also calculate the AST preorder sequence similar‐ity between the query code and the top-Kcandidate code using the SW algorithm.Finally,we re-rank the candidate code list based on the two kinds of similarity and return the final recommendation result.To evalu‐ate the effectiveness of our method,we have conducted experiments on several real-world code databases.Experimental results have shown that our method can recommend code with high recall and significantly outperforms compared methods.
In the future,we will consider introducing a clustering process to group functionally similar code fragments together and accelerate the search.We will also consider optimizing the storage structure of the intermediate representation to further reduce the ad‐ditional storage space required.
Contributors
Yichao SHAO and Zhiqiu HUANG designed the algorithm,implemented the tool,and drafted the paper.Weiwei LI and Yaosheng YU helped organize the paper.Zhiqiu HUANG and Yaosheng YU revised and finalized the paper.
Compliance with ethics guidelines
Yichao SHAO,Zhiqiu HUANG,Weiwei LI,and Yaosheng YU declare that they have no conflict of interest.
Appendix A:Categories of node types for Javalang
Appendix B:Recommendation results before and after sub-tree validation
Eq.(2)is used to verify the sub-tree pairs that may produce a hash collision.We tested Recall@1 before and after using hash verification on the OJSys‐tem dataset and collected the false-positive results that appeared in the first place in the recommenda‐tion list.We analyzed the proportion of recommenda‐tion errors caused by hash collision.The experimen‐tal results are shown in Table B1.
Table B1 Recommendation results before and after sub‐tree validation
Table B1 shows that 24.3%of the false-positive results were caused by hash collisions without the verification phase.After implementing hash valida‐tion using Eq.(1),we observed very few cases caused by hash collision.We conclude that the hash verification process can largely eliminate the nega‐tive impact of hash collision.
Frontiers of Information Technology & Electronic Engineering2022年8期