Li WANG ,Xianglong KONG ,Jiahui WANG ,Bixin LI
1School of Computer Science and Engineering,Southeast University,Nanjing 210096,China
2Jiangsu Automation Research Institute,Lianyungang 222061,China
3Huawei Digital Technology Lab,Suzhou 215125,China
E-mail:wangli1218@seu.edu.cn;xlkong@seu.edu.cn;18262609320@163.com;bx.li@seu.edu.cn
Abstract: It is difficult to keep software architecture up to date with code changes during software evolution.Inconsistency is caused by the limitations of standard development specifications and human power resources,which may impact software maintenance.To solve this problem,we propose an incremental software architecture recovery(ISAR) technique.Our technique obtains dependency information from changed code blocks and identifies different strength-level dependencies.Then,we use double classifiers to recover the architecture based on the method of mapping code-level changes to architecture-level updates.ISAR is evaluated on 10 open-source projects,and the results show that it performs more effectively and efficiently than the compared techniques.We also find that the impact of low-quality architectural documentation on effectiveness remains stable during software evolution.
Key words: Architecture recovery;Software evolution;Code change
Software architecture encompasses the principles and decisions of design and plays an important role in software lifecycle,especially in software evolution.It is difficult to maintain up-to-date architectural documentation because it should contain knowledge from all software stakeholders;therefore,significant literature exists concerning software architecture recovery (Kong et al.,2018;Cho et al.,2019;Schmitt Laser et al.,2020;Pourasghar et al.,2021).Software architecture recovery refers to the process of identifying and extracting architectural information from lower-level representations of a software system,such as source code (Mendon?a and Kramer,1998).Software recovery is a costly task in both academia and industry.For example,Garcia et al.(2013b)took two years of effort to recover the architecture of Google Chromium with the assistance of related developers.
Because of high cost of software architecture recovery,efforts have been dedicated to automated recovery,which applies automatic methods to extract architectural information,such as code dependency and module functionality (Lima et al.,2019;Link et al.,2019;Silva et al.,2019;S?zer,2019;Lee and Lee,2020).Most of the current techniques rely on the information extracted from source code(Andritsos and Tzerpos,2005;Tamburri and Kazman,2018).Code-based recovery techniques extract dataflow-or controlflow-based dependencies between code entities,and then identify components by applying clustering or prediction methods.The complexity and changeability of software code result in the hidden drawback of architecture recovery techniques.On one hand,researchers usually extract architectural information by analyzing and clustering code entities based on the dependencies between them.Information loss is inevitable during aggregation and extraction,which may impact the effectiveness of architecture recovery.On the other hand,software evolution is constant,and it is costly to apply architecture recovery techniques to constantly update architecture.Software evolution refers to the dynamic behaviors of software maintenance and continuous updating throughout its lifecycle(Lehman,1996;Ali and Maqbool,2009).Code change is the most important form of software evolution,and it includes addition,deletion,and modification (Mens and Tourwe,2004).Effectiveness and efficiency of automated architecture recovery are directly impacted by the efforts of code analysis and clustering methods.
There are also some empirical studies on current code-based recovery techniques (Anquetil and Lethbridge,2003;Maqbool and Babri,2004;Kobayashi et al.,2012;Garcia et al.,2013a);however,the conclusions from these studies are usually different from each other and there are no“silver bullets”for architecture recovery.There is no technique that always performs better than others in the recovery of massive projects.In practice,most large-scale projects have one or more high-quality architecture documents which are generated at the beginning of development or are revised through maintenance.Welldocumented architecture clearly presents the structure of specific versions,and developers usually pay much attention to the design documents at the beginning of a project’s lifecycle.Current software architecture recovery techniques still have a lot of room for improvement in effectiveness due to information loss during dependency processing(Garcia et al.,2013b).Therefore,our goal is to recover architecture based on some existing high-quality design documents.We can track code changes during software evolution,and we aim to build a mapping mechanism between code-level changes and architecture-level updates.In this way,we can recover architecture based on a partial dependency graph that is related to the changed code entities.
In this study,we propose an incremental software architecture recovery (ISAR) technique which consists of three steps.We first extract information from the changed code (Tufano et al.,2019).Then,we apply file-level code preprocessing on a dependency graph to map code-level changes and architecture-level updates.Finally,we recover the architecture including the update caused by code evolution.Our technique requires documentation of the previous version to update the architecture,and its quality directly impacts the performance of ISAR.The final step applies double classifiers to adjust the recovered architecture.
To evaluate the effectiveness and effi-ciency of ISAR,we conduct experiments on 10 open-source projects with two other architecture recovery techniques,i.e.,Bunch(https://www.cs.drexel.edu/~spiros/bunch) (Mancoridis et al.,1999)and directory-based dependency processing(DBDP) (Kong et al.,2018).The results of our experiments show that ISAR performs the best in terms of effectiveness and efficiency.We also find that ISAR’s effectiveness decreases obviously during evolution,but it stabilizes after several released versions.This means that ISAR can work well with low-quality architectural documentation.In summary,our paper makes the following contributions:
1.We propose an ISAR technique which recovers architecture based on existing architectures and related code changes.
2.We build a mapping between code-level changes and architecture-level updates,which can help researchers improve software recovery techniques.
3.We evaluate ISAR on 10 projects with two other recovery techniques,and find that our approach can generally improve the effectiveness and efficiency.
In this section,we present details of the ISAR technique.The technique comprises three main steps:information extraction from code,file-level code preprocessing,and incremental software architecture update.
The ISAR framework is presented in Fig.1.First,the changed code files and file-level dependency graph are obtained by analyzing code before and after software evolution.Second,the changed code files and file-level dependency graph are preprocessed to determine the changed elements.These changed elements are preliminarily clustered to form an incremental entity set.Finally,these incremental entities are processed using double classifiers to achieve the top-down incremental update of the software architecture.
Fig.1 Incremental software architecture recovery (ISAR) framework
The purpose of code information extraction is to obtain the changed code files and to build the filelevel dependency graph after code changes.Code changes can be divided into five levels:directory level,file level,class level,method level,and statement level.The elements and types of changes are different in different levels.Table 1 describes the elements and operations that cause changes at different levels.
In this study,we focus on file-level code changes.The changed code elements need to be clustered from bottom to top into three kinds of change files,as shown in Table 2.We use the multi-level change detection tool (Wu et al.,2005) to obtain changed code files.The tool uses the abstract syntax tree(AST) built by Java development tools (JDT) to parse two different versions of code and to obtain multi-level changed files between target versions.
Table 1 Elements and operations at different levels
Table 2 Three types of file-level changes
The file-level dependency graph is an abstraction of dependency information between files.As shown in Fig.2,the node represents the code file and the directed edge represents the dependency between two files.For example,A →Bmeans that fileAdepends on fileB.In binary on the dependency edge,elementAdenotes the dependency type and elementBdenotes the number of dependencies.The numbers on the dependent edges indicate the file dependency strength.The dependency types and the number of dependencies between files are obtained by analyzing the AST,which can provide the basis for generating file-level dependency graphs.
Fig.2 An example of the file-level dependency graph
In this study,we focus on 10 types of dependencies:generalization,implementation,combination,association,call,instantiation,parameter,return,declaration type,and import.Among them,generalization,implementation,and combination are the top three strongest dependencies (Glukhikh et al.,2012;Lutellier et al.,2018),and may have stronger relationships with architecture updates.The strength of code dependency can reflect the degree of dependence on the architecture level.The code entities that have stronger dependencies are more likely to be clustered into one module during architecture recovery.IfAandBrepresent two files,the strength of dependency between them is defined as follows:
where LOCArepresents the number of valid lines in fileA,DependTypei(i1,2,···,n) represents the number of dependencies of typei,andδirepresents the weight of dependency typei.
The purpose of file-level code preprocessing is to obtain the incremental entity set.First,when a file is added,deleted,or modified,the dependency between the file and other files may change.However,when a file does not change,its dependency may change due to the changes in other files.There are four types of dependency changes in code files:adding a new type of dependency,deleting a type of dependency,increasing the number of dependencies,and reducing the number of dependencies.In this study,the code file whose dependency changes is defined as an incremental file (Table 3).
According to Table 3,the incremental files can be divided into six sets:Set_Incre_Add,Set_Incre_Del,Set_Incre_TypeAdd,Set_Incre_TypeDel,Set_Incre_DepAdd,and Set_Incre_DepDel.There is also a set of non-incremental files,i.e.,Set_NoIncreFile.
Table 3 Information of incremental file types
Second,we cluster incremental files with strong dependencies,namely,generalization,implementation,and combination.The incremental files with strong dependencies are clustered to form code modules.This approach can effectively reduce the number of objects to be updated and reduce the time needed for updating.Because the code module concept is added at this step,the concept of“entity”needs to be introduced.The incremental file set is upgraded to the incremental entity set.The elements in the incremental entity set include incremental files and incremental modules.The files in Set_Incre_Del are not to be clustered because they will be removed directly during the updating process.The clustering conditions are as follows:
Rule 1:When incremental fileAis strongly dependent on non-incremental fileB,AandBare clustered.Ais classified into the module containingB.
Rule 2:When incremental fileAis strongly dependent on incremental fileB,AandBare merged into moduleC.The dependency betweenCand other files is the union set of the dependency ofAandB.The dependency strength is calculated by summation.
Rule 3:When moduleCstrongly depends on unchanged fileD,we cluster all incremental files inCtoDand deleteCfrom the incremental entity set.
In Fig.3,A,B,D,andFare code files andCis a code module.Suppose thatAandBhave strong dependency with type 1 and that the dependency strength is 0.3.According to the above rules,AandBare clustered intoC.The type of dependency betweenCandDis type 2,the dependency strength is 0.1+0.2=0.3,the type of dependency betweenCandFis type 3,and the dependency strength is 0.2.
Fig.3 An example of file clustering
Incremental software architecture update uses double classifiers to classify entities in the six kinds of incremental sets obtained by preprocessing to realize top-down incremental update.It can ensure the consistency of code and architecture.Incremental software architecture update includes the following two steps:
Step 1:We delete the incremental entities and their dependencies.The components of incremental entities may change after the classification by double classifiers.These incremental entities and their dependencies need to be removed from the file dependency graph.Algorithm 1 presents the details.First,we add the file dependency graph,module-file dependency graph,component-module hierarchy graph,and component dependency graph to the software architecture graph.Second,all the nodes and edges of entities in Set_Incre_Del,Set_Incre_TypeAdd,Set_Incre_TypeDel,Set_Incre_DepAdd,and Set_Incre_DepDel are removed from the modulefile dependency graph.Then,we delete the modules and entities that do not exist in the new version of architecture graph.Finally,all elements in the six incremental entity sets except Set_NoIncreFile are merged and added into UPDATE for subsequent entity classification.
Step 2:We apply the entity classification using double classifiers,which can update software architecture incrementally.The two classifiers used in this study are a Bayesian classifier and an Orphan adoption algorithm classifier.
2.3.1 Bayesian classifier
The Bayesian classifier follows the Bayes theorem to check whether an incremental entity can be classified into a component according to the clustering probability between them.The basic formula is
wherejranges from 1 ton(nrepresents the number of components) andiranges from 1 tom(mrepresents the number of entities to be classified).P(Cj)andP(Fi) are the prior probabilities of componentsCjandFi,respectively.P(Fi|Cj) is the posterior probability ofFigivenCj.P(Cj|Fi)is the posterior probability ofCjgivenFi.
Fig.4 Component-file dependency graph
2.3.2 Orphan adoption algorithm classifier
The Orphan adoption algorithm classifier(Tzerpos and Holt,2000)uses naming rules and structural rules to find components with the highest strength of dependence for Orphan resources (i.e.,incremental entities).When the dependence strength is the same,we can use tie-breakers to carry out the final classification.The rules are as follows:
Rule 1:The incremental entity is included in the candidate component,which reduces the penetration.
Rule 2:The incremental entity is included in the candidate components,and the output of the candidate components increases the least.
2.3.3 Combined use of these two classifiers
First,the Bayesian classifier is used to calculate the probability of the incremental entityFbeing allocated to each component,and the probability is sorted non-incrementally according to the value.The probabilities that incremental entityFis assigned to componentsCiandCjareP(Ci|F)andP(Cj|F),respectively.P(Ci|F) andP(Cj|F) are the maximum probability and the secondary maximum probability,respectively.The usage rules of these two combined classifiers are as follows:
Rule 1:IfP(Ci|F)≥50 andP(Ci|F)±P(Cj|F)≥α,then incremental entityFis allocated to componentCi,where 0≤α ≤1.The larger the value ofα,the greater the probability that incremental entityFis allocated to componentCi.Through our experiments in Section 3,αis taken as 0.3.
Rule 2:IfP(Ci|F)±P(Cj|F)<αandP(Ci|F)is not 0,we select the top-kcomponents that have the highest probabilities of incremental entityFbeing allocated to them.The Orphan adoption algorithm classifier is used to select the components to which incremental entityFshould finally be allocated.The larger the value ofk,the higher the classification accuracy,and the higher the time cost.kis taken as 10 in this work.
In rule 1,only the Bayesian classifier is used to classify incremental entityF.This is because the probability that fileFis allocated to componentAis far greater than those of other components.If the Orphan adoption algorithm classifier is used for secondary classification,the results could be the same with a significant probability,which will increase the time cost.In rule 2,the Bayesian classifier and the Orphan adoption algorithm classifier are used to classify the incremental entities.This is because only the Bayesian classifier cannot complete the classification accurately.The Orphan adoption algorithm classifier is used to analyze the dependency between the incremental entities.
According to the double-classifier usage rules,we classify the files and modules in the incremental entity set UPDATE from the preprocessing step.First,entities in UPDATE are classified using these double classifiers.When the incremental entities are classified to a new component,all the nonincremental files associated with the incremental entities are put into UPDATE for re-classification.Second,we classify the entities that have been classified at step 1 at the module level.Finally,we adjust the component clustering,including adding,deleting,and splitting components.Details of the incremental software architecture update algorithm are shown in Algorithm 2.The rules used in Algorithm 2 are as follows:
Rule 1:To add components,there are two situations for entities in Failset that cannot be classified.One is that some entities (including files and modules) do not have any dependency relationship with other entities and they belong to isolated entities.We will add components to which these isolated entities can be classified.The other one is that there are dependency relationships among these entities.We will cluster these entities from bottom to top to form new components.
Rule 2:When no entity is detected in a component,we delete the component.
Rule 3:When the component scale is larger than 30%of the system scale,the component is considered to be too large to split.
The incremental software architecture update technique can generate accurate relationships between code elements based on historical versions of architecture,which can help developers understand and maintain software projects.In this section,we aim to answer the following research questions:
RQ1:How effective is the software architecture update technique?
RQ2:How does the software architecture update technique perform with the evolution of projects?
RQ3:How efficient is the software architecture update technique?
3.1.1 Subject projects
To answer the above research questions,we selected 10 Java projects from GitHub according to their popularity,i.e.,Okhttp,Mabatis,Mockito,Junit,Retrofit,Jadx,Terrier,Clone,Freecol,and Fastjson,which have been widely used in existing studies (Sievi-Korte et al.,2019;Kong et al.,2020;Jia et al.,2021;Liu et al.,2021;Pourasghar et al.,2021;Zhang et al.,2021).For each project,we selected the first five released versions on GitHub.All of these projects have high-quality documentation,which makes the manual recovery of software architecture possible for our team.
The initial architectures of the studied projects were recovered manually according to the existing works(Garcia et al.,2013b;Kong et al.,2018).The manual recovery included four steps.First,we extracted the file-level dependency graph based on Eclipse JDT and listed the detailed relationships between files.Second,we checked the partitioning of file directories and clustered the coupled files within the same directory into a module.Third,we iteratively grouped the modules according to their dependency relationships and functionalities,which were partially obtained through the online documentation.Finally,an architect from Huawei Digital Technology Lab (i.e.,the third author of this paper) revised the architecture based on the publicly available information and her experience.
Table 4 presents the details of these selected projects.LOC presents the number of valid lines in subject projects,which is calculated using CLOC(https://github.com/AlDanial/cloc) in the experiments.The description presents the basic functionality of each project.
Table 4 Subject system statistics
3.1.2 Selected recovery techniques
To evaluate the effectiveness of our incremental software architecture recovery technique,we selected Bunch (Mancoridis et al.,1998) and DBDP (Kong et al.,2018) to compare the recovered architecture.Bunch is a classical architecture recovery technique,and has been widely used in existing studies(Schmitt Laser et al.,2020;Campo et al.,2021).DBDP is another effective architecture recovery technique which extracts information from code and the structure of directories.DBDP is viewed as a promising method for extracting architectural information from code and directories (Pourasghar et al.,2021).DBDP is compatible with the two kinds of hill-climbing algorithms to resolve the optimization problem,i.e.,the nearest and steepest ascent hill climbing(NAHC and SAHC) used by Bunch.There are two reasons for the selection of these techniques.First,these two techniques have generated promising results in existing studies (Schmitt Laser et al.,2020;Pourasghar et al.,2021).Second,it is possible for us to obtain source code for these two techniques,which helps us easily apply automation in our experiments.In our experiments,we collected the results generated by NAHC and SAHC and used the more accurate one in the comparison.The clustering algorithm is Bunch.TurboMQ and the results are presented as the median level of graph.
3.1.3 Measurements
MojoSim is widely used to measure the effectiveness of software architecture recovery techniques(Wu et al.,2005;Bittencourt and Guerrero,2009;Bazylevych and Burtnyk,2015;Lutellier et al.,2015,2018).It can calculate the similarity between the recovered architecture and ground-truth architecture.A high similarity value means that the recovered architecture is accurate.It is defined by the following formula:
whereAindicates the architecture that is updated by our technique,Bindicates the ground-truth architecture,mno(A,B) is the minimum number of Move and Join operations needed to convertAtoB,andnis the number of architecture entities.The 100% value of MoJoSim(A,B) means that the updated architectureAis the same as the ground-truth architectureB.
Turbo MQ extends the basic modularization quality (Basic MQ) to support a dependency graph with edge weights.Turbo MQ can measure the quality of the recovered architecture without the groundtruth architecture.Turbo MQ evaluates the architecture based on the degree of high cohesion and low coupling,and is calculated by the following formula:
where CFi(i.e.,cluster factor of modulei)is defined as
To calculate the cluster factor of modulesiandj,we collectedμias the number of intra-relationships of moduleiand∈ij+∈jias the number of interrelationships between clusteriand clusterj.Ahigher value of Turbo MQ means that the organization of the architecture is better,i.e.,more satisfying the principle of design.
3.1.4 Experimental steps
For each studied subject,we performed the following steps:
Step 1:We collected the first five released versions of the 10 projects in Table 4 from GitHub,which gave us 50 different programs in total.
Step 2:To obtain the ground-truth architecture,we implemented architecture recovery manually on the first version for each project.The manual recovery was designed according to the exist works(Garcia et al.,2013b;Kong et al.,2018).The architecture was generated based on the file-level dependency graph and the publicly available documents.
Step 3:For each studied project,we recovered the architecture with Bunch and DBDP to build the comparative experiments.
Step 4:For each selected project,we collected the code changes from its previous version.The first version of a project was marked as the initial version,so it had no changes.There were a total of 166 code blocks which were changed during the project evolution.These changes helped us recover architecture based on the previous architecture version.
Step 5:For the last four versions of the projects,we recovered the architecture with our incremental software architecture recovery technique based on the previous version,and collected all the results to analyze the effectiveness and efficiency.
In the experiments,the file-level dependency graph was obtained through Eclipse JDT.We ran the three architecture recovery techniques,i.e.,ISAR,Bunch,and DBDP,on a Ryzen 3950X server with 128 GB of memory.
3.2.1 RQ1:improvements in effectiveness
To evaluate the effectiveness of our technique,we applied ISAR,Bunch,and DBDP on the second version of the studied projects.There are two reasons to choose the second version:the ground-truth architecture of the second version is available according to step 2 and its previous version of ground-truth architecture,i.e.,the first version,is also available.Our technique,i.e.,ISAR,needs the architecture of the previous version of the target project because it can establish a mapping between code-level changes and architecture-level updates.The ground-truth architecture of the second version is obtained manually.We collected the results and calculated the MojoSim scores on the basis of the related groundtruth architecture.Table 5 presents the results.
Table 5 MoJoSim scores of the studied techniques
From the results in Table 5,we have the following observations:
First,ISAR obtains the highest MojoSim scores on the second version of the studied projects.This means that the architecture generated by ISAR obtains the highest value of similarity with the groundtruth architecture compared with the architectures produced by Bunch and DBDP.The overall 0.90 ISAR MojoSim score is a very promising result.We recovered the architecture based on the previous architecture and code changes,and the results take the advantage of the manual recovery.Because DBDP is implemented based on Bunch,it usually obtains a higher score than Bunch.
Second,the MojoSim scores of the Retrofit and Jadx projects are much lower than those of the other studied projects.We looked into the source code of these two projects and found that their dependencies are much more centralized than other results.These two projects have a high-weighted module in their architecture,which seriously affects other small module scores.For these two projects,all the studied techniques generated several different small modules,which makes the MojoSim scores lower than those of the other projects.
Finding 1For the studied projects,ISAR performs the best in terms of effectiveness.
3.2.2 RQ2:effectiveness during evolution
All the studied projects have high-quality design documents,which help us obtain the ground-truth architecture of the first version.When the projects change with various requirements,the architecture may not be updated in a timely manner.Therefore,we evaluated ISAR on the last four versions of the studied projects to investigate how the effectiveness changes during evolution.Table 6 presents the results of Turbo MQ for ISAR on the projects.Because ISAR needs the previous version to generate architecture,we could not collect the results of the first version.A high Turbo MQ score means that the structure of the architecture is good;i.e.,it satisfies the“high cohesion and low coupling”design principle.
From Table 6,we can see that the effectiveness of ISAR decreases obviously during evolution.The degree of the decline is acute at first and becomes stable between the last two versions.The reason forthe downtrend is that our technique recovers architecture on the basis of the previous version,which results in a loss in the stock of information during evolution.However,we still have a good result because the decline can be stabilized after several released versions,so our technique can play proper roles for the projects that do not have high-quality previous design documents.
Table 6 Turbo MQ scores on the projects of ISAR
Finding 2Although low-quality architectural documentation can obviously impact the effectiveness of ISAR,it can still obtain promising results,i.e.,an average 8.21 Turbo MQ score on the last version.
3.2.3 RQ3:efficiency improvements
We collected time consumption data of ISAR,Bunch,and DBDP on the last four versions of the studied projects.Table 7 presents the execution time of the techniques;the execution time of the three studied techniques for the four versions of selected projects are shown.From the table,we can see that the execution time of ISAR is much shorter than those of the two other techniques.This is because our technique does not start with the whole dependency graph of the target project.ISAR needs to translate only the architecture-level changes to the previous version,whereas the two other techniques need to apply the clustering algorithm several times during recovery.The complex structure of the dependency graph makes the architecture recovery time-consuming.For the small-sized projects,ISAR efficiency improvements are not obvious.In cases of recovery on large-sized projects,i.e.,Freecol and Fastjson,ISAR can significantly improve the efficiency.Consequently,we claim that ISARcan obtain the most accurate architecture in the shortest time.
Table 7 Time consumption of studied techniques
Finding 3ISAR performs the best in terms of efficiency due to the reduction of target dependencies.
In summary,ISAR can obviously improve the effectiveness of architecture recovery with the help of existing high-quality architectural documentation.If there is no available previous version architecture,architects may obtain the design through some existing recovery techniques or manual work,and then ISAR can significantly improve the efficiency of recovery for the following versions.
1.Threats to construct validity
The metrics used in these experiments threaten the construct validity.To reduce this threat,we selected widely used measurements,i.e.,MojoSim and Turbo MQ.We used Turbo MQ scores to evaluate the effectiveness of ISAR during evolution without the ground-truth architecture.In future work,we will evaluate the techniques in terms of more measurement types.
2.Threats to internal validity
The main threat to internal validity is the potential negligence in our manual recovery.The projects in our experiments have been widely used in existing works,and we obtained some detailed design documents from online communities which help us a lot during recovery.To examine the accuracy of architectures we recovered manually,we studied as many related documents as possible,and discussed the conformation of the recovered architecture with some professional architects from the industry.To reduce this threat,we will try to communicate with the related developers to obtain the detailed design document and send them the results we recovered manually.
The other threat to internal validity is the potential faults in existing tool configurations,or in our data analysis.To reduce this threat,we carefully reviewed all the configurations,code,and data analysis scripts used in the study.
3.Threats to external validity
The selected versions of projects,dependency extracting tools,and automatic architecture recovery tools used in our experiments may pose threats to external validity.The projects used in our experiments are all popular Java programs on GitHub.We obtained the source code of the first five released versions from their online repositories.We used Eclipse JDT to extract code dependencies and built a filelevel dependency graph.
We selected two existing techniques in the experiments,i.e.,Bunch and DBDP.Bunch is implemented based on a genetic algorithm and a hillclimbing algorithm.Because of the iterative random process,the recovered architecture is not unique for a specific project.DBDP is based on Bunch,but it applies a new dependency preprocessing approach to improve the effectiveness.To reduce these threats,we will conduct the study with more dependency extracting tools,more architecture recovery tools,and more projects.
In the literature,there are many software architecture recovery techniques.According to the module extraction method,the current techniques can be divided into four categories:cluster-,pattern-,graph-,and classification-based architecture recovery techniques.
Cluster-based architecture recovery techniques commonly extract the structure of modules based on some clustering algorithms,i.e.,mountain climbing and genetic algorithm(Mitchell and Mancoridis,2006;S?zer,2019).The techniques may improve the modular clustering technology for object-oriented systems (Zhao et al.,2015),take advantage of cluster ensembles to obtain accurate architecture (Cho et al.,2019),apply collaborative clustering technology in the field of software modularization (Naseem et al.,2013),or apply meta-heuristic search techniques(Mitchell,2003;Yang et al.,2021)during clustering.Teymourian et al.(2020)proposed a fast clustering algorithm for large-sized projects,which performs operations on the dependency matrices.Most clustering algorithms are hierarchical and searchbased,and cluster-based architecture recovery techniques can present an explicit structural design,but they are usually time-consuming.
Pattern-based architecture recovery techniques commonly follow a pattern-driven approach to help understand software (Tzerpos and Holt,2000).This kind of technique is based on some common patterns,such as the directory structure pattern,source file pattern,leaf set pattern,and body-head pattern.Tamburri and Kazman(2018)investigated a general usage pattern to match code entities with architecture information.Monroy and Pinzger(2021)mined the pattern through interaction with stakeholders and obtained the architecture based on a conceptual model.Teymourian et al.(2020) recovered architecture descriptions based on centrality measures,which can guide the architect to distribute the entities to different architectural layers.The condition of the architecture pattern limits the usage range of this kind of technique because most of them are domain-specific.
Graph-based architecture recovery techniques usually have two ways of generating architecture.One is to construct a module dependency graph and use vectors to divide the graph into subgraphs to increase cohesion and to reduce coupling between nodes (Akthar and Rafi,2010).The other method converts the recovery process to a graphical pattern matching process by constructing a software entity relationship graph (Sartipi,2003).Pourasghar et al.(2021)combined a pattern-based method and a cluster-based method to build a graph-based clustering algorithm that can use the depth of relationships to calculate similarity between artifacts.The main drawbacks of graph-based techniques are that the results are usually a locally optimal solution and that the techniques work very slowly when faced with large graphs.
Classification-based architecture recovery techniques usually use Bayesian classification to divide updated software modules into subsystems to update incomplete or outdated documents in the software(Maqbool and Babri,2007).The use of Bayesian classification focuses only on code addition and ignores deletion and modification.Link et al.(2021)applied text classification on the identification and extraction of architectural information.The effectiveness of these techniques is limited by the classifier,which is generated based on rich knowledge of architecture.
We present an incremental software architecture recovery technique driven by code changes,which uses a Bayesian classifier and an Orphan adoption classifier to make incremental software architecture updates based on original architectural documentation.Different from traditional methods,our technique takes into account the similarity of code during evolution,and we update the previous architecture from the changed spots instead of comprehensive clustering or classification.
Traditional software architecture recovery techniques do not take into account the similarity between two versions of software architecture during evolution.This kind of information loss results in excessive time consumption during architecture recovery.Therefore,in this paper,we presented an incremental software architecture recovery technique driven by code changes,i.e.,ISAR.We used a Bayesian classifier and an Orphan adoption classifier to update software architecture based on features of the previous version and changed entities.We evaluated ISAR by conducting experiments on 10 projects and comparing it with two other recovery techniques.The results showed that ISAR performs the best in terms of effectiveness and efficiency.There is still room for improvement of architecture recovery effectiveness.Follow-up work can start with the method of combining multiple classifiers to further improve the effectiveness of software architecture recovery.
Contributors
Bixin LI designed the technical framework.Li WANG and Xianglong KONG implemented the approach and drafted the paper.Jiahui WANG proposed the initial idea and confirmed the correctness of the recovered architecture.Bixin LI revised and finalized the paper.
Compliance with ethics guidelines
Li WANG,Xianglong KONG,Jiahui WANG,and Bixin LI declare that they have no conflict of interest.
Frontiers of Information Technology & Electronic Engineering2022年5期