Emerging topic identification from app reviews via adaptive online biterm topic modeling*

2022-05-21 08:37:24WanZHOUYongWANGCuiyunGAOFeiYANG

Frontiers of Information Technology & Electronic Engineering 2022年5期

Wan ZHOU ,Yong WANG,2 ,Cuiyun GAO ,Fei YANG

1School of Information and Computer,Anhui Polytechnic University,Wuhu 241000,China

2State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210000,China

3School of Computer Science and Technology,Harbin Institute of Technology (Shenzhen),Shenzhen 518000,China

4Zhejiang Lab,Hangzhou 310000,China

?E-mail:yongwang@ahpu.edu.cn

Abstract: Emerging topics in app reviews highlight the topics (e.g.,software bugs) with which users are concerned during certain periods.Identifying emerging topics accurately,and in a timely manner,could help developers more effectively update apps.Methods for identifying emerging topics in app reviews based on topic models or clustering methods have been proposed in the literature.However,the accuracy of emerging topic identification is reduced because reviews are short in length and offer limited information.To solve this problem,an improved emerging topic identification (IETI) approach is proposed in this work.Specifically,we adopt natural language processing techniques to reduce noisy data,and identify emerging topics in app reviews using the adaptive online biterm topic model.Then we interpret the implicature of emerging topics through relevant phrases and sentences.We adopt the official app changelogs as ground truth,and evaluate IETI in six common apps.The experimental results indicate that IETI is more accurate than the baseline in identifying emerging topics,with improvements in the F1 score of 0.126 for phrase labels and 0.061 for sentence labels.Finally,we release the codes of IETI on Github(https://github.com/wanizhou/IETI).

Key words: App reviews;Emerging topic identification;Topic model;Natural language processing

1 Introduction

App reviews,the most straightforward feedback of users’immediate experience,contain subjective and objective evaluations of software features(Nguyen et al.,2015),such as praise or criticism for a software function.Emerging topics in app reviews refer to topics related to app features with which users are concerned during certain periods,such as new bugs that affect user experience (e.g.,software crash)or existing undesirable features(e.g.,too many advertisements)(Gu and Kim,2015).

Identifying emerging topics accurately and quickly can guide developers in updating their apps.According to statistics,there were no less than four million apps available in the Google Play and Apple App Store as of the fourth quarter of 2020 (https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/).To maintain the competitiveness and popularity of apps,it is crucial to continuously update rich features while maintaining user-friendly experience(McIlroy et al.,2016).Moreover,emerging topics can provide informative evidence for app developers to know what is going on with their apps,so they can effectively maintain and update apps by repairing software bugs and adding new functions to meet user needs (Sarro et al.,2015).For example,a popular game app in China named Gray Raven Punishing suffered from a lot of bad reviews on app stores in December 2019.The reason is that the probability of drawing card changed in releasing new versions and that developers failed to identify bugs quickly(https://www.bilibili.com/read/cv4175642/).This situation could have been mitigated if emerging issues had been identified in a timely manner from app reviews.

To the best of our knowledge,the most recent work on emerging topic identification is IDentify Emerging App(IDEA)issues(Gao et al.,2018),which can be directly applied to identify emerging topics from app reviews.Specifically,different versions of app reviews act as the input,and IDEA proposes adaptive online latent Dirichlet allocation(AOLDA) based on the topic model online latent Dirichlet allocation(OLDA)(AlSumait et al.,2008)to capture topic evolution and identify emerging topics.

However,there are two characteristics of app reviews that prior studies fail to consider,which limits their ability to accurately identify emerging topics.First,app reviews are generally short in length and provide limited context.According to Chen et al.(2014) and Genc-Nayebi and Abran (2017),the average length of an app review was 71 characters and only about 30% of reviews could provide valuable information for updating apps.The sampling algorithm often fails to converge due to the information sparsity in short text.Second,the impact of misspelled words and abbreviations in app reviews has rarely been considered in identifying emerging topic (Noei et al.,2021).Because application text is frequently dissimilar to standard language,topic models are likely to interpret it as a distinct cluster,resulting in an increase in the number of incorrectly identified topics.

In this study,we propose an improved emerging topic identification approach named IETI,to solve the above problems and accurately identify emerging topics in app reviews.Specifically,we adopt natural language processing techniques to reduce noisy data,including correcting misspelled words and extending abbreviations.To address the brevity characteristic of app reviews,we track topic evolutions using the adaptive online biterm topic model (AOBTM)(Hadi and Fard,2020) and identify emerging topics by outlier detection methods.

To validate the effectiveness of IETI,we adopt the official app changelogs as our ground truth,and calculate semantic similarity between emerging topics and app changelogs to determine the validity of emerging topics.We conduct experiments on the same six popular apps,which are from Google Play and Apple App Store.Experimental results show that IETI can identify emerging topics more accurately than the baselines,with improvements in the F1 score of 0.126 in phrase labels and 0.061 in sentence labels.

The main contributions are as follows:(1) We suggest more sophisticated preprocessing methods for reducing noise in app reviews,including the correction of misspelled words and abbreviations,and more stringent filtering rules.(2) We apply the AOBTM topic model to track topic evolution in app reviews,and propose IETI to identify emerging topics in app reviews.(3) We evaluate the effectiveness of IETI by conducting experiments on six popular apps,and release the IETI codes on Github(https://github.com/wanizhou/IETI).

2 Preliminaries

2.1 App review emerging topic

According to Huang et al.(2017),the term“emerging topics”refers to topics that are diffusely discussed in the current time slice,but seldom mentioned in previous time slices.We consider the example in Gao et al.(2019).Fig.1 shows the statistics for the number of reviews on the topic“sound”from June 29,2017 to July 8,2017.After releasing WeChat version X on July 5,the number of reviews related to“sound”rose abruptly.According to Huang et al.(2017),the topic“sound”is an emerging topic.In reality,the version X has a functional bug when sharing sounds and pictures,leading to the rise of the topic“sound”in WeChat.The following definition of emerging topic is given:

Fig.1 The number of user reviews related to“sound”from June 29,2017 to July 8,2017

Definition 1(Emerging topic of app reviews) One topic that has rarely been discussed in reviews during previous time slices but is mentioned in many user reviews in the current time slice,is defined as an app review emerging topic.Notably,the time slices referred to in this paper are divided into different app versions,which means that each time slice corresponds to a specific app version.

2.2 Topic model

The benchmark topic models based on the Dirichlet hypothesis include mainly latent Dirichlet allocation(LDA),Dirichlet multinomial mixture(DMM),correlated topic model (CTM),and biterm topic model (BTM) (Jin et al.,2018).Due to the sparseness of short texts,the LDA sampling algorithm (Gibbs sampling) may not converge completely in computation,which ultimately reduces the accuracy of topic detection results.Details about the LDA model are available in Blei et al.(2003).

BTM solves the sparse text problem by constructing biterms to improve the accuracy of short text clustering (Li CL et al.,2017).Fig.2 shows the schematic of BTM.Different from LDA,BTM is not concerned with whether the document belongs to one or more topics;it assumes that two words in each biterm belong to the same topic,and that each topic is a polynomial distribution of words.The generation process of each biterm is as follows:

1.Generate the topic distribution vectorθfrom the Dirichlet prior distribution with parameterα:θ～Dir(α).

2.For each topick(k1,2,...,K),the topicword distribution vectorΦkis generated from the Dirichlet prior distribution with parameterβ:Φk～Dir(β).

3.For each bitermb{w1,w2},the topic distribution ofbis generated fromθ:zb～Multi(θ).For theithposition inb,the termwb,iis generated by the topic distributionzb:wb,i～Multi().

BTM obtains the hidden variablesθandΦby maximizing the following joint probability:

whereBis the set of all biterms.Then BTM employs the collapsed Gibbs sampling algorithm to convert the conditional probabilityp(z,θ,Φ|b,α,β)into the conditional probabilityp(zb|z-b,B,α,β) of a topic probability distributionzbon the remaining topics:

wherenkis the number of times that bitermbis assigned to topicz,is the number of times that wordwiis assigned to topick,andV βis the number of unique words in the corpus.

2.3 App changelogs

App changelogs describe the improvements that developers made to the app in new versions.Although app changelogs may not cover all the changes to the previous version of the app,they represent the prescribed minimum and the most significant parts of the change.For example,Fig.3 shows the changelogs of the shopping app Ebay in version 6.12.1,and we can observe that Ebay fixed mainly a few bugs that cause functional errors in version 6.12.1.

Fig.3 Changelogs for Ebay in version 6.12.1

If the app emerging topics identified in this study are semantically similar to the keywords described in app changelogs,then these topics can be considered to be covered in changelogs and have practical significance in guiding developers in updating and maintaining their apps.Therefore,we adopt app changelogs as the ground truth in verifying the accuracy of the identified emerging topics.Specific evaluation methods will be introduced in the following section.

3 Methodology

Fig.4 exhibits the IETI framework.App reviews of diverse versions and the official changelogs serve as the input to the IETI.The outputs contain the evaluation score for identified emerging topics and the relevant phrases and sentences that are used to interpret the meaning of the identified emerging topics.

Fig.4 Overview of the improved emerging topic identification (IETI)

IETI consists of three phases.In the first phase,we preprocess app reviews to reduce noisy data,and the outputs include processed reviews and candidate topic labels.In the second phase,emerging topics are identified by AOBTM and anomaly detection methods.In the third stage,emerging topics are interpreted by labels,and the explanatory labels are divided into two types:one is the phrases most relevant to the topics,and the other is the sentences most relevant to the topics.Finally,we evaluate the effectiveness of emerging topics through the app changelogs.

3.1 Preprocessing

Because the submission of app reviews is limited to mobile devices and inconvenient keyboards,app reviews contain a mass of noisy data,such as misspelled words,abbreviations,and repetitive words.In this subsection,we preprocess initial app reviews to reduce noisy data and extract key phrases.

1.Word formatting

We cite some preprocessing steps in IDEA.First,we make all words lowercase. Then we use the natural language toolkit NLTK(http://www.nltk.org) for lemmatization and use“”to replace all numbers.

Because abbreviations and misspelled words may increase the number of misidentified topics,we need to restore abbreviations and correct misspelled words.Specifically,we restore abbreviations with the NLTK toolkit.Then we implement the finetuning task for correction of misspelled words using the open-source framework PyCorrector.Because the default model of PyCorrector adopts the Chinese pre-training model,we select the English pre-training model Bert-base-uncased(Devlin et al.,2019) from huggingface and introduce the custom dictionary Wiki Dictionary to enhance the effect of text error correction.

There are still some meaningless words in the reviews,such as articles (a,an,the) and personal pronouns(I,you,...).These words are necessary elements of sentence formation,but their contribution to the identification of emerging topics has been neglected.Therefore,we combine the stop word list provided by NLTK with our stop word list to filter these words.

2.PMI phrase extraction

A phrase (in this study,a combination of two or more words) can commonly convey more semantic information than a single word.To better understand the actual meaning of emerging topics,we employ labeled data at the phrase or sentence level to interpret identified emerging topics.

We adopt the typical phrase extraction method pointwise mutual information (PMI)(https://en.wikipedia.org/wiki/Pointwise_mutual_information)based on the co-occurrence frequency to identify more meaningful combinations of words and link them together by underlining them.The PMI calculation formula is as follows:

wherep(wiwj) represents the co-occurrence probability ofwiandwj,andp(wi)represents the probability ofwiin the whole review collection.The value of PMI is proportional to the probability of the combination ofwiandwj.By setting the threshold,we select the appropriate phrase as the candidate labeled data.

3.2 Emerging topic identification

In this subsection,we aim to identify emerging topics in the current app version by considering topics in previous versions.We adopt AOBTM (Hadi and Fard,2020)to capture topic evolutions in different versions,and identify emerging topics from topic evolutions by the anomaly discovery method.

3.2.1 Topic evolutions

Whenever a new batch of text data is input,BTM requires retraining to capture potential topic distributions,which takes a lot of time.OBTM(Cheng et al.,2014) can fuse the features of topic distribution in the previous time slice to model the reviews in the current time slice.However,OBTM considers only the influence of the topic distribution in one time slice.In practice,developers might consider emerging topics from previous versions to ensure the effectiveness of version updates.

Similar to AOLDA,AOBTM allows users to customize the version window,that is,to control the number of time slices that affect the topic distribution characteristics in the current time slice.AOBTM can also control the influence weight of the distribution of topic features in historical time slices when calculating the topic distribution features of short texts in the current time slice.Different from AOLDA,AOBTM is more suitable for modeling of continuously input app reviews because the underlying model of AOBTM is BTM,which is more suitable for topic modeling of short texts.Fig.5 shows the details of AOBTM.

Fig.5 Overview of the adaptive online biterm topic model (AOBTM)

In AOBTM,different versions of app reviews are expressed asR{R1,R2,...,Rt,...}(wheretrepresents the app version).Enter reviews for each version in turn,and each review is treated as a separate corpus.We useαandβto represent the prior distribution of the corpus topics and topic words respectively,andαandβare defined initially.SetKto represent the number of topics.For thekthtopic,represents the probability of all input terms in time slicet.The parameterwdefines the number of previous app versions that need to be considered when inferring the topic distribution of reviews in the current version.AOBTM adaptively integrates the distribution of topics from previous versions ofw,denoted as｛φt-1,...,φt-i,...,φt-w｝,for generating the prior probability distributionβtof thetthversion.The adaptive integration is designed to summarize the topic distribution of different weightsγt,iof different versions.βtis calculated as follows:

whereirepresents theithversion before the current versiont.represents the number of times thatwis assigned to topickin version(time slice)t.The weight parameteris determined by the similarity between the distributions of thekthtopic under the(t -i)thtime slice.The calculation method is as follows:

where the tensor dot productcalculates the similarity between the topic distributionand the prior probability distributionof the (t -1)thversion.This adaptive aggregation allows the topic distribution characteristics in the previous time slices to impact the topic distribution characteristics in the current time slice differently.

3.2.2 Outlier detection

The topic evolutions (i.e.,βt,βt-1,···,βt-i)describe the topic distribution in different versions.There will be significant differences between the topic distributions in continuous versions when emerging topics exist.Therefore,we capture outliers (i.e.,emerging topics)in topic evolution using an anomaly detection method.

First,we select classical Jensen-Shannon (JS)divergence (https://dit.readthedocs.io/en/latest/measures/divergences/jensen_shannon_divergence.html) to measure the difference in thekthtopic between two consecutive app versions (e.g.,andThe JS divergence calculates the similarity between two probability distributions:

whereP(i) is theithitem inP.A higher JS divergence value represents a larger difference between the two distributions.

Second,we employ a typical outlier detection method (Rousseeuw and Hubert,2011) to capture exceptional topics.The method assumes that the divergence obeys a Gaussian distribution with mean valueμand varianceσ2.The anomaly topic is then detected by setting a thresholdδ.For versiontof the app,thresholdδtis defined dynamically in the following steps:

1.We calculate the previouswversions ofDJSfor each topic and express it as aDJSmatrix ofw×K(Kis the number of topics).

2.We calculate the mean valueμand the varianceσ2of all values in theDJSmatrix.

3.We set the thresholdδtμ+1.25σ,and the coefficient 1.25 represents the acceptance of 10% of the topics as emerging topics.

For thetthversion,topics are regarded as emerging topics when the divergenceDJSvalue of topics exceeds the defined thresholdδt.

3.3 Topic interpretation

Single words are not enough to understand the actual meaning of topics.For example,Table 1 shows the results of topics from a short text by BTM.We employ the five words with the highest probabilities under the first three topics to explain the topics.We observe that topics 1 and 2 use the same word“browser”to interpret topics.However,we cannot understand the actual semantic information of the topic“browser.”It may be a browser crash or browser incompatibility.Furthermore,it is not clear whether the word“browser”in topics 1 and 2 contains the same semantic information.

Table 1 Output of the BTM

In this subsection,to better understand the actual meaning of emerging topics,we employ relevant phrases and sentences to interpret emerging topics.

3.3.1 Candidate labels

1.Phrases

In Section 3.1,we obtain the phrases extracted by PMI as the candidate labels.On this basis,we further constrain the phrase requirements:(1) the length of each word in the phrase should be no less than 4;(2) the phrases should contain at least one noun or one verb,but no adverbs or determiners.The remaining phrases are regarded as our candidate phrase-level labels.

2.Sentences

We adopt the NLTK tool to segment the reviews and further filter the sentences as candidate sentence-level labels:(1)the sentence should include candidate phrase labels;(2) the length of the sentence must be no less than 5 words;(3)noisy data in sentences should be filtered out.

3.3.2 Similarity score

We employ the method in Mei et al.(2007) to calculate the similarity between the candidate label data (where the phrase is calleda,the sentence is calleds,andlis the label that is delegated to the phrase or sentence)and the emerging topic distributionThe similarity score calculation formula is as follows:

Scoresen(l) represents the sentiment score of labell.The star rating and text length of app reviews can reflect user’ concerns:on one hand,a low star rating generally means that users are reporting issues for the app;on the other hand,long text reviews often provide more valuable information (Gao et al.,2018).Scoresen(l)is calculated as follows:

whererlandhlrepresent the star rating and text length of the app review containing labell,respectively.

Based on Score(),the top three candidate phrases and sentences with the highest similarity scores are selected to interpret emerging topics.

4 Experiments

4.1 Dataset

We selected apps that meet the following four criteria:(1) these apps are popular ones in the app stores,which means that developers will constantly update them;(2)these apps come from different app categories and different platforms,to ensure the generalization of IETI;(3)each app has more than 5000 user reviews,which ensures the effective training of IETI;(4) most different versions of changelogs have detailed records to facilitate the evaluation of the model.We selected apps that meet these criteria in descending order from the rankings of different types of apps in the Apple App Store and Google Play.Overall,we obtained 16 4026 reviews from 89 app versions from six apps.Table 2 describes the dataset in more detail.All project codes run on the MacOS system,the programming language is Python3,and the CPU is M1.

Table 2 Subject apps

4.2 Evaluation method

We employed the official app changelogs as the ground truth,and then manually extracted keywords from the changelogs as verification data.We adopted Word2Vec to compute the semantic similarity between labell(phrases and sentences) and the keywords in the changelogs.We set a threshold to determine whether the emerging topic is covered in the changelogs,and we set the same similarity threshold 0.6.Then,we split each label into single words and calculated the similarity between each word and the keywords in the app changelogs.If the similarity score of a word is greater than the set threshold,we consider the label to be covered by the changelogs and mark it as a valid issue prediction.

In addition,we employed three performance metrics to evaluate the effectiveness of IETI.The first metric is employed to measure the accuracy of the emerging topics detected,and is defined as Precision.The second metric evaluates whether all topics detected by IETI (both emerging and nonemerging topics) are covered in the changelogs,and is defined as Recall.The third metric measures the balance between Precision and Recall,and is defined as F1 score.

whereE,G,andLrepresent the sets of detected emerging topics,keywords in changelogs,and all topics (including emerging and non-emerging ones) respectively,andI(·) counts the number elements in the set.During evaluation,we set the parameters asw=3,K=10,PMI=5,μ=0.75,andλ0.75,and initializedα0.1 andβ0.01.

4.3 Experimental results

Table 3 shows the evaluation results of emerging topic identification,and we analyzed the performance improvement of IETI in identifying emerging topics from the following three aspects:

1.For phrase labels,IETI achieved 0.628,0.529,and 0.572 of Precision,Recall,and F1 score on average,respectively.Compared to IDEA,the average values of these three metrics of IETI were increased by 0.094,0.107,and 0.126,respectively.

2.For sentence labels,IETI achieved 0.672,0.628,and 0.647 of Precision,Recall,and F1 score on average,respectively.Compared to IDEA,the average values of these three metrics of IETI were increased by 0.068,0.025,and 0.061,respectively.

3.Precision can accurately provide developers with the information on emerging topics,and Recall represents the ability of the model to identify emerging and common topics.These metrics help developers avoid missing some unexpected issues encountered by users.Precision and Recall collectively measure the ability of IETI to identify emerging topics,and IETI generally outperforms IDEA and OLDA in both metrics.

4.4 Analysis and discussion

4.4.1 Ablation study

We next investigated whether the improvement of the evaluation indexes created by IETI in Table 3 is attributable mainly to the unique preprocessing of the model,or AOBTM,or the combined effect of AOBTM and preprocessing.Therefore,we designed the following models for the ablation study:

1.IDEA+:Replace the preprocessing of IDEA with the preprocessing of IETI,and the topic model is AOLDA.

2.IETI-:Replace the preprocessing of IETI with the preprocessing of IDEA,and the topic model is AOBTM.

Then we applied IDEA,IDEA+,IETI-,and IETI to the same data sets.Table 4 shows the evaluation results of emerging topics identified by each model.

Compared with IDEA,Precision,Recall,and F1 score of IDEA+or IETI-were increased on both the phrase level and sentence level.Therefore,we confirmed that both the unique preprocessing and AOBTM of IETI improved the accuracy of emerging topic identification.We did not compare the contributions of the preprocessing and AOBTM to emerging topic identification,because we found that IDEA+and IETI-had some fluctuations in the comparison of evaluation indexes.

Compared with IDEA+and IETI-,the Precision,Recall,and F1 score metrics of the IETI were increased on both the phrase level and sentence level.This shows that the improvement in evaluation indexes brought by IETI is attributable mainly to the combined effect of AOBTM and preprocessing.

4.4.2 Label diversity

Both the IETI and baselines adopt labels(phrases or sentences) to evaluate the effectivenessof emerging topics.However,the labels may be different,and label diversity is reflected in the following two aspects:on one hand,labels may be constructed differently and with different semantics;on the other hand,labels may be constructed differently but with similar semantic.Table 5 shows the emerging topics in NOAA radar.First of all,we need to declare that the order of labels of an emerging topic is random,which will change the order of emerging topics,but does not affect the evaluation results of emerging topics.In Table 5,we can observe that the phrase labels used by IETI were not completely the same as those used by IDEA.The label“l(fā)ightning strike”appeared only in the experimental results of IDEA;although the phrases“waste money”and“pay money”were constructed differently,their semantics were similar.

Table 3 Comparison results of emerging topics identified by different models

Table 4 Ablation study results of different models

Table 5 Emerging topic detection results of NOAA radar in version 1.7

The difference between IETI and the baselines is embodied in the label differences.Our experimental results demonstrate that IETI adopts more accurate labels to interpret emerging topics or to reduce the number of falsely identified emerging topics,which means that IETI can obtain higher F1 score.

4.4.3 Labels

Based on the results in Table 3,we can have that the average Precision,Recall,and F1 score values used in IETI on these six apps were 0.628,0.529,0.572 respectively,for phrase labels,and 0.672,0.628,0.647 respectively,for sentence labels.Therefore,compared to phrase labels,adopting sentence labels to interpret the identified emerging topics can increase the average of Precision,Recall,and F1 score metrics by 0.044,0.099,and 0.075,respectively.Furthermore,the superiority of adopting sentencelevel labels includes obtaining more semantic details about identified emerging topics.For example,in NOAA radar’s emerging topic identification results,emerging topics appear in version 3.1 and the label phrase with the highest probability is“weather apps.”However,we cannot specifically understand the semantic information of“weather apps.”The label sentence with the highest probability is“if Yahoo weather can then there be no reason others can not implement it as well,”from which we can see that“weather apps”is likely to refer to“Yahoo weather,”and the user response“Yahoo weather”is superior to“NOAA radar”in some app features.

We recommend employing labels at the sentence level to interpret the identified emerging topics.This approach can improve the accuracy of emerging topic recognition.More importantly,sentences can help developers better understand the practical significance of emerging topics.

4.4.4 Parameter influence

In this subsection,we demonstrate the impact of the two core parameters (the number of topicsKand the window sizew) on the performance of IETI.Other parameters in IETI have little influence on emerging topic identification,or are used only for algorithm initialization,which does not have practical guiding significance (Gao et al.,2018;Hadi and Fard,2020).

1.Number of topicsK

The topic model based on the Dirichlet hypothesis distribution demands that the topic numberKis set in advance when modeling the text.Consequently,in benchmark models such as LDA and BTM,the results of topic modeling will be influenced byK.AOBTM is an improvement on BTM,andKlikewise affects the results of emerging topic identification.Commonly,the larger the value ofK,the more emerging topics are identified,and the higher the Precision.However,the number of misidentified emerging topics will also be increased,resulting in a lower Recall.

We attempt to understand the impact ofKon the emerging topic identification results for six apps.Fig.6 exhibits the effect of different numbers of topics with IETI on the F1 score.To compare IETI with IDEA under uniform conditions,we setKequal to 10.Applying different numbers of topics for six apps,we conjectured that a smaller number of topics seemed to yield higher F1 scores in minor scale text.In major scale text,a larger number of topics can obtain a higher F1 score.

Fig.6 Impact of the number of topics K on F1 score:(a) NOAA radar;(b) YouTube

2.Window sizew

When calculating the topic distribution of the current version,AOBTM will integrate some topic features of previous versions.Therefore,the number of previous versions will have a certain impact on the AOBTM calculation results.In this study,the number of previous versions is denoted as parameter window sizew.Fig.7 displays the impact ofwon emerging topic recognition results.When we set differentwvalues for six apps to identify emerging topics,we observed that the F1 score of the model was relatively high whenwequaled 2 or 3.To maintain the consistency of the comparative experimental conditions,wwas set to 3.

Fig.7 Impact of the window size w on F1 score:(a) NOAA radar;(b) YouTube

5 Threats to validity

1.Limitations of app changelogs

As mentioned,the emerging topics identified by IETI may not be covered in the app changelogs.However,because changelogs may not cover all the changes to the previous version of the app,these topics may represent unresolved issues in the updating procedures of multiple versions.It is difficult to formally define the practical meaning of these topics,but we can display these topics.Compared with reading each review,developers can reduce the time overhead by reviewing these topics.

2.Subjective emotions

We observed that some emerging topics are a result of the users’subjective emotions rather than specific software bugs or software features,such as the phrase label“waste money”in Table 5.These topics are rarely covered by app changelogs,which reduces the accuracy of emerging topic identification.However,these topics are valuable for app development,and analogously,we can also display these topics to developers.

3.Datasets and time slice

Our experimental data involved six types of apps,but we cannot predict whether our model is applicable to all types of apps.We mitigated this issue by selecting representative apps from different platforms and categories.In addition,fine-grained time slices can help the model quickly identify emerging topics when accepting continuous input.To alleviate this issue,we can divide time slices finely by treating them as new app versions.

6 Related works

This study applies the emerging topic identification method to app review analysis,so we divide the related works into two areas:app review analysis and emerging topic identification.

6.1 App review analysis

Because app reviews can provide valuable app related information from the perspective of immediate user experience,they are vital in app development.In recent years,more researchers have focused on automatically mining valuable information from app reviews (Liu YZ et al.,2019).The most familiar area of research in app review analysis is app review classification (Li YC et al.,2017;Jha and Mahmoud,2019;Su et al.,2019).Some research has adopted machine learning techniques to classify app reviews and evaluate the effectiveness of models through manual tagging(Darbanibasmanj et al.,2019).Guzman et al.(2015) developed a set of machine learning classifiers that can automatically classify app user reviews into categories related to app development.The authors tested the effectiveness of the classifier on manually labeled data sets,and the final precision rate of the classifier reached 0.74 and the recall rate reached 0.59.Maalej and Nabil(2015)applied review metadata,text categorization,natural language processing,and sentiment analysis techniques to classify app review text into four categories:bug reports,feature requests,user experience,and ratings.Calefato et al.(2018) proposed a specially trained classifier called Senti4SD,which aims to classify original user reviews into reviews with various emotions.Aslam et al.(2020) adopted a convolutional neural network to classify app reviews,and the precision,recall,and F1 score of this method on a public data set were 0.95,0.94,and 0.95,respectively.

In recent years,topic models have been widely used in app review analysis research.Some research concerned employing a topic model to model app reviews to complete tasks such as clustering or dimensionality reduction of app reviews.Park et al.(2015)proposed AppLDA,based on the LDA topic model,which models app reviews and app descriptions.The authors verified the effectiveness of AppLDA with more than one million reviews from 43 041 apps.Liu YD et al.(2016) proposed stratify app reviews(SAR),which is based on LDA.SAR classifies informative reviews into different levels and groups app reviews according to the context of users’attention.Wang et al.(2018) proposed group spamming latent Dirichlet allocation (GSLDA) to identify spam from reviews.Noei et al.(2021) adopted LDA to extract key topics from app reviews in different app categories.

6.2 Emerging topic identification

Some research has focused on identifying emerging topics from social media such as Twitter (Verasakulvong et al.,2018;Choi and Park,2019).However,the characteristics of social media comments are different from those of app reviews.The main relevant research work for apps focuses on the changing trend of app topics over time and employs the traditional abnormal topic detection method to identify emerging topics (Gao et al.,2015).For example,Vu et al.(2016) proposed a phrase-based automatic method for extracting user opinions from app reviews,which adopts part-of-speech templates to extract phrases in reviews and monitors the outbreaks of phrase clusters with negative emotions over a period of time to extract user opinions.Zeng et al.(2018) employed a memory network for topic modeling and classification of short texts.The network adopts a novel topic memory mechanism to track the change in short text topics in various time slices.Fan and Ma (2014) summarized the application of LDA in the detection of emerging topics.

Identifying emerging topics in app reviews is a challenging task.IDEA was the first to apply online topic modeling methods to identify emerging topics from app reviews (Gao et al.,2018).In the framework of IDEA,the authors proposed a method called AOLDA to adaptively capture topic evolution in continuous app version reviews and identified emerging topics using anomaly detection.Based on the six most popular apps from the Apple App Store and Google Play,the evaluation scores of emerging topic identification by IDEA were as follows:the mean values of precision,recall,and F1 score were 0.604,0.603,and 0.586,respectively.

7 Conclusions

Emerging topics in app reviews can guide developers in updating and maintaining their apps.In this study,we proposed IETI to identify emerging topics in continuous app versions.IETI reduced noisy data in reviews through richer preprocessing methods and applied AOBTM to emerging topic identification.We validated IETI’s emerging topic identification effectiveness on six real-world apps.In the future,we will filter reviews related to app features or bugs to improve emerging topic identification accuracy and improve generalization of IETI by mixing reviews and fine-grained time slices.

Contributors

Wan ZHOU and Yong WANG designed the research.Wan ZHOU processed the data and drafted the paper.Yong WANG,Cuiyun GAO,and Fei YANG helped organize the paper.Wan ZHOU and Yong WANG revised and finalized the paper.

Compliance with ethics guidelines

Wan ZHOU,Yong WANG,Cuiyun GAO,and Fei YANG declare that they have no conflict of interest.

Frontiers of Information Technology & Electronic Engineering2022年5期

Frontiers of Information Technology & Electronic Engineering的其它文章: Intelligent analysis for software data:research and applications; Stabilization of switched linear systems under asynchronous switching subject to admissible edge-dependent average dwell time*; Wireless passive flexible accelerometer fabricated using micro-electro-mechanical system technology for bending structure surfaces*; Shot classification and replay detection for sports video summarization*; Depth estimation using an improved stereo network*; Smart grid dispatch powered by deep learning:a survey*

亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放