Jiaqi GAO ,Jingqi LI ,Hongming SHAN ,Yanyun QU ,James Z.WANG ,Fei-Yue WANG?,Junping ZHANG?
1Shanghai Key Laboratory of Intelligent Information Processing,School of Computer Science,Fudan University,Shanghai 200433,China
2Institute of Science and Technology for Brain-inspired Intelligence,Fudan University,Shanghai 200433,China
3Shanghai Center for Brain Science and Brain-inspired Technology,Shanghai 201210,China
4School of Information Science and Technology,Xiamen University,Xiamen 361005,China
5College of Information Sciences and Technology,the Pennsylvania State University,University Park,PA 16802,USA
6State Key Laboratory of Management and Control for Complex Systems,Institute of Automation, Chinese Academy of Sciences,Beijing 100190,China
Abstract:Crowd counting has important applications in public safety and pandemic control.A robust and practical crowd counting system has to be capable of continuously learning with the newly incoming domain data in real-world scenarios instead of fitting one domain only.Off-the-shelf methods have some drawbacks when handling multiple domains: (1)the models will achieve limited performance(even drop dramatically)among old domains after training images from new domains due to the discrepancies in intrinsic data distributions from various domains,which is called catastrophic forgetting;(2)the well-trained model in a specific domain achieves imperfect performance among other unseen domains because of domain shift;(3) it leads to linearly increasing storage overhead,either mixing all the data for training or simply training dozens of separate models for different domains when new ones are available.To overcome these issues,we investigate a new crowd counting task in incremental domain training setting called lifelong crowd counting.Its goal is to alleviate catastrophic forgetting and improve the generalization ability using a single model updated by the incremental domains.Specifically,we propose a self-distillation learning framework as a benchmark (forget less,count better,or FLCB) for lifelong crowd counting,which helps the model leverage previous meaningful knowledge in a sustainable manner for better crowd counting to mitigate the forgetting when new data arrive.A new quantitative metric,normalized Backward Transfer (nBwT),is developed to evaluate the forgetting degree of the model in the lifelong learning process.Extensive experimental results demonstrate the superiority of our proposed benchmark in achieving a low catastrophic forgetting degree and strong generalization ability.
Key words: Crowd counting;Knowledge distillation;Lifelong learning
Crowd counting is to predict the number of persons in an image or a video sequence.Accurate crowd counting for crowded scenes has important applications such as traffic control,preventing stampedes from occurring,and estimating participation in large public events like parades.For example,during a pandemic,authorities may need to maintain social distancing for public spaces to minimize the risk of infection.Thus,crowd counting systems are usually deployed in multiple diverse scenarios,such as malls,museums,squares,and public squares.For one site,the running system is expected to continually handle the non-stationary data with different densities,illumination,occlusion,and various head scales.For multiple sites,the system should also consider dozens of scenes and perspective information.
As data are increasingly produced and labeling is time-consuming,the new domain data available for training are usually collected and labeled incrementally.We may ask: how can we sustainably handle the crowd counting problem in multiple domains using a single model when the newly available domain data arrive? We try to find the best potential solution to this question from the following aspects.
Currently,most crowd counting approaches(Zhang YY et al.,2016;Sam et al.,2017;Sindagi and Patel,2017;Cao et al.,2018;Li YH et al.,2018;Chen XY et al.,2019;Liu WZ et al.,2019;Ma et al.,2019,2020;Tan et al.,2019;Bai et al.,2020;Jiang XH et al.,2020b;Tian et al.,2020;Song et al.,2021) concentrate on training an independent model for each single domain dataset.They heavily rely on the assumption that images from both the training set and test set are independent and identically distributed.Although producing promising counting performance in the corresponding domain,such a training strategy,as shown in Fig.1a,has drawbacks in dealing with multiple and incremental new datasets,which are common in the real world,e.g.,when limited labeled data from a new site are available before applying the model at the site.One drawback is that these separately trained models often have low generalization ability when dealing with new,unseen domain data due to the domain shift evidenced in Table 1.Another is that saving multiple different sets of trained parameters from distinct domains for inference is not economical when deploying them to hundreds of thousands of realworld sites.Training a shared and universal model from scratch by mixing all the data (also known as joint training) or sequential training for each newly incoming dataset may improve the performance on the unseen domains (Figs.1b and 1c).Nevertheless,both paradigms still have some limitations.The joint training strategy (Ma et al.,2021;Yan et al.,2021) requires storing all training data from previous domains when the newly available data arrive,leading to lengthy training time and high storage overhead.Meanwhile,the sequential training strategy will dramatically deteriorate the model’s performance among previous domains after training the new domain data,i.e.,catastrophic forgetting.
Table 1 The mean absolute error(MAE)scores of our reproduced DM-Count (Wang BY et al.,2020)model separately trained in a single dataset and tested over other datasets,showing obvious performance drop due to domain discrepancy
To deal with the aforementioned forgetting,generalization,and storage overhead issues,inspired by the learning mechanism of mammals,we investigate a new task of crowd counting in this study,termed lifelong crowd counting,which can sustainably learn with the new domain data and concurrently alleviate catastrophic forgetting and performance drop among preceding domains under the domain-incremental training settings (Fig.1d).Note that the goal of the proposed lifelong crowd counting task is different from that of previous cross-and multi-domain crowd counting tasks (Chen BH et al.,2021;Ma et al.,2021;Yan et al.,2021).During the whole lifelong learning process with incremental training data,the goal is to maximize the overall performance among all domains–previously trained,newly arriving,and unseen–instead of focusing only on the target domain performance.We consider the tradeoffbetween the forgetting degree and the generalization ability of the models.In particular,we develop a novel benchmark of domain-incremental lifelong crowd counting with the help of knowledge selfdistillation techniques.The proposed benchmark has both strong generalization ability on unseen domains and low forgetting degrees among seen domains.This enables the model to have sustainable counting capability when new data arrive in the future.In our experiments,we use four fruitful crowd counting backbones,CSRNet (Li YH et al.,2018),SFANet (Zhu L et al.,2019),DM-Count (Wang BY et al.,2020),and DKPNet (Chen BH et al.,2021),to illustrate the effectiveness and superiority of our proposed framework.
Fig.1 The conceptual differences of four training paradigms: (a)directly training an individual model for each dataset;(b) training a unified model by mixing all datasets from different domains;(c) leveraging previous data or models to improve the performance on the target domain dataset;(d) ours: lifelong learning with incremental domains to improve the performance among all domains.In (c),the dashed lines indicate that the past domain data may be used repeatedly to improve the performance in the target domain dataset.In(d),our proposed FLCB (forget less,count better) model does not replay any previous domain data and evaluates all domain datasets at the training stage.Without storing previous domain data,FLCB itself can still sustainably handle the crowd counting problem among multiple domains,being updated by the new available domain dataset only
The contributions of this work can be summarized as follows:
1.To the best of our knowledge,this is the first work to investigate lifelong crowd counting by considering the catastrophic forgetting and generalization ability issues.Our method may serve as a benchmark for further research in the lifelong crowd counting community.
2.We design a balanced domain forgetting loss function (BDFLoss) to prevent the model from dramatically forgetting the previous knowledge when being trained on the newly arriving crowd counting dataset.
3.We propose a new quantitative metric,normalized Backward Transfer(nBwT)of lifelong crowd counting,to measure the forgetting degree of trained models among seen data domains.We treat the mean absolute error (MAE) as the criterion for evaluating model generalization on the unseen data domain.
4.Extensive experiments indicate that our proposed method has a lower degree of forgetting compared with sequential training and outperforms the joint training strategy on the unseen domain with a much lower MAE score and time and space complexity.
Traditional detection-and regression-based methods extract handcrafted features such as scale invariant feature transform(SIFT)(Lowe,1999)and histogram of oriented gradient (HoG) (Dalal and Triggs,2005) to detect individual heads (Dalal and Triggs,2005;Leibe et al.,2005;Tuzel et al.,2008;Dollar et al.,2012) or directly regress the count number (Chan and Vasconcelos,2009).Nevertheless,these models cannot learn the spatial information of person distribution to make accurate predictions in highly congested scenes.Most of the latest crowd counting approaches are built upon deep learning methods to estimate a density map for a given image.Many researchers design various architectures like fully convolutional networks (Wang C et al.,2015;Zhang C et al.,2015),multi-column networks (Boominathan et al.,2016;Zhang YY et al.,2016;Sam et al.,2017;Sindagi and Patel,2017),scale aggregation or scale pyramid networks (Cao et al.,2018;Chen XY et al.,2019;Liu LB et al.,2019;Jiang XH et al.,2020b;Zhao et al.,2020;Song et al.,2021),and attention mechanisms (Guo et al.,2019;Liu N et al.,2019;Zhu L et al.,2019;Jiang XH et al.,2020a;Sindagi and Patel,2020)to extract the multi-scale feature representations to deal with scale variation and non-uniform distribution issues.CSRNet (Li YH et al.,2018) points out the multi-scale feature redundancies among multi-branch architectures and proposes a new deeper single-column convolutional neural network (CNN) with dilated convolutions to capture different receptive fields.ADCNet (Bai et al.,2020) extends the discrete dilated ratio (integer value) to a continuous value to match the large-scale variation and self-correct the density map using the expectation-maximization(EM)algorithm.Local region modeling methods(Liu L et al.,2020;Jiang SQ et al.,2020)also help correct the local information.Most off-the-shelf crowd counting models focus on single domain learning.The models will be retrained when the new domain data arrive.In our study,we focus on using a single model to handle multiple incremental datasets for crowd counting.
Many researchers exploit the cross-domain problems(Wang Q et al.,2019,2022;Wu et al.,2021;Zou et al.,2021;Liu WZ et al.,2022)in crowd counting,including cross-scene (Zhang C et al.,2015),cross-view (Zhang Q et al.,2021),and cross-modal(Liu LB et al.,2021).The adversarial scoring network (Zou et al.,2021) is applied to adapt to the target domain from coarse to fine granularity.In addition,cross-domain features can be extracted by the message-passing mechanisms based on a graph neural network (Luo et al.,2020).A semantic extractor (Han et al.,2020) has been designed to capture the semantic consistency between the source domain and target domain to enhance the adapted model.A large synthetic dataset(GCC)(Wang Q et al.,2019)has been released to study the transferability from synthetic data to real-world data.Quite a few researchers (Shi et al.,2019;Xiong et al.,2019;Yang et al.,2020) investigated similar tasks like vehicle counting based on the same crowd counting architectures.Learning with multiple domains simultaneously (Chen BH et al.,2021;Ma et al.,2021;Yan et al.,2021)has also been preliminarily explored,and is required to mix all the data for training at the same time.DCANet (Yan et al.,2021) uses a channelattention-guided multi-dilation module to assist the model in learning a domain-invariant representation,while DKPNet (Chen BH et al.,2021) propagates the domain-specific knowledge with the help of variational attention techniques.Ma et al.(2021)developed a scale alignment component to learn an adaptive rescaling factor for each image patch for better crowd counting.In reality,such cross-domain approaches need a careful alignment module design and place more emphasis on the target domain performance only,while the multi-domain learning methods require more storage overhead to save old domain data.These methods often achieve limited performance in previous (source) domains.In contrast,our proposed lifelong crowd counting task is based on training the domains incrementally(one by one) using a single model,alleviating catastrophic performance drop of the previous domains (forget less),and maintaining the overall performance in all domains (count better).The lifelong crowd counting system can mimic the biological brain to learn sustainably in its lifetime inspired by the learning mechanisms of mammals,i.e.,integrating the new knowledge increasingly while maintaining previous memories.
Lifelong learning attempts to alleviate the catastrophic forgetting issues and enhance the model generalization ability when a system increasingly faces non-stationary data.The mainstream strategies are applied to image classification (Kirkpatrick et al.,2017;Lopez-Paz and Ranzato,2017;Rebuffiet al.,2017;Li ZZ and Hoiem,2018;Belouadah and Popescu,2019) and numerical prediction tasks(He YJ and Sick,2021),which can be categorized into four groups: model-growth approaches (Rusu et al.,2016),rehearsal-based techniques (Lopez-Paz and Ranzato,2017;Rebuffiet al.,2017),regularization (Kirkpatrick et al.,2017;Rebuffiet al.,2017),and distillation mechanisms (Li ZZ and Hoiem,2018).Specifically,the model-growth(e.g.,productbased neural network (PNN) (Rusu et al.,2016))and rehearsal-based methods (e.g.,GEM (Lopez-Paz and Ranzato,2017))require more computational and memory costs because they either instantiate a new network or replay old data when learning new classes or tasks.LwF (Li ZZ and Hoiem,2018)is a combination of the distillation networks and fine-tuning to boost the overall performance.However,the aforementioned classification-based lifelong learning approaches cannot migrate to the crowd counting task directly because counting is an openset problem (Xiong et al.,2019) by nature,whose value ranges from zero to positive infinity in theory.Latent feature representations with general visual knowledge together with high-level semantic information at the output layer play a crucial role in such dense prediction tasks.Therefore,in this paper,we propose a simple yet effective self-distillation loss at both the feature level and the output level for lifelong crowd counting to alleviate catastrophic forgetting with a low time and space complexity.
In this section,we will first introduce concrete formalized definitions of typical crowd counting and the proposed lifelong crowd counting.After that,we describe the details of our proposed domainincremental self-distillation lifelong crowd counting benchmark including model architectures and the proposed loss function.
3.1.1 Typical crowd counting
A typical crowd counting task can be regarded as a density map regression problem,training and validating in a single domain,as shown in Fig.1a.Suppose that one datasetDM=〈XM,YM〉containsMtraining images and the corresponding annotations.Then,a binary mapBis easy to obtain given the coordinates of pedestrian heads per image,which can be formally defined as follows:
The ground truth density mapYis generated by employing the Gaussian kernelGσto smooth the binary map:
3.1.2 Lifelong crowd counting
We propose a new,challenging,yet practical crowd counting task,i.e.,lifelong crowd counting,for investigating the catastrophic forgetting and model generalization problems in training domainincremental datasets.Different from previous works that maintained good performance only in a single target domain,the lifelong crowd counting model could be sustainably optimized over the new incoming datasets to maximize the performance among all domains.
For convenience,we first define some key notations as follows and introduce the details of the lifelong crowd counting process.A sequence ofNdomain datasets{D1,D2,...,DN}is prepared to train the lifelong crowd counterG*(·;ψ) with parametersψo(hù)ne by one.are the training images and corresponding ground truth density maps withMtsamples from thetthdomainDt,respectively.Here,we assume that different datasets are coming from different domains with their own distinct data distributions,i.e.,because they are normally captured from different cameras or different scenarios like streets,museums,and gymnasiums.The model is initially trained from scratch over the first domain and then trained and optimized by the rest of the other datasets sequentially.The optimal objectψ*is defined as follows:
whereG(t)(·;ψ) represents thetthmodel for training thetthdatasetwithMtsamples.The ultimate model is expected to achieve decent performance among seen and unseen domains.What deserves to be pointed out is that lifelong crowd counting is distinct from cross-domain tasks with different optimization objectives,as well as the training settings.In lifelong crowd counting,the goal is to maximize the performance on both seen and unseen domains instead of maximizing the target domain performance only.Specifically,when the training data from previous domains are absent or unavailable,lifelong crowd counters could still work efficiently because they are trained and updated only by the newly arriving domain dataset one after another.
Our proposed framework focuses on tackling the catastrophic forgetting and generalization issues under the circumstances of domain-incremental training settings.In this study,we simply regard different crowd counting datasets as different domains because the statistics (mean and variance) of person count are different.The detailed explanations of the domain concept can be seen in the supplementary materials.To be more specific,we propose a novel domain-incremental self-distillation lifelong crowd counting benchmark for sustainable learning with newly arriving data and without an obvious performance drop among previous domains.The key factor is how to effectively leverage the previously learned meaningful knowledge when training over the data from a new domain for better crowd counting.Inspired by the knowledge distillation technique,we expect to use a well-trained model among old domains (teacher model) to guide the currently optimized model with new domain data (student model) to mitigate performance drop among previous domains,considering that the old data may be unavailable.The overview of our proposed framework is illustrated in Fig.2.We design a selfdistillation mechanism plugged into both featureand output-level layers of the network to constrain the output distribution similarities between the teacher and student models,which can reuse the learned knowledge when facing the new domain data without storing or training the old data repeatedly.Details will be given in Section 3.3.The ultimate model is expected to be deployed to an arbitrary domain to estimate the person count.
Fig.2 Overall architecture of our proposed domain-incremental self-distillation learning benchmark (FLCB)
For better understanding,the overall training pipeline is described in detail as shown in Algorithm 1.A queueQcollectsNincreasingly arriving datasets from different domains to be trained one by one.First,we initialize the first modelG(1)(·;ψ)by training the first available datasetD1in queueQ.Another queuePis prepared for future evaluation,receiving the test set popped fromQ.After that,the model will be trained and optimized by the subsequent datasets fromD2toDN,repeating the following main steps until queueQis empty:
1.Pop thetthdatasetDtfrom queueQfor training.
2.Copy the parameters of the last well-trained modelG(t-1)to modelF(·;θ) as a teacher network for distillation.
3.Train the currenttthmodelG(t)(·;ψ)over thetthdatasetDtvia the self-distillation loss we propose.
4.Push thetthdatasetDtinto queuePfor evaluation when the model converges.
Note that the parametersθof modelF(·;θ)are frozen during the lifelong training process.The fixed model is regarded as a teacher network to guide the current student networkG(t)(·;ψ)with learnable parametersψto remember old meaningful knowledge for better crowd counting.Eventually,we obtain the final model with the best parametersψ*,which can continue to be trained using our proposed framework when the newly incoming labeled data are ready in the future.Because we do need to store any previously seen training data to be replayed to train our model,the time and space complexities are approximatelyO(N) andΩ(M),respectively,superior toO(N2) andΩ(N ×M) of joint training.Mis the maximum ofMi.Although the distillation mechanism is required to save an additional model,its storage overhead is negligible compared to storing the entire dataset for retraining.
To balance the model plasticity (the ability to learn new data) and stability (the ability to remember previous knowledge),we propose a novel balanced domain forgetting loss function,i.e.,BDFLoss,consisting of mainly counting loss and selfdistillation loss.We integrate the optimal transport loss in our basicL1counting loss in this study because it has tighter generalization error bounds(Wang BY et al.,2020).L1counting loss is defined as follows:
whereL1(·,·) loss computes the difference between the predicted and actual counts.
The optimal transport lossLOTis used to minimize the distribution discrepancy between the predicted density maps and the point-annotated binary maps,defined as follows:
whereWc(μ,v;C) is the optimal transport loss with the transport costC.It aims at minimizing the cost to transform one probability distributionμto anotherv.Cis defined as the quadratic transport cost here.α*andβ*are the optimal solutions to its dual problem:
To improve the approximation of the lowdensity regions of images,we embed a normalized regularization itemLr,defined as follows:
Thus,the total count loss is made up of the three aforementioned loss functions with two hyperparameters,ηandγ,which are set to 0.1 and 0.01,respectively,in our experiments.
When training to thetthdomain,the performance among previous domains may degrade dramatically,i.e.,catastrophic forgetting,if no constraints are imposed.The self-distillation lossLdistillis designed to help the model forget less and count better during the lifelong learning process.To be more specific,we regard the current training modelG(t)(·) as the student model,which can be guided by the teacher modelG(t-1)(·) well-trained at the previous step (Fig.2).The student model is not expected to forget some previously learned knowledge when training in the new domain.Normally,the deep layers of a CNN with a large receptive field contain task-specific and high-level semantic information,while the intermediate layers include general visual knowledge.They are mutually beneficial and complementary,and assist the model in remembering the helpful knowledge learned previously,during the lifelong crowd counting process.Thus,we deploy the self-distillation loss at both the feature level and the output level when thetthnew domain dataset arrives for training.
whereH(·) denotes the feature extractor of modelG(·).Since the similarity metric is not our crucial research point in this study,we just choose theL2loss for simplicity.To sum up,the BDFLoss is made up of these two components within the hyper-parameterλ:
whereλis applicable as a trade-offbetween model plasticity and stability.It is the same as vanilla sequential fine-tuning whenλis equal to 0.
Our proposed domain-incremental selfdistillation lifelong crowd counting benchmark is model-agnostic.To illustrate its effectiveness,we integrate it into several state-of-the-art crowd counting backbone models without the bells and whistles,CSRNet(Li YH et al.,2018),SFANet(Zhu L et al.,2019),DM-Count (Wang BY et al.,2020),and DKPNet (Chen BH et al.,2021).Because the attention map supervision of SFANet may introduce some biases in the experimental comparisons and the source code of DKPNet is not released,we make the following modifications in our experiments.A small improvement of SFANet is that we enable the network to learn the attention map adaptively based on training images without generating additional attention maps for supervision.We modify DKPNet-baseline in our experiments because we focus only on investigating the effectiveness of our proposed framework in forgetting and generalization under different model capacities.
In this section,we will briefly introduce four datasets used in our experiments,the training settings,and some hyper-parameter selections.
We train and evaluate our model in the public crowd counting datasets,i.e.,ShanghaiTech PartA(Zhang YY et al.,2016),ShanghaiTech PartB(Zhang YY et al.,2016),UCF-QNRF (Idrees et al.,2018),NWPU-Crowd (Wang Q et al.,2021),and JHU-Crowd++(Sindagi et al.,2019) (Table 2).To illustrate the generalization of different training paradigms,we have to select one of them as the unseen dataset that could never be trained during the domain-incremental lifelong learning process.In our experiments,we take the JHU-Crowd++dataset as an unseen one because it has a variety of diverse scenarios and unconstrained environmental conditions(Sindagi et al.,2019).The synthetic dataset GCC(Wang Q et al.,2019) is also used to analyze the synthetic-to-real generalization performance under the lifelong crowd counting settings.
We strictly follow the same basic image preprocessing settings as in most recent literature (Li YH et al.,2018;Ma et al.,2019;Zhu L et al.,2019;Wang BY et al.,2020).The crop size is 256×256 for SHA,and 512×512 for SHB,QNRF,and NWPU datasets.To generate the density map as ground truth,we just adopt the fixed Gaussian kernel whose varianceσis set to 15 for all datasets.Several useful augmentations like random horizontal flipping with a probability of 0.5 and normalization are applied to those images before training.The hyper-parameterλin the loss function is set to 0.5 to achieve a tradeoffbetween model plasticity and stability.We use the fixed learning rate of 1×10-5,a simple weight decay of 5×10-4,and an Adam optimizer in all our experiments.We use the PyTorch framework and NVIDIA GeForce RTX 3090 GPU workstation.
The catastrophic forgetting phenomenon often exists in domain-incremental learning.To evaluate how much old knowledge on earth the model forgets in the previous domains and make a fair comparison with other methods,we propose a new metric,called normalized Backward Transfer(nBwT).With the help of nBwT,the total forgetfulness overtincremental domains could be measured to determine whether the model is equipped with the sustainable learning ability.The normalization operation we introduce in nBwT could eliminate the potential negative impact because of the different learning difficulties in different domains.
whereet,iis the test MAE score of theithdataset when obtaining the optimal model on thetthdataset,andi <t.nBwTtis the accumulation of the forgetting performance among all previoust-1 domain datasets.The non-zero divisorei,iis a normalization factor.The larger the nBwT value is,the greater the model forgetting degree is.A value smaller than 0 indicates that the model has attained a positive performance improvement among previously trained datasets.The theoretical lower bound of nBwTtiswhenet,iequals zero.
Furthermore,we propose two reasonable and impartial criteria,i.e.,mMAE and mRMSE,the respective means of MAE and root mean square error(RMSE)inNdatasets,to evaluate roughly the overall counting precision of the lifelong crowd counting task:
whereMidenotes the number of images from theithtest set.andYjare the predicted count and actual count of thejthimage,respectively.mMAE and mRMSE reduce to standard MAE and RMSE respectively whenNis equal to 1.
In addition,we still use the standard MAE score on the unseen JHU-Crowd++dataset to compare the model generalization within different training strategies.
In this section,we first evaluate the overall performance and generalization ability of our proposed FLCB framework by comparison with two classical continual learning approaches(Kirkpatrick et al.,2017;Li ZZ and Hoiem,2018) (Table 3).Then,we demonstrate the difference between FLCB and three other learning strategies,especially for analyzing their respective forgetting degrees among the trained datasets(SHA,SHB,QNRF,and NWPU),and theirgeneralization abilities on the unseen dataset(JHUCrowd++).The synthetic-to-real experiments are also conducted considering the data privacy issues and some ethical policies.
Table 2 The number of images used to train models on different datasets
As shown in Table 3,we reproduce two of the classical lifelong learning methods and modify them to adapt to our crowd counting task,because most lifelong learning methods focus on the classification task,while crowd counting is a regression-like task.The average performances in past domains and unseen domains of our proposed FLCB method all surpass those of LwF and EwC approaches.We compare the quantitative results between the baselines and our proposed method based on four benchmark models.The results in Table 4 demonstrate that our method can remarkably alleviate the catastrophic forgetting phenomenon on all models with the lowest mMAE,mRMSE,and nBwT(i.e.,forgetting degree)under the domain-incremental training settings.We also report the model parameters and the Multiply-ACcumulate operations(MACs)for each benchmark model.The forgetting degree in the intermediate process is detailed in Table 5.The results imply that the model will forget less and count better when more labeled datasets are involved in the lifelong learning process.This indicates that our framework can remember the old yet meaningful knowledge from the last well-trained model when handling the new domain dataset.
The proposed balanced domain forgetting loss(BDFLoss)is composed of optimal transport counting loss and self-distillation loss.The hyperparameterλplays a dominant role in our proposed BDFLoss to control how much previously learned meaningful knowledge should be retrained when learning on new domain data.In other words,λis a trade-offbetween model plasticity and stability.The greater the value ofλis,the more attention shouldbe paid to leveraging the distilled knowledge.Ifλis equal to 0,it degenerates to the vanilla sequential training without any constraint of previous knowledge.We just empirically chooseλ=0.5 to conduct our main experiments in this study.In this subsection,we also investigate whether differentλvalues will have a visible effect on forgetting.The extensive results demonstrate thatλ=0.5 is a reasonable choice(Table 6).
Table 3 The results with different domain-incremental lifelong learning methods
Table 4 Quantitative results with different paradigms to compare the forgetting degree and overall performance
5.3.1 Real-to-real generalization
To build a robust model for better crowd counting,we expect that the model can obtain acceptable performance among unseen domains,because labeling crowd images is extremely expensive and time-consuming in the real world.After the ultimate models converge,we test them directly on the unseen JHU-Crowd++dataset(Table 7).Note that the images from JHU-Crowd++are never trained during the process of lifelong learning.Our proposed FLCB can achieve lower prediction errors in terms of MAE and RMSE over the unseen dataset,indicating a stronger generalization ability compared with the joint training strategy.Furthermore,taking DKPNet as an example,we delve into the ablation study of different layers for distillation in the intermediate lifelong learning process.Every time the training of a new incoming dataset is finished,the model will be evaluated on the unseen dataset.The results,shown in Table 8,illustrate that its performance is boosted progressively with incremental data from different domains.It is also indicated that the model cancount better on the unseen domain under the mutually complementary interaction of both feature-and output-level distillation.Training in different orders may achieve fluctuating performance in unseen domains.We present the results in the supplementary materials because they could be related to curriculum learning,which is not our main focus in this study.
Table 5 Forgetting performance in the intermediate process of lifelong crowd counting among four models with FLCB
Table 6 Forgetting degree comparison results with different hyper-parameters λ’s
5.3.2 Synthetic-to-real generalization
Considering data privacy and some ethical policies (i.e.,the real-world training images may be unobtainable),we conduct the training with the same lifelong settings on the synthetic crowd dataset(GCC)(Wang Q et al.,2019)and investigate the generalization on the unseen real-world dataset(ShanghaiTech PartB).The GCC dataset is collected from the GTA5 game environment,containing 15 212 synthetic images with diverse scenes.The synthetic dataset can provide precise but not time-consuming annotations for training.We split the GCC synthetic dataset into four subsets to mock the same lifelong training settings.The forgetting phenomenon among incremental synthetic subsets is still analyzed(Table 9),as well as the generalization performance on the unseen dataset.After obtaining the ultimate model,our FLCB benchmark achieves the lowest mMAE,mRMSE,and nBwT among previously seen datasets and decent performance on the unseen realworld dataset.Furthermore,the generalization experimental results (Table 10) verify the superiority of our proposed benchmark.
Table 7 Generalization comparison of different training strategies on the unseen JHU-Crowd++dataset
Table 8 Generalization comparison on the unseen JHU-Crowd++dataset with self-distillation at different levels during the entire lifelong learning process
Table 9 Experimental results of DKPNet with the synthetic-to-real training settings
Table 10 The test MAE and RMSE scores on the unseen ShanghaiTech PartB dataset after training synthetic GCC subsets
In summary,our proposed lifelong crowd counting benchmark FLCB can help the crowd counters forget less and count better to sustainably handle multiple-domain crowd counting using a single model,which indicates that it has potential to tackle more complicated scenes in the future.
To make a more qualitative comparison,we visualize the prediction density maps under different training strategies.As illustrated in Fig.3,we discover that the sequential training methods achieve terrible performance among old domains after training images from a new domain.Our proposed lifelong crowd counting benchmark can estimate crowd density on both seen and unseen datasets more accurately and outperforms other training paradigms.
5.5.1 Limitations
In this paper,we attempt to develop a single model to handle the incremental datasets from different domains for better lifelong crowd counting.Judging from both quantitative and qualitative results,our proposed FLCB does well in achieving a trade-offperformance from all domain datasets compared with other methods.However,there are still some limitations that may drive future research directions in lifelong crowd counting.On one hand,according to the visualization results,our proposed FLCB method seems to have difficulty in dealing with the missing annotations(yellow bounding boxes) and background noises (green bounding boxes),like the loudspeaker box in Fig.3.On the other hand,we do not integrate any replay-based strategies into our experiments considering the training time and storage overhead.Efficient data sampling strategies and replay-based approaches may boost lifelong crowd counting,which deserves to be investigated in the future.
Fig.3 The visualization results of different training paradigms.The top row shows the predictions and compares the forgetting degree on the first training dataset (SHA),while the bottom row illustrates the predictions and compares the generalization ability on the unseen dataset (JHU) (red: FLCB can correctly discriminate the non-human objects like traffic lights;green: FLCB may be affected by background noise such as loudspeakers;yellow: FLCB may not handle well the missing annotations,which is not the key research point in our work).References to color refer to the online version of this figure
5.5.2 Lifelong learning vs.self-supervised learning
We would like to discuss lifelong learning and self-supervised learning from a pretraining perspective.They share something in common that is expected to lay the foundation for artificial general intelligence.Recent literature (Caron et al.,2020;Chen T et al.,2020;Grill et al.,2020;He KM et al.,2020;Niu et al.,2020,2022;Huang et al.,2022;Niu and Wang,2022a,2022b) shows the power of selfsupervised learning as a novel pretraining paradigm to empower multiple downstream tasks.To an extent,lifelong learning could be regarded as a kind of pretraining method,because it learns the shared knowledge and general representations to boost performance.However,lifelong learning usually requires labeled data for training to enhance model capacity,whereas self-supervised learning does not.From our perspectives,both types of learning could provide a good pretrained network or initialization for the training of other domain datasets or downstream tasks,and lifelong learning may empower selfsupervised learning in the future.
We propose a domain-incremental selfdistillation learning benchmark for lifelong crowd counting to deal with the catastrophic forgetting and model generalization issues using a single model when training new datasets from different domains one after another.With the help of the BDFLoss function that we have designed,the model can forget less and count better during the entire lifelong crowd counting process.Additionally,our proposed metric nBwT can be used to measure the forgetting degree in future lifelong crowd counting models.Extensive experiments demonstrate that our proposed benchmark has a lower forgetting degree over the sequential training baseline and a stronger generalization ability compared with the joint training strategy.Our proposed method is a simple yet effective way to sustainably handle the crowd counting problem among multiple domains using a single model with limited storage overhead when the newly available domain data arrive.It can be incorporated into any existing backbone as a plug-and-play training strategy for better crowd counting in the real world.Although our work considers crowd counting,the proposed framework has the potential to be applied in other regression-related image or video tasks.
Contributors
Jiaqi GAO designed the research and drafted the paper.Jingqi LI contributed ideas for experiments and analysis.Jingqi LI,Hongming SHAN,Yanyun QU,James Z.WANG,Fei-Yue WANG,and Junping ZHANG helped organize and revised the paper.Jiaqi GAO,Hongming SHAN,and Junping ZHANG finalized the paper.
Compliance with ethics guidelines
Jiaqi GAO,Jingqi LI,Hongming SHAN,Yanyun QU,James Z.WANG,Fei-Yue WANG,and Junping ZHANG declare that they have no conflict of interest.
Data availability
The data that support the findings of this study are available from the corresponding authors upon reasonable request.
List of supplementary materials
1 Domain concept and gaps of different datasets
2 Effect of different training orders
Fig.S1 Data distributions of four benchmark datasets
Table S1 Forgetting degree comparison results with different training orders
Table S2 Generalization comparison results with different training orders on the unseen JHU-Crowd++dataset
Frontiers of Information Technology & Electronic Engineering2023年2期