亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

A Real-Time Multi-Vehicle Tracking Framework in Intelligent Vehicular Networks

2021-07-26 06:53:58HuiyuanFuJunGuanFengJingChuanmingWangHuadongMa

China Communications 2021年6期

Huiyuan Fu,Jun Guan,Feng Jing,Chuanming Wang,Huadong Ma

1 Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,Beijing University of Posts and Telecommunications,Beijing 100876,China

2 Horizon Robotics,Beijing 100080,China

Abstract:In this paper,we provide a new approach for intelligent traffic transportation in the intelligent vehicular networks,which aims at collecting the vehicles’locations,trajectories and other key driving parameters for the time-critical autonomous driving’s requirement.The key of our method is a multi-vehicle tracking framework in the traffic monitoring scenario..Our proposed framework is composed of three modules:multi-vehicle detection,multi-vehicle association and miss-detected vehicle tracking.For the first module,we integrate self-attention mechanism into detector of using key point estimation for better detection effect.For the second module,we apply the multi-dimensional information for robustness promotion,including vehicle re-identification(Re-ID)features,historical trajectory information,and spatial position information For the third module,we re-track the miss-detected vehicles with occlusions in the first detection module.Besides,we utilize the asymmetric convolution and depth-wise separable convolution to reduce the model’s parameters for speed-up.Extensive experimental results show the effectiveness of our proposed multi-vehicle tracking framework.

Keywords:multiple object tracking;vehicle detection;vehicle re-identification;single object tracking;machine learning

I.INTRODUCTION

Nowadays,vehicular networks still face a mass of challenges in supporting emerging time-critical applications,which need to obtain fast response and accurate vehicles’ location,trajectory and other driving conditions to provide emergency communications[1–5].

We consider that the framework in the traffic monitoring scenario can be more effectively applied to the vehicular networks.

The real-time multi-vehicle tracking framework can detect and track vehicles through traffic monitoring devices,which obtain the positions and trajectories of surrounding vehicles in real-time,thereby realizing advanced warning of dangerous states,reducing traffic accidents,and improving traffic efficiency.In our proposed framework,we first collect the video signal of the road,then detect and track the multiple vehicles using the deep learning technology to get the locations of the vehicles.The key of the framework is to design a powerful approach for multi-vehicle tracking accurately.

Currently,the related work to multi-object tracking can be summarized as follows:object detection,single or multiple object tracking.

Previously,object detection methods are usually based on background-differencing methods[6–8]to extract the vehicles.However,it is very timeconsuming and cannot be well used into practical scenes.Recent deep neural network based methods have demonstrated their ability for object detection task[9–19].The newest algorithm CenterNet[20]has directly estimated the object key point by generating a heat map of the key point,which greatly improves the test speed and accuracy.

Single object tracking can be divided into traditional algorithms based on handcrafted features[21,22],correlation filtering[23–27]and deep learning algorithms[28,29].Among them,the state-of-the-art work is SiameseRPN[28],which adds a regional candidate network to the SiameseFC[29]to achieve multi-scale adaptation to the object for tracking accuracy.

In multi-object tracking,the methods aim to estimate trajectories of multiple objects by finding target locations and maintaining target identities across frames.They mainly adopt the tracking-by-detection strategy and handle the task by linking detections across frames using data association algorithms.[30–32]are real-time in data association module,but do not consider the time consumption of the detector and feature extraction,thus the entire tracking algorithm is not real-time.Meanwhile,these approaches heavily rely on the quality of detection results.If the detection is missing or inaccurate,the target object is prone to be lost.[33–35]solve the problems of occlusion and similar appearance in multi-object tracking,but they did not meet the real-time conditions.Therefore,the methods are not suitable for time-critical applications.

To promote the detection accuracy,we propose to integrate the self-attention mechanism module into detector of using key point estimation for better detection effect.Meanwhile,we propose to use the optimized non-local neural networks to improve the detection metric and reduce the number of parameters.To enable robust multi-vehicle association,we propose to combine the vehicle re-identification information,,trajectory information and positional space information together into online tracking Hungarian algorithm.What’s more,other approaches heavily rely on the quality of detection results.To ensure the missdetected vehicles which caused by the occlusions are correctly tracked,we pay attention to these vehicles for independent tracking to recover their trajectories to add to the original tracking list.Meanwhile,we adopt asymmetric convolution to promote its speed in the feature extraction part,and utilize the depth-wise separable convolution to reduce the module’s parameters in the correlation part to speed the whole approach up.Therefore,the entire tracking algorithm is a real-time multi-vehicle tracking framework.Extensive experiments are conducted.The experimental results show that our proposed framework is effective compared with the state-of-the-art approaches.The remainder of this paper is organized as follows.Section II introduces the detailed framework of our method.Then,Section III is devoted to the discussion of experimental results on the multi-vehicle tracking framework.The main conclusions are summarized in Section IV.

II.OUR METHOD

In this section,we introduce our proposed method in detail.It is divided into three modules:multivehicle detection,multi-vehicle association,and missdetected vehicle tracking.The proposed multi-vehicle tracking framework is shown in Figure 1.There are three modules of the framework.In the vehicle detection module,it firstly obtains continuous video frames from traffic monitoring devices in the intelligent vehicular networks.Then,the frame is passed through the attention mechanism based network to detect multiple vehicles.After post-processing,the detected regions of vehicles are sent to the vehicle association module.In the vehicle association module,the consecutive frames’ vehicles match through the network for data association.We fuse three different dimensional information to make the vehicle association more reliable,resulting in the vehicle trajectory by connecting and drawing the locations of the same vehicle.In the miss-detected vehicle tracking module,we adopt a single vehicle tracking strategy for miss-detection.It will save trajectory information if the vehicle matches successfully in N consecutive frames.Otherwise,it will delete the trajectory.To reduce model parameters,we adopt an asymmetric convolution layer and a separable convolution layer in this module,and its speed can reach real-time.

Figure 1.The proposed multi-vehicle tracking framework.

2.1 Multi-vehicle Detection Module

The vehicle detection module aims to detect vehicles in each image accurately.Since there are difficulties such as deformations and occlusions in vehicle detection,we propose to use a vehicle detection method based on key points,which can clearly reflect the position relationships of vehicles on the feature map.

Meanwhile,we propose an attention mechanism to model the relationship between regions of each vehicle in the image to improve the accuracy of object detection in the context of real-time performance.The traditional convolution and image filtering are local image operations,while the standard non-local filtering operation needs to be correlated with all the pixels in the image,which cost large amounts of time.Hence,by denoting a small region of feature map as block,we use a block-by-block scheme to calculate the correlation among vehicles.In this scheme,more similar blocks will be assigned larger weight,which reflects the main idea of non-local mean filtering:highlighting common points and weakening differences.Its mathematical expression is shown as follows:

Wherexis the input feature map andi,jis the index of the blocks in the feature map.The response value is calculated by enumerating all possible locations.The functionf(.,.)calculates the similarity between alliandj,the functiong(.)represents the representation of the input feature ofj-th block,and the final response value is normalized byC(x).

We propose a non-local neural network,which stacks convolution operation,to achieve the process of non-local mean in image processing[36].Through convolution and pooling operations,the consecutive frames obtained by traffic monitoring devices will be downsampled,and the feature map will be reduced to 16×16 to obtain higher dimensional information.Each pixel in the feature map can be seen as a small region in the input image,thus we use the feature of each position in the feature map as the block feature.[37]optimizes the non-local neural network,for calculating only the similarity of each point on the feature map in the horizontal direction and the similarity in the vertical direction.In this way,the calculation amount of the originalm×nfeature map has been reduced fromm×n×m×nto 2×m×n×(m+n).We use the optimized non-local neural network to interpolate the self-attention mechanism into the model,realizing the aggregation of information between blocks by generating an attention map with the feature map size.

The process of the optimized non-local neural network has been shown in Figure 2.The original feature map with the dimension ofC×H×Whas been sent to three 1×1 convolutions to generate three individual feature maps,K,Q,andV.The feature dimension ofKandQisD×H ×W.For reducing the amount of calculation,we makeDsmaller thanC.Feature mapsQandKthen perform an affinity operation to obtain the attention map of size(W+H ?1)×H×W.Each pixel in the attention map represents its relationship with the pixels in the same column and the same row.The softmax function then performs the normalized processing to obtain the final(W+H ?1)×H×Wdimension attention map.Consequently,the attention map aggregated with the feature mapVis added to the original feature map to obtain the final weighted feature map.After the attention module,a deformable convolution is used for feature extraction,and transposed convolution is used for upsampling to restore the feature map to 1/4 size of original image.It should be noted that the vehicle center prediction for each pixel on the feature map can only be achieved when the image is restored to a size close to the original image.Otherwise,a pixel in the feature map may cover multiple vehicles in the original image,affecting the detection effect.Finally,the outputs of three branches respectively predict the key points of the vehicle,regression of the object bounding box,and the offset of the vehicle.

Figure 2.Self-attention mechanism structure.

2.2 Multi-vehicle Association Module

The inter-frame object connection is an essential key for multi-vehicle tracking.Once an object matching error occurs,the tracking trajectory will be incorrectly updated,which will have a greater impact on future frame associations.In this module,we combine three dimensional information to construct the similarity matrix,which include vehicle re-identification(Re-ID)features,historical trajectory information,and spatial position information.Then,the Hungary algorithm,a online data association method,is used to solve the loss matrix converted by the similarity matrix,so as to obtain the optimal matching of the vehicle objects between frames.

In recent years,the re-identification technology has developed rapidly,with the proposal of VeRi dataset and VehicleID dataset,the research of vehicle re-identification is gradually carried out[38,39].The works of[40–42]use different feature reorganization and network structure to realize vehicle reidentification,which are complex and do not meet the speed requirement.We use an improved ResNet-18 network to directly extract the vehicle re-identification features of vehicles.The network configuration is shown in Table 1.

Table 1.Vehicle reid feature extraction network configuration.

Through this network,extracted features such as vehicle appearance and vehicle type are used to measure the similarity of adjacent frame features.The similarity calculation function of reid features can be expressed as follow:

Among them,?is the vehicle re-identification feature extraction network,and cosine(.,.)is used to calculate the cosine distance of two features.In the traffic scenario,theIoU(Intersection over Union)between adjacent frames of vehicles can reflect the similarity on the spatial scale.The same vehicle in two adjacent frames usually has large overlap.Therefore,the spatial matching of two vehicles can be achieved by the rate of the overlapping,which can be expressed as follows:

In equation(3),S(2)(i,j)represents the spatial similarity of two rectangular region in two adjacent frames,andIoUis calculated as equation(4).

However,fast speed makes the vehicle has a lowIoUrate,and experiments show that mismatches will occur with lowIoUvalues.Therefore,we use the angle between the trajectory and the detection object to measure the motion similarity.It can be expressed asfollows:

For Figure 3,we use equation(6)to calculate the trajectory similarity ofA1,A2,andB2:

Figure 3.The instance of trajectory representation.

And we use equation(7)to calculate the trajectory similarity ofB1 andA2 andB2.

Obviously,the similarity ofB1B2 is higher thanB1A2 by getting matchingB1B2 andA1A2.But if it is a local comparison,the similarity ofA1B2 is higher than that ofA1A2,which is not the global optimal solution.After matching through the Hungarian algorithm,the global optimal solution is stillB1B2,A1A2.

However,we find that the center pixel coordinates of the same vehicle between frames will change greatly.In this case,the cosine value similarity measure is not credible,and the corresponding score should be low.Since the traffic monitoring devices does not change direction in usual,we can use the vehicle speed to measure the degree of the motion trajectory matches.The speed formula is calculated as follows:

Equation(8)represents the distance of pixels moved per unit time.For video sequences,represents the speed of thei-th object in framet ?1.represents the pixel coordinates of thei-th object in framet ?1.The similarity value obtained by trajectory information is more reliable thanIoUbased on the similarity of spatial information whenVis relatively large,while the similarity obtained by spatial information is more reliable whenVis relatively small.Thus we sum of the similarity of spatial information and motion information asS23to consider both cases,which can be expressed as follows:

Whereλis a hyper parameter,which represents the maximum pixel distance of the object moving between two adjacent frames.Vis derived from equation(8).

Based on the above analysis,we propose a multidimensional information fusion method.The vehicle’s association is carried out by comprehensively considering the vehicle’s appearance feature information,spatial location information,and movement trajectory information.The conversion of the similarity matrix to the loss matrix is shown in equations(10)and(11):

The loss matrix can find the best matching result through the Hungarian algorithm.The algorithm is described in Algorithm 1.

2.3 Miss-detected Vehicle Tracking

Dense vehicles will lead to miss-detection,which is a serious problem for multi-vehicle tracking.In this module,single object tracking is used to predict vehicles to overcome the challenge of miss-detection.If the single object tracking module is applied to the realtime multi-vehicle tracking method,it needs a faster speed.The Siamese-RPN network can achieve multiscale adaptation to the object for tracking accuracy which greatly alleviate the tracking frame drift caused by occlusion.Thereby,we improve a lightweight object tracking network based on the Siamese-RPN network.Compared with Siamese-RPN,it has a faster tracking speed.

Algorithm 1.vehicle association.Require:tracki,detj(1<=i<=m,1<=j <=n)Ensure:List[(int,int)],The set which matched 1:for i=1,2,...,m do 2:for j=1,2,...,n do 3:calculate S1,S2,S3,by Eq.2,3,5 4:end for 5:end for 6:filter S1,by S1 <δ 7:calculate S23 by Eq.9 and filter S23 by S23

To reduce the amount of calculation,we adopt a universal asymmetric convolution layer to replace ordinary convolution.Meanwhile,we also adopt a depthwise separable convolution layer,which is the core of the MobileNet network,to calculate the correlation between the template and the search area.Then we present the single-vehicle tracking model in Figure 4 and detailed network structure of the proposed framework in Table 2.

Table 2.Vehicle reid feature extraction network configuration.

Figure 4.Lightweight single object tracking model structure diagram.

III.EXPERIMENTS AND ANALYSIS

In this section,a series of experiments are conducted on the vehicle detection module and vehicle tracking module of our method.Then the multi-vehicle tracking method is evaluated to verify its practicability in intelligent vehicular networks.

3.1 Vehicle Detection Module Experiment

To evaluate the performance of our proposed vehicle detection module,we use the video collected in the traffic monitoring scene by ourselves and label vehicles in each video clip.Meanwhile,we also adopte the LSVH dataset[43]to enrich our own dataset.LSVH is a public high-speed monitoring scene data set.In this paper,12,000 pictures in LSVH are combined with the self-labeled data set for a total of 19,854 pictures.The training set and test set are divided by the ratio of 8:2.Simultaneously,the test data set is divided into simple test,medium test and difficult test according to the density of vehicles on the road.

The evaluation metric adopt the mAP(mean Average Precision)used in the Pascal VOC competition to measure the accuracy[44]and FPS to test the speed.To ensure fairness,we re-implement existing methods by same Pytorch framework and evaluate them on the GPU of M40.Table 3 shows the performance.

MobileNet-RPN+is an improvement based on Faster R-CNN.It replaces the VGG16 backbone network in Faster R-CNN with the MobileNet network on resource and accuracy tradeoffs and show stronger performance in speed.Our module is improved on the basis of Centernet.We can see that our detection module can reach higher accuracy in the three diffierent level test set with the help of the self-attention mechanism.However,non-local neural networks has huge amount of parameters,which not meet the the realtime conditions.Thus,with the help of the optimized non-local neural networks,our detection module can reach 59 fps in speed,which is on speed and accuracy tradeoffs.We also compare these methods on the UA-DETRAC dataset.It contains 60 training video sequences and 40 test video sequences and it contains test scenarios under different weathers.We still use AP(%)and Fps as the evaluation metrics,and the results are shown in Table 4.It can be seen from the experimental results that our module is still very advantageous in terms of speed,and it can also surpass the Faster R-CNN algorithm in terms of accuracy.Compared to R-FCN,our method is ten times faster with a small loss of accuracy.In the traffic monitoring scenario,real-time is more significant,thus we should sacrifice some accuracy to ensure speed performance.Finally,we try to explain the improved method by visualizing the self-attention mechanism in Figuree 5.Obviously,after adding the self-attention mechanism,the noise points are reduced and the vehicle position is more prominent.Therefore,the proposed method is effective.

Figure 5.Visualization of feature map without selfattention mechanism(left),with attention mechanism(middle)and Output result(right).

3.2 Miss-detected Vehicle Tracking Module Experiment

For miss-detected vehicle tracking in traffic monitoring scenarios,we crop ILSVRC VID video dataset[45]and YouTube-BB[46]video dataset to generate the train set.We select frames for the video within 100 frames to generate the pair of template frame and detection frame.The training details are shown in Table 5.

Table 3.Comparison of vehicle detection algorithms in our dataset.

Table 4.Comparison of vehicle detection algorithms in the UA-DETRAC dataset.

Table 5.Training model parameters in single vehicle tracking.

We use the OTB100 dataset[47]as the test dataset,using the OPE(one-pass evaluation)evaluation method which use the first frame as the ground truth,then the tracking algorithm is used to predict the position of the target in the next consecutive frames.By comparing with the ground truth,the average success rate and accuracy are obtained.It can be seen from the Table 6 that the model proposed in this paper,compared with the existing model,improves the tracking speed with ensuring accuracy.

3.3 Comprehensive Test of Multi-vehicle Tracking Method

We collect 7 traffic surveillance videos as our dataset to test our method.First,we show the details of the dataset in Table 7.The maximum number of vehicles in a single frame can reflect the complexity of the scene.By observing the Table 7,it can be seen that our method can output the tracking result in real time when the scene is relatively simple.And when the scene is more complex,the test speed of method will decrease.Next,we test the accuracy of the method and compare it with MDPTracking[48]and DeepSORT.The experiment uses the MOTA goal in CLEAR MOT[49],and its formulation is:

Among them,mt,fpt,mmetrepresent the number of missed detections,false detections,and matching errors at framet,respectively.Then,subtract all the possibility of erroneous tracking from 1 in Equation(12),which means the accuracy of tracking.Table 8 shows the test results of the corresponding scenarios.Due to the detector’improvement by self-attention mecha-nism,the comprehensive consideration of spatial location information,historical trajectory information,and vehicle feature information for vehicle association,the miss-detected vehicle tracking by single object tracking,the method in this paper has a higher accuracy in both dense vehicle scenes and sparse vehicle scenes,ensuring stable multi-vehicle tracking.Figure 6 showsthat ID jumps will occur in the DeepSORT method due to problems such as occlusion,while our method can match correctly,which verifies the robustness of our proposed method.

Figure 6.ID jumps will happen in DeepSORT in(a)and(b),while our method matches correctly in(c)and(d).

Table 6.Comparison of single vehicle tracking algorithms in the OTB100 dataset.

Table 7.Multi-vehicle tracking framework performance test.

Table 8.Comparison of the accuracy of multi-vehicle tracking methods.

IV.CONCLUSION

In this paper,we propose a real-time multi-vehicle tracking framework in intelligent vehicular networks in the traffic monitoring scenario,which is consisted of three modules:multi-vehicle detection,multi-vehicle association,and miss-detected vehicle tracking.The key technology is that we can re-track the missdetected vehicles with occlusions for better multivehicle tracking.We have evaluated our proposed approach with a large number of experiments,the experimental results show that it outperforms the state-ofthe-art methods.

ACKNOWLEDGEMENT

This work was supported in part by the Beijing Natural Science Foundation(L191004),the National Natural Science Foundation of China under No.61720106007 and No.61872047,the Beijing Nova Program under No.Z201100006820124,the Funds for Cre ative Research Groups of China under No.61921003,and the 111 Project(B18008).

China Communications2021年6期

China Communications的其它文章: LED Adaptive Deployment Optimization in Indoor VLC Networks; Reinforcement Learning-Based Sensitive Semantic Location Privacy Protection for VANETs; Boosting Unsupervised Monocular Depth Estimation with Auxiliary Semantic Information; Bit-Level Composite Signal Design for Simultaneous Ranging and Communication; Joint 3D Trajectory and Resource Optimization for A UAV Relay-Assisted Cognitive Radio Network; A Blockchain-Based Credible and Secure Education Experience Data Management Scheme Supporting for Searchable Encryption