Thermal Infrared Salient Human Detection Model Combined with Thermal Features in Airport Terminal

2022-09-15 13:39:46，，，

Transactions of Nanjing University of Aeronautics and Astronautics 2022年4期

，，，

School of Computer，Jiangsu University of Science and Technology，Zhenjiang 212100，P.R.China

Abstract: Target detection in low light background is one of the main tasks of night patrol robots for airport terminal.However，if some algorithms can run on a robot platform with limited computing resources，it is difficult for these algorithms to ensure the detection accuracy of human body in the airport terminal.A novel thermal infrared salient human detection model combined with thermal features called TFSHD is proposed.The TFSHD model is still based on U-Net，but the decoder module structure and model lightweight have been redesigned.In order to improve the detection accuracy of the algorithm in complex scenes，a fusion module composed of thermal branch and saliency branch is added to the decoder of the TFSHD model.Furthermore，a predictive loss function that is more sensitive to high temperature regions of the image is designed.Additionally，for the sake of reducing the computing resource requirements of the algorithm，a model lightweight scheme that includes simplifying the encoder network structure and controlling the number of decoder channels is adopted.The experimental results on four data sets show that the proposed method can not only ensure high detection accuracy and robustness of the algorithm，but also meet the needs of real-time detection of patrol robots with detection speed above 40 f/s.

Key words：thermal infrared image；human body detection；saliency；thermal features；lightweight model

0 Introduction

With the vigorous development of civil avia?tion，airport terminal safety patrol has gradually be?come one of the important work contents to ensure airport safety.The existing safety patrol mode of air?port terminal is mainly manual patrol.At the same time，it is often necessary to supplement the central control room with information technology means such as patrol personnel positioning and video moni?toring.In this way，airport managers must face some problems，including the increase of labor cost and labor intensity.In addition，due to the reason that patrol inspection in the airport terminal is main?ly performed by people，the staff's sense of responsi?bility has become one of the key factors affecting the patrol inspection effect.If the staff are distracted in the process of patrol inspection，it is very easy to cause potential safety hazards in the airport terminal.In recent years，patrol robots has been widely used in many fields，including power patrol，hazy weath?er detection［1］and intruder detection［2］.Therefore，it has become an inevitable trend of the development of intelligent security technology to use robots to carry out patrol inspection in airport terminal.

In order to meet the needs of night patrol of air?port terminal，patrol robots should have the ability to accurately identify pedestrians in low light or even no-light environment.Therefore，it is neces?sary to select the thermal infrared camera as the monitoring camera of the patrol robot.Essentially，the core of robot patrol is the process of human body detection on the image captured by the camera.Tra?ditional human body detection algorithms mostly re?ly on manual features，such as histograms of orient?ed gradients（HOG）［3］，integral channel features（ICF）［4］and deformable part model（DPM）［5］.Cer?tain effects can be achieved by using such methods in visible light scenes.However，it is often difficult to achieve good results if it is directly used in the thermal infrared images environment.The reason is that compared with visible light images，thermal in?frared images have many disadvantages，including lack of texture features，blurred visual effects，low resolution and low signal-to-noise ratio.

In recent years，some researchers have pro?posed to extend the saliency detection method to hu?man body detection in thermal infrared scenes.Ac?cording to the features of low contrast and high noise between target and background in infrared im?ages，an associated saliency based visual attention model was proposed［6］.In this method，the associa?tive saliency generated from region saliency and edge contrast is used to improve the accuracy and ro?bustness of infrared target segmentation.Similarly，based on the visual attention mechanism，an infra?red image-based saliency extraction algorithm is also proposed［7］.Using pedestrian brightness and appear?ance characteristics，a pedestrian detection method in infrared images is implemented by significant propagation between design domains［8］.

However，only the shallow features of the im?age are used in the above methods.In order to com?plete the detection task，the above traditional meth?ods need to design manual features for a class of tar?gets in a specific scene and extract valid features from the image by using manual features.Although features can be quickly extracted from images manu?ally，these features fail to cope with the misrecogni?tion caused by other factors such as changes in hu?man pose and occlusion in complex scenes.In recent years，deep neural network models have been pro?posed for target detection.Deep neural networks perform end-to-end learning through multi-layer neu?ral networks，and can directly use training samples to deeply mine the potential features of data.For this reason，the feature representation of different human postures in complex scenes can be obtained by self-learning using deep neural network models，which effectively avoids the shortcomings of tradi?tional manual design features.

Human body detection methods based on deep neural network generally include two types，namely target detection and image segmentation.For target detection methods，commonly used models mainly include R-CNN［9］and YOLO［10］.This type of meth?od regards target detection as a region detection problem，and ensures high accuracy of the algo?rithm by performing two tasks of classification and positioning at the same time.When the YOLOv3 model［11］is directly applied to the thermal infrared scene，the detection and positioning of the human body in the thermal infrared images can be realized.However，there are obvious defects in the target de?tection model.On one hand，it is prone to miss the detection behavior，on the other hand，it reguires high computing resources.For the patrol task of the terminal building，the human detection task can on?ly be achieved by the limited computing resources of the patrol robots，and the target missed detection is not allowed.It can be seen that the target detection algorithm represented by YOLOv3 is not suitable for the actual needs of terminal patrol task.

Image segmentation model is another widely used human body detection method，in which pixels are its detection units.When it is applied to a ther?mal infrared scene，the edge pixels may be incorrect?ly detected，and the segmented target may be in?complete.Nevertheless，most areas of suspicious targets can still be correctly detected by image seg?mentation，which will not affect the recognition of suspicious targets.Therefore，it is still possible to effectively avoid the occurrence of missed detection.In addition，the requirements of the image segmen?tation model for computing resources are much low?er than that of the target detection model.Existing research work shows that CNN can be used to de?sign pixel wise classifier in thermal infrared imag?es［12］.As a fully connected neural network，U-Net model［13］can be used for rapid and accurate detec?tion of human targets in thermal infrared images.In fact，as an image segmentation method，the salien?cy detection model based on the U-Net network not only can detect most saliency human body objects in different application scenarios，but also can be adapted to patrol robot platforms with limited com?puting resources.This provides a feasible technical framework for human body detection during airport terminal inspections.

However，when using image segmentation method to detect human body in thermal infrared im?ages，there are still some objective factors that af?fect target recognition in the airport terminal.First of all，pedestrians in the terminal have various pos?tures such as standing，walking，sitting，and squat?ting.At the same time，fixed objects including seats，beams and pillars in the airport terminal will also form a partial occlusion of pedestrian targets.Secondly，when using thermal infrared cameras to obtain target images，thermal sources such as light sources and display screens in the airport terminal will also be imaged in the thermal infrared images.Furthermore，when the patrol robot detects the sur?rounding environment from a horizontal perspec?tive，the human body will show different scales in the image due to the different distances from the camera.

Therefore，in order to reduce the occurrence of missed detection and cope with the adverse factors such as multi-posture，multi-scale，local occlusion and thermaling source interference，we propose a novel thermal infrared image saliency detection algo?rithm based on U-Net model in this paper.We still adopt U-net as the architecture of our method.But its encoder is replaced by a VGG network［14］，and the fusion module is added to the decoder.In this way，after the overlay convolution operation，the saliency decoder feature map contains both the ther?mal features and the saliency features of the detec?tion target.Furthermore，when designing the loss function of the final prediction map，the weight of the pixels in the high temperature area of the ther?mal map increases.As a result，the algorithm be?comes more sensitive to high temperature areas in the image to reduce the adverse effects of various in?terference factors in human body detection of the air?port terminal.Finally，by simplifying the VGG net?work structure and controlling the number of decod?er channels，the model realizes lightweight improve?ment and can be better adapted to the patrol robots with limited computing resources.

Our contributions can be summarized as fol?lows：

（1）We adopt VGG network as the encoder and improve the decoder mechanism of U-Net net?work.Consequently，the adaptability of the model to the night scene of the airport terminal is improved by the effective fusion of the thermal features and sa?liency features of the detection target.

（2）We design a learning method that uses the saliency map and the thermal map to train the salien?cy branch and the thermal branch in the decoder re?spectively.Furthermore，by redesigning the loss function of the final prediction map，the accuracy of the algorithm for human body target detection is im?proved.

（3）By simplifying the VGG network structure and controlling the number of decoder channels，the complexity of our model is reduced，and as a result，the demand for computing resources of our algo?rithm is also reduced.

1 Related Work

1.1 Thermal features in thermal infrared images

Because the temperature of the human body is usually higher than that of surrounding objects［15］，thermal features become one of the most efficient features to characterize the human body in thermal infrared images.In addition，the thermal features can be easily extracted，so they are widely used in various algorithms of human body detection.The main factors affecting the gray value of the object in thermal infrared images are temperature and radia?tion［16］，which has nothing to do with the lighting conditions.When the gray value of the object is larg?er，it means that the temperature of the object is rel?atively higher.Therefore，extracting thermal fea?tures in thermal infrared images based on gray val?ues has become one of the most important methods in thermal infrared image detection tasks.However，the objects with obvious thermal features in the im?age are sometimes not only the human body，but al?so various devices such as light sources and display screens.These interfering factors have brought chal?lenges to human detection tasks.

In order to cope with the above-mentioned un?favorable factors，the combination of thermal fea?tures and other human body features has become one of the methods of human body detection.Fernández-Caballero et al.［15］proposed a thermal-in?frared pedestrian ROI extraction method by fusing thermal features and dynamic information.Zheng et al.［17］proposed an infrared human body detection method based on saliency propagation，in which both the thermal and appearance features of the hu?man body are used.In addition，strengthening and highlighting the human body area in the infrared im?age is another effective method to overcome exter?nal interference factors.Mi et al.［18］proposed a method to highlight the human body part in the ther?mal infrared images by enhancing the thermal con?trast between the human body and the background.From the perspective of the highlighting thermal fea?ture distribution and gradient features，Lu et al.［19］proposed a saliency detection method for infrared images based on the contrast and the distribution.

Inspired by the above methods，and taking into account the fact that the body temperature of the hu?man body is often higher than the ambient tempera?ture，we use the thermal map extracting from the thermal infrared images to improve the robustness of algorithm to external interference.

1.2 Saliency detection based on U?Net net?work

Saliency detection is a task to segment the most visually distinctive object or region from the image.Early saliency detection methods mostly use manual features that rely on image contrast.Howev?er，such methods are not applicable when the con?trast between the target and the background is low?er.In recent years，the wide application of the deep neural network breaks the limitation of insufficient accuracy of manual features in low contrast images.Among them，the U-Net network［13］is one of the most popular methods.It can perform pixel-level segmentation on the input image according to the en?coder-decoder architecture.For the U-Net network，the features of the same size in the encoder and de?coder are merged by superposition.The process of fusion of these features plays an important role in combining contextual information，making the UNet network still has good results when facing imag?es with insufficient contrast.As a result，the image output of the network also has higher quality than that of earlier models.

Nevertheless，when the U-Net network is used for complex target detection，there will still be the defect of the poor edge pixel effect.Luo et al.［20］de?signed a network structure that can integrate local and global features to improve the performance of significant regions by penalizing the loss of boundary errors.Similarly，BASNet is a network model based on U-Net.With the help of the hybrid loss function of pixel-level，patch-level and map-level，a boundary-aware salient object detection method is realized［21］.Considering the relationship between contour and saliency，Zhou et al.［22］extended the de?coder into two branches consisting of a saliency branch and a contour branch.In this way，the detec?tion effect of edge pixels can be improved by learn?ing the association between the saliency map and the contour map.In addition，introducing context fea?ture information into the model is also one of the ef?fective ways to improve the accuracy of the algo?rithm［23?24］.For such methods，attention mecha?nisms are usually used to design methods that can select and fuse multi-level contextual information.It can be seen that when designing a saliency detection method based on the U-net network，it is often nec?essary to improve the original network architecture according to the actual situation of the application scenario.

Inspired by the above work，we will improve the architecture of the U-Net network.Then the thermal features contained in the thermal map will be used to improve the segmentation performance of the salient human body.Furthermore，by lightening the network model，the algorithm can efficiently run in the patrol robot system with limited computing re?sources.

2 Salient Human Detection Model Combined with Thermal Features

2.1 Framework

Based on the fact that the human body tempera?ture is often higher than the temperature of the sur?rounding environment，salient human detection model combined with thermal features（TFSHD）is proposed in this paper.The framework of our meth?od is demonstrated in Fig.1，whereEirepresents the encoder feature map，Aithe feature map output of the embedded module，Dithe original decoder feature map，andSithe saliency decoder feature map.The fusion module includes the thermal branchH（i）and saliency branchS（i），respectively.In addition，S0indicates the final output image ob?tained after up-sampling all the salient decoder fea?ture maps.Similar to the traditional U-Net net?work，our neural network model is still an encoderdecoder architecture.However，in order to effective?ly make use of the thermal feature information，VGG network is adopted as the encoder，and the decoder of the original U-Net network is expanded.The expanded decoder consists of three modules，namely the original decoder module，the saliency de?coder module and the fusion module.Among them，the fusion module contains two parts，i.e.the ther?mal branchH（i）and the saliency branchS（i）.

Fig.1 Framework of TFSHD

2.2 Extraction of thermal maps

As described in Section 2.1，the information of the thermal feature is beneficial to significantly im?prove the detection performance of the human body.For this reason，the widely used thermal map is used as the original representation of thermal fea?tures in our paper.For thermal infrared cameras，temperature and radiation are the main factors form?ing thermal infrared images.Correspondingly，the temperature information in the thermal infrared im?ages is mainly reflected in the gray value of the pix?el.The larger the gray value，the higher the temper?ature of the pixel.In the airport terminal at night，the temperature of the human body is relatively high compared with the surrounding environment.It can be inferred that the area with a higher gray value in thermal infrared images is likely to be the target area of the human body.Therefore，as shown in Fig.2，we obtain the thermal map by segmenting the high gray value region from the input image.

Fig.2 Input images on the left and extracted thermal maps on the right

However，there are not only pedestrians，but also other high temperature sources in the robot working scene.In addition，the gray value of human targets in thermal infrared images will also be affect?ed due to factors such as clothing，distance from the camera，etc.Thus，the region with medium or high gray value in the thermal infrared image is select?ed as the thermal map.Similar to the method in Ref.［15］，the segmentation method based on gray threshold is used to segment objects and regions with thermal features in the image.

where the gray thresholdθTAis calculated from the standard deviationσIand the average value -Iof the input imageI.

2.3 Encoder module

In the U-Net network，the main function of the encoder module is to extract feature maps of differ?ent scales from the images.We choose VGG neural network as the feature extraction network of the en?coder module.When the VGG network is used for feature extraction，the extracted feature maps of dif?ferent scales will show a progressive state.This pro?vides a variety of deep and shallow information for the later feature integration.In addition，VGG neu?ral network is an encoder module in TFSHD mod?el，whose purpose is to extract the feature map of the input image.Therefore，in order to make the model lightweight，the full-connection layer，which is the last part of the VGG network，is discarded.In fact，only the first five different scales of encoder feature maps extracted by the VGG neural network is used in our method.As shown in Fig.1，the fea?ture map of the encoder extracted by the VGG net?work is represented byEi，wherei=1，2，…，5.

It should be noted that although there are more lightweight feature extraction networks including Shufflenet［25］，this type of network is usually more suitable for small neural networks.If we directly se?lect such a lighter-weight network for feature extrac?tion，when the decoder module of the network mod?el is up-sampling，it will often cause serious distor?tion because there is not enough context information in the feature map.

2.4 Lightweight operation

The computing resources of patrol robots are very limited，but it has high requirements for the re?al-time performance of the target detection.For this purpose，when designing the TFSHD model，the process of model lightweight must be considered.Model lightweight is an effective way to reduce the computing resource requirements of the algorithm and improve the operating efficiency of the algo?rithm.By analyzing the structure of the TFSHD model，we can see that there are two main ways to realize the purpose of model lightweight.One way is the simplified processing of the VGG encoder net?work described in Section 2.3，and another feasible way is to reduce the number of channels of the fea?ture map.

When using the deep network to learn the fea?ture map，it is necessary to perform the operation of changing the number of channels.For the traditional U-Net network，the change of the channel number is generally realized by convolution operation.But the convolution operation is very time-consuming and more parameters will be generated.Similar to interactive two?stream decoder（ITSD）model［22］，the convolution operation in U-Net network will be abandoned and replaced by embedded module.As shown in Fig.1，Aiis the output the encoder feature mapEipassing through the embedded module and the channel number ofAirelated toEiis changed.Thus，in the process of obtaining feature mapAi，the model parameters and the amount of calculation can be reduced，and the purpose of model light?weight can also be achieved.The operation per?formed by the embedded module is shown as follows

whereAirepresents the corresponding feature map after the feature mapEiis operated by the embedded module.Compared withEi，the number of channels inAiis changed.As the index value of the encoder feature map，the value range ofiis［1，5］.The term，indicated asrepresents a channel of the encoder feature mapEi，wherej×+kis the number of channels.It should be noted thatjandkare the integers，andnandmrepresent the number of input and output channels，respectively.Note thatnandmmust be divided with no remainder.As shown in Eq.（2），in order to change the number of channels，the embedded module is essentially real?ized by gathering the maximum value of each group of channels，where the original number of channels for each group isn/m.

2.5 Decoder module

In order to effectively use the thermal feature information of the detection target to improve the recognition performance of the salient human body，the decoder of the traditional U-Net network is ex?panded in the TFSHD model.The expanded decod?er contains three modules，i.e.，the original decoder module，the saliency decoder module and the fusion module.The main purpose of designing the fusion module is to fuse the thermal features and saliency features extracted from the original decoder feature map，so as to generate a saliency decoder feature map which is more sensitive to high temperature re?gion.

The original decoder module in Fig.1 consists of a series of up-sampling，concat operations，con?volutional layers and activation functions.As shown in Eq.（3），the original decoder feature mapDiis ob?tained by gradually fusing features of various scales，wherei=1，2，…，5.

whereTindicates that the learning method used in the training process is supervised learning，and its subscript indicates the corresponding module.The symbol of cat indicates the concat operation and the symbol of up the up-sampling operation.Specifical?ly，Diis obtained by superimposingAiwith the upsampledDi+1，whereAiis the feature map output of the embedded module andDi+1the original decoder feature map of the upper layer ofDi.It should be noted thatAiis obtained from the feature mapEiof the same layer of encoder，but it is necessary to change the number of channels in the embedded module to obtainAi.Wheniis equal to 5，Di+1in Eq.（3）is equal toA5，and no up-sampling opera?tion is performed.In addition，in order to improve the efficiency of the algorithm，the upsampling method adopted by the original decoder module is bi?linear interpolation.

In order to improve the robustness of the TF?SHD model to suspicious human targets with partial occlusion，multi-scale and multi-posture，the feature map of the saliency decoder that is more sensitive to the high temperature region of the thermal infrared images will be trained.Thus，a fusion module is de?signed in this paper.With the help of the module，the thermal and saliency feature information con?tained in the original decoder feature mapDiis ex?tracted.Furthermore，the two types of information are fused and applied to the learning of the saliency decoder feature mapSi.However，if the two kinds of feature information are fused directly and then used to learn the saliency decoder feature map，the salien?cy map of the final output of the neural network still cannot meet the needs of practical application.

Based on the above considerations，a fusion module as shown in Fig.3 is designed in this paper.Firstly，we learn the thermal feature information and saliency feature information contained in the original decoder feature mapsDiof five different scales.Secondly，we use these two types of infor?mation to construct the thermal branchH（i）and the saliency branchS（i），respectively.Finally，we use a supervised learning method to train the parameters of these two branch structures separately.Specifical?ly，as shown in Eq.（4），the saliency human body is obtained by manual labeling as training data，and the saliency branch is trained by supervised learning.Similarly，as shown in Eq.（5），the thermal map is obtained from the input image as training data，and the thermal branch is trained also by supervised learning.In Eqs.（4，5），the symbolTindicates that the learning method used in the training process is supervised learning，and its subscript represents the corresponding branch.

Fig.3 Fusion module

As shown in Eq.（6），after training the thermal branchH（i）and the saliency branchS（i），Si+1，H（i）andS（i）will be adopted together in the saliency decoder module to generate the saliency decoder fea?ture mapSi.In Eq.（6），Tstill indicates the method of supervised learning and its subscript the corre?sponding module.In addition，“cat”represents the concat operation，“cp”the same method of chang?ing the number of channels as the embedded mod?ule，and“up”the up-sampling operation.Eq.（6）is as follows

where the saliency decoder feature mapSiis ob?tained by fusing the up-sampledSi+1and the two branchs，namelyH（i）andS（i），where the channel number ofH（i）andS（i）are changed，respectively，andSi+1is the decoder feature map of the upper lay?er ofSi.Similar to the learning of the feature map of the original encoder，wheniis equal to 5，Si+1in Eq.（6）is equal toD5，and no up-sampling opera?tion is performed.

Since the two branches ofH（i）andS（i）strength?en the thermal information and saliency informa?tion，the feature mapSilearned by the saliency de?coder module also strengthens the thermal and sa?liency features in the thermal infrared images accord?ingly.When these feature maps are restored to the final output image，it will contribute to the effect of the human body segmentation in the high tempera?ture area.

2.6 Output module

As shown in Eq.（7），the final output of the model is represented byS0.It can be obtained by in?tegrating five saliency decoder feature mapsSi（i=1，2，…，5）of different scales.Specifically，fiveSiof different scales are respectively up-sampled to ob?tain the same scale as the input image，and then they are superimposed and fused to obtainS0.

In addition，in order to realize the model light?weight，a mixture up-sampling method，namely bi?linear interpolation method and nearest neighbor method，is adopted in Eq.（7）.

2.7 Loss function

In the process of the saliency map detection of the human body by TFSHD，three prediction out?puts are involved，i.e.，the thermal map branch out?put，the saliency map branch output in the fusion module，and the final output of the TFSHD model.As shown in Eqs.（8—10），the prediction map is calculated from the feature map of each branch.

wherei=1，2，…，5.Thus，by changing the number of feature map channels，a prediction map with a channel number of 1 is obtained.Obviously，andare the prediction images corresponding to the two branches，denoted asH（i）andS（i）.repre?sents the final output prediction map.

For the loss function in the two branch struc?tures，we choose binary crossentropy loss，which is widely used in segmentation tasks［26］.Then it can be used to calculate the loss between the real image and the corresponding predicted image.As the loss functions corresponding to thermal branchH（i）and saliency branchS（i），are defined in Eqs.（11，12），respectively.

wherenrepresents the total number of pixels andmthe index value of pixels.In addition，the symbolGdenotes a real image，the superscript represents the image name on the corresponding branch，and the subscript represents the location of the current pixel.Similarly，the symbolPdenotes a predicted image，the superscript represents the name of the predicted image on the corresponding branch，and the sub?script represents the location of the current pixel.It should be noted that the values ofare both 0 or 1，and the values ofare in the interval［0，1］.

Similar to the definition ofthe def?inition of the loss function of the final output graphS0is shown in Eq.（13）.However，in order to make the final output mapS0of the TFSHD model pay more attention to the pixels in the high thermal re?gion，the term denoted as 1+is added to the loss functionLS0.This means that the pixels in the high thermal region of the thermal map will have a higher weight.In addition，when calculating the pre?diction loss of the final output mapS0，only the in?formation of the real imageof the first branch on the saliency branch is used.For the same reason，when using the thermal information to weightonly the real image of the first branch in the thermal branch is used.

Thus，the total loss functionLof the TFSHD model can be defined as the weighted sum of the above three loss functions.The definition ofLis shown as follows

3 Results and Analysis

The GPU model used in training the model is RTX 2080 Ti，and the code of the model is built based on the pytorch framework.For model train?ing，the stochastic gradient descent（SGD）method is adopted.Correspondingly，the learning rate is set to 0.001，and the batch value is set to 6.

3.1 Dataset

In order to verify the efficiency and effective?ness of the model，we use four test data sets，among which OSU［27］，KAIST［28］and FLIR are three public data sets，and ATH is the actual data set collected by ourselves in the airport terminal.For the OSU data set，some sequence images from irw01 to irw06 in OSU are selected.These images mainly focus on the occlusion of pedestrian targets，including the partial occlusion of the body by the ob?ject in the hands of the pedestrian and the occlusion of the body when two pedestrians meet.The data sets are taken by a fixed camera，so the background does not change.Consequently，there is a high con?trast between pedestrians and the background.

Both KAIST and FLIR data sets are image da?ta collected on the streets using an in-vehicle cam?era.These image data often show different imaging effects due to different camera parameters.In addi?tion，pedestrians at different distances from the cam?era will also show different scales in the image.Con?sidering the actual working scenario of the airport patrol robots，we focus on the different scales of the human body，biking objects and crowds in the data sets.

The ATH data set is an actual data set we col?lected in the real scene of the airport terminal.Simi?lar to the two data sets of KAIST and FLIR，the ATH data set also faces the problem of image scale caused by the different distances between the human body and the camera.However，unlike other stan?dard data sets，ATH data set contain samples from scenarios such as departure gates，seat areas，and windows that are unique to airport terminals.In ad?dition，the human body in the data set includes a va?riety of postures such as standing，squatting，and bending.Moreover，the data set also contains multi?ple occlusion scenarios such as mutual occlusion by human bodies，or occlusion by seats，columns，and clothing.Compared with the standard test data set，the ATH data set is more realistic and the scene is more complicated.

3.2 Evaluation metrics

In order to evaluate the performance of the model，we select the commonly used evaluation in?dicatorsF-measure and mIOU in the segmentation model.In general，the higher theF-measure and mIOU values，the better the performance of the model.As shown in Eq.（15），F(xiàn)-measure defines the average value of the weighted harmony of the precision and recall of the saliency human image seg?mentation.

For patrol robots，it is more important to cor?rectly detecte a suspicious target than to correctly detect all areas of the suspicious target's body.This means that when usingF-measure to measure mod?el performance，the accuracy of the saliency human image segmentation should take up a larger propor?tion.As recommended in Ref.［29］，the value ofβ2is set to 0.3.The definitions of Precision and Recall are shown in Eq.（16）.

where TP represents the number of positive sam?ples that are correctly predicted，F(xiàn)P and FN repre?sent the number of negative samples detected as pos?itive samples and the number of positive samples predicted as negative samples，respectively.

mIOU defined in Eq.（17）represents the inter?section ratio of the positive sample in the real image and the positive sample in the predicted image.

3.3 Experimental results and analysis

To verify the validity of the model，the model of TFSHD is compared with five saliency methods including Hsaliency［30］，Amulet［31］，BASNet［21］，CPD［32］，and SRM［33］.Among these five models，Hsaliency belongs to the traditional saliency detec?tion method，and the other four methods are based on the saliency detection method of the deep neural network model.Table 1 shows the experimental re?sults of the above six methods on four data sets in?cluding OSU，KAIST，F(xiàn)LIR and ATH.Table 2 shows the FPS（Frames per second）values of the methods on two different devices，namely RTX 2080 Ti and GTX 1060.

As shown in Table 1，as a traditional saliency detection method，the experimental effect of the Hsaliency method on four data sets is significantly lower than that of the other five methods.According to the experimental results of the Hsaliency method itself on four data sets，the overall performance of this method on the OSU data set is the best.This shows that the traditional saliency method has a cer?tain effect on the data set with the strong contrast be?tween the target and the background.In contrast，the TFSHD model and the other four models havebetter adaptability to the human body detection in various complex scenes due to the use of a deep net?work structure.

Table 1 Experimental results on data sets of OSU,KAIST,FLIR and ATH

Table 2 FPS values of methods on two different devices

Throughout the experimental results of these five deep learning methods，the detection speed of the TFSHD method is only lower than that of CPD method，but significantly higher than that of the oth?er four methods.From the values ofF-measure in Table 1，the TFSHD method is higher than other methods on three other data sets except that it is slightly lower on KAIST data set than Amulet method.Similarly，from the values of mIOU in Ta?ble 1，the TFSHD method is still optimal on the FLIR and ATH data sets，but only slightly lower than the BASNet method on the other two data sets.In general，the performance of Amulet method is close to that of the TFSHD method on these four data sets，and both of them have high accuracy and robustness.However，as shown in Table 2，the Amulet method requires high computational power，which is difficult to meet requirements of real-time detection for patrol robots.In addition，although the CPD method is slightly better than our method in terms of the detection speed，its overall detection accuracy is the worst among these five depth mod?els.Therefore，the experimental performance of the TFSHD method on four data sets has some advan?tages，especially on ATH data set.

In Fig.4 we show some visualization examples of the six experimental methods on four data sets.For the OSU data set，almost all models can achieve good detection results due to its single back?ground and high contrast.Due to the influence of outdoor radiation on imaging，the final imaging ef?fect of long-distance targets in the KAIST data set is very poor.This leads to the missed detection of distant small-scale targets in the detection process of CPD and BASNet methods.The FLIR data set has a high average gray value of the images due to the differences in camera parameters，which makes it difficult for the Hsaliency method to extract effec?tive features from it.The ATH data set is collected from the airport and contains various complex situa?tions such as multi-scale，multi-posture，and partial occlusion.These factors directly lead to the failure of those models with poor robustness to obtain good results.From another point of view，since ATH da?ta set shows the real state of pedestrians in airport terminals more clearly，the actual application effect of each model in airport terminals is also more clear?ly reflected by the running results of different mod?els on ATH data set.Based on Table 1 and Fig.4，the TFSHD method has the best experimental re?sults on the ATH data set.

Fig.4 Visual comparison of TFSHD model with other saliency models on four data sets

It can be seen from Table 2 and Fig.4 that the BASNet method has both better robustness and fast?er running speed because it pays attention to the model lightweight and the segmentation effect of tar?get edge pixels at the same time.However，the BASNet model often misses detection when dealing with small-scale targets in the distance.In compari?son，the TFSHD model can pay more attention to the pixels in the high temperature area，which en?sures that the model still has a better effect even when facing small-scale targets.Additionally，due to the lightweight design of the TFSHD model，this allows the model to use fewer parameters and computing power to complete the calculation of the model.

As shown in Table 2，the TFSHD method can obtain detection results at a speed above 40 f/s on the RTX 2080 Ti device，and obtain detection re?sults at a speed above 12 f/s on a GTX 1060 device with limited computing power.The detection speed has been able to meet the real-time detection task under the thermal imaging camera.

Fig.5 shows the segmented effect of the TF?SHD method on human body targets in airport tick?et gates，rest areas，passages and other scenes.As can be seen from Fig.5，the TFSHD method has achieved good segmentation results for multi-pos?ture targets such as bending，sitting，and standing in the above scenes，as well as multi-scale targets such as small scales and large scales in the distance.When the human body is occluded by objects such as ticket gates，seats，etc.，the target can still accu?rately segmented by the TFSHD method.

Fig.5 Visual images of TFSHD model in various scenarios of airport terminal

Fig.6 shows the effect of the proposed method on the human body target segmentation when there is interference from other thermal sources.Among the three application scenarios，the last column is the night terminal scenario.In addition to the human body，these scenes also contain other thermal sourc?es，such as lights，screens and vehicles.In fact，as mentioned above，the TFSHD method improves the sensitivity of the model to the thermal features of the human body region and optimizes the detec?tion performance of the model for salient human bodies by means of the thermal map of the regions with medium and high gray value in the infrared im?age.In this way，thermal sources other than the hu?man body are regarded as interference factors.As shown in Fig.6，after the human body region seg?mentation of the image is completed，the location，the center and the size of the detection target can al?so be obtained through the region division，and can be displayed to the user in a visual way.

Fig.6 Detection results of TFSHD model in scenarios with multiple thermal sources

4 Conclusions

We propose a novel salient human detection model called TFSHD.The proposed TFSHD is based on the traditional U-Net network with an en?coder-decoder architecture.However，in order to optimize the detection performance of the salient hu?man body by using the thermal information in the image，the decoder in TFSHD is composed of three components，i.e.，the original decoder module，the saliency decoder module and the fusion module.With the help of the optimization of the model archi?tecture，thermal features in the image are used for model parameter training and the learning of salient feature maps.The experimental results on four data sets show that the proposed method is superior to the other five methods in terms of prediction accura?cy and robustness.The experimental results on the actual terminal data set ATH further show that the proposed method can effectively generate salient hu?man bodies in multi-posture，multi-scale and partial occlusion situations，and efficiently complete the hu?man body detection in different scenarios in the air?port terminal.Additionally，a series of model light?weight design is adopted in our paper.Thus the detection results can be obtained at a speed above 40 f/s，which has been able to meet the require?ments of real-time detection for patrol robots.

Transactions of Nanjing University of Aeronautics and Astronautics2022年4期

Transactions of Nanjing University of Aeronautics and Astronautics的其它文章: Preface; Generalized Canonical Transformations for Fractional Birkhoffian Systems; A Current Mode Low?Noise Gm?C Filter with a Cut?Off Frequency of 5 GHz in Telecommunication System; Application of WSGSA Model in Predicting Temperature and Soot in C2H4/Air Turbulent Diffusion Flame; Hybrid Fault Diagnosis and Isolation for Component and Sensor of APU in a Distributed Control System; A GPU?Accelerated Discontinuous Galerkin Method for Solving Two?Dimensional Laminar Flows

亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放