Shaopeng LIU, Guohui TIAN?, Yongcheng CUI, Xuyang SHAO
School of Control Science and Engineering, Shandong University, Jinan 250061, China
Abstract:This paper focuses on the problem of active object detection(AOD).AOD is important for service robots to complete tasks in the family environment, and leads robots to approach the target object by taking appropriate moving actions. Most of the current AOD methods are based on reinforcement learning with low training efficiency and testing accuracy. Therefore, an AOD model based on a deep Q-learning network (DQN) with a novel training algorithm is proposed in this paper. The DQN model is designed to fit the Q-values of various actions, and includes state space, feature extraction, and a multilayer perceptron. In contrast to existing research, a novel training algorithm based on memory is designed for the proposed DQN model to improve training efficiency and testing accuracy. In addition, a method of generating the end state is presented to judge when to stop the AOD task during the training process. Sufficient comparison experiments and ablation studies are performed based on an AOD dataset, proving that the presented method has better performance than the comparable methods and that the proposed training algorithm is more effective than the raw training algorithm.
Key words: Active object detection; Deep Q-learning network; Training method; Service robots
The family service robot’s main function is to provide various services for its users (Shuai and Chen, 2019), and in recent years, it has received much attention from the research community. When performing service tasks in a family environment,for example, finding a service-related object (Wang et al.,2019;Liu et al.,2022b),the service robot needs to not only perceive the environment but also detect the target object. The purpose of object detection is to recognize and locate objects.
With the development of deep learning, the accuracy of object detection has been greatly improved by state-of-the-art object detection methods (Pu et al., 2021). In recent years, anchor-free networks like ExtremeNet (Zhou et al., 2019) and CenterNet(Duan et al., 2019) are popular due to their advantages of concerning small object detection. The development and success of object detection in the field of computer vision have provided new solutions for detection tasks in robotics. A vision system using the YOLO algorithm has been proposed to detect objects that can be obstacles in the path of a mobile robot (Dos Reis et al., 2019). Wan and Goudos(2020) employed Faster R-CNN for multi-class fruit detection for a harvesting robot in the field of smart farming.
In robotic applications, the detection task is executed over multiple images using active motion control of the robot to obtain an optimal viewpoint and approach the target object for further manipulation, which is called robot active object detection(AOD) (Ammirato et al., 2017). With the guidance of AOD,the robot can make appropriate movements(e.g., forward and turning left) to get close to the target object for better detection. It is worth noting that AOD is based on object detection, because the detection result of each image is the necessary basis for AOD to take the next action.
In Zhang et al.(2017),viewpoint control strategies have been presented for active object recognition that is quite similar to an AOD task. With the development of reinforcement learning (RL) techniques,RL-based AOD methods have become popular. Paletta and Pinz (2000) first introduced RL in the context of AOD using Q-learning. Recent works about AOD have focused on deep RL (DRL) approaches (Ammirato et al., 2017; Han et al., 2019).Ammirato et al.(2018)used a DRL method to train a network to take an action at each step and built an AOD dataset. Han et al. (2019)first introduced the deep Q-learning network (DQN) with a dueling architecture for robot AOD to predict multiple actions at each step. Nevertheless, the mentioned works trained DQN models based on the basic training algorithm, which can lead to low training efficiency and testing accuracy. At the same time, designing an appropriate state space and model structure can enhance the performance of robot AOD.
In this study, a DQN-based AOD model with a novel training algorithm is proposed for service robots. The DQN-based model is designed, including state space, feature extraction, and a multilayer perceptron (MLP). In the state space, the RGB image and the bounding box of the target object are selected as the state information to reflect the environmental changes that occur after the robot moves.Feature extraction contains two types of extractors for different types of state information. The convolutional neural network (CNN) feature extractor is composed mainly of 16 residual blocks (He et al.,2016) to generate a feature vector of the RGB image. The location feature extractor normalizes the values of the bounding box to produce a location feature vector. Two kinds of feature vectors are combined as a fusion feature to be input into the MLP.The three-layer perceptron fits the Q-values of different actions. Meanwhile, a novel training algorithm based on memory(TAM)is designed to speed up the exploration with high efficiency during the training and to increase the testing accuracy. The proposed training algorithm can generate more training data with positive rewards and prevent the model from repeatedly learning the data with negative rewards.In addition, when to stop exploration during training is a key problem. To solve this problem, the end state is used to judge whether the robot has arrived at the end point, and the generation method of the end state is presented. Extensive comparison experiments and ablation studies prove that our proposed method is superior to other methods with better training efficiency and higher testing accuracy in the robot AOD tasks.
The main contributions of this paper can be summarized as follows:
1.We propose a DQN-based AOD model with a novel training algorithm for service robots.
2. The designed TAM with the generation method of the end state improves the training effi-ciency and testing accuracy of the DQN-based model for the robot AOD.
3. The comparative experiments and ablation studies based on an AOD dataset show the competitive performance of our method and provide analysis that explains the superiority of the presented training algorithm.
The robot AOD can be regarded as an action decision problem(Han et al.,2019). When detecting an object, the robot should move to an appropriate observation location where it can have a better detection view of the target. This motion process is modeled by (S,A,fr,T,π). The description of each element is provided as follows:
1.S={st,st+1,...,st+n}denotes a space of robot states at different time points. The content of eachsis the robot observation of its environment.
2.A={a1,a2,...,a6}is an action space including six motion directions of the robot, which are clockwise rotation (rotate_cw), counterclockwise rotation(rotate_ccw), left, right, forward, and backward.
3.frexpresses a reward function
aiming to give a score of the robot actionatin the current statest.
4.Tas a state transition function brings the robot from the current statestinto the next statest+1after the robot executes an actiona.
5.π, the action selection strategy, is the map from the state spaceSto the action spaceA,
which guides the robot to execute an actionataccording to the current statest.
As the formulation is established, the workflow of the AOD is demonstrated in Fig. 1. In the initial statest, a robot starts to detect the target object appearing in the current observationot. In the next step,t+1, the robot will choose an actionatbased on the action selection strategyπto get into the next statest+1with a new observationot+1. Meanwhile,the robot will obtain a rewardrtfrom the environmental feedback. The reward function (1)is used to guide the robot to choose the correct action in each state; its principle is that a good action receives a positive reward while a bad action receives a negative reward. Combining functions (2) and (3), this state transition process is expressed by wherest+1is closely related to the last statestand the last actionat.
An appropriate action can lead the robot to obtain a better observation view than that of the last state. For example,in Fig.1,when the bounding box of the target object lies on the right side ofotinst,the rotate_cw action will be executed to make the robot have a better view where the target object is in the center ofot+1. If the robot acts incorrectly inst, such as rotate_ccw, it will lose the target object in the observation of the next state. Thus,the action selection strategy is crucial. The problem solved in this study is to generate an optimal action selection strategyπ*based on DQN (Mnih et al., 2015) for the robot to have a better detection view. This issue can be optimized by the action-value function:
which represents the sum of all the robot’s rewards from start to end in an AOD task by performing an actionatbased on strategyπin each statest. The purpose is to find an optimal action-value function
for solving the robot AOD problem. In this work,the action-value function is represented approximately by a designed deep neural network:
whereθdenotes the parameter of the deep neural network.
Fig. 1 Workflow of active object detection (AOD) expressed in four parts: state, observation, action, and reward (Red point and bounding box represent the target object in the observation part. References to color refer to the online version of this figure)
The DQN state space represents the environmental information perceived by the robot and the changes caused by the robot’s movement. The state space is the basis for the robot to make decisions and evaluate its long-term benefits. The quality of the state space design directly determines whether the DQN algorithm can converge or not.
In the AOD task, the state space is mostly represented by robot’s visual information. As shown in Fig. 2, the state space contains two parts: the captured scene image and the bounding box. The scene image captured by an RGB camera displays the current environmental state information, which can reflect the space location of the robot in the current state. Meanwhile, the bounding box provides the location of the target object. The values of the bounding box are normalized to prevent the gradient from vanishing. The environmental information combined with the bounding box of the target object is beneficial in leading the robot to approach the target object. Because the emphasis on study of this thesis is the action selection strategy for robot AOD rather than the object detection model, the bounding box in the state is assumed to be available using an object detection model, such as Faster R-CNN(Ren et al.,2017).
The model of the proposed DQN for the robot AOD is shown in Fig. 2,including the state information as the input,the feature extraction module with two extractors (CNN feature extractor and location feature extractor),and the three-layer MLP.
The model is designed as a double-input network because the state contains two types of state information. To extract the features of the RGB image,the CNN feature extractor is designed based on the residual blocks. As shown in Fig. 3, the CNN feature extractor contains a convolution and pooling layer, 16 residual blocks, a pooling layer, and a flatten layer. In each residual block, there are batch normalization (BN) layers, activation layers using rectified linear unit(ReLU),and two kinds of convolution layers(1×1 conv and 3×3 conv). The functions of the BN layers are to speed up the network training and avoid gradient explosion and gradient disappearance. The 1×1 conv adds non-linear activation to the last layer. The 3×3 conv increases the layer number. In the final layer of the CNN feature extractor, the flatten layer is used to process the features to form a one-dimensional (1D) feature vector. For the location feature extractor, the coordinate values of the bounding box are normalized to generate a 1D feature vector,which is combined with the CNN feature to create a fusion feature using the concat operation.
After the feature extraction mentioned above,the MLP is used to fit the Q-value functions of different actions. Considering the complexity and training difficulty of the network, the MLP in the DQN includes only three layers: the input layer (IL), hidden layer (HL), and output layer (OL). The input
Fig. 2 Model of the proposed deep Q-learning network (DQN) for the robot active object detection (AOD),including the state information as the input, the feature extraction module with two extractors (convolutional neural network (CNN) feature extractor and location feature extractor), and the three-layer multilayer perceptron (MLP)
Fig. 3 Network structure of the designed convolutional neural network (CNN) feature extractor, including a convolution and pooling layer, 16 residual blocks, a pooling layer, and a flatten layer (The output of the CNN feature extractor is a one-dimensional feature vector)
of the IL is the fusion feature and the HL contains 64 neurons. There are six neurons in the OL, because there are six categories of action in the robot AOD. The activation function used in the MLP is ReLU. The function of the DQN model is designed as a non-linear approximation of the optimal actionvalueQ*(s,a) in function (7). During the training,all parametersθ’s in the DQN are optimized.
To improve the training performance of the DQN model for robot AOD, a novel TAM is proposed in this subsection. In the raw DQN training algorithm, the robot takes a random action with a certain probability, or an action generated by the trained model in each state. The action taken may lead to a failed exploration, which receives a negative reward. For the same or similar states, the robot may take the same incorrect actions repeatedly,which leads to a lot of training data with negative rewards. However,too much training data with negative rewards has an adverse effect on the training efficiency and testing accuracy of the DQN model.Consequently, an exploration method with memory is added to TAM to avoid the repetitive error actions in the same state, which reduces the amount of training data with negative rewards. The TAM is summarized in Algorithm 1. When taking an action,the flag continue_search is set to avoid repeatedly making wrong actions (lines 16–18 and lines 36–38 in Algorithm 1).
The reward function used in the TAM is based on the work in Liu et al.(2022a):
whereSbis the area of the bounding box andDis the center distance between the bounding box and the observation image.
Each episode must end when the robot has arrived near the target object location and obtained a good observation point. Therefore, judging whether the robot has arrived at the appropriate place is a key problem during training, which is the same as generating the end statese(line 6 in Algorithm 1)of the episode. Therefore, the generation method of the end state (GMES) is proposed based on the position and acreage of the target object bounding box, which is summarized in Algorithm 2. In the final state, the robot can achieve a good view of the target object. In some cases, the two actions in the adjacent states are inverse (e.g.,at= forward andat+1= backward), which results in the reciprocating motion of the robot. To overcome this issue, the last actionalis recorded to avoid choosing the inverse action in the current state and then select the action that has the second value in the action-score list as the best action(lines 7–10 in Algorithm 2). All the generated end states will be checked and saved for the TAM.An example of the end state generation is demonstrated in Fig. 4.
Algorithm 1 TAM for robot AOD Require: initialized training data D Ensure: well-trained DQN model 1: Define a storage S and initialize the evaluation DQN Qe(s,a|θe) and the target DQN Qt(s,a|θt) with random parameters θe and θt 2: Ns = 0 /* Ns is the number of steps */3: for episode=1:M do 4: Ns = Ns+1 5: Initialize a scene image Ii with a target object oid a the start state st from D 6: Obtain se /* se is the end state of this episode */7: done=False 8: continue_search = False 9: action_list = {a1,a2,a3,a4,a5,a6}10: while not done do 11: if Ns >200 then 12: Select a random action at with probability ∈otherwise, at = max Qe(st,a|θe)13: else 14: Select a random action at 15: end if 16: if continue_search and action_list are not nul then 17: at =as /* sample an action from action_list */18: end if 19: st+1 =T(st,at)20: if st+1 is null then 21: continue_search = True 22: st+1 =st, rt =-1 23: else 24: if st+1 = se then 25: done = True, rt = 1 26: else 27: continue_search = True, rt = fr(st,at)28: end if 29: end if 30: Store (st,at,rt,st+1) in S 31: st =st+1 32: if Ns >2000 then 33: Sample batch size Bs from S randomly 34: Use Bs to train Qe(s,a|θe) and update θe by th RMSProp optimizer 35: end if 36: if continue_search then 37: Remove at in action_list 38: end if 39: if Ns%2000 ==0 then 40: θt = θe 41: end if 42: end while 43: end for 44: Return Qt(s,a|θt)
An active vision dataset benchmark (AVDB)(Ammirato et al., 2017, 2018)has been proposed to develop and compare robot AOD methods in a realworld environment, and here it is used to verify the performance of the proposed method. The AVDB included 14 family environments and one office environment to simulate a robot moving to find target objects in an indoor environment. Most homes were scanned once and some of them(e.g.,Home_001 and Home_003)were scanned twice. Some images of various homes in the AVDB are shown in Fig. 5. The common manipulable object instances in a household environment appearing in the BigBIRD dataset(Singh et al., 2014) were placed in every home for object detection. There were 33 instance classes in the AVDB,and some examples are shown in Fig. 6.
Algorithm 2 Generation method of the end state(GMES)Require: initialized state s1 and the target object o Ensure: end state sos1 1: done=True 2: Current state st =s1 3: Last action al =None 4: Last state sl = None 5: while done do 6: Obtain the best action amax in st by Eq. (9)7: if amax is the inverse action of al then 8: amax is the action that owns the second largest value in Eq. (9)9: end if 10: al =amax 11: sl =st 12: Next state st+1 = T(st,amax)13: if amax ==backward then 14: done = False 15: end if 16: end while 17: sos1 =sl 18: Return sos1
Fig. 4 An example of the end state generation based on the GMES,where the bounding box represents the target object
As shown in Table 1, the data collected from 10 homes were used to train models and the trained models were tested in two homes. Two twice-scanned scenes (e.g., Home_001 and Home_005) appeared in both training and test data. In the second scan scenes(e.g.,Home_001_2 and Home_005_2),some instances were moved around, some were removed completely, and some new instances were added.
The success rate (SR) and average step (AS,representing the average number of steps)were used to compare the performances of different AOD methods. SR is defined by
Fig. 5 Some scene images in the active vision dataset benchmark (AVDB)
Fig. 6 Some examples of the object instances in the active vision dataset benchmark (AVDB)
Table 1 Experimental environment setup
whereLtis the step number of the success taskt.The training parameters are given in Table 2.
Table 2 Training parameters
The proposed AOD method was compared with the following DQN-based models (DQN, DDQN,D3QN,and DQN_dueling).
1. DQN (Mnih et al., 2015): This is a deep Qlearning model that has the same architecture as in Fig. 2. The difference between the DQN and the proposed method is that the DQN is trained by raw training arithmetic.
2.DDQN:The double Q-learning training strategy (van Hasselt et al., 2016) is used to train the DQN.
3. D3QN: DDQN has a dueling architecture.
4. DQN_dueling: A dueling architecture is added in the DQN based on the idea of Han et al.(2019).
DQN, DDQN, D3QN, DQN_dueling, and our method were trained with the same parameters in Table 2 to ensure the fairness. After 4650 training steps, SR and AS test results of the above methods are provided in Table 3. Moreover, the well-trained performance of our method is compared with those of some state-of-the-art AOD methods in Table 4.
The following conclusions can be drawn from the results of the comparison experiments:
Table 3 Results of different DQN-based methods
Table 4 Performance comparison with some state-ofthe-art AOD methods
1. In DQN-based methods, our method has the highest SR (0.7935) and the second-lowest AS(21.0265).
2. Because the environment Home_001_2 is more complicated than Home_005_2, SR in Home_001_2 are lower than that in Home_005_2.However, SR of our method is larger than 0.6 in Home_001_2, which is much higher than those of the other methods.
3. Compared with the DQN, the average AS of our method is 2.8758 smaller and the average SR of our method is 0.3328 higher, especially about 0.4 in Home_001_2. This proves that the model trained by the proposed TAM can improve the performance of the robot AOD with a higher SR and a lower AS.
4.Table 4 shows that our method is more accurate than state-of-the-art methods in the AOD tasks.Compared with the AOD model in Xu et al. (2021),the SR of our method is 0.07 higher.
In summary,our method outperforms the other methods on AOD tasks.
To analyze the performance difference between the proposed TAM and the raw training algorithm(RTA),we completed an experiment concerning data generation in the training process. When the DQN model is trained based on TAM and RTA, every generated transition T,{st,at,rt,st+1}, is recorded.Different transitions are classified into two categories: positive transition (PT) and negative transition (NT). PT receives positive rewards while NT receives negative rewards. During training,the DQN model learns from various transitions and the number of transitions in one learning step is the mini batch sizenbs. To show the number of PTs in the current learning stept, the PT proportion in the current mini batch is calculated by
The PT proportion from the beginning (t= 1)to the current learning step(t=n)is computed by
The curves of T, PT, PTPB, and PTPM based on RTA and TAM are demonstrated in Fig. 7.
As shown in Fig. 7, the rate range of PTPM based on TAM is 0.10–0.33, much higher than that based on RTA (the rate range of PTPM based on RTA is from 0 to 0.15). The rate of PTPB based on TAM is >0.2 in every learning step, which is larger than that of PTPB based on RTA(about 0.05). The rates of PT based on RTA and TAM are almost straight lines. With increased learning steps, the number of PTs grows,but the growth rate of the PT based on TAM is faster. Therefore,the DQN model can learn more PTs based on the proposed TAM and avoids learning NTs repeatedly using RTA,reflecting that our method outperforms the other models with a higher SR and a lower AS. Meanwhile, more PTs can effectively improve the performance of the DQN model.
To study the relationship between the object detection accuracy(ODA)and the AOD performance,our model was tested with different ODAs. From the results in Table 5, it can be seen that ODA can affect the performance of the AOD model. A low ODA leads to a small SR; for example, the average SR is only 0.58 with ODA being 0.7. With an increase in ODA, the SR is improved. When ODA is 1.0,our model obtains the largest average SR of 0.84.Therefore, it is important to ensure that the object detection model has high ODA when applying the AOD model to an actual scene.
Fig. 7 Curves of T, PT, PTPB, and PTPM based on RTA (a) and TAM (b) during 201 learning steps
In this subsection we discuss the ablation studies concerning the state space design and the training algorithm. In the proposed DQN model, the state space includes an RGB image observed by the robot (rgb) and the bounding box of the target object(box). Considering that the robot usually has an RGBD camera,the depth image(depth) is added to the state space to train and test the proposed DQN model to explain the rationality of our state space design. The DQN model testing results with various state spaces are given in Table 6. It can be seen from Table 6 that rgb+box has the best SR and AS. The RGB image can represent the environment better than the depth image, so rgb+box is more accurate and efficient than depth+box. When combining the RGB image and the depth image as the state information, the improved performance of the DQN model based on rgb+depth+box is not obvious because the feature extraction of the depth image can increase the complexity of the DQN model. Therefore,the designed state space rgb+box is reasonable for the proposed DQN model.
To assess the effectiveness of the TAM, training efficiency experiments were carried out. The proposed DQN model has been trained based on different training algorithms, RTA and TAM. During training, the model was saved every 465 learning steps and was tested in Home_001_2 and Home_005_2. The testing curves of RTA and TAM are shown in Fig. 8 and the best testing results of these two training algorithms are provided in Table 7. From the curves in Fig. 8, the RTA training speed is lower than that of TAM, which needs 27×465 steps to obtain the best SR of 0.79. However, the DQN model trained by TAM can achieve a 0.8 SR when the number of steps is only 5×465.From the results in Table 7, the average SR of the TAM is 0.84, much higher than that of the RTA(0.79). Meanwhile, the AS of the TAM is smaller than that of the RTA. In summary, the proposed
Table 5 Results of the DQN model based on different object detection accuracy (ODA) values
Table 6 Results of the DQN model with different state spaces
Table 7 Results of the DQN model based on different training algorithms (TAs)
Fig. 8 Test curves of SR based on RTA and TAM in Home_001_2(a),Home_005_2(b),and the average SR (c)
TAM can improve the training performance and test accuracy of the DQN model,and guarantee high detection efficiency.
In this paper, a DQN-based AOD model with a novel training algorithm is proposed to increase the training efficiency and improve the performance of the DQN model. In particular, a training algorithm based on memory is designed to avoid repetitive learning of the data with negative rewards and increase the amount of data with positive rewards in the training process. Our method is evaluated on an AOD dataset, and the experimental results demonstrate that our method can improve the performance of the robot AOD and speed up the DQN training process. In addition,the ablation studies explain the rationality of the proposed method. In the future,we will study how a robot makes an appropriate action prediction if the target object is not detected at the beginning of the AOD task.Contributors
Shaopeng LIU and Guohui TIAN designed the research.Shaopeng LIU addressed the problems, processed the data,and drafted the paper. Guohui TIAN, Yongcheng CUI, and Xuyang SHAO helped organize the paper. Shaopeng LIU and Guohui TIAN revised and finalized the paper.
Compliance with ethics guidelines
Shaopeng LIU, Guohui TIAN, Yongcheng CUI, and Xuyang SHAO declare that they have no conflict of interest.
Frontiers of Information Technology & Electronic Engineering2022年11期