Behavioral control task supervisor with memory based on reinforcement learning for human–multi-robot coordination systems**&#

2022-08-24 08:46:24JieHUANGZhibinMOZhenyiZHANGYutaoCHEN

Frontiers of Information Technology & Electronic Engineering 2022年8期

Jie HUANG，Zhibin MO，Zhenyi ZHANG，Yutao CHEN

1School of Electrical Engineering and Automation，F(xiàn)uzhou University，F(xiàn)uzhou 350108，China

25G+Industrial Internet Institute，F(xiàn)uzhou University，F(xiàn)uzhou 350108，China

3Key Laboratory of Industrial Automation Control Technology and Information Processing of Fujian Province，F(xiàn)uzhou University，F(xiàn)uzhou 350108，China

?E-mail:yutao.chen@fzu.edu.cn

Abstract:In this study，a novel reinforcement learning task supervisor(RLTS)with memory in a behavioral control framework is proposed for human-multi-robot coordination systems(HMRCSs).Existing HMRCSs suffer from high decision-making time cost and large task tracking errors caused by repeated human intervention，which restricts the autonomy of multi-robot systems(MRSs).Moreover，existing task supervisors in the null-space-based behavioral control(NSBC)framework need to formulate many priority-switching rules manually，which makes it difficult to realize an optimal behavioral priority adjustment strategy in the case of multiple robots and multiple tasks.The proposed RLTS with memory provides a detailed integration of the deep Q-network(DQN)and long short-term memory(LSTM)knowledge base within the NSBC framework，to achieve an optimal behavioral priority adjustment strategy in the presence of task conflict and to reduce the frequency of human intervention.Specifically，the proposed RLTS with memory begins by memorizing human intervention history when the robot systems are not confident in emergencies，and then reloads the history information when encountering the same situation that has been tackled by humans previously.Simulation results demonstrate the effectiveness of the proposed RLTS.Finally，an experiment using a group of mobile robots subject to external noise and disturbances validates the effectiveness of the proposed RLTS with memory in uncertain real-world environments.

Key words:Human-multi-robot coordination systems;Null-space-based behavioral control;Task supervisor;Reinforcement learning;Knowledge base

1 Introduction

Human-multi-robot coordination systems(HMRCSs)(Zheng et al.，2017;Lippi and Marino，2018)have been used in homes(Lee and Kim，2018)，military detection (Gans and Rogers，2021)，rescue robots(Queralta et al.，2020)，industrial processes(Robla-Gómez et al.，2017)，and outer space exploration(Bluethmann et al.，2003). Coordination between humans and robots can improve control efficiency and robustness，and allows robots to successfully complete specific predetermined tasks while encountering emergencies such as partial failure.On this topic，researchers have proposed different methods to achieve efficient and practical human-multi-robot coordination.An automated advising agent system has been introduced to solve the problem of supervising and operating multiple robots simultaneously in a large-scale human-multirobot coordination system(HMRCS)for search and rescue(Rosenfeld et al.，2017).A distributed control strategy based on robust adaptive control has been designed to realize efficient physical interaction between human and multiple manipulators(Lippi et al.，2019). An intelligent robot navigation system has been proposed to realize humanmulti-robot coordination control，where multiple robots can implement tasks with priority with external disturbances such as human and robot motion(Bajcsy et al.，2019).A robot fault human information processing(RF-HIP)model has been designed(Honig and Oron-Gilad，2018)to solve communication and perception problems among humans and robots，in which efficient decision-making and control are achieved under the condition of perception error and response failure in some robots.

Although these approaches achieve effective cooperation among humans and robots，they do not consider repeated human intervention in the task execution process when robots encounter similar emergencies or failures.This may require consistent human attention，increasing the burden of human monitoring and controlling and the probability of making mistakes.In addition，frequent human participation and repeated intervention can result in significant decision-making time cost and task tracking errors，which seriously affect task execution process and may even cause safety problems.To tackle these problems，researchers have recently proposed learning methods to memorize human intervention information，called human-in-loop(HIL)hybrid enhanced intelligence(Zheng et al.，2017).For example，an HIL hybrid enhanced intelligent closed-loop system has been built by introducing machine learning and human knowledge into robotic decision-making(Fu et al.，2019).A hierarchical emotional episodic memory method has been proposed in social human-robot collaboration(Lee and Kim，2018)，where robots are able to remember and manage human experiences and predict and prevent emergencies.

However，conflicts among humans and robots are inevitable in human-robot cooperation as robots are flexibly combined and designated to complete tasks with increasing complexity.Null-space-based behavioral control(NSBC)(Antonelli and Chiaverini，2006)is a practical method for resolving task conflicts.NSBC ensures that tasks with higher priority are fully executed，while those with lower priority can be partially executed using nullspace projection and system redundancy.To this end，one of the key issues in NSBC is to design an implicit centralized supervisor to manage multiple tasks that may be in conflict.Traditional supervisors include the finite state automaton(FSA)method(Baizid et al.，2017)，fuzzy logic method(Moreno et al.，1993;Huang et al.，2019)，and model predictive control(MPC)(Chen et al.，2020)，which can realize real-time dynamic switching of task priority.However，F(xiàn)SA and fuzzy logic methods need to manually formulate priority switching rules.These methods are not intuitive when their task space and state space are large.In addition，the supervisor based on MPC has the disadvantages of requiring an accurate mathematical model and high cost of real-time computation.

The task supervisor design is even more troublesome in the case of human intervention in HMRCSs，in which questions like when and how humans intervene are not easy to answer. In early studies，a human drift diffusion model(DDM)has been proposed to account for human intervention in the NSBC framework(Huang et al.，2020).However，these works do not consider the repeated and frequent human intervention problems which may cause high decision-making time cost and significant task tracking errors.Motivated by these issues，we propose a novel reinforcement learning task supervisor(RLTS)with memory in the NSBC framework(Mo et al.，2022).A deep Q-network(DQN)and a long short-term memory(LSTM)knowledge base are employed to address the problems of dynamic task priority adjustment and repeated human participation.In particular，the proposed RLTS with memory first memorizes human intervention history from the situations when robot systems are not confident in emergencies，and then reloads the history information when encountering the situation that was previously tackled by humans.RLTS is first trained off-line，and can obtain dynamic task priority adjustment strategies online depending on the environment.This overcomes the defects of the traditional FSAand MPC task supervisors.This paper is significantly improved on the basis of Mo et al.(2022)in the following areas:(1)An artificial potential field model is introduced to improve the state selection accuracy of the RLTS.So，the proposed RLTS with memory can accurately determine whether to trigger human intervention，and the robots have the ability to avoid dynamic obstacles.(2)In addition to numerical simulations，experiments are conducted using real mobile robots to demonstrate the effectiveness of the proposed RLTS.

2 Preliminaries and problem statement

2.1 Markov decision process

whereG trepresents the attenuation sum of all rewards from stateS tat timetto the final state.

2.2 Deep Q-networks

In DQNs，the agent tries to learn the optimal action value functionq*(s，a)through value iteration update(Mnih et al.，2015;Wang et al.，2020).In the value iteration process，DQNs introduce a deep neural networkq w(s，a)with parameterwto replace the Q-table in Q-learning(Watkins and Dayan，1992).The parameterwis learned by randomly sampling a mini-batchn mof transitions from an experience replay buffer and minimizing the squared temporaldifference(TD)error.The cost function can be calculated as

whereGμ=r+γmaxa′?q(s′，a′;w-)is the TD goal，also called the expected state-action reward，wis the neural network parameter of the target Qnetwork，sands′are the current and next states respectively，andaanda′are the current selected action and next action respectively.

2.3 Null-space-based behavioral control

The NSBC approach can be designed in a three-level structure consisting of elementary behaviors，composite behaviors，and a task supervisor(Baizid et al.，2015，2017).The speed output of each elementary behavior can be combined and superimposed according to the geometric rules of null-space projection to obtain reference speed signals of the robots.

2.3.1 Elementary behaviors

In NSBC，elementary behaviors are the atomic task functions to be controlled at the kinematic level.They can be expressed by a function that involves the degree of freedom of the system and variables to be controlled.

Defineρ∈Rmas the task variable andδ∈Rnas the system configuration.ρiis the function related toδi，i=1，2，...，κ，andκis the number of robots in the HMRCS.Thus，the corresponding task function of each robot can be expressed as

The corresponding differential relationship of Eq.(4)is

whereJ i(δi)∈Rm×nis the configuration-related task Jacobian matrix of theithrobot andv i∈Rnis the stacked vector of theithrobot.ndepends on the specific system and controllable degree of freedom;for example，for a mobile robot，n=3.The system configuration here is referred to the position and orientation.The reference velocityvdcan be calculated by converting the local linear mapping(5)into the least-square formula.The integration of reference velocity would incur a certain drift of the reconstructed position of the robot，which can be compensated for by the following closed-loop inverse kinematics(CLIK)algorithm(Antonelli and Chiaverini，2006):

Remark 1A behavior is also called a task or mission in behavioral control in this study.

2.3.2 Composite behaviors

Composite behaviors are combinations of multiple elementary behaviors and determined by the priority of tasks.

Letρj∈Rm jbe thejthtask function，wherej=1，2，···，Kandm jdenotes the space dimension of thejthtask.Define a time-related priority functiong(j，t):N K×[0，∞]→N K，N K={1，2，...，K}，representing a mapping between the task function index and the priority index.Then，the composite behavior combination rules can be defined as follows:

1.j=1 is assumed as the top priority.jα＞jβindicates thatjβowns a higher priority thanjα.The behavior with priorityjαcannot interfere with the behaviors with priorityjβ，?jα，jβ∈N K，jα/=jβ.The behaviors with a lower priority are allowed to be executed in the null-space of all behaviors with a higher priority.

2.The behavior Jacobian matricesJ g(j，t)∈Rm j×n，j=1，2，···，K，determine the mappings from the generalized velocities of the system to the behavior velocities.

3.The dimension of the lowest-level taskm kcan be greater thanSo，the dimensionm nof the behavior space is larger than the total dimension of all behaviors.

4.g(j，t)depends on a task supervisor according to task requirements and the sensor feedback information.

The velocity output of composite behaviors at timetis given by the following recursive expression by distributing a given priority to multiple elementary behaviors.Therefore，the velocity output of composite behaviors at timetcan be expressed as

Remark 2Only those elementary behaviors in conflicts are supposed to be allocated priorities among all elementary behaviors，while interindependent behaviors can be executed autonomously with available degree of freedom.

2.3.3 Behavioral control task supervisor

In the NSBC approach，the real-time switching between composite behaviors must be assigned by a so-called task supervisor(Baizid et al.，2015)，which can dynamically manage and adjust the composite behaviors.It can be triggered flexibly according to task requirements and sensor information.The design of the behavioral control task supervisor module will be detailed later.

2.4 Problem description and assumptions

The goal of this study is to develop an optimal adjustment strategy of the behavioral priority and reduce the frequency of repeated human participation.First，we have the following commonly used assumptions:

Assumption 1All robot tasks are local tasks.There is no information interaction between robots.Each robot in HMRCSs pursues its own maximum benefit.

Assumption 2Robots are autonomously controlled by a robot controller.Human intervention will take over control only when the robot system is not confident enough or fails to complete the tasks within a limited time period.

3 RLTS with memory

In this section，we describe the design of a novel RLTS with memory to control the HMRCS.It is designed to obtain an optimal behavioral priority adjustment strategy and reduce the frequency of repeated human participation in the HMRCS by elaborately integrating the NSBC approach，DQN，and a specific memory base.

3.1 Basic framework of the HIL hybrid enhanced intelligence system

The basic framework of the HIL hybrid enhanced intelligence was systematically summarized in Zheng et al.(2017). When autonomous unmanned systems are in an abnormal situation in which robots are unable to execute tasks in a limited time period or the host computer is not confident enough to complete those specific tasks，the autonomous unmanned systems ask for human assistance and automatically update operation information to the knowledge base after estimating the confidence and a cognitive state of the host computer.Accuracy and credibility can be enhanced after introducing human prediction and intervention.The frequency of human participation can be reduced due to the knowledge base.

3.2 HMRCS under the NSBC-based framework

Based on the aforementioned framework，a novel HMRCS under the NSBC-based framework is designed，as shown in Fig.1.It consists of six key elements:an autonomous decision maker，an RLTS with memory，the typical NSBC scheme，the actual physical environment，a data processing station based on DDM，and an HIL decision maker.They are marked①-⑥in the block diagram.A detailed description of each element is given below:

Fig.1 The novel human–multi-robot coordination system(HMRCS)under the null-space-based behavioral control(NSBC)based framework(KB:knowledge base)

1.The autonomous decision maker.As shown in element①，this element is responsible for robot moving autonomously according to preset programs or control laws.In our application scenario，the preset or designated programs include the tracking task，obstacle avoidance task，and collision avoidance task.

2.The RLTS with memory.As shown in element②，the role of the supervisor is to obtain an optimal behavioral priority adjustment strategy and dynamically adjust the task priority during the task execution process.It learns，records，and reloads the control input of human intervention.This module will be further described in Section 3.3.

3.The NSBC scheme.As shown in element③，this element is responsible for controlling implementations.The desired speed control signal is obtained by fusing elementary task outputs based on the task priority from the RLTS.Then，a proportion integration differentiation(PID)controller embedded in the robots tracks the desired speed.

4.The actual physical environment.As shown in element④，this element refers to the real environment including obstacles，terrains，and robots.Robots transform their states through actions or behavior execution，observe the environment through sensor feedback，and make their decisions for the next action.

5.Data processing station based on DDM.As shown in element⑤，this element consists of an information collector，a filter，and a DDM(Bogacz et al.，2006).DDM is introduced to model the accuracy and reaction time of the value-based human choice in the intervention process.During task execution，the decision information is collected in real time，which helps determine the starting and ending time of human intervention.Once the accumulated information reaches a decision threshold，human intervention is activated.The decision threshold is determined by minimizing the Bayes risk brought by human intervention(Huang et al.，2020).

6.HIL decision maker.As shown in element⑥，this element is responsible for supervising and taking over the multi-robot system(MRS)when necessary，which would generate human control input into the robots.The robot autonomous decision maker would take control again after eliminating the risk or failure.In this study，two human tasks are considered，including the monitoring task and the human intervention task.Other human behaviors such as planning and recording are omitted，but they can also be taken into account in the proposed framework.

3.3 Design of RLTS with memory

Based on the NSBC task supervisor，the design of our RLTS with memory is discussed in detail in this subsection.The proposed RLTS includes a reinforcement learning task allocation supervisor and an LSTM knowledge base.The supervisor is responsible mainly for dealing with task conflicts in the HMRCS task execution process，realizing effective human-robot coordination，and adjusting task priorities in real time.The LSTM knowledge base achieves mainly the memory and storage，and improves the independent decision-making ability and intelligence of the HMRCS.

3.3.1 Design of RLTS

In the RLTS with memory，the optimal mapping relationship between system states and composite tasks is obtained by off-line training.This ensures that each robot selects its optimal task priority order.To enhance the adaptability of the HMRCS in unknown environments，an artificial potential field model is introduced to help select states in reinforcement learning，to increase the accuracy of action or behavior selection，and to accelerate the training convergence.In this subsection，a DQN with a dueling structure is employed to accelerate the convergence of the neural network.The pseudo code of the proposed supervisor is given in Algorithm 1.DefineEas a static environment，S ias the set of states of theithrobot，Bas the set of behaviors，Das the experience replay buffer with capacityN D，Mas the total number of training episodes，andTstepas the time step of one episode.

Algorithm 1 Reinforcement learning task supervisor(RLTS)for the i th robot 1:Input:total number of training episodes M，ε-greedy policy decay coefficientγε，time step T step of an episode，and the target network update step length N C 2:Initialize replay buffer D i with capacity N D 3:Initialize the initial value ofε-greedy policy asε0 4:Initialize action-value function q i(s i，b i;W i，q，W i，α，W i，β)=V i(s i;W i，q，W i，β)+A i，b(s i，b i;W i，q，W i，α)with random initial weights W i，q，W i，α，and W i，β 5:for episode=1:M do 6: Initialize initial state s0 7: for t=1:T step do 8: Select a random behavior b i，t with probabilityε;otherwise，select b i，t=argmax q(s i，t，b i，t;W i，q，W i，α，W i，β)9: Execute behavior b i，t and observe reward r i，t and next state s i，t+1 10: Store this transition(s i，t，b i，t，r i，t，s i，t+1)in D i 11: Sample mini-batch n m of transitions(s i，t，b i，t，r i，t，s i，t+1)from D i with priority 12: y i，z = r i，z+ max b b i，z+1?q i(s i，z+1，b i，z+1;Wi，q，Wi，α，W-i，β)13: Perform a gradient descent step on (y i，zq i(s i，z，b i，z;W i，q，W i，α，W i，β))2 with respect to the network parameters W i，q，W i，α，and W i，β 14: Set s i，t+1=s i，t 15: Reset?q i=q i at every N C steps 16: end for 17:end for

The proposed supervisor satisfies the Markov property，which means that robots interact with environmentEat timetwith states t，select a behaviorb taccording to theε-greedy policy，and obtain a rewardr t，and then the robots are transferred from the current states tto the next states t+1.Theε-greedy policy means that the robots behave with the maximumQvalue with a probability 1-εand randomly select a behavior with probabilityε.In addition，the transition quadruple tuple(s t，b t，r t，s t+1)is stored in the experience replay bufferD.Then，at timet+1，the training continues untilMepisodes are finished.To make DQN explore optimistically in the early stage of the learning process and keep stable in the later stage，a high initial value ofε-greedy policyε0and an appropriate decay coefficientγεare set in RLTS.

In RLTS，theq-function is separated into two components: a value functionV(s;W q，Wβ)and an advantage functionA b(s，b;W q，Wα)，whereV(s;W q，Wβ)is estimated using a value function network with parameterWβandA b(s，b;W q，Wα)is estimated using a state-dependent behavior advantage function network with parameterWα.Then，theq-function is expressed byq(s，b;W q，Wα，Wβ)=V(s;W q，Wβ)+A b(s，b;W q，Wα).After off-line training，RLTS can guide task priority selection on-line during task execution.

In addition，an artificial potential field value corresponding to a specific position of the robot is chosen as one of the states in RLTS.The states of reinforcement learning are selected as a joint matrix，which can reflect the position，potential field，and the distance from each obstacle to the robot，designed as

whereO i(t)is the observation，andP i(t)，V i(t)，andD i，o(t)are the vectors of position，potential field，and distance from each obstacle to theithrobot，respectively.To simplify the model，only the repulsive force field is considered.Thus，the potential field function of thejthobstacle(j=1，2，...，φ，andφis the number of obstacles detected by the robot sensor)for theithrobot can be defined as

whereλis a positive-definite constant gain，d(P i，P j，o)represents the distance between theithrobot and thejthdetected obstacle，andd j，odenotes the influence radius of thejthobstacle.The potential field of the HMRCS environment is shown in Fig.2.

Fig.2 Potential field of a specific environment

3.3.2 Design of the LSTM knowledge base

LSTM has been successfully applied to largescale book retrieval knowledge bases(Zhou et al.，2018)and complex semantic recognition systems(Graves and Schmidhuber，2005). In our application scenario，the LSTM knowledge base is designed to memorize human intervention information in the HMRCS，which includes the forget stage，selective memory stage，and output stage.The network can be realized as

whereφ，f，κ，andcrepresent the input gate，forget gate，output gate，and cell vectors respectively，his the hidden vector，lis the data vector，σ(·)represents a logistic sigmoid function，tanh(·)is a hyperbolic tangent function，andθrepresents the weight parameter matrix from a module to another.The LSTMknowledge base in the HMRCS is able to learn human intervention control information and experience history.In our HMRCS application scenario，the data can be defined as a time series array of five dimensions，which consist of the position，velocity，and potential field value of theithrobot controlled by a person.The knowledge base needs to label processing history data according to different situations.Therefore，the array can be expressed as

whereP i，x，P i，y，V i，x，andV i，yare vectors concerning the position and velocity in thexandydirections of theithrobot，respectively，andU idenotes a vector concerning the potential field value.

4 Simulation results

In this section，three mobile robots running on a flat plane are considered.The control goal of the simulation for the robots is to maintain formation and reach their target points in an unknown environment，without colliding with obstacles or the other robots.The parameters of the RLTS，environment configuration，and LSTM knowledge base are given in Table 1.

Table 1 Parameter values in the simulation

4.1 Task design

Three elementary tasks are briefly introduced:motion，obstacle avoidance，and human intervention tasks. The motion task is to control robots in approaching their target positions along a predetermined trajectory or according to robots’autonomous motion requirements.For mobile ground robots，based on the behavior design guidelines of the NSBC framework，the corresponding task function of the motion task is defined as

wherep i∈R2×1is the position of theithrobot.The reference velocity of the motion task can be calculated as

Then，the obstacle avoidance task drives the robots to avoid obstacles along their planned path.The task function is designed to keep a safe distance between the robots and the nearest obstacle，which is defined as

wherepo∈R2×1is the position of the obstacle.The reference velocity of the obstacle avoidance behavior can be calculated as

The human intervention task is to describe the control of humans in the HMRCS in the NSBC framework.Human intervention is activated once the accumulated information reaches a decision threshold，indicating that robots in the HMRCS might encounter problems or emergencies such as a local minimum and partial failure.The task output can be a time-dependent position or a speed signal.For mobile ground robots，the human intervention task is defined as

The reference velocity of the human intervention task can be calculated as

Finally，elementary tasks are fused according to Eq.(7)，based on the task priority provided by RLTS.

4.2 Reward function in RLTS

The task tracking error of theithrobot is defined as

whereΔtdenotes the sampling interval，andkr∈(0，0.5]is a small positive constant，which is used to improve the convergence of the neural network in the initial training phase.

4.3 Comparative analysis

In this subsection，a comparative simulation is conducted between the FSA-based task supervisor and our proposed RLTS with memory in the HMRCS framework.The important parameters used to realize our simulation are shown in Table 1，including the task and gain parameters of NSBC，configuration of robots and environmental obstacles，reinforcement learning parameters，and knowledge base parameters.In the simulation，three mobile robots move according to the preset task function trajectories.

Figs.3 and 4 show the moving trajectories of the robots using the FSA-based task supervisor and our proposed RLTS with memory，respectively.When encountering obstacles on the way，the robotic system supervisor will assign task priorities and composite NSBC behaviors. Robots in the HMRCS sample their status information and collect real-time drift diffusion decision information during the task execution process.The real-time drift diffusion decision information collected using the FSA-based task supervisor and our proposed RLTS with memory is shown in Figs.5 and 6，respectively.The decision information is also defined as the task tracking error.When robots detect new obstacles or fall into local minima，the human intervention task is triggered after the decision information reaches the decision threshold.During human intervention，human direct control input can help robots cross the dangerous area smoothly，and the control authority is handed over to the robot controller after they are far enough away from the obstacle or other hazardous areas.

Fig.3 Trajectories of the robots using the FSAbased task supervisor with two human intervention processes in the HMRCS

4.3.1 FSA-based task supervisor analysis

Fig.7 shows the distance between the robots and the detected obstacles，and Fig.8 illustrates the switching mode of composite tasks using the FSAbased task supervisor.After analysis，human intervention would take over the control process when robots encounter an emergency and reach the decision threshold with the FSA-based task supervisor.In addition，the FSA-based supervisor lacks the ability to dynamically adjust the task priority，which means that the robots will constantly create safety concerns and speed mutation.When this occurs，a person takes over twice the FSA-based supervisor.

Fig.4 Trajectories of the robots using the proposed RLTS with memory with one human intervention process and one reloading intervention process in the HMRCS

Fig.5 Task tracking error of the robots using the FSA-based task supervisor with two human intervention processes in the HMRCS

Fig.6 Task tracking error of the robots using the proposed RLTS with memory with one human intervention process and one reloading intervention process in the HMRCS

Fig.7 Distance between the robots and the detected obstacles using the FSA-based task supervisor with two human intervention processes in the HMRCS

Fig.8 Robot task mode using the FSA-based task supervisor in the HMRCS with two human intervention processes:(a)robot 1;(b)robot 2;(c)robot 3

4.3.2 RLTS with memory analysis

Fig.9 shows the distance between the robots and the detected obstacles，and Fig.10 illustrates the switching mode of composite tasks using our proposed RLTS with memory.After analysis，human intervention would take over the control process when robots encounter the first local minimum and reach the decision threshold with the proposed RLTS with memory.The LSTM knowledge base updates the human controlled information and its training of the neural network simultaneously，and accurately and efficiently reloads the control information when facing a similar situation.In addition，the HMRCS realizes an optimal behavioral priority adjustment strategy for human and robot tasks，making the moving trajectories of robots smoother and more accurate without speed mutation.Human intervention takes over once in the simulation under the RLTS with memory.

Fig.9 Distance between the robots and the detected obstacles using the proposed RLTS with memory with one human intervention process and one reloading intervention process in the HMRCS

Fig.10 Robot task mode using the proposed RLTS with memory in the HMRCS with one human intervention process and one reloading intervention process:(a)robot 1;(b)robot 2;(c)robot 3

4.3.3 Performance comparison

After comparing simulation results of the two methods，their performances are summarized in Table 2.The moving trajectories of the robots when implementing the proposed RLTS with memory are smoother and more accurate，without bursting into the dangerous collision area，which demonstrates that the HMRCS realizes an optimal behavioral priority adjustment strategy for human and robot tasks.After integrating learning and memory abilities into the design of the supervisor，the HMRCS can memorize human intervention history when robot systems encounter emergencies and are not confident in decision making.Then the proposed RLTS with memory reloads the history information while encountering the same situation when tackled by humans.The frequency of repeated human intervention will be greatly reduced，and the intelligence of the HMRCS is considerably improved.At the end of the comparative analysis，the training loss curves of the RLTS with memory during 200 and 40 000 epochs are shown in Figs.11 and 12，respectively.These training curves demonstrate that the proposed RLTS with memory has good convergence performance and practicability.

Table 2 Performance comparison

Fig.11 LSTM network training loss of the RLTS with memory during 200 epochs(the training process is carried out after a human intervention task)

5 Experiments

In this section，we present an experiment using a group of mobile robots subject to external noise and disturbances，and the experiment was conducted in uncertain real-world environments.Two sets of verification were presented.The first was to verify that the RLTS can determine the optimal priority adjustment strategy to realize dynamic priority switching in real time. The second was to verify that the LSTM knowledge base can effectively reduce the frequency of human intervention. The experimental video of the proposed HMRCS is provided in the supplementary materials and can also be accessed from YouTube(https://youtu.be/gk8SRVTsp64) and Bilibili(https://www.bilibili.com/video/BV1GM4y1F7Lc).

5.1 Experimental setup

The experimental parameters of the HMRCS were set as follows: the decision threshold was 1.48 m.The obstacle avoidance task gain was 5.The motion task gain was 0.4.The safe distance was 1.8 m.The initial positions of robots 1，2，and 3 were(0.5，7.8)，(0.8，4.8)，and(0.5，1.8)m，respectively.The positions of obstacles were set as:po1，(2.5，7)m;po2，(7.5，9)m;po3，(3.8，6)m;po4，(7.8，6)m;po5，(2.5，3)m;po6，(7.5，1)m.The positions of the newly detected obstacles were set as:po7，(4.2，4)m;po8，(8.2，4)m.The motion task functions 1，2，and 3 were set as(0.5+0.1t，8)，(1.5+0.1t，5)，and(0.5+0.1t，2)m，respectively.The remaining parameters of the RLTS and LSTMwere the same as those in Table 2.

Fig.12 Training loss of the RLTS with memory during 40 000 epochs

The experimental scheme of the HMRCS with multiple mobile robots is shown in Fig.13.It is divided into four parts:an HIL decision-and-control center，a Linux-based computing control unit，a mobile robot chassis driving module，and an ultra-wideband(UWB)positioning system.Each robot was equipped with a Raspberry PI running on a Ubuntu system.The HIL decision-and-control center was responsible for monitoring the operation of the HMRCS and providing inputs to the human intervention task.The robots’position information and speed information were collected in real time by the UWB positioning system.The configuration of the physical experimental environment is shown in Fig.14，including eight obstacles，three mobile robots，an HIL decision center，and four UWB positioning base stations and their terminals mounted on the robots.The configuration of a single mobile robot is shown in Fig.15.

Fig.13 Experimental scheme of the HMRCS

Fig.14 Configuration of the human–multi-robot coordination experimental platform

Fig.15 Configuration of a single mobile robot in the HMRCS

5.2 Experimental results

Snapshots of the experiment at 0，36，67，78，88，and 100 s are shown in Fig.16.The task status，robot trajectories，and decision-making information during the experiment process are also presented in the snapshots.These results showed that the RLTS can dynamically adjust the behavioral priority.The supervisor can memorize human intervention history when the robots were not confident in decision making，and then reload the history information when encountering a situation that was previously tackled by a person.

6 Conclusions

In this study，we proposed an RLTS with memory by integrating DQN and LSTM knowledge base within an NSBC framework to address the problems of dynamic task priority adjustment and repeated human intervention in HMRCSs.Simulations and experiments were conducted in an uncertain realworld environment to demonstrate the effectiveness of the proposed RLTS.Results showed that RLTS can successfully memorize human intervention history and reload human control input when robots are not confident，greatly improving the robustness and flexibility of the HMRCS.Our future work will focus on multi-agent reinforcement learning in the HMRCS with more complex dynamics and more complicated environmental constraints.

Contributors

Jie HUANG and Zhibin MO designed the research.Zhibin MO and Zhenyi ZHANG processed the data and drafted the paper.Jie HUANG and Yutao CHEN helped organize the paper.Jie HUANG，Zhibin MO，and Yutao CHEN revised and finalized the paper.

Compliance with ethics guidelines

Jie HUANG，Zhibin MO，Zhenyi ZHANG，and Yutao CHEN declare that they have no conflict of interest.

List of supplementary materials

Video S1 Behavioral control based on reinforcement learning for human-multi-robot coordination systems

Frontiers of Information Technology & Electronic Engineering2022年8期

Frontiers of Information Technology & Electronic Engineering的其它文章: Human-machine augmented intelligence:research and applications; Causality fields in nonlinear causal effect analysis＊; Introducing scalable1-bit full adders for designing quantum-dotcellular automata arithmetic circuits; Training time minimization for federated edge learning with optimized gradient quantization and bandwidth allocation*#; Amatrix-based static approach to analysis of finite state machines*; Adaptive neural network based boundary control of a flexible marine riser systemwith output constraints*#

亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放