Minghe Cao and Jianzhong Wang
(School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China)
Abstract: Unmanned weapons have great potential to be widely used in future wars. The gaze-based aiming technology can be applied to control pan-tilt weapon systems remotely with high precision and efficiency. Gaze direction is related to head motion, which is a combination of head and eye movements. In this paper, a head motion detection method is proposed, which is based on the fusion of inertial and vision information. The inertial sensors can measure rotation in high-frequency with good performance, while vision sensors are able to eliminate drifts. By combining the characteristics of both sensors, the proposed approach achieves the effect of high-frequency, real-time, and drift-free head motion detection. The experiments show that our method can smooth the outputs, constrain drifts of inertial measurements, and achieve high detection accuracy.
Key words: gaze aiming; head motion detection; visual-inertial information fusion
Unmanned weapons will play an essential role in future wars. Remote control of unmanned weapon systems by Human-Machine-Interacting will be in the mainstream for a long period of time in the future. Now the teleoperation of unmanned weapon systems heavily relies on control terminals, which has several drawbacks, including the inconvenience of operation and imprecise target aiming. The teleoperation based on head-gaze motion can easily and efficiently achieve precise aiming on pan-tilt weapon systems[1].
Head and gaze motions are highly related to human inspecting behavior. It is a natural procedure that head motion can provide an extra field of view for gaze. When people are interested in an object, the head will naturally turn to the target area. When gazing at a static target, the head remains still. While tracking a dynamic target, it has different levels of movements. In the detection of eye movements, the gaze axis is based on the head. Consequently, the detection of head pose plays an essential role in gaze tracking[2-3].
The detection of head motion can be seen as a state estimation problem in six degrees of freedom. Several methods can be leveraged[4]. Sensors, including inertial sensors and visual sensors, can be attached to head or helmet directly to measure the motion. The measurements based on these sensors are easily setup, thus have broader applications[5-6].
The Inertial Measurement Unit (IMU) is a device to detect angular velocity and acceleration of a rigid body using a gyroscope and an accelerometer. The main benefit of employing IMU is that no external transmitter or receiver is needed to attach to the head when performing the detection, releasing the constrains of head motion. However, a flaw of IMU is that its outputs contain measurement noise and drift, leading to a diverge to the ground truth over time with integration. The vision sensors exploit features of environments such as edges and surfaces to acquire pose information. However, due to the high computation complexity, the algorithms based on vison can hardly achieve a relatively high real-time performance. Besides, the vision-only methods cannot resolve series of issues like blur, occlusion, lacking features, pure rotation, and scale ambiguity, causing a failure of tracking and positioning. The fusion of visual information and inertial information can efficiently make up for corresponding shortcomings.
Visual Inertial Odometry (VIO) is a method aiming to detect the poses of a rigid body by combing the information of inertial and visual[7]. It can be classified into two categories: filter-based[8-10]and optimization-based[11-14]. The filter-based approaches estimate the current state utilizing only one previous state, while the optimization-based methods consider the impact of all states, finding solutions by minimizing errors of optimization functions. Generally, the accuracy of optimization-based approaches is superior to filter-based approaches. However, due to high volume of data, the optimization-based approaches have poor timing performance. Alternatively, based on how the visual information and inertial information are associated, the VIO can also be divided into two groups as loosely-coupled[10-11]and tightly-coupled[8-9,14]. The loosely-coupled approaches deal with visual information and inertial information separately and fuse the two in the back-end. Oppositely, the tightly-coupled techniques consider visual features and inertial measurements as one state vector and find the solution based on the state vector. In tightly-coupled optimization-based approaches,various visual information is used as the front-end. OKVIS[12]uses the Harris corner to extract keypoints and describes the keypoints using the BRISK descriptor. VINS-mono[14]employed KLT sparse optical flow. VIORB[13]exploited ORB features. Other than odometry, the VIO approaches can also be applied to applications like Visual Reality and Augment Reality[15].
In this paper, a head motion detection method based on the fusion of visual information and inertial information is proposed. The inertial information is used as a high-frequency base output to ensure a real-time performance. The rectification is performed by constructing the optimization function that minimizes the errors of reprojection and inertial bound. The rectified data is then smoothly filtered. The proposed technique achieves a high-frequency output, which has the benefits of real-time,drift-free, and smooth head motion detection.
(1)
(2)
where ? is quaternion multiplication. In the continuous-time model, the integration of state from current stateito next time stepjcan be expressed as
(3)
Typically, the output frequency of inertial sensors is higher than visual sensors’, causing multiple inertial measurements between images. In optimization problems, after the image poses are adjusted at every time, the inertial measurements need to be re-integrated, resulting in computationally time-consuming. To avoid there-integration, we adopt a pre-integration model. We denote Δqijas the rotation between timeiandjin the frameb, letqwbj=qwbi?Δqij, from Eq. (3) we have
(4)
In visual information, several methods such as HOG,SIFT,SURF can be leveraged to extract environmental features[16]. Oriented FAST and Rotated BRIEF(ORB)is a fast and robust local feature detector composed of keypoint and descriptor. The Harris corner detection is first adopted to find FAST keypoints to extract ORB features. However, FAST features do not have an orientation component and multiscale features, so ORB uses a multi-scale image pyramid, which downsampling image by a scale factor in different levels and detects keypoints at each level and assigns an orientation to each keypoint. The keypoints are then scored and selected based on quality and finally described using the BRIEF descriptor.
After extracting ORB features, feature matching can be employed to find the same environment point from different image views. The similarity between two descriptors can be obtained by calculating the distance. For BRIEF, we use the Hamming Distance. Some features are inevitably mismatched when conducting feature matching. Thus RANSAC is used to cull outliers.
(5)
where π(·) is the reprojection of features from the 3D world frame to 2D camera plane,zcifjis the observation ofcito featurefj, andxis the state vector to be optimized.
The head motion detection system mainly includes inertial information and visual information acquisition, visual-inertial fusion based rectification, feature point cloud initialization and update, and output filtering. The inertial measurements collect data from the accelerometer and gyroscope at a high frequency and provide short-time motion states by the integration model. The visual information acquires images and extracts features of the surrounding environment at a low frequency. The visual-inertial fusion based rectification takes the inertial bound and visual matching to construct an optimization function and then computes local drift-free pose by solving the function. After the fusion, images are selected based on certain conditions as keyframes to update the feature pointcloud. The output filter is responsible for imposing the low-frequency rectifying data to the high-frequency inertial data, which is then smoothly filtered as the final output. The flow chart of head motion detection is illustrated in Fig.1.
Fig.1 Flow chart of head motion detection
To ensure that visual information can provide a global reference in which the error of visual-inertial fusion will not accumulate with time, we construct a feature pointcloud in the world coordinate based on visual features and their position. Before the detect system running, we initialize a prior ORB feature pointcloud to ensure the accuracy and stability of the reference. The tester collects images of environments by performing head rotate and translate. After the collection, we perform the full BA optimization of collected images to find the optimized camera poses and feature positions. Finally, we stitch features together as the initialized feature pointcloud. With the pointcloud, images can take a reference to build reprojection errors.
The visual-inertial fusion will provide rectified information to inertial data. Alternatively, some images are selected from the outcome of the visual-inertial fusion as keyframes to update the pointcloud. Specifically, the select condition includes: the image has sufficient ORB features, the number of features decreased 50% compared to the last keyframe, and relative pose greater than a certain threshold.
After selecting a set of keyframes, we set the prior pointcloud fixed and employ BA to optimize keyframes pose and feature and finally update the pointcloud by inserting the optimized features.
When computing the visual-inertial rectification, the optimization-based approach will be used again. Distinct from the feature pointcloud construction, for real-time rectification, we optimize the current framejby minimizing the feature reprojection error of all matched features and an IMU error term link to one previous frame.
Specifically, given the current imagecj, we first compute reprojection errors by extracting its ORB features and matching the features within the pointcloud. Let xi,xjare the IMU states corresponding to two successive images. We can establish an optimization function
(6)
whereeM(·) is the reprojection error term of all matched points,eI(·) is an inertial error term of pre-integration,ΛfandΛI(xiàn)are information matrices of visual information and inertial information, respectively.ρ(·) denote the Huber cost function. By linearizing Eq.(6), we can solve the problem using the Gauss-Newton algorithm.
The inertial information provides high-frequency base measurements while the visual-inertial information imposes rectifications to them. However, due to the difference in frequency, a simple fusion may result in a discontinuity, leading to an unstable control signal. Therefore, the rectified head motion is applied to a fixed window filter that weighted the poses within the window and finally smoothed the output. For a unit quaternion q=[w,x,y,z]T, the conversion to the axis-angle representation can be expressed as
(7)
where the equation satisfies, in whichθand u are angle and axis. Let the rotation at timeibeqiand its axis-angle representation isρi. Set the window size toN. The output poseis
(8)
Experiments were carried out to measure the head motion when remotely control a pan-tilt system to track a target using our gaze tracking system. The experiment scene is presented in Fig. 2. A tester worn the experimental system to track the target moving in front of apan-tilt system. The experiment system is composed of a head detection module, a gaze tracking module, and a ground truth IMU module. The head detection module consists of an RGB-D camera and a BMI088 IMU. The ground truth module uses a 3DM-GX5-25 IMU. In the actual operation, the head and gaze motion are combined to control the pan-tilt system. In this experiment, only head motion is considered.
Fig.2 Experiment scene
Due to the fact that inertial measurements and visual-inertial fusion have different output frequencies, imposing rectification directly could result in a segmented output. Fig.3 presents the raw output of head motion when tracking the target. Thex-axis is time, and they-axis is the pitch rotation in degrees. Fig.3a shows the result of directly imposing rectification. Fig.3b gives the result of filtering with a window of sizeN=10. From the figures, it can be observed that the filtered curve made up the fracture caused by the rectification in the raw output and smoothed the raw output.
Fig.3 Smoothed output
The results of our method is compared to the outcome of raw IMU integration, and the errors with respect to the ground truth are shown in Fig. 4. Thex-axis is experimental time andy-axis is the rotation error of pan(Fig. 4a) and tilt (Fig. 4b) directions. Fig.4 shows that the visual-inertial approach achieved a good detection accuracy that the mean errors from both directions are less than one degree. Also, due to the measuring noises, the error for the inertial measures increases with time, while our visual-inertial method does not accumulate for the global feature pointcloud reference. The results show that our approach can efficiently rectify the inertial drift, achieved a good detection performance in both control directions.
Fig.4 Detection error
In this paper, a head motion detection method based on the fusion of visual and inertial information is proposed. The high-frequency inertial information is used as a base output in short-time for its good rotation character. The visual information is employed to construct a feature point cloud to provide a global drift-free reference. The visual information exploits inertial bounds between two images and reprojection error of the feature point cloud, and imposes rectifications to inertial measurements. Finally, the rectified data is smoothed by filtering. The experiment results show that the proposed head motion detection method with the basis on visual-inertial fusion is able to smoothly detect head motion with good accuracy.
Journal of Beijing Institute of Technology2020年1期