Xianjing Xu, Haiyan Jiang
Abstract: The surface electromyography (sEMG) is one of the basic processing techniques to the gesture recognition because of its inherent advantages of easy collection and non-invasion.However,limited by feature extraction and classifier selection, the adaptability and accuracy of the conventional machine learning still need to promote with the increase of the input dimension and the number of output classifications.Moreover, due to the different characteristics of sEMG data and image data, the conventional convolutional neural network (CNN) have yet to fit sEMG signals.In this paper, a novel hybrid model combining CNN with the graph convolutional network (GCN) was constructed to improve the performance of the gesture recognition.Based on the characteristics of sEMG signal, GCN was introduced into the model through a joint voting network to extract the muscle synergy feature of the sEMG signal.Such strategy optimizes the structure and convolution kernel parameters of the residual network (ResNet) with the classification accuracy on the NinaPro DBl up to 90.07%.The experimental results and comparisons confirm the superiority of the proposed hybrid model for gesture recognition from the sEMG signals.
Keywords: deep learning; graph convolutional network (GCN); gesture recognition; residual network (ResNet); surface electromyographic (sEMG) signals
The surface electromyographic (sEMG) signals are a kind of biological electrical signals recorded on the skin surface, containing movement intentions of human body.The sEMG signals have inherent advantages in the human-machine interaction (HMI) like accessibility and non-invasion.Therefore, they can achieve the aim of HMI according to human autonomous consciousness.
A sEMG-based approach which is also known as the muscle-computer interface (MCI),is widely used in prosthetic control [1], robot control [2], sign language recognition [3], and HMI[4].For MCI, one key point is to precisely recognize gestures including the analyses of different categories such as hand posture, object grasping,and hand movements [5].Such sEMG-based gesture recognition method has been widely used in many fields like the computer games, the assisted exoskeleton robots [6], and the medical rehabilitation robot [7], etc.
The recent researches about the sEMGbased gesture recognition mainly focus on the machine learning and the deep learning.The machine learning approach [8] classifies gestures by extracting sEMG features and constructing classifiers, such as the support vector machine(SVM), the naive Bayes classifiers, the random forest, and the neural network.Tavakoli et al.used high-dimensional spatial features and SVM to classify five gestures with the final classification accuracy of 95%-100% [9].Jiang et al.combined electromyography (EMG) and force myography (FMG) for the gesture recognition of American Sign Language digits with a classification accuracy of 91.6% [10].Waris et al.extracted the sEMG signal features to classify gestures by using artificial neural network [11].Above works prove that the artificial neural network shows the better performance than the classical K-nearest neighbor (KNN) and SVM.
A key limitation of machine learning methods is the feature extraction which is a cumbersome process.Therefore, it is a crucial issue to select an optimal classifier for gesture classification according to the selected feature.A number of features are most commonly used in sEMGbased gesture recognition, including time-domain features, frequency-domain features, and time-frequency features [7,12,13].However, these features are susceptible to the muscle fatigue and the electrode displacement, which reduces the robustness and accuracy of the gesture recognition.Therefore, it is expected to get the optimal features from the principle of neural control movement to improve the robustness and accuracy [14].The principle indicates that the muscle synergy, the smallest controlled unit of the central nervous system [15], presents the enhanced robustness under the influence of different factors.Luo et al.applied the muscle synergy method to recognize five gestures with 96% classification accuracy [16].Zhang et al.designed a multi-degree-of-freedom parallel control framework based on the muscle synergy and achieved 96.79%±2.46% classification accuracy in 18 finger movements [17].Masoumdoost et al.extracted the muscle synergy feature of six wrist movements and obtained 99.78%±0.45% classification accuracy in multilayer perceptron (MLP) [18].These studies show that muscle synergy features have a promising application in sEMG-based gesture recognition.However, the convolutional neural network (CNN) is difficult to extract this feature because CNN is designed for Euclidean data,while muscle synergy feature is a graph data.Recently, some literature reported that the graph convolutional network (GCN) has great potential in the field of action recognition [19].The work in [20] proposed a recurrent neural network method with global context-aware capability by adding an attention mechanism (AT) to the long short-term memory (LSTM) network, reaching 77.1% classification accuracy on the benchmark dataset NTU-RGBD; Shi et al.designed a topology network, two-stream adaptive graph convolutional neural network (2S-AGCn), that can adaptively learn different graph convolution layers and skeleton data, significantly improving the accuracy of recognition [21].Therefore, it would be a promising tactic to apply the GCN to improve the performance of the gestures classification.
In the field of sEMG-based gesture recognition, it is demonstrated that the method based on the deep learning has the more advantages than the machine learning in improving classification accuracy, universality, and generalizability [22,23].For instance, Atzori et al.[24] used a four-layer simple convolution network structure to classify 52 gestures on the NinaPro DBl [25]with a 2%-5% improvement in the classification accuracy compared to the classical KNN and SVM.Chen et al.[26] proposed a model based on long short-term memory and CNN to recognize gestures, which achieved a classification accuracy of 75.12% on the NinaPro DBl.Wei et al.[27] split each segments of the data samples into patches of images which are processed with a parallel multi-stream CNN architecture.The classification accuracy reached 88.2% on the NinaPro DBl, but the method is complex.
Here, according to the characteristics of the sEMG signal and the principle of muscle synergy,a novel hybrid model is proposed by combining residual network (ResNet) and GCN.The main contributions of this work are as follows.
1) A model combining ResNet and GCN is successfully used for the sEMG-based gesture recognition.
2) The proposed method obtains the augmented features by combining ResNet and GCN.Specifically, ResNet is used to extract the sEMG features, GCN is used to extract the muscle synergy features, and finally the sEMG features and the muscle synergy features are combined to obtain the augmented features.
3) We constructed 3D sEMG images by analyzing the intrinsic mode functions (IMF) of the sEMG signals from the variational mode decomposition (VMD) decomposition.Meanwhile, we evaluated the proposed method on the NinaPro DBl and compared its performance with existing deep learning methods.
The organization of the paper is as follows:Section 2 presents the proposed methodology.Section 3 describes the dataset and data preprocessing.Section 4 is the experimental results and discussion.Section 5 summarizes our work and discusses future work.
The overall architecture of the hybrid model based on ResNet and GCN is shown in Fig.1.The multichannel sEMG signals are reconstructed intoL×M×Csize sEMG images.Firstly,ResNet and GCN are used to extract sEMG signals featureXand muscle synergy featureFq,respectively.XandFqare then concatenated to the augmented featureXe, which is then fed into the global average pooling (GAP).Finally, the NN is used for gesture recognition to get the final classification result.
Fig.1 Architecture of the model based on ResNet and GCN
The NN in Fig.1 has 4 hidden layers, which are 256, 256, 128, and 128 neurons respectively.In each hidden layer, the dropout layer is used to prevent overfitting during network training, and the loss rate of the dropout layer is set to 0.2.In the output layer, the softmax function is used as the activation function, and the L2 regularization parameter is set to 0.001.Furthermore, gradient descent optimization is performed using stochastic gradient descent (SGD) optimizer,where the learning rate and decay values of the optimizer are set to 10-3and 10-4, and epoch and batch size are set to 100 and 32, respectively.
Conventional ResNet solves the problem of model degradation caused by CNN deepening, which is suitable for image classification.However, since the sEMG data of each channel corresponds to the time series signal generated by a single muscle, the classification accuracy of conventional ResNet is not high when it is used for gesture recognition.In order to solve this problem and make ResNet better learn the features of sEMG signals, we adjust the convolution kernel size of each module of ResNet to solve the degradation problem of deep CNN model while preserving the ResNet architecture.
As shown in Fig.2, the ResNet architecture is composed of 1 initial convolution layer, 8 residual blocks, and 1 full connection layer.The initial convolution kernel is adjusted to 1×9, the step size is set to 1, and there is no maximum pooling layer to avoid information loss.The convolution kernel size in each residual block is set to 3×3.X∈ RH×W×Vobtained from the improved ResNet.
Note: Conv and BN denote the convolution layer and batch normalization, respectively.The number following the layer name denotesthe number of filters, and the numbers after the ampersand (@) denote the convolution kernel size.Fig.2 Architecture and parameters of the improved ResNet
The CNN achieved impressive performance in a wide variety of fields.Despite the merit, CNN fails to properly address problems with non-Euclidean data.However, the muscle synergy is a unique kind of graph relationship.The most basic characteristic of muscle synergy is that each muscle has its feature, and the synergy between each two muscles is different and uncertain.To overcome this problem, this paper introduced the GCN by the voting network to automatically extract muscle synergy features.GCN operates directly on non-Euclidean data can learn node feature and structure feature information end-to-end with strong applicability.
The GCN in this paper is shown in Fig.3,the network starting unit is a joint voting network.The feature matrix of each muscle is obtained by voting the features of sEMG signal.The voting weight matrix function is expressed as
whereψ(·) is a transform function implemented by a 1×1 convolution,Φis the spatial softmax normalization, andW∈RH×L×Nis the voting weight, where thekth channelWk∈RH×Wrepresents the voting matrix of thekth muscle.
Fig.3 Structures of the proposed GCN
Thek-th muscle feature, denoted asfk, is calculated as follows
whereXirefers to the sEMG feature of theith channel,?(·) refers to the transform function implemented by a 1×1 convolution layer, andWki, an element ofWk, is the voting weight matrix forXi.Thus, we have the featureFof allNmuscles as below:
To intelligently extract the muscle synergy features, the connection graph G between muscles is assumed as fully connected.Then,A∈RN×Nis defined as a matrix representing G with elements 1 except the diagonal element 0,Iis the identity matrix, andDis the degree matrix ofA.Following GCN defined in [28], we perform graph reasoning over featureFwith matrix multiplication, resulting the evolved muscle synergy featuresFe
whereAe=A+I,Weis a trainable transformation matrix, andσ(·) is a transform function implemented by a 1×1 convolution with BN and ReLU.The evolved muscle synergy features can be used to augment the sEMG features.Specifically, the evolved muscle synergy features are mapped back to the muscle features and then combined withXto calculate the augmented features.
whereFekis the muscle synergy feature evolved from thekth muscle,Wikis the same as the voting weightWki,ρ(·) is a transform function implemented by a 1×1 convolution with BN and ReLU, and then the final muscle synergy featureFqis calculated using the mean value ofCik.Finally, we concatenateXandFqto obtain the augmented featureXe
whereτ(·) is a transform function implemented by a 1×1 convolution with BN and ReLU.
NinaPro is a publicly available database consisting of ten data sets that provide benchmark electromyography data sources of the upper limbs to test the accuracy of machine learning algorithms.Among these databases, NinaPro DB1 consists of sEMG signals extracted from 27 subjects (20 males and 7 females, aged 28±3.4-year-old) while they performed 52 different hand activities (tag No.0-52), as shown in Tab.1, there are 53 labels and corresponding gesture categories.Fig.4 shows 52 hand movements in the NinaPro DB1 dataset, the movements are grouped into three exercises, namely exercise A, exercise B, and exercise C, respectively.The exercise A involves basic movements of the fingers, such as flexions and extensions.The exercise B consists of multiple finger flexion and extension movements as well as wrist movements.The exercise C consists of grasping household objects.Each activity is performed for a duration of 5 s with a rest period of 3 s and repeated over 10 trials and the sEMG sampling rate is 100 Hz.
Tab.1 The label of gesture activity on the NinaPro DB1
Fig.4 The 52 hand movements in the NinaPro DB1:(a) 12 basic movements of the fingers; (b) 8 isometric and isotonic hand configurations and 9 basic movements of the wrist; (c) 23 grasping and functional movements
When dealing with real life conditions, the accuracy of sEMG-based gesture recognition is often affected by undesirable noises and disturbances,such as poor contact, electrode displacement, and the subject’s skin condition.Therefore, several signal processing steps were performed before data analysis and classification.The DB1 dataset has been preliminarily preprocessed toward sEMG, including signal full-wave rectification,synchronization, and relabeling.Therefore, only the high-frequency noises of the acquisition device need to be considered.In this work, the first-order Butterworth filter as in previous studies [29–31] was used to carry out low-pass filtering on the raw sEMG of each channel to remove the high-frequency noise.Fig.5 shows the comparison between the denoised signals and the raw signals.
Fig.5 Comparison of denoised and the raw signals: (a) the raw signals; (b) the denosied signals
In order to construct more recognizable sEMG images, the VMD method is used to decompose the raw signal into multiple IMF.Each IMF signal has its center frequency, and each mode is smoothed after demodulation.Finally, IMF components with narrow bands were obtained according to the frequency domain characteristics of the actual signal.The raw sEMG signal and the IMF1-IMF4 decomposed by the VMD method were respectively used for training, and the results are shown in Fig.6.The accuracy of recognition of the raw sEMG image was 86.54%, and the accuracy of IMF1-IMF4 reached 91.51%, 89.77%, 78.51%, and 54.07%,respectively.The greater contribution of low-frequency domain to the result may be related to the greater correlation of amplitude and other time-domain features with gesture.
After the decomposition using the VMD method, several IMF components were chosen to reconstruct the signals.It can be seen that the accuracy of 3D sEMG images construction using the IMF1-IMF3 is further improved to 92.76%.Therefore, we employed a 5-level VMD decomposition for each channel of the raw sEMG signal and constructed a 3D sEMG images.Fig.7 shows the image construction of a one-dimensional sEMG signal, the processed signal is segmented into continuous stream of windows with a sliding length of 200 ms and an increment of 170 ms.Data segmentation can not only multiply the number of the samples, but also improve the real-time performance of the model [32], the number of samples per subject can reach more than 10 000 after processing.Since the generation of sEMG signal is 30–150 ms earlier than human muscle movements, the setting of 200 ms sliding window can meet the real-time requirements of HMI technology.
Fig.6 The accuracy of the different IMF
Fig.7 Sliding windows of the sEMG data
For each gesture, we followed the same validation scheme described in previous studies[24,29,30,33].Specifically, the test set is composed of approximately 1/3 of the movement repetitions, and the training set is composed of the remaining repetitions.According to the above method, the final classification result is obtained through the cross-entropy loss function.The formula is as follows
whereyi*is the label of the training sample,yiis the output value of the model, andnis the number of samples.
The proposed approach is used to train and test on the DB1 dataset, and the classification accuracy is evaluated as follows
Fig.8 shows the accuracy and loss curves of the model in the training and test sets during the training of one subject of NinaPro DBl.It can be seen that the model can converge to the optimal value relatively quickly.In addition, the mixture matrix shows that the model has good gesture recognition performance.
Fig.9 shows the recall rate and precision rate of the gesture recognition on one subject’s test set.Most of the gestures achieved high recall rate and precision rate except for the gesture labeled 49, which, according to the mixing matrix in Fig.8, were similar in power to the gestures labeled 50, 44, 39, and 34.The overall average recall rate and precision rate reached 93.01% and 92.84%, respectively.It can be seen that the proposed model has strong generalization ability and recognition ability.
Fig.10 shows the evaluation results of 52 gestures performed by 27 subjects, with an average accuracy of 90.07%.The results show that the proposed model architecture is completely feasible.
Fig.8 Performance of the model on one subject’s data: (a) the training curve of one subject; (b) the mixture matrix of one subject’s test set
Fig.9 The recall rate and precision rate of various gestures of one subject’s test set
Fig.10 Classification accuracy of each subject on the NinaPro DB1
As shown in Tab.2, the proposed model is also compared with existing networks, such as CNN, LS-SVM, CNN-RNN, Multi-view CNN,and EvCNN on NinaPro DB1 dataset.The sliding window size of each model in Tab.2 is set to 200 ms.Compared with ResNet, the accuracy of the proposed model is improved by 7.82%, which proves the effectiveness of the method.Compared with RCNN, multi-stream ResNet, and multi-stream convolutional, the gesture classification accuracy of the proposed model is improved by 1.2%, 0.42%, and 0.37%, respectively.The experimental results show that the proposed model has certain advantages in sEMG gesture recognition, and achieves higher accuracy than the existing methods.
Tab.2 Compared with the classification accuracy (%) of different types of networks
The performance of the proposed hybrid model based on ResNet and GCN is improved due to the following reasons:
1) The sEMG signal is a strong nonlinear and non-stationary time series signal, and VMD can well process this kind of signal.The timedomain characteristics such as the amplitude of the IMFs in the low-frequency domain are strongly correlated with hand gestures.Therefore, the 3D sEMG images constructed by decomposing IMFs with VMD can well deal with the problem of gesture classification.
2) The GCN in the proposed model is used to extract the optimal muscle synergy features,which further improves the accuracy of gesture recognition.And the architecture of the ResNet also contributes to the performance of the model.
This paper proposed a novel approach based on the ResNet and the GCN to achieve better gesture recognition performance of sEMG-based HCI.Based on the characteristics of sEMG data,the convolution kernel parameters in the ResNet are improved for gesture recognition.According to the characteristics of muscle synergy, GCN was introduced into the model through a joint voting network to extract the muscle synergy feature of the sEMG signals.Moreover, VMD was also used to decompose the signal into multiple variable components to reconstruct more recognizable 3D sEMG images.The average classification accuracy of the hybrid model based on ResNet and GCN for the NinaPro DB1 dataset is superior to that of the existing research methods,reaching 90.07%.Clearly, the experimental results showed that the proposed model is effective due to the improved convolution kernel parameters of the ResNet, and the introduction of the GCN can further improve the accuracy of gesture recognition.
Our future work will focus on improving the robustness of our proposed hybrid model through other methods.For example, from the perspective that sEMG signals are sequential data, we can try to continue to improve the robustness of model recognition by introducing recurrent neural networks (RNNs).
Journal of Beijing Institute of Technology2023年2期