Du Jing Tang Manting Zhao Li
(1School of Information Science and Engineering, Southeast University, Nanjing 210096, China)(2School of Computational Engineering, Jinling Institute of Technology, Nanjing 211169, China)
Abstract:Because of the excellent performance of Transformer in sequence learning tasks, such as natural language processing, an improved Transformer-like model is proposed that is suitable for speech emotion recognition tasks. To alleviate the prohibitive time consumption and memory footprint caused by softmax inside the multihead attention unit in Transformer, a new linear self-attention algorithm is proposed. The original exponential function is replaced by a Taylor series expansion formula. On the basis of the associative property of matrix products, the time and space complexity of softmax operation regarding the input’s length is reduced from O(N2) to O(N), where N is the sequence length. Experimental results on the emotional corpora of two languages show that the proposed linear attention algorithm can achieve similar performance to the original scaled dot product attention, while the training time and memory cost are reduced by half. Furthermore, the improved model obtains more robust performance on speech emotion recognition compared with the original Transformer.
Key words:transformer; attention mechanism; speech emotion recognition; fast softmax
Speech emotion recognition, one of the key technologies of intelligent human-computer interaction, has received increasing interest[1]. Recurrent neural networks, especially long short-term memory[2]and gated recurrent[3]neural networks, have been firmly established as the main approaches in sequence modeling problems, such as speech emotion recognition[4-5]. However, a recurrent neural network typically performs recursive computation along the positions of the input and output sequences, which results in the failure of parallel training[6]. Especially when handling ultralong sequences, the training efficiency of the recurrent neural network is extremely low because of computer memory constraints.
The Transformer model, completely based on the self-attention mechanism introduced by Google[7], solves the above problems effectively. By abandoning time-consuming operations, such as loops and convolutions, the time cost, as well as the memory footprint, is greatly reduced during training. In Transformer architecture, multihead attention (MHA) realizes the parallel training process, compared with the traditional self-attention mechanism, by allowing the model to pay attention to the information from multiple representation subspaces of different positions so that more information in the sequence will be retained. At present, MHA has been successfully applied in several fields. For example, India et al.[8]extended multihead self-attention in the field of speech recognition, which mainly solved the speech recognition problem of non-fixed-length input speech and achieved excellent performance. In the multimodal emotion recognition task for the IEMOCAP dataset[9], MHA is used to concentrate on the only relevant utterance of the target utterance[10], which improves the recognition accuracy by 2.42%. In Ref.[11], the dilated residual network combined with MHA was applied to feature learning in speech emotion recognition, which not only alleviated the loss of the feature’s time structure but also captured the relative dependence of elements in progressive feature learning, achieving 67.4% recognition accuracy on IEMOCAP dataset.
However, the scaled dot product attention (SDPA) computing unit in MHA has quadratic complexity in time and space, which prohibits its application in the context of ultralong sequence input. Therefore, Taylor linear attention (TLA) is proposed to address this limitation, which has linear complexity in terms of the input sequence length and dramatically shortens the time cost and memory footprint. The proposed algorithm changes the way attention weights are calculated in SDPA by using a Taylor formula instead of the exponential operation in softmax and by making use of the associative property of matrix products to avoid the tremendous memory consumption of intermediate matrices. Transformer has been an exceeding success in the field of natural language processing, such as machine translation[12], since its introduction. In this paper, we extend Transformer to the area of speech emotion recognition, and the Transformer-like model (TLM) is thus proposed. The proposed TLA algorithm is shown to have similar emotion recognition performance with SDPA, while the computational power requirement is tremendously reduced. Meanwhile, the TLM can enhance the position information representation of acoustic features and thereby obtain more robust emotion recognition performance.
The main implementation unit of MHA in Transformer is scaled dot product attention (SDPA), whose structure is shown in Fig.1. The main idea is to enhance the representation of the current word by introducing context information. The query vectorQin Fig.1 represents the content that the network is interested in. The key vectorKis equivalent to the labels of all words in the current sample. The result of the dot product ofQandKreflects the influence degree of context words on the central word, and then softmax is used to normalize the correlation weights. Finally, the attention score is obtained by using the correlation matrix to weigh the value vectorV.
Fig.1 Scaled dot product attention
SDPA is calculated by
(1)
S=softmax(A)V
(2)
(3)
whereAis the output after scaling;Sis the output of the attention unit;Q,K, andVare generated by the input feature vector with the shape of (N,d), soQ,K,V∈RN×d, whereNrepresents the input sequence length, anddis the input sequence’s dimension. Generally,N>dor evenN?dis satisfied in ultralong sequence situations.
According to the definition of softmax, Eq.(2) can be mathematically expanded as
(4)
Multihead attention (MHA) is critically significant in parallel training for Transformer. By dividing the input vector into multiple feature subspaces and then applying the self-attention mechanism, the model may be trained in parallel while extracting the main information. Compared with the current mainstream single-head average attention weighting, MHA can improve the effective resolution to enhance the model’s different characteristics of speech features in different subspaces, which avoids the inhibition by average pooling of such characteristics. MHA is calculated by
(5)
S=Concat(H1,H2,…,Hn)W
(6)
whereXis the input feature sequence;Qi,Ki, andVirepresent query, key, and value, respectively;Hiis the attention score of each head; SDPA is the self-attention unit of each head;Wis the linear transformation weight;i=1,2,…,n, andnis the number of heads, andiis the index of each head.
First, the input feature sequenceXis equally divided intonsegments in the feature dimension, and each segment generates a group of (Qi,Ki, andVi) after a linear transformation. Then,Hiis respectively calculated for each head. Thenattention scores are spliced successively. Finally, the total attention score is generated from spliced vectors by performing the linear transformation.
An obvious problem pertains to the use of MHA. When calculating SDPA, each head needs to use softmax to normalize the dot product ofQandKso thatVcan be weighted to obtain the score. As dividing subspaces by MHA will not affect the input sequence length, the length ofQandKis stillN. With an increase in the input sequence length, the computational resource demand of each head during training will increase in quadratic order, which is unbearable and leads to a decrease in the quality of long-distance dependent modeling in sequence learning as well.
(7)
(8)
(9)
Eq.(9) can be rewritten as
(10)
Eq.(10) may be further simplified to
(11)
On the basis of the associative property of matrix multiplication, i.e., (QKT)V=Q(KTV), Eq.(11) can be further simplified to
(12)
The TLM structure is shown in Fig.2.
Fig.2 Model structure
In the position encoding layer, because the expression of speech emotion is related to the position of emotional stimulation, and the model completely adopts the attention mechanism, it cannot learn the positional relationship among features, which means that input features need to be encoded additionally as follows:
(13)
(14)
where the shape of the original input vector is (N,d);p∈[0,N) represents thep-th frame of inputs;i∈[0,d/2-1], 2iand 2i+1 represent the even and odd dimensions of the current inputs, respectively. The position encoding vector retains the same shape as the original inputs, which are then concatenated with the audio feature vector in the feature dimension to generate the input vector of subsequent network layers with the shape of (N,2d).
Then, the TLA unit is adopted in the MHA layer. Considering that MHAs at different levels in BERT represent different functions, the bottom layer is usually more focused on grammar, while the top layer is more focused on semantics. Therefore, in this paper, multi-layer MHA is also adopted to learn different levels of speech emotion representation.
The feed-forward layer is composed of two linear transformations, and the calculation process is shown as
F(x)=GELU(xW1+b1)W2+b2
(15)
(16)
wherexis the input to the current layer;andWi,bi(i=1,2) denote the weights and biases to be trained in thei-th dense layer. The Gaussian error linear unit (GELU)[13]is adopted as the activation function to randomly regularize the input vector and match it with a random weight according to the size of the input.
As for the connection between layers, a residual connection is adopted, and the final output of the sublayer is normalized by
O=BatchNorm[x+Sublayer(x)]
(17)
wherexis the input of the sublayer;Sublayer(·) denotes the implementation function of the sublayer. To facilitate the connection of residuals, the input and output of each sublayer remain the same dimension.
Finally, the predicted label is output from a fully connected layer through the softmax activation function.
To prevent overfitting, two regularization methods are used. One method is to use dropout before the final output of all sublayers and the dropout ratioPd=0.1. The other method is to adopt label smoothing, and all one-hot encoded label vectors are smoothed by
L′=(1-)
(18)
whereLis in the form of one-hot encoding;L′ represents the label after smoothing;Nis the number of one-hot encoding states;and=0.1.
The Adam optimizer is adopted in the training process. Moreover, the warmup learning rate[14]used in the experi-ments is calculated as follows:
rs=(r0w)-0.5min(s-0.5,sw-1.5)
(19)
wherer0is the initial learning rate;rsis the learning rate at current training steps; andwdenotes the warmup step. When the current step is less thanw, the learning rate increases linearly;on the contrary, the learning rate decreases proportionally with the inverse square root of the number of steps. All parameter settings of the model are shown in Tab.1.
Tab.1 Model parameters
The experiments are performed on EmoDB[15]and URDU[16]. The information of each dataset is shown in Tab.2. Four emotions, anger, happiness, neutral, and sadness, are selected in the experiment.
Tab.2 Dataset information
All data samples were resampled with 16 kHz, with a pre-emphasis coefficient of 0.97. Each file was divided into frames of 25-ms width with a stride of 10 ms. Any audio file longer than 300 frames was truncated to 300 frames, while files shorter than 300 frames were padded with zeros, where 300 was regarded as the sequence length. Log Mel-filter bank energies (LFBE) were then subsequently calculated for each frame with the number of filter banks set to 64. Each dataset was divided into a training set, validation set, and test set at a ratio of 8∶1∶1.
In most of the MHA models, such as BERT, the feature representation sizes (word embedding) are approximately 300-1 024 so that the number of heads is empirically set from 12 to 16[17]. Considering that our feature dimension is 128, we tried 2, 4, 8, 16, and 32 (factors of 128) to study the effect of the number of heads in MHA on the performance of speech emotion recognition, as shown in Fig.3. In this experiment, the head number was the only variable, and other parameters remained the same as listed in Tab.1.
Fig.3 Effect of the number of heads on the performance of emotion recognition
Fig.3 shows that the number of heads does not have a significant effect on the performance of emotion recognition. Because of the redundancy of the attention mechanism, even if the attention head is calculated independently, there is a high probability that the emotional information paid attention to is consistent. Notably, UAR increase with the number of heads on URDU and EmoDB, indicating that more attention can be paid to the local emotional information from those relative outlier attention heads with the increase in the number of heads so that the model is further optimized. However, when the number of heads reaches 8, UAR is almost unchanged or even slightly decreased, indicating that after the number of heads increases to a certain number, the expression ability of emotional information brought by multiple subspaces is enhanced to reach the upper bound. The increase in the number of heads may lead to an excessively scattered distribution of emotional information in the feature subspace, which results in a decline in the emotion recognition performance of the model. Therefore, appropriate head cardinality should be selected in the experiment not only to ensure the occurrence probability of outlier heads to learn more subtle emotion expression but also to prevent the distribution of emotional information from being too discrete to reduce the recognition performance. In this paper, the number of heads is set to eight in the subsequent experiments.
In Transformer, a word vector is added to a positional encoding vector to embed positional information, which may not be applicable in speech emotion recognition. Therefore, we selected two embedding methods, named add and concatenation, to study the influence of embedding type on the recognition performance. Other parameters were kept consistent with those shown in Tab.1.
Fig.4 shows the UAR curve on the test set during the training. It can be intuitively seen that the recognition performance of the model with feature concatenation is better than that with feature addition. Moreover, the UAR of the model using the add method has greater volatility after convergence, reflecting that the Add embedding method causes the model’s emotion recognition performance to be more unstable, which infers that directly adding or subtracting the position encoding vector to the input speech feature may result in invalidation of the position information embedding and even loss of the original
Fig.4 Effect of embedding type for position encoding vectors on the test set emotional information. Consequently, using Concatenation on the TLM increases the robustness and improves the recognition performance to varying degrees.
To verify the speech emotion recognition performance of the proposed method, we chose the TLM with the SDPA unit as the baseline[17], where eight heads and the concatenation method were adopted, and other parameters were consistent with those in Tab.1. Additionally, we also chose some classical models for further comparison, such as the support vector machine (SVM)[18]and ResNet[19], which represent the traditional machine learning method and prevailing CNN framework, respectively. Each model adopted the same input as described in Section 3. The UAR accuracy results on each dataset are shown in Tab.3.
Tab.3 Recognition accuracy of different models on different emotion categories %
The Transformer-like model outperforms SVM and ResNet-50, signifying that the TLM is more suitable in the field of speech. Compared with the baseline, the emotion recognition performance using TLA is not significantly different from that of SDPA on the whole, which indicates the effectiveness of the attention unit algorithm proposed in this paper.
The change in the UAR with step number and time after iterating 3 000 steps on the baseline and proposed model are shown in Figs.5 and 6, respectively, under the parameter settings shown in Tab.1. As can be seen, the proposed TLA algorithm and the SDPA algorithm perform similarly at emotion recognition, but the proposed TLA algorithm is far lower than the baseline SDPA algorithm in training time cost, indicating that TLA has lower time complexity.
To further compare the complexity of the proposed TLA, four groups of Transformer-like models were trained on EmoDB. The lengths of the input sequence (LFBE frames) were chosen as 256, 512, 768, and 1 024. The processor used in the experiment was Inter? Core(TM) I7-8700 CPU @ 3.20 GHz, the GPU was NVIDIA Geforce RTX 2080Ti, and the memory size was 16.0 GB. To avoid the overflow of memory errors, the batch size was selected as eight for training. The other parameters were kept consistent with Tab.1, and each model was iterated for 1 500 steps.
Fig.5 UAR comparison between the baseline and proposed models within 3 000 steps
Fig.6 UAR comparison between the time use of the baseline and proposed models within 3 000 steps
The training time of the model as the length of input sequence increases is shown in Fig.7 when iterating the same steps, where the time of the baseline approximately conforms to the square distribution, while that of the proposed TLM roughly meets the linear distribution. The proposed TLM obviously has linear time complexity regarding the input sequence length. Regarding the memory usage, as shown in Fig.8, the memory footprint of TLM is much smaller than that of the baseline. In addition, when the input feature length is 768, the memory usage has reached the upper limit of available memory so that although the number of input feature frames increases in subsequent experiments theoretically, the actual memory usage of the model remains unchanged. Similar to the time consumption distribution, the memory use of the baseline approximately conforms to a square distribution, while the memory occupation of the TLM roughly satisfies a linear distribution, indicating that the proposed model has a linear space complexity in terms of the sequence length.
Fig.7 Comparison between the time use of the baseline and proposed methods with different sequence lengths
Fig.8 Comparison between the GPU memory use of the baseline and proposed methods with different sequence lengths
1) The best performance of MHA is found with eight heads, indicating a certain limit on the recognition accuracy brought by the number of heads.
2) For the attention computing unit, the proposed TLA algorithm not only has similar emotion recognition performance to SDPA but also greatly reduces the time cost and memory footprint during training by making use of the Taylor formula and the associative property of matrix products, leading to linear complexity in time and space.
3) For speech emotion recognition tasks, a novel TLM is proposed, achieving a final UAR of 74.9% and 80.0% on EmoDB and URDU, respectively. The experimental results demonstrate that the TLM has certain advantages in handling ultralong speech sequences and has bright, practical application prospects due to the greatly reduced demand for computing power.
Journal of Southeast University(English Edition)2021年2期