A Method upon Deep Learning for Speech Emotion Recognition

. Feature extraction and emotional classi(cid:28)cation are signi(cid:28)cant roles in speech emotion recognition. It is hard to extract and select the optimal features, researchers can not be sure what the features should be. With deep learning approaches, features could be extracted by us-ing hierarchical abstraction layers, but it requires high computational resources and a large number of data. In this article, we choose static, di(cid:27)erential, and acceleration coe(cid:30)cients of log Mel-spectrogram as inputs for the deep learning model. To avoid performance degradation, we also add a skip connection with dilated convolution network integration. All representatives are fed into a self-attention mechanism with bidirectional recurrent neural networks to learn long-term global features and exploit context for each time step. Finally, we investigate contrastive-center loss with softmax loss as loss function to improve the accuracy of the emotion recognition. For validating robustness and e(cid:27)ectiveness, we tested the proposed method on the Emo-DB and ERC2019 datasets. Experimental results show that the performance of the proposed method is strongly comparable with the existing state-of-the-art methods on the Emo-DB and ERC2019 with 88% and 67%, respectively.

Abstract. Feature extraction and emotional classication are signicant roles in speech emotion recognition. It is hard to extract and select the optimal features, researchers can not be sure what the features should be. With deep learning approaches, features could be extracted by using hierarchical abstraction layers, but it requires high computational resources and a large number of data. In this article, we choose static, dierential, and acceleration coecients of log Mel-spectrogram as inputs for the deep learning model. To avoid performance degradation, we also add a skip connection with dilated convolution network integration. All representatives are fed into a self-attention mechanism with bidirectional recurrent neural networks to learn longterm global features and exploit context for each time step. Finally, we investigate contrastivecenter loss with softmax loss as loss function to improve the accuracy of the emotion recognition. For validating robustness and eectiveness, we tested the proposed method on the Emo-DB and ERC2019 datasets. Experimental results show that the performance of the proposed method is strongly comparable with the existing state-ofthe-art methods on the Emo-DB and ERC2019 with 88% and 67%, respectively.

Introduction
Speech emotion recognition (SER) has an important role in Human-Computer Interaction (HCI) and also the most signicant one in human communication. It has widely applied in health, education, robotics, and customer service systems. Yoon et al. [1] proposed the SER agent for mobile communication service. Huahu et al. [2] integrated the SER model into an intelligent household robot platform. Cen et al. [3] explored emotional recognition of continuous speech and developed a real-time SER system that can be applied to an online learning system.
There are two important roles in the SER system: (1) feature extraction that extracts the features from raw audio/speech data, and (2) emotional classication that decides the emotional state of speech. Feature extraction and selection are some of the key points in the SER sysc 2020 Journal of Advanced Engineering and Computation (JAEC) 273 tems, but nobody is quite sure what the features should be. Researchers have been used a variety of feature sets, such as Mel-frequency cepstral coecients (MFCC) [47], Linear predictive coding (LPC) [7], signal energy, pitch, and zero-crossing rate [811]. The best approach is using deep learning as it will extract features into hierarchical abstraction layers. With the development of deep learning and neural networks, convolutional neural network (CNN) can perform better than traditional techniques in SER challenge [1214]. Zhao et al. [15] found that 1-D CNN with long short-term memory (LSTM) and 2-D CNN with LSTM could explore the local and global features from raw audio/speech and the log Mel-spectrogram, respectively. Chan et al. [16] proved that the 2- For classifying the emotional states, a lot of classier schemes have been used for the SER, such as hidden Markov model (HMM), Gaussian mixture model (GMM), support vector machine (SVM), articial neural network (ANN), knearest neighbors (k-NN) and the others. HMM classier has been applied widely in speech applications and emotional classication. Schuller et al. [24] used continuous HMM for the SER, Li et al. [25] investigated a hybrid deep neural network (DNN) HMM with discriminative pretraining for the SER, and Nwe et al. [4] proposed a method using short time log frequency power coecients (LFPC) feature and classify the emotional states by a discrete HMM. As a special continuous HMM, GMM has also used to classify the emotional states from speech with global features extracted from training utterances. Tashev et al. [26] combined the GMMbased with DNN to extract both low-level and high-level features. Navyasri et al. [27] employed the GMM to classify the emotional states from speech features extracted by MFCC, spectral centroid, spectral skewness, and spectral pitch chroma. Shahin et al. [28] proposed a novel hybrid sequential GMM-DNN based classier that gave signicantly better accuracy than the SVM and multiplayer perceptron (MLP) classiers.
Lanjewar et al. [29] investigated and compared the GMM and k-NN classiers to recognize six emotional states from speech features extracted by the MFCC, wavelet, and the pitch of vocal traces. Other researchers have optimized loss functions to train a state-of-the-art DNN for the SER. Tripathi and Zhu proposed a Focal loss to improve the accuracy of the emotion recognition system [30,31]. Meng and Dai proposed a novel approach to discriminate emotional states from speech features by combining center loss with softmax loss as loss function [18,32]. This research is motivated by previous works.
Meng et al. [18] proposed a novel architecture ADRNN (dilated CNN with residual block and bidirectional long short-term memory (Bi-LSTM) based on the attention mechanism) which applied the dilated CNN to extract the features from the 3-D log Mel-spectrogram and combined the softmax loss with the center loss to improve the accuracy of emotion recognition. The ADRNN outperforms Chen's ACRNN (3-D attention-based convolutional recurrent neural networks) architecture with the softmax loss in [17] and the center loss in [32]. However, due to the weakness of the center loss, Qi et al. [33] proved that the contrastive-center loss outperforms the center loss for deep neural networks. Therefore, the combination of the contrastivecenter loss with the softmax loss is investigated to improve the accuracy of emotion recognition.
In this article, we choose the 3-D static, differential, and acceleration coecients of the log Mel-spectrogram extracted from the raw signal as inputs for the proposed model. Then all features are fed into an architecture AD-CRNN (attention-based dilated convolution and bidirectional recurrent neural networks) to extract high-level features. Finally, we use the contrastive-center loss with softmax loss to classify the emotional states. Furthermore, we also adopt a trick of dropout and batch normalization (BN) to normalize the features and improve the performance of deep neural network because of avoiding vanishing gradient problem in the training process. Our proposed method tested 274 c 2020 Journal of Advanced Engineering and Computation (JAEC) on the benchmark Emo-DB and validated on the ERC2019. Our contributions in this research are summarized below: • We utilize the ADCRNN to extract spatial local features, learn the sequential global features and exploit the context for each time step from spectrogram-based inputs included static, dierential, and acceleration coecients.
• We also validate and compare the proposed loss and the center loss with our ADCRNN and ACRNN [17], respectively.
• Experimental results show that the proposed method outperforms the existing state-of-the-art methods on the Emo-dB and ERC2019 by 88% and 67%, respectively.
This article is organized as follows: Section 2 describes the methodology in detail, Section 3 shows the experimental results and comparison, and Section 4 describes the conclusion.

Generate 3-D log Mel-spectrogram
The 3-D log Mel-spectrograms (static, dierential, and acceleration coecients) are used as the inputs for our proposed method. Given a raw audio/speech signal, we compute the log Mellterbank energy features under the sample rate of 16 kHz, the number of 40 lters in the lterbank, and the FFT size is chosen to 512. Besides, we choose the length of the analysis window of 0.025 sec and the step between successive windows of 0.01 sec. Furthermore, to obtain the 40 Mel-lterbank, we also choose the lowest band edge of Mel lters, the highest band edge of Mel lters, and the pre-emphasis ler with preempt as coecients of 300 Hz, 8,000 Hz, and 0.97, respectively.
The static coecient of the log Melspectrogram is obtained following six steps as below: • Firstly, the Mel scale frequency analysis [34] was computed as below: where M (f req) is the Mel scale converted from the frequency f req. The lowest and highest frequencies were converted to 401.25 Mels and 2,835.00 Mels, respectively.
• Secondly, we need at least 42 points to get the 40 lterbanks. So, we added 40 points spaced linearly between the lowest and highest Mel points.
• Thirdly, we inverted the Mel scale back to frequency as in Eq. 2 so that there were 42 frequency points between 300 Hz and 8,000 Hz as mentioned before: where M −1 (m) is the frequency inverted from the Mel scale m.
• Next, we had to round those frequency points to the nearest FFT bin numbers because we could not have the exact frequency resolution as calculated above. The FFT bin numbers can be computed as follows: where the nF F T is the FFT size, the sp is the sample rate, the h(n) is the frequency points in Hert, and the f loor is the function that gives the greatest integer output less than or equal to the real number input.
• Then, the lterbanks can be dened as follows: where k is the point of the FFT bin numbers, m is the number of lterbanks we wanted, and f () is a list of m+2 Mel points. After that, we passed the power spectrum calculated by short-time Fourier transform (STFT) through the Mel lterbanks to get the Mel-spectrogram.
• Finally, we computed the log Melspectrogram (logM ) by taking the logarithm of the Mel-spectrogram, which is the rst dimension of the inputs.
After computing the static coecient of the log Mel-spectrogram, we obtained the dierential (second dimension) and acceleration (third dimension) coecients of the inputs as below: where ∆(M) is the dierential coecient (deltas) computed by taking the time derivative t of static coecient from logM t+N to logM t−N , and N is set to 2 in this equation. The acceleration coecient (delta-deltas) is computed likely in Eq. 2, but the input from the deltas (dierential) coecient.
We combined the 3-D log Mel-spectrogram features X ∈ t,f,c as the inputs of the proposed model. In which, t is set to 0.3 sec as the chunk size of the audio/speech duration, f is set to the number of 40 lters in the lterbank, and c is set to 3 channels or dimensions represented the static, dierential, and acceleration coecients, respectively. The 3-D log Mel-spectrogram corresponding to the wave data from the audio/speech signal is shown in Fig. 1 [35] to design a new convolutional network module for dense prediction. Based on the dilated convolutions, the receptive eld exponentially expands without loss of resolution or coverage while the number of parameters grows linearly. The dilated convolutions with the height P and the width Q is dened as follows:  dr to the lter F (i, j). Therefore, the original convolution is the dilated one with the dr = 1.
The illustration of 3×3 kernel size with the dr = 2 is shown in Fig. 3.
In this subsection, we design three DCNN layers after performing by one original CNN and Max-Pool layer. We utilize the CNN layer with 3×3 kernel size and stride at 1. To down-sample representation, the Max-Pool layer with 2 × 4 kernel size and 2×4 stride is added. Each DCNN layer has 3×3 kernel size, the stride of 1, and the dilation rate of 2. Furthermore, we add a skip connection with dilated convolution (Skip-Net) to avoid performance degradation. The padding is set to the "VALID" for both the CNN and Max-Pool layers while it is the "SAME" in all the DCNN layers. Instead of using ReLU, we use the Leaky ReLU activation function to solve the vanishing gradient descent and ensure the non-linearity of deep neural networks when the input value is negative. The Leaky ReLU activation function g(x) can produce a non-zero output for a negative input as below: where α is a constant in range (0, 1). We choose the α = 0.01 for the Leaky ReLU activation function in this experiment.

2) Recurrent neural networks
Recurrent neural networks (RNN) has solved many sequential problems by learning historical features. It has taken more advantage of natural language processing, video processing, and time series prediction. However, it has a limitation with long-term dependencies. Hochreiter et al. [36] proposed a novel LSTM method to deal with complex, articial long-term tasks. In this study, the eectiveness of BiLSTM proposed by Schuster et al. [37] for the SER is investigated.
The BiLSTM can learn the sequential features in both forward and backward directions by splitting the neurons of regular RNN. Therefore, the BiLSTM can use the input information from the future and past of the current time step for prediction. Each LSTM cell is updated as in Eqs.

8-12:
where σ g and τ g denote the sigmoid and tanh activation functions, the (·) operator is the element-wise product. The i t , f t , o t , c t , z t and h t represent the input gate, forget gate, output gate, cell state with a self-recurrent, input vector, and hidden state at the time step t, respectively. All weight matrices are set to W, U and corresponding bias vectors are set to b.
In this implementation, we choose the cell units of 256 for each LSTM direction. All features obtained in the framework of DCNN are fed into the BiLSTM to learn the sequential global features.

3) Attention mechanism
After performing the BiLSTM to learn sequential features, we add the attention mechanism to exploit the context of each time step. The attention-based model has been eectively used in almost sequence-to-sequence and the SER tasks [17,18,2023,38,39]. In the SER task, not all emotional content from speech signal contributes equally to represent the emotional states. Hence, in this research, the attentionbased method is constructed to concentrate more on the specic part of the spectrogrambased features that involve mostly the emotional classication task.
In this experiment, the attention structure of BiLSTM at the time step t is dened as below: where the Att is the output of the attention layer, the β t is the attention weight computed by the softmax function as follows: where the (·) operator is the element-wise product, the W is a trainable parameter, the h t is the hidden state at time step t from the BiL-STM that is h t = − → h t ; ← − h t . Next, we integrate the contrastive-center loss with the softmax loss to classify the higher-level representation from the attention output.

Loss function for classication
For most of the classication tasks, people commonly used softmax cross-entropy with logit for multi-class classication. In this article, we not only want to separate the emotional states but also discriminate against them. To investigate that we integrated the contrastive-center loss with the softmax loss as the loss function to update weights during training process. The contrastive-center loss [33] performs better than the center loss for deep neural networks and classication problems.
Due to the weakness of the center loss, the contrastive-center loss has been proposed to discriminate the intra-class compactness and interclass separability as follows: where S denotes the number of training samples in a mini-batch , x i − C yi 2 2 denotes the distances between the training samples and their corresponding class centers, x i −C m 2 2 denotes the distances between the training samples and their non-corresponding class centers, E denotes the number of classes, and the constant λ is set to 1 to ensure that the denominator not equal zero. Finally, we add the softmax loss with the contrastive-center loss to obtain nal loss function to update weights during training progress as below:

Datasets
To evaluate the robustness and eectiveness of our proposed method with the 3-D log Melspectrogram, we used the Berlin Database for Emotional Speech (Emo-DB) and the Emotion Recognition Challenge 2019 dataset (ERC2019).

Experimental setup
The system was used for running the experiment is built in Intel CORE i5 8th Gen with NVIDIA Graphics Card 1080Ti. We used TensorFlow deep learning framework [42] to implement the whole model.
For the feature extraction,`python speech features' framework [43] is used to compute and extract the 3-D log Mel-spectrogram as follows: • Compute the static coecient of the raw audio/speech signal: The window length, the overlap between windows, FFT size, and pre-emphasis coecient are set by default at 0.025 sec, 0.01 sec, 512, and 0.97, respectively. The sampling rate, the lowest band edge of Mel lters, and the highest band edge of Mel lters are set to 16 kHz, 300 Hz, and 8,000 Hz, respectively.
• Compute the dierential coecient is computed by taking the time derivative of the static coecient.
• Compute the acceleration coecient is computed by taking the time derivative of the dierential coecient.
For parameter optimization, we set the batchsize 32 to compatible with the limited memory. Then, we chose Adam optimizer with a learning rate of e −4 . Besides, we also integrated the contrastive-center loss with standard softmax loss for the proposed model to improve the classication performance. Furthermore, to get the best results, we also employed k-fold crossvalidation with k = 5 to get the mean and standard deviation accuracy.

1)
Experiment on the Emo-DB In Tab. 2, our proposed loss function achieved better accuracy than the center loss [18] in both ACRNN [17] and our ADCRNN architectures. Our proposed loss function reaches 0.86 ± 0.01 and 0.88 ± 0.03, respectively. The results in Tab. 2 also prove that our ADCRNN architecture with both proposed and center losses were higher accuracy than ADRNN with the center loss [18] by 0.88 ± 0.03, 0.86 ± 0.05, and 0.85 ± 0.02, respectively.
The confusion matrix shows the model predicted results and the ground truth labels for each emotional state in Fig. 6. The A, B, D, F, H, S, and N labels represented anger, boredom, disgust, fear, happiness, sadness, and neutral emotional states, respectively. As the confusion matrix is shown in Fig. 6, the proposed method was better than the previous methods in [17,18]. In terms of comparison with [18], our proposed method achieved 3%, 27%, and 11% better performance in the fear, happy, and bored emotional states, respectively. Besides, our proposed method was more accurate 12%, 36%, 1%, 17%, and 7% than [17] in the angry, happy, fear, bored, and neutral emotional states, respectively.

2) Experiment on the ERC2019
Tab. 3: The comparison of accuracy on the ERC2019.

Model
Loss function Accuracy ACRNN [17] Center [18] 0.63 ± 0.01 Proposed 0.63 ± 0.00 ADCRNN Center [18] 0.65 ± 0.01 Proposed 0.67 ± 0.01 In Tab. 3, our proposed loss function achieved better accuracy than the center loss [18] with our ADCRNN architectures. Our proposed loss function reaches 0.86 ± 0.01 with our ADCRNN architecture while the center loss reaches 0.65 ± 0.01. Besides, the accuracy is likely the same in both the center and proposed loss functions with ACRNN architecture. The results in Tab. 3 also prove that our ADCRNN architecture with the proposed loss function is better than the others on the ERC2019.
The confusion matrix shows the model predicted results and the ground truth labels for each emotional state in Fig. 7. The N, H, S, A, F , and D labels represented neutral, happiness, sadness, anger, fear, and disgust emotional states, respectively.

Conclusions
In this article, we proposed an architecture AD-CRNN with the contrastive-center loss for SER systems. The model not only learned spatial local features by DCNN, but also learned longterm global features and exploited the context for each time step by bidirectional RNN with attention mechanism from the 3-D log Melspectrogram (static, dierential, and acceleration coecients) of raw speech/audio signal. The DCNN with a residual block that consisted of three dilated convolution layers with one Leaky ReLU activation function in each layer and the skip connection with one dilated convolution layer. Then, we employed the contrastivecenter loss together with softmax loss to improve performance classication.
The proposed method was tested on the benchmark Emo-DB and also validated on the ERC2019 which was used in the Emotion Recognition challenge. The experimental results show that the proposed model with the contrastive-center loss not only extracted the spatial features of the 3-D log Mel-spectrogram and learn the long-term global features but also discriminated the emotional states instead of separating them. Our proposed method achieved better accuracy than state-of-the-art methods by 88% and 67% on the Emo-DB and ERC2019, respectively.