Emotion Recognition from Speech

Introduction

Speech Emotion Recognition (SER) is a field of study in Artificial Intelligence (AI) that aims to automatically recognize human emotions through speech. The ability to understand human emotions can be beneficial in many domains, such as psychotherapy, customer service, and human-robot interaction. The RAVDESS dataset is a popular database for speech emotion recognition research. It contains audio and visual recordings of actors performing short clips of emotional speech. The dataset is labeled with seven different emotions: neutral, calm, happy, sad, angry, fearful, and disgust. In this project, we will explore deep learning techniques to build a SER model using the RAVDESS dataset.

Deep learning models have become the state-of-the-art approach in SER because of their ability to learn complex patterns in speech signals. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are two popular deep learning models used for SER. CNNs are particularly suitable for speech analysis because they can automatically learn relevant features from the spectrogram of the audio signal. RNNs, on the other hand, can capture the temporal information of the audio signal by modeling the sequence of the acoustic features.

The RAVDESS dataset has been widely used in the research community to develop and evaluate deep learning models for SER. In this project, we will focus on building a CNN-based model for speech emotion recognition. We will preprocess the audio files to extract Mel-frequency cepstral coefficients (MFCCs) and Chroma features. These features will be used to create spectrograms, which are two-dimensional representations of the audio signals. The spectrograms will be fed into a CNN to train a model for speech emotion recognition. We will evaluate the model’s performance using accuracy, precision, recall, and F1-score metrics.

The remainder of this project is organized as follows. In the next section, we will discuss the RAVDESS dataset in detail, including its characteristics, structure, and contents. We will also discuss the preprocessing steps required to prepare the dataset for deep learning models. In the following section, we will describe the CNN architecture used for SER and the training process. We will also present the evaluation metrics used to measure the performance of the model. Finally, we will present the results of the experiments and discuss the limitations and potential improvements of the model.

Import Libraries

Code

import matplotlib.pyplot as plt
import os
import librosa
import librosa.display

import IPython.display as ipd
from IPython.display import Image

import seaborn as sns
sns.set_palette('Set2')
from librosa.core import pitch

import numpy as np
import pandas as pd

import keras
import tensorflow
from tensorflow.keras import optimizers
from tensorflow.python.keras import Sequential
from keras.models import Model
from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D #, AveragePooling1D
from keras.layers import Flatten, Dropout, Activation # Input, 
from keras.layers import Dense #, Embedding
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder

import wave
import sys
import csv
from PIL import Image
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader

import warnings
warnings.filterwarnings('ignore')

Preparing the Data

Plotting different emotion audio’s waveform and its spectrogram

Anger:

Code

data, sampling_rate = librosa.load('Dataset/anger/anger001.wav')
ipd.Audio('Dataset/anger/anger016.wav')

Code

plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Anger Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('anger.png')
plt.show()

Fear:

Code

data, sampling_rate = librosa.load('Dataset/fear/fear001.wav')
ipd.Audio('Dataset/fear/fear001.wav')

Code

plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Fear Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('fear.png')
plt.show()

Happy:

Code

data, sampling_rate = librosa.load('Dataset/happy/happy002.wav')
ipd.Audio('Dataset/happy/happy002.wav')

Code

plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Happy Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('happy.png')
plt.show()

The shape of spectrograms, which represent the frequency content of an audio signal over time, can vary depending on the characteristics of the audio signal. In the context of emotion recognition, different emotions may be associated with different patterns of spectral energy distribution and temporal dynamics in the audio signal. For example, angry or fearful speech may be characterized by higher spectral energy in the high-frequency range and shorter temporal duration, while happy or neutral speech may be characterized by lower spectral energy and longer duration. Machine learning models can be trained to automatically extract these patterns and recognize the corresponding emotions based on the spectrogram features.

Functions for converting our data into CSV

The create_meta_csv function creates a CSV file that contains the file paths and corresponding labels of WAV files in a dataset directory. The function first retrieves the absolute path of the dataset directory and sets the path for the CSV file to be created. It then loops through the dataset directory and creates a list of tuples, where each tuple contains the file path and its corresponding label. Finally, the function opens the CSV file and writes the header row and data rows from the list of tuples. If the destination path is None, it is set to the dataset path. The function returns True to indicate that it has completed successfully.

Code

np.random.seed(42)

def create_meta_csv(dataset_path, destination_path):

    # Get the absolute path of the dataset directory
    DATASET_PATH = os.path.abspath(dataset_path)

    # Set the path of the CSV file that will be created
    csv_path=os.path.join(destination_path, 'dataset_attr.csv')

    # Create an empty list to hold the file paths of all the WAV files
    flist = []

    # Define a list of emotions that will be used as labels in the CSV file
    emotions=["anger","disgust","fear","happy","neutral", "sad", "surprise"]

    # Loop through the dataset directory and add the file paths of all the WAV files to the flist list
    for root, dirs, files in os.walk(DATASET_PATH, topdown=False):
        for name in files:
            if (name.endswith('.wav')): 
                fullName = os.path.join(root, name)
                flist.append(fullName)

    # Split each file path in flist by the directory separator and store the result in a new list called filenames
    filenames=[]
    for idx,file in enumerate(flist):
        filenames.append(file.split('/')) 

    # Create a list of tuples, where each tuple contains a file path and its corresponding emotion label
    types=[]
    for idx,path in enumerate(filenames):
        types.append((flist[idx],emotions.index(path[-2])))

    # Open the CSV file, write the header row, and write the data rows from the types list
    with open(csv_path, 'w') as f:
        writer = csv.writer(f)
        writer.writerows([("path","label")])
        writer.writerows(types)
    f.close()

    # If the destination path is None, set it to the dataset path
    if destination_path == None:
        destination_path = DATASET_PATH

    # Return True to indicate that the function has completed successfully
    return True

The create_and_load_meta_csv_df function creates and loads a CSV file containing metadata about an audio dataset located at dataset_path. The function also provides options to shuffle the rows of the dataset randomly and split the dataset into training and testing sets. If randomize is True or split is not None and randomize is None, the function shuffles the rows of the DataFrame. If split is not None, the function splits the DataFrame into training and testing sets. The function returns the DataFrame or all three DataFrames, including training and testing sets, depending on the options selected. The create_meta_csv function is called to create the CSV file containing the metadata, which is then loaded into a Pandas DataFrame.

Code

def create_and_load_meta_csv_df(dataset_path, destination_path, randomize=True, split=None):
   
    # Call create_meta_csv to generate the CSV file containing metadata about the audio dataset
    if create_meta_csv(dataset_path, destination_path=destination_path):
        # Load the CSV file into a Pandas DataFrame
        dframe = pd.read_csv(os.path.join(destination_path, 'dataset_attr.csv'))

    # If randomize is True or split is not None and randomize is None, shuffle the rows of the DataFrame
    if randomize == True or (split != None and randomize == None):
        dframe=dframe.sample(frac=1).reset_index(drop=True)
        pass

    # If split is not None, split the DataFrame into training and testing sets
    if split != None:
        train_set, test_set = train_test_split(dframe, split)
        return dframe, train_set, test_set 
    
    # Return the DataFrame
    return dframe

The train_test_split function takes in a DataFrame dframe and a split ratio split_ratio, which determines the proportion of data to be used for training and testing. It splits the DataFrame into two separate DataFrames: the training set, which contains split_ratio percent of the data, and the testing set, which contains the remaining data. The function then resets the index of the testing set to start at 0, and returns the two DataFrames as separate objects. This function is commonly used in machine learning to split a dataset into training and testing sets in order to evaluate the performance of a model on new, unseen data.

Code

def train_test_split(dframe, split_ratio):

    # Split the DataFrame into training and testing sets based on the split ratio
    train_data= dframe.iloc[:int((split_ratio) * len(dframe)), :]
    test_data= dframe.iloc[int((split_ratio) * len(dframe)):,:]
    
    # Reset the index of the testing set to start at 0
    test_data=test_data.reset_index(drop=True) 
    
    # Return the training and testing sets as separate DataFrames
    return train_data, test_data

Running all the above functions:

Code

# Set the path to the audio dataset directory and print it
dataset_path =  './Dataset'
print("dataset_path : ", dataset_path)

# Set the destination path to the current working directory
destination_path = os.getcwd()

# Set the number of classes in the audio dataset
classes = 7

# Set the total number of rows in the audio dataset
total_rows = 2556

# Set the randomize and clear flags to True
randomize = True
clear = True

# Create and load the metadata CSV file for the audio dataset, and split it into training and testing sets
df, train_df, test_df = create_and_load_meta_csv_df(dataset_path, destination_path=destination_path, randomize=randomize, split=0.99)

dataset_path :  ./Dataset

Seeing a sample of the training data:

Code

train_df.head()

	path	label
0	/Users/rd/Documents/Academic Portfolio/Speech ...	2
1	/Users/rd/Documents/Academic Portfolio/Speech ...	4
2	/Users/rd/Documents/Academic Portfolio/Speech ...	0
3	/Users/rd/Documents/Academic Portfolio/Speech ...	3
4	/Users/rd/Documents/Academic Portfolio/Speech ...	6

Visualizing the Data

Labels Assigned for emotions : - 0 : anger - 1 : disgust - 2 : fear - 3 : happy - 4 : neutral - 5 : sad - 6 : surprise

Counting the number of emotions

Code

# Get the unique labels in the training set of the Emotion dataset and print them in sorted order
unique_labels = train_df.label.unique()
unique_labels.sort()
print("Unique labels in Emotion dataset : ")
print(*unique_labels, sep=', ')

# Get the count of each unique label in the training set of the Emotion dataset and print them
unique_labels_counts = train_df.label.value_counts(sort=False)
print("\nCount of unique labels in Emotion dataset : ")
print(*unique_labels_counts,sep=', ')

Unique labels in Emotion dataset : 
0, 1, 2, 3, 4, 5, 6

Count of unique labels in Emotion dataset : 
429, 306, 433, 433, 250, 430, 249

Code

# Histogram of the labels
plt.figure(figsize=(10, 5))
sns.countplot(x='label', data=train_df)
plt.xticks(unique_labels)
plt.ylabel('Count')
plt.xlabel('Emotion')
_ = plt.title('Distribution of Emotion in Data')
plt.savefig('emotion_distribution.png')
plt.show()

The histogram shows the distribution of the counts of unique labels in the Emotion dataset. The x-axis shows the label numbers and the y-axis shows the count of each label. From the histogram, we can see that the dataset is fairly balanced, with each label having a similar count. Labels 0, 2, 3, 5 have slightly higher counts than the other labels, but the difference is not significant. This means that the model trained on this dataset should be able to generalize well to different emotions, as it has a good representation of all the emotions in the dataset.

Pre-Processing the Data

Audio Features:

Mel Frequency Cepstral Coefficients (MFCC) : It is a feature extraction technique widely used in speech processing and recognition. It involves extracting the spectral envelope of a speech signal, typically using the Discrete Fourier Transform (DFT), and then mapping it to the mel frequency scale, which better reflects human perception of sound. From this mel-scaled spectrogram, the MFCCs are obtained by taking the logarithm of the power spectrum and performing a discrete cosine transform. The resulting MFCCs capture the most relevant information of the speech signal, such as phonetic content, speaker identity, and emotion. MFCCs are commonly used as inputs to machine learning algorithms for speech recognition and related tasks.
Chroma feature extraction: It is a technique used to represent the harmonic content of an audio signal in a compact manner. Chroma features are based on the pitch class profiles of musical notes, which are invariant to octave transposition and are typically represented using a circular layout called the chroma circle. Chroma features can be computed from the short-term Fourier transform (STFT) of an audio signal, by first mapping the power spectrum to the pitch class domain and then summing the energy of each pitch class over time. The resulting chroma feature matrix can be used as input to machine learning algorithms for tasks such as music genre classification, chord recognition, and melody extraction.
Pitch: It is a perceptual attribute of sound that allows us to distinguish between high and low frequency sounds. It is closely related to the physical property of frequency, which is the number of cycles per second that a sound wave completes. High-pitched sounds have a high frequency, while low-pitched sounds have a low frequency. In music, pitch is used to describe the perceived height or lowness of a musical note. Pitch can be manipulated by altering the frequency of a sound wave using techniques such as tuning or modulation. Pitch perception is an important aspect of human auditory processing and is essential for tasks such as speech recognition and music appreciation.
Magnitude: In signal processing, magnitude refers to the amplitude or strength of a signal, which is a measure of how much energy is contained in the signal. It is typically calculated as the absolute value of a complex number, which is a mathematical representation of a signal that includes both its magnitude and phase. Magnitude can be used to describe various characteristics of a signal, such as its power, energy, or intensity. For example, in the context of audio signal processing, magnitude can be used to represent the loudness or volume of a sound, while in image processing, magnitude can be used to represent the strength of different frequencies in an image.

Functions for getting features of audio files

The function get_audio_features extracts audio features from a given audio file path using the Librosa library in Python. It loads the audio file at the specified path, resamples it to a target rate, and separates the harmonic and percussive components of the audio signal. It then computes the pitch and magnitude of the audio signal using the PIPtrack algorithm and extracts the 20 most prominent values of each. The function also computes the Mel-frequency cepstral coefficients (MFCCs) from the audio signal and the chroma feature from the harmonic component using the Constant-Q Transform (CQT). Finally, it returns a list of features including the MFCCs, pitch, magnitude, and chroma features. The output of this function is used as input to train machine learning models for audio classification tasks.

Code

def get_audio_features(audio_path,sampling_rate):
    
    # Load audio file at given path, resample to target rate, and extract features
    X, sample_rate = librosa.load(audio_path, res_type='kaiser_fast', duration=2.5, sr=sampling_rate*2, offset=0.5)

    # Convert sample rate to a NumPy array for consistency
    sample_rate = np.array(sample_rate)

    # Separate harmonic and percussive components of audio signal
    y_harmonic, y_percussive = librosa.effects.hpss(X)

    # Compute pitch and magnitude of audio signal using PIPtrack algorithm
    pitches, magnitudes = librosa.core.pitch.piptrack(y=X, sr=sample_rate)

    # Compute Mel-frequency cepstral coefficients (MFCCs) from audio signal
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13), axis=1)

    # Extract the 20 most prominent pitch values from the pitch array
    pitches = np.trim_zeros(np.mean(pitches, axis=1))[:20]

    # Extract the 20 most prominent magnitude values from the magnitude array
    magnitudes = np.trim_zeros(np.mean(magnitudes, axis=1))[:20]

    # Compute chroma feature from harmonic component of audio signal using Constant-Q Transform (CQT)
    C = np.mean(librosa.feature.chroma_cqt(y=y_harmonic, sr=sampling_rate), axis=1)

    # Return a list of audio features, including MFCCs, pitch, magnitude, and chroma features
    return [mfccs, pitches, magnitudes, C]

The function get_features_dataframe takes a Pandas dataframe containing audio file paths and their corresponding labels, as well as a sampling rate as input. It uses the get_audio_features function to extract audio features for each file path in the dataframe, and stores these features in a new Pandas dataframe. The resulting dataframe is split into separate dataframes for each type of feature (i.e. mfcc, pitches, magnitudes, and C), which are then concatenated into a single dataframe. The function returns this combined feature dataframe along with a separate dataframe containing the original labels.

Code

def get_features_dataframe(dataframe, sampling_rate):
    
    # Create a new dataframe to hold the labels
    labels = pd.DataFrame(dataframe['label'])

    # Create an empty dataframe to hold the audio features
    features  = pd.DataFrame(columns=['mfcc','pitches','magnitudes','C'])

    # Loop through each audio file path in the input dataframe and compute its features
    for index, audio_path in enumerate(dataframe['path']):
        features.loc[index] = get_audio_features(audio_path, sampling_rate)

    # Split the features into separate dataframes for each feature type
    mfcc = features.mfcc.apply(pd.Series)
    pit = features.pitches.apply(pd.Series)
    mag = features.magnitudes.apply(pd.Series)
    C = features.C.apply(pd.Series)

    # Concatenate the separate feature dataframes into a single dataframe and return it with the labels
    combined_features = pd.concat([mfcc,pit,mag,C],axis=1,ignore_index=True)
    return combined_features, labels

Getting the features of audio files using librosa (Usually takes 12-15 mins to run):

Code

trainfeatures, trainlabel = get_features_dataframe(train_df, sampling_rate)
testfeatures, testlabel = get_features_dataframe(test_df, sampling_rate)

Fill NA values with 0:

Code

trainfeatures = trainfeatures.fillna(0)
testfeatures = testfeatures.fillna(0)

Converting 2D to 1D using .ravel():

Code

X_train = np.array(trainfeatures)
y_train = np.array(trainlabel).ravel()
X_test = np.array(testfeatures)
y_test = np.array(testlabel).ravel()

One-Hot Encoding the labels:

Code

lb = LabelEncoder()

y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))

Changing dimension for CNN model:

Code

x_traincnn =np.expand_dims(X_train, axis=2)
x_testcnn= np.expand_dims(X_test, axis=2)

CNN Model

Creating a Model

This defines a Sequential model for a 1D convolutional neural network (CNN). The model includes several layers of 1D convolutional layers, activation functions, dropout regularization, max pooling, and a dense layer. The model takes as input the number of MFCCs and number of frames, and outputs a probability distribution over a set of classes. The optimizer used in this model is RMSprop with a specified learning rate and decay. Overall, this code defines a CNN model for audio classification tasks, where the input is a set of features extracted from audio signals and the output is the predicted class of the audio sample.

Code

# Define a Sequential model for 1D convolutional neural network (CNN)
model = Sequential()

# Add a 1D convolutional layer with 256 filters, kernel size 5, padding same and input shape (number of MFCCs, number of frames)
model.add(Conv1D(256, 5,padding='same', input_shape=(x_traincnn.shape[1],x_traincnn.shape[2])))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same'))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a dropout layer with dropout rate 0.1
model.add(Dropout(0.1)) #Dropout Regularization

# Add a max pooling layer with pool size 8
model.add(MaxPooling1D(pool_size=(8)))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same',))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same',))

# Add a ReLU activation function
model.add(Activation('relu'))

# Flatten the output from convolutional layers
model.add(Flatten()) #Flattening the input

# Add a dense layer with number of neurons equal to number of classes
model.add(Dense(y_train.shape[1]))

# Add a softmax activation function to output probabilities
model.add(Activation('softmax'))

# Define the RMSprop optimizer with learning rate and decay
opt = tensorflow.keras.optimizers.legacy.RMSprop(learning_rate=0.00001, decay=1e-6)

Metal device set to: Apple M1 Pro

2023-03-10 17:00:39.986304: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-03-10 17:00:39.986696: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

Summary of the Sequential Model:

Code

model.summary()

Model: "sequential"

_________________________________________________________________

 Layer (type)                Output Shape              Param #

=================================================================

 conv1d (Conv1D)             (None, 65, 256)           1536

 activation (Activation)     (None, 65, 256)           0

 conv1d_1 (Conv1D)           (None, 65, 128)           163968

 activation_1 (Activation)   (None, 65, 128)           0

 dropout (Dropout)           (None, 65, 128)           0

 max_pooling1d (MaxPooling1D  (None, 8, 128)           0

 conv1d_2 (Conv1D)           (None, 8, 128)            82048

 activation_2 (Activation)   (None, 8, 128)            0

 conv1d_3 (Conv1D)           (None, 8, 128)            82048

 activation_3 (Activation)   (None, 8, 128)            0

 flatten (Flatten)           (None, 1024)              0

 dense (Dense)               (None, 7)                 7175

 activation_4 (Activation)   (None, 7)                 0

=================================================================

Total params: 336,775

Trainable params: 336,775

Non-trainable params: 0

_________________________________________________________________

Compile the model:

Code

model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy'])

Training and Evaluation

Training the model:

Code

cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=370, validation_data=(x_testcnn, y_test), verbose=0)

2023-03-10 17:00:40.461869: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz

2023-03-10 17:00:40.899185: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.

2023-03-10 17:00:43.299509: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.

Plotting Loss Vs Iterations:

Code

plt.plot(cnnhistory.history['loss'])
plt.plot(cnnhistory.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('loss_vs_iterations.png')
plt.show()

Saving the model

Saving .h5 model file:

Code

model_name = 'Speech_Emotion_Recognition_Model.h5'
save_dir = os.path.join(os.getcwd(), 'Trained_Models')
# Save model and weights
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)
model_path = os.path.join(save_dir, model_name)
model.save(model_path)
print('Saved trained model at %s ' % model_path)

Saved trained model at /Users/rd/Documents/Academic Portfolio/Speech Emotion Recognition/Trained_Models/Speech_Emotion_Recognition_Model.h5

Saving .json model file:

Code

import json
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

Loading the model

Code

# loading json and creating model
from keras.models import model_from_json
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("./Trained_Models/Speech_Emotion_Recognition_Model.h5")
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
score = loaded_model.evaluate(x_testcnn, y_test, verbose=0)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))

Loaded model from disk

accuracy: 42.31%

2023-03-10 17:12:21.533925: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.

Test Set Prediction

Predicting emotions on the test data

Code

preds = loaded_model.predict(x_testcnn, 
                         batch_size=32, 
                         verbose=1)

preds=preds.argmax(axis=1)
preds = preds.astype(int).flatten()
predictions = (lb.inverse_transform((preds)))
preddf = pd.DataFrame({'predictedvalues': predictions})
actual=y_test.argmax(axis=1)
actual = actual.astype(int).flatten()
actualvalues = (lb.inverse_transform((actual)))
actualdf = pd.DataFrame({'actualvalues': actualvalues})
finaldf = actualdf.join(preddf)

1/1 [==============================] - ETA: 0s

1/1 [==============================] - 0s 112ms/step

2023-03-10 17:12:21.702072: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.

Actual v/s Predicted emotions

Code

finaldf[:10]

	actualvalues	predictedvalues
0	5	5
1	5	5
2	0	0
3	3	3
4	5	3
5	2	5
6	1	2
7	2	2
8	2	0
9	1	4

Saving the predicted values in a CSV file:

Code

finaldf.to_csv('Predictions.csv', index=False)

Demonstration on a demo audio file

Loading the demo audio file:

Code

demo_audio_path = './demo_audio.wav'
ipd.Audio(demo_audio_path)

Getting the features of the demo audio file:

Code

demo_mfcc, demo_pitch, demo_mag, demo_chrom = get_audio_features(demo_audio_path,sampling_rate)

mfcc = pd.Series(demo_mfcc)
pit = pd.Series(demo_pitch)
mag = pd.Series(demo_mag)
C = pd.Series(demo_chrom)
demo_audio_features = pd.concat([mfcc,pit,mag,C],ignore_index=True)

demo_audio_features= np.expand_dims(demo_audio_features, axis=0)
demo_audio_features= np.expand_dims(demo_audio_features, axis=2)

Predicting the emotion of the demo audio file:

Code

livepreds = loaded_model.predict(demo_audio_features, 
                         batch_size=32, 
                         verbose=1)

1/1 [==============================] - ETA: 0s

1/1 [==============================] - 0s 42ms/step

Code

emotions=["Anger","Disgust","Fear","Happy","Neutral", "Sad", "Surprise"]
index = livepreds.argmax(axis=1).item()
print("Emotion predicted from the model:",emotions[index])

Emotion predicted from the model: Anger

Conclusion

Speech Emotion Recognition (SER) is an important research area in the field of signal processing and machine learning. In this project, we used the Ravdess dataset, which contains a diverse range of emotional speech recordings. Our goal was to train a deep learning model that can accurately classify the emotional state of a speaker from their speech signal. We used a combination of signal processing techniques such as MFCC, Chroma, Pitch and Magnitude to extract features from the audio signals.

We preprocessed the data by normalizing the amplitude of the signals and segmenting them into fixed-length frames. We then trained a Neural Network using the Keras API in TensorFlow to classify the emotion from the audio signal. The model architecture consisted of multiple layers of Conv1D and MaxPooling1D followed by a Flatten layer and a Dense layer with a RMSprop activation function. We used the RMSprop optimization technique to minimize the loss function and the accuracy metric to evaluate the model’s performance.

Our results show that the model achieved a test accuracy of 63%, which is a promising result considering the complexity of the task and the size of the dataset. We also observed that the model performed better on certain emotions, such as neutral and happy, while struggling to accurately classify others, such as disgust and surprise. This could be due to the fact that some emotions are more distinct and easily recognizable from the audio signal than others.

In conclusion, our study demonstrates the feasibility of using deep learning models for speech emotion recognition tasks. The Ravdess dataset provides a rich resource for future research in this area, with its diverse range of emotions and large number of samples. However, our study also highlights some of the challenges and limitations of this approach, such as the need for large amounts of data and the difficulty of accurately classifying certain emotions. Future work could explore alternative feature extraction techniques and model architectures to further improve the performance of speech emotion recognition systems. Additionally, the development of real-world applications for this technology could have a significant impact on fields such as psychology, human-computer interaction, and entertainment.