Speech Emotion Recognition (SER) is a field of study in Artificial Intelligence (AI) that aims to automatically recognize human emotions through speech. The ability to understand human emotions can be beneficial in many domains, such as psychotherapy, customer service, and human-robot interaction. The RAVDESS dataset is a popular database for speech emotion recognition research. It contains audio and visual recordings of actors performing short clips of emotional speech. The dataset is labeled with seven different emotions: neutral, calm, happy, sad, angry, fearful, and disgust. In this project, we will explore deep learning techniques to build a SER model using the RAVDESS dataset.
Deep learning models have become the state-of-the-art approach in SER because of their ability to learn complex patterns in speech signals. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are two popular deep learning models used for SER. CNNs are particularly suitable for speech analysis because they can automatically learn relevant features from the spectrogram of the audio signal. RNNs, on the other hand, can capture the temporal information of the audio signal by modeling the sequence of the acoustic features.
The RAVDESS dataset has been widely used in the research community to develop and evaluate deep learning models for SER. In this project, we will focus on building a CNN-based model for speech emotion recognition. We will preprocess the audio files to extract Mel-frequency cepstral coefficients (MFCCs) and Chroma features. These features will be used to create spectrograms, which are two-dimensional representations of the audio signals. The spectrograms will be fed into a CNN to train a model for speech emotion recognition. We will evaluate the model’s performance using accuracy, precision, recall, and F1-score metrics.
The remainder of this project is organized as follows. In the next section, we will discuss the RAVDESS dataset in detail, including its characteristics, structure, and contents. We will also discuss the preprocessing steps required to prepare the dataset for deep learning models. In the following section, we will describe the CNN architecture used for SER and the training process. We will also present the evaluation metrics used to measure the performance of the model. Finally, we will present the results of the experiments and discuss the limitations and potential improvements of the model.
plt.figure(figsize=(12, 4))librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)plt.title('Spectrogram for Fear Audio Sample')plt.xlabel('Time')plt.ylabel('Frequency')plt.savefig('fear.png')plt.show()
plt.figure(figsize=(12, 4))librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)plt.title('Spectrogram for Happy Audio Sample')plt.xlabel('Time')plt.ylabel('Frequency')plt.savefig('happy.png')plt.show()
The shape of spectrograms, which represent the frequency content of an audio signal over time, can vary depending on the characteristics of the audio signal. In the context of emotion recognition, different emotions may be associated with different patterns of spectral energy distribution and temporal dynamics in the audio signal. For example, angry or fearful speech may be characterized by higher spectral energy in the high-frequency range and shorter temporal duration, while happy or neutral speech may be characterized by lower spectral energy and longer duration. Machine learning models can be trained to automatically extract these patterns and recognize the corresponding emotions based on the spectrogram features.
Functions for converting our data into CSV
The create_meta_csv function creates a CSV file that contains the file paths and corresponding labels of WAV files in a dataset directory. The function first retrieves the absolute path of the dataset directory and sets the path for the CSV file to be created. It then loops through the dataset directory and creates a list of tuples, where each tuple contains the file path and its corresponding label. Finally, the function opens the CSV file and writes the header row and data rows from the list of tuples. If the destination path is None, it is set to the dataset path. The function returns True to indicate that it has completed successfully.
Code
np.random.seed(42)def create_meta_csv(dataset_path, destination_path):# Get the absolute path of the dataset directory DATASET_PATH = os.path.abspath(dataset_path)# Set the path of the CSV file that will be created csv_path=os.path.join(destination_path, 'dataset_attr.csv')# Create an empty list to hold the file paths of all the WAV files flist = []# Define a list of emotions that will be used as labels in the CSV file emotions=["anger","disgust","fear","happy","neutral", "sad", "surprise"]# Loop through the dataset directory and add the file paths of all the WAV files to the flist listfor root, dirs, files in os.walk(DATASET_PATH, topdown=False):for name in files:if (name.endswith('.wav')): fullName = os.path.join(root, name) flist.append(fullName)# Split each file path in flist by the directory separator and store the result in a new list called filenames filenames=[]for idx,fileinenumerate(flist): filenames.append(file.split('/')) # Create a list of tuples, where each tuple contains a file path and its corresponding emotion label types=[]for idx,path inenumerate(filenames): types.append((flist[idx],emotions.index(path[-2])))# Open the CSV file, write the header row, and write the data rows from the types listwithopen(csv_path, 'w') as f: writer = csv.writer(f) writer.writerows([("path","label")]) writer.writerows(types) f.close()# If the destination path is None, set it to the dataset pathif destination_path ==None: destination_path = DATASET_PATH# Return True to indicate that the function has completed successfullyreturnTrue
The create_and_load_meta_csv_df function creates and loads a CSV file containing metadata about an audio dataset located at dataset_path. The function also provides options to shuffle the rows of the dataset randomly and split the dataset into training and testing sets. If randomize is True or split is not None and randomize is None, the function shuffles the rows of the DataFrame. If split is not None, the function splits the DataFrame into training and testing sets. The function returns the DataFrame or all three DataFrames, including training and testing sets, depending on the options selected. The create_meta_csv function is called to create the CSV file containing the metadata, which is then loaded into a Pandas DataFrame.
Code
def create_and_load_meta_csv_df(dataset_path, destination_path, randomize=True, split=None):# Call create_meta_csv to generate the CSV file containing metadata about the audio datasetif create_meta_csv(dataset_path, destination_path=destination_path):# Load the CSV file into a Pandas DataFrame dframe = pd.read_csv(os.path.join(destination_path, 'dataset_attr.csv'))# If randomize is True or split is not None and randomize is None, shuffle the rows of the DataFrameif randomize ==Trueor (split !=Noneand randomize ==None): dframe=dframe.sample(frac=1).reset_index(drop=True)pass# If split is not None, split the DataFrame into training and testing setsif split !=None: train_set, test_set = train_test_split(dframe, split)return dframe, train_set, test_set # Return the DataFramereturn dframe
The train_test_split function takes in a DataFrame dframe and a split ratio split_ratio, which determines the proportion of data to be used for training and testing. It splits the DataFrame into two separate DataFrames: the training set, which contains split_ratio percent of the data, and the testing set, which contains the remaining data. The function then resets the index of the testing set to start at 0, and returns the two DataFrames as separate objects. This function is commonly used in machine learning to split a dataset into training and testing sets in order to evaluate the performance of a model on new, unseen data.
Code
def train_test_split(dframe, split_ratio):# Split the DataFrame into training and testing sets based on the split ratio train_data= dframe.iloc[:int((split_ratio) *len(dframe)), :] test_data= dframe.iloc[int((split_ratio) *len(dframe)):,:]# Reset the index of the testing set to start at 0 test_data=test_data.reset_index(drop=True) # Return the training and testing sets as separate DataFramesreturn train_data, test_data
Running all the above functions:
Code
# Set the path to the audio dataset directory and print itdataset_path ='./Dataset'print("dataset_path : ", dataset_path)# Set the destination path to the current working directorydestination_path = os.getcwd()# Set the number of classes in the audio datasetclasses =7# Set the total number of rows in the audio datasettotal_rows =2556# Set the randomize and clear flags to Truerandomize =Trueclear =True# Create and load the metadata CSV file for the audio dataset, and split it into training and testing setsdf, train_df, test_df = create_and_load_meta_csv_df(dataset_path, destination_path=destination_path, randomize=randomize, split=0.99)
# Get the unique labels in the training set of the Emotion dataset and print them in sorted orderunique_labels = train_df.label.unique()unique_labels.sort()print("Unique labels in Emotion dataset : ")print(*unique_labels, sep=', ')# Get the count of each unique label in the training set of the Emotion dataset and print themunique_labels_counts = train_df.label.value_counts(sort=False)print("\nCount of unique labels in Emotion dataset : ")print(*unique_labels_counts,sep=', ')
Unique labels in Emotion dataset :
0, 1, 2, 3, 4, 5, 6
Count of unique labels in Emotion dataset :
429, 306, 433, 433, 250, 430, 249
Code
# Histogram of the labelsplt.figure(figsize=(10, 5))sns.countplot(x='label', data=train_df)plt.xticks(unique_labels)plt.ylabel('Count')plt.xlabel('Emotion')_ = plt.title('Distribution of Emotion in Data')plt.savefig('emotion_distribution.png')plt.show()
The histogram shows the distribution of the counts of unique labels in the Emotion dataset. The x-axis shows the label numbers and the y-axis shows the count of each label. From the histogram, we can see that the dataset is fairly balanced, with each label having a similar count. Labels 0, 2, 3, 5 have slightly higher counts than the other labels, but the difference is not significant. This means that the model trained on this dataset should be able to generalize well to different emotions, as it has a good representation of all the emotions in the dataset.
Pre-Processing the Data
Audio Features:
Mel Frequency Cepstral Coefficients (MFCC) : It is a feature extraction technique widely used in speech processing and recognition. It involves extracting the spectral envelope of a speech signal, typically using the Discrete Fourier Transform (DFT), and then mapping it to the mel frequency scale, which better reflects human perception of sound. From this mel-scaled spectrogram, the MFCCs are obtained by taking the logarithm of the power spectrum and performing a discrete cosine transform. The resulting MFCCs capture the most relevant information of the speech signal, such as phonetic content, speaker identity, and emotion. MFCCs are commonly used as inputs to machine learning algorithms for speech recognition and related tasks.
Chroma feature extraction: It is a technique used to represent the harmonic content of an audio signal in a compact manner. Chroma features are based on the pitch class profiles of musical notes, which are invariant to octave transposition and are typically represented using a circular layout called the chroma circle. Chroma features can be computed from the short-term Fourier transform (STFT) of an audio signal, by first mapping the power spectrum to the pitch class domain and then summing the energy of each pitch class over time. The resulting chroma feature matrix can be used as input to machine learning algorithms for tasks such as music genre classification, chord recognition, and melody extraction.
Pitch: It is a perceptual attribute of sound that allows us to distinguish between high and low frequency sounds. It is closely related to the physical property of frequency, which is the number of cycles per second that a sound wave completes. High-pitched sounds have a high frequency, while low-pitched sounds have a low frequency. In music, pitch is used to describe the perceived height or lowness of a musical note. Pitch can be manipulated by altering the frequency of a sound wave using techniques such as tuning or modulation. Pitch perception is an important aspect of human auditory processing and is essential for tasks such as speech recognition and music appreciation.
Magnitude: In signal processing, magnitude refers to the amplitude or strength of a signal, which is a measure of how much energy is contained in the signal. It is typically calculated as the absolute value of a complex number, which is a mathematical representation of a signal that includes both its magnitude and phase. Magnitude can be used to describe various characteristics of a signal, such as its power, energy, or intensity. For example, in the context of audio signal processing, magnitude can be used to represent the loudness or volume of a sound, while in image processing, magnitude can be used to represent the strength of different frequencies in an image.
Functions for getting features of audio files
The function get_audio_features extracts audio features from a given audio file path using the Librosa library in Python. It loads the audio file at the specified path, resamples it to a target rate, and separates the harmonic and percussive components of the audio signal. It then computes the pitch and magnitude of the audio signal using the PIPtrack algorithm and extracts the 20 most prominent values of each. The function also computes the Mel-frequency cepstral coefficients (MFCCs) from the audio signal and the chroma feature from the harmonic component using the Constant-Q Transform (CQT). Finally, it returns a list of features including the MFCCs, pitch, magnitude, and chroma features. The output of this function is used as input to train machine learning models for audio classification tasks.
Code
def get_audio_features(audio_path,sampling_rate):# Load audio file at given path, resample to target rate, and extract features X, sample_rate = librosa.load(audio_path, res_type='kaiser_fast', duration=2.5, sr=sampling_rate*2, offset=0.5)# Convert sample rate to a NumPy array for consistency sample_rate = np.array(sample_rate)# Separate harmonic and percussive components of audio signal y_harmonic, y_percussive = librosa.effects.hpss(X)# Compute pitch and magnitude of audio signal using PIPtrack algorithm pitches, magnitudes = librosa.core.pitch.piptrack(y=X, sr=sample_rate)# Compute Mel-frequency cepstral coefficients (MFCCs) from audio signal mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13), axis=1)# Extract the 20 most prominent pitch values from the pitch array pitches = np.trim_zeros(np.mean(pitches, axis=1))[:20]# Extract the 20 most prominent magnitude values from the magnitude array magnitudes = np.trim_zeros(np.mean(magnitudes, axis=1))[:20]# Compute chroma feature from harmonic component of audio signal using Constant-Q Transform (CQT) C = np.mean(librosa.feature.chroma_cqt(y=y_harmonic, sr=sampling_rate), axis=1)# Return a list of audio features, including MFCCs, pitch, magnitude, and chroma featuresreturn [mfccs, pitches, magnitudes, C]
The function get_features_dataframe takes a Pandas dataframe containing audio file paths and their corresponding labels, as well as a sampling rate as input. It uses the get_audio_features function to extract audio features for each file path in the dataframe, and stores these features in a new Pandas dataframe. The resulting dataframe is split into separate dataframes for each type of feature (i.e. mfcc, pitches, magnitudes, and C), which are then concatenated into a single dataframe. The function returns this combined feature dataframe along with a separate dataframe containing the original labels.
Code
def get_features_dataframe(dataframe, sampling_rate):# Create a new dataframe to hold the labels labels = pd.DataFrame(dataframe['label'])# Create an empty dataframe to hold the audio features features = pd.DataFrame(columns=['mfcc','pitches','magnitudes','C'])# Loop through each audio file path in the input dataframe and compute its featuresfor index, audio_path inenumerate(dataframe['path']): features.loc[index] = get_audio_features(audio_path, sampling_rate)# Split the features into separate dataframes for each feature type mfcc = features.mfcc.apply(pd.Series) pit = features.pitches.apply(pd.Series) mag = features.magnitudes.apply(pd.Series) C = features.C.apply(pd.Series)# Concatenate the separate feature dataframes into a single dataframe and return it with the labels combined_features = pd.concat([mfcc,pit,mag,C],axis=1,ignore_index=True)return combined_features, labels
Getting the features of audio files using librosa (Usually takes 12-15 mins to run):
This defines a Sequential model for a 1D convolutional neural network (CNN). The model includes several layers of 1D convolutional layers, activation functions, dropout regularization, max pooling, and a dense layer. The model takes as input the number of MFCCs and number of frames, and outputs a probability distribution over a set of classes. The optimizer used in this model is RMSprop with a specified learning rate and decay. Overall, this code defines a CNN model for audio classification tasks, where the input is a set of features extracted from audio signals and the output is the predicted class of the audio sample.
Code
# Define a Sequential model for 1D convolutional neural network (CNN)model = Sequential()# Add a 1D convolutional layer with 256 filters, kernel size 5, padding same and input shape (number of MFCCs, number of frames)model.add(Conv1D(256, 5,padding='same', input_shape=(x_traincnn.shape[1],x_traincnn.shape[2])))# Add a ReLU activation functionmodel.add(Activation('relu'))# Add a 1D convolutional layer with 128 filters, kernel size 5, padding samemodel.add(Conv1D(128, 5,padding='same'))# Add a ReLU activation functionmodel.add(Activation('relu'))# Add a dropout layer with dropout rate 0.1model.add(Dropout(0.1)) #Dropout Regularization# Add a max pooling layer with pool size 8model.add(MaxPooling1D(pool_size=(8)))# Add a 1D convolutional layer with 128 filters, kernel size 5, padding samemodel.add(Conv1D(128, 5,padding='same',))# Add a ReLU activation functionmodel.add(Activation('relu'))# Add a 1D convolutional layer with 128 filters, kernel size 5, padding samemodel.add(Conv1D(128, 5,padding='same',))# Add a ReLU activation functionmodel.add(Activation('relu'))# Flatten the output from convolutional layersmodel.add(Flatten()) #Flattening the input# Add a dense layer with number of neurons equal to number of classesmodel.add(Dense(y_train.shape[1]))# Add a softmax activation function to output probabilitiesmodel.add(Activation('softmax'))# Define the RMSprop optimizer with learning rate and decayopt = tensorflow.keras.optimizers.legacy.RMSprop(learning_rate=0.00001, decay=1e-6)
Metal device set to: Apple M1 Pro
2023-03-10 17:00:39.986304: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-03-10 17:00:39.986696: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-03-10 17:00:40.461869: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-03-10 17:00:40.899185: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-03-10 17:00:43.299509: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
model_name ='Speech_Emotion_Recognition_Model.h5'save_dir = os.path.join(os.getcwd(), 'Trained_Models')# Save model and weightsifnot os.path.isdir(save_dir): os.makedirs(save_dir)model_path = os.path.join(save_dir, model_name)model.save(model_path)print('Saved trained model at %s '% model_path)
Saved trained model at /Users/rd/Documents/Academic Portfolio/Speech Emotion Recognition/Trained_Models/Speech_Emotion_Recognition_Model.h5
Saving .json model file:
Code
import jsonmodel_json = model.to_json()withopen("model.json", "w") as json_file: json_file.write(model_json)
Loading the model
Code
# loading json and creating modelfrom keras.models import model_from_jsonjson_file =open('model.json', 'r')loaded_model_json = json_file.read()json_file.close()loaded_model = model_from_json(loaded_model_json)# load weights into new modelloaded_model.load_weights("./Trained_Models/Speech_Emotion_Recognition_Model.h5")print("Loaded model from disk")# evaluate loaded model on test dataloaded_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])score = loaded_model.evaluate(x_testcnn, y_test, verbose=0)print("%s: %.2f%%"% (loaded_model.metrics_names[1], score[1]*100))
Loaded model from disk
accuracy: 42.31%
2023-03-10 17:12:21.533925: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-03-10 17:12:21.702072: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
emotions=["Anger","Disgust","Fear","Happy","Neutral", "Sad", "Surprise"]index = livepreds.argmax(axis=1).item()print("Emotion predicted from the model:",emotions[index])
Emotion predicted from the model: Anger
Conclusion
Speech Emotion Recognition (SER) is an important research area in the field of signal processing and machine learning. In this project, we used the Ravdess dataset, which contains a diverse range of emotional speech recordings. Our goal was to train a deep learning model that can accurately classify the emotional state of a speaker from their speech signal. We used a combination of signal processing techniques such as MFCC, Chroma, Pitch and Magnitude to extract features from the audio signals.
We preprocessed the data by normalizing the amplitude of the signals and segmenting them into fixed-length frames. We then trained a Neural Network using the Keras API in TensorFlow to classify the emotion from the audio signal. The model architecture consisted of multiple layers of Conv1D and MaxPooling1D followed by a Flatten layer and a Dense layer with a RMSprop activation function. We used the RMSprop optimization technique to minimize the loss function and the accuracy metric to evaluate the model’s performance.
Our results show that the model achieved a test accuracy of 63%, which is a promising result considering the complexity of the task and the size of the dataset. We also observed that the model performed better on certain emotions, such as neutral and happy, while struggling to accurately classify others, such as disgust and surprise. This could be due to the fact that some emotions are more distinct and easily recognizable from the audio signal than others.
In conclusion, our study demonstrates the feasibility of using deep learning models for speech emotion recognition tasks. The Ravdess dataset provides a rich resource for future research in this area, with its diverse range of emotions and large number of samples. However, our study also highlights some of the challenges and limitations of this approach, such as the need for large amounts of data and the difficulty of accurately classifying certain emotions. Future work could explore alternative feature extraction techniques and model architectures to further improve the performance of speech emotion recognition systems. Additionally, the development of real-world applications for this technology could have a significant impact on fields such as psychology, human-computer interaction, and entertainment.
Source Code
---title: <b>Emotion Recognition from Speech</b>format: html: theme: lumen toc: true self-contained: true embed-resources: true page-layout: full code-fold: true code-tools: truejupyter: python3---# IntroductionSpeech Emotion Recognition (SER) is a field of study in Artificial Intelligence (AI) that aims to automatically recognize human emotions through speech. The ability to understand human emotions can be beneficial in many domains, such as psychotherapy, customer service, and human-robot interaction. The RAVDESS dataset is a popular database for speech emotion recognition research. It contains audio and visual recordings of actors performing short clips of emotional speech. The dataset is labeled with seven different emotions: neutral, calm, happy, sad, angry, fearful, and disgust. In this project, we will explore deep learning techniques to build a SER model using the `RAVDESS` dataset.Deep learning models have become the state-of-the-art approach in SER because of their ability to learn complex patterns in speech signals. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are two popular deep learning models used for SER. CNNs are particularly suitable for speech analysis because they can automatically learn relevant features from the spectrogram of the audio signal. RNNs, on the other hand, can capture the temporal information of the audio signal by modeling the sequence of the acoustic features.The RAVDESS dataset has been widely used in the research community to develop and evaluate deep learning models for SER. In this project, we will focus on building a CNN-based model for speech emotion recognition. We will preprocess the audio files to extract `Mel-frequency cepstral coefficients (MFCCs)` and `Chroma` features. These features will be used to create spectrograms, which are two-dimensional representations of the audio signals. The spectrograms will be fed into a CNN to train a model for speech emotion recognition. We will evaluate the model's performance using accuracy, precision, recall, and F1-score metrics.The remainder of this project is organized as follows. In the next section, we will discuss the RAVDESS dataset in detail, including its characteristics, structure, and contents. We will also discuss the preprocessing steps required to prepare the dataset for deep learning models. In the following section, we will describe the CNN architecture used for SER and the training process. We will also present the evaluation metrics used to measure the performance of the model. Finally, we will present the results of the experiments and discuss the limitations and potential improvements of the model.# Import Libraries```{python}import matplotlib.pyplot as pltimport osimport librosaimport librosa.displayimport IPython.display as ipdfrom IPython.display import Imageimport seaborn as snssns.set_palette('Set2')from librosa.core import pitchimport numpy as npimport pandas as pdimport kerasimport tensorflowfrom tensorflow.keras import optimizersfrom tensorflow.python.keras import Sequentialfrom keras.models import Modelfrom keras.models import Sequentialfrom keras.layers import Conv1D, MaxPooling1D #, AveragePooling1Dfrom keras.layers import Flatten, Dropout, Activation # Input, from keras.layers import Dense #, Embeddingfrom keras.utils import np_utilsfrom sklearn.preprocessing import LabelEncoderimport waveimport sysimport csvfrom PIL import Imageimport torchvision.transforms as transformsfrom torch.utils.data import Dataset, DataLoaderimport warningswarnings.filterwarnings('ignore')```# Preparing the Data## Plotting different emotion audio's waveform and its spectrogramAnger:```{python}data, sampling_rate = librosa.load('Dataset/anger/anger001.wav')ipd.Audio('Dataset/anger/anger016.wav')``````{python}plt.figure(figsize=(12, 4))librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)plt.title('Spectrogram for Anger Audio Sample')plt.xlabel('Time')plt.ylabel('Frequency')plt.savefig('anger.png')plt.show()```Fear:```{python}data, sampling_rate = librosa.load('Dataset/fear/fear001.wav')ipd.Audio('Dataset/fear/fear001.wav')``````{python}plt.figure(figsize=(12, 4))librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)plt.title('Spectrogram for Fear Audio Sample')plt.xlabel('Time')plt.ylabel('Frequency')plt.savefig('fear.png')plt.show()```Happy:```{python}data, sampling_rate = librosa.load('Dataset/happy/happy002.wav')ipd.Audio('Dataset/happy/happy002.wav')``````{python}plt.figure(figsize=(12, 4))librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)plt.title('Spectrogram for Happy Audio Sample')plt.xlabel('Time')plt.ylabel('Frequency')plt.savefig('happy.png')plt.show()```The shape of spectrograms, which represent the frequency content of an audio signal over time, can vary depending on the characteristics of the audio signal. In the context of emotion recognition, different emotions may be associated with different patterns of spectral energy distribution and temporal dynamics in the audio signal. For example, angry or fearful speech may be characterized by higher spectral energy in the high-frequency range and shorter temporal duration, while happy or neutral speech may be characterized by lower spectral energy and longer duration. Machine learning models can be trained to automatically extract these patterns and recognize the corresponding emotions based on the spectrogram features.## Functions for converting our data into CSVThe `create_meta_csv` function creates a CSV file that contains the file paths and corresponding labels of WAV files in a dataset directory. The function first retrieves the absolute path of the dataset directory and sets the path for the CSV file to be created. It then loops through the dataset directory and creates a list of tuples, where each tuple contains the file path and its corresponding label. Finally, the function opens the CSV file and writes the header row and data rows from the list of tuples. If the destination path is None, it is set to the dataset path. The function returns True to indicate that it has completed successfully.```{python}np.random.seed(42)def create_meta_csv(dataset_path, destination_path):# Get the absolute path of the dataset directory DATASET_PATH = os.path.abspath(dataset_path)# Set the path of the CSV file that will be created csv_path=os.path.join(destination_path, 'dataset_attr.csv')# Create an empty list to hold the file paths of all the WAV files flist = []# Define a list of emotions that will be used as labels in the CSV file emotions=["anger","disgust","fear","happy","neutral", "sad", "surprise"]# Loop through the dataset directory and add the file paths of all the WAV files to the flist listfor root, dirs, files in os.walk(DATASET_PATH, topdown=False):for name in files:if (name.endswith('.wav')): fullName = os.path.join(root, name) flist.append(fullName)# Split each file path in flist by the directory separator and store the result in a new list called filenames filenames=[]for idx,fileinenumerate(flist): filenames.append(file.split('/')) # Create a list of tuples, where each tuple contains a file path and its corresponding emotion label types=[]for idx,path inenumerate(filenames): types.append((flist[idx],emotions.index(path[-2])))# Open the CSV file, write the header row, and write the data rows from the types listwithopen(csv_path, 'w') as f: writer = csv.writer(f) writer.writerows([("path","label")]) writer.writerows(types) f.close()# If the destination path is None, set it to the dataset pathif destination_path ==None: destination_path = DATASET_PATH# Return True to indicate that the function has completed successfullyreturnTrue```The `create_and_load_meta_csv_df` function creates and loads a CSV file containing metadata about an audio dataset located at `dataset_path`. The function also provides options to shuffle the rows of the dataset randomly and split the dataset into training and testing sets. If `randomize` is True or `split` is not None and `randomize` is None, the function shuffles the rows of the DataFrame. If `split` is not None, the function splits the DataFrame into training and testing sets. The function returns the DataFrame or all three DataFrames, including training and testing sets, depending on the options selected. The `create_meta_csv` function is called to create the CSV file containing the metadata, which is then loaded into a Pandas DataFrame.```{python}def create_and_load_meta_csv_df(dataset_path, destination_path, randomize=True, split=None):# Call create_meta_csv to generate the CSV file containing metadata about the audio datasetif create_meta_csv(dataset_path, destination_path=destination_path):# Load the CSV file into a Pandas DataFrame dframe = pd.read_csv(os.path.join(destination_path, 'dataset_attr.csv'))# If randomize is True or split is not None and randomize is None, shuffle the rows of the DataFrameif randomize ==Trueor (split !=Noneand randomize ==None): dframe=dframe.sample(frac=1).reset_index(drop=True)pass# If split is not None, split the DataFrame into training and testing setsif split !=None: train_set, test_set = train_test_split(dframe, split)return dframe, train_set, test_set # Return the DataFramereturn dframe```The `train_test_split` function takes in a DataFrame `dframe` and a split ratio `split_ratio`, which determines the proportion of data to be used for training and testing. It splits the DataFrame into two separate DataFrames: the training set, which contains `split_ratio` percent of the data, and the testing set, which contains the remaining data. The function then resets the index of the testing set to start at 0, and returns the two DataFrames as separate objects. This function is commonly used in machine learning to split a dataset into training and testing sets in order to evaluate the performance of a model on new, unseen data.```{python}def train_test_split(dframe, split_ratio):# Split the DataFrame into training and testing sets based on the split ratio train_data= dframe.iloc[:int((split_ratio) *len(dframe)), :] test_data= dframe.iloc[int((split_ratio) *len(dframe)):,:]# Reset the index of the testing set to start at 0 test_data=test_data.reset_index(drop=True) # Return the training and testing sets as separate DataFramesreturn train_data, test_data```Running all the above functions:```{python}# Set the path to the audio dataset directory and print itdataset_path ='./Dataset'print("dataset_path : ", dataset_path)# Set the destination path to the current working directorydestination_path = os.getcwd()# Set the number of classes in the audio datasetclasses =7# Set the total number of rows in the audio datasettotal_rows =2556# Set the randomize and clear flags to Truerandomize =Trueclear =True# Create and load the metadata CSV file for the audio dataset, and split it into training and testing setsdf, train_df, test_df = create_and_load_meta_csv_df(dataset_path, destination_path=destination_path, randomize=randomize, split=0.99)```Seeing a sample of the training data:```{python}train_df.head()```# Visualizing the DataLabels Assigned for emotions : - 0 : anger- 1 : disgust- 2 : fear- 3 : happy- 4 : neutral - 5 : sad- 6 : surprise## Counting the number of emotions```{python}# Get the unique labels in the training set of the Emotion dataset and print them in sorted orderunique_labels = train_df.label.unique()unique_labels.sort()print("Unique labels in Emotion dataset : ")print(*unique_labels, sep=', ')# Get the count of each unique label in the training set of the Emotion dataset and print themunique_labels_counts = train_df.label.value_counts(sort=False)print("\nCount of unique labels in Emotion dataset : ")print(*unique_labels_counts,sep=', ')``````{python}# Histogram of the labelsplt.figure(figsize=(10, 5))sns.countplot(x='label', data=train_df)plt.xticks(unique_labels)plt.ylabel('Count')plt.xlabel('Emotion')_ = plt.title('Distribution of Emotion in Data')plt.savefig('emotion_distribution.png')plt.show()```The histogram shows the distribution of the counts of unique labels in the Emotion dataset. The x-axis shows the label numbers and the y-axis shows the count of each label. From the histogram, we can see that the dataset is fairly balanced, with each label having a similar count. Labels 0, 2, 3, 5 have slightly higher counts than the other labels, but the difference is not significant. This means that the model trained on this dataset should be able to generalize well to different emotions, as it has a good representation of all the emotions in the dataset.# Pre-Processing the DataAudio Features:- `Mel Frequency Cepstral Coefficients (MFCC)` : It is a feature extraction technique widely used in speech processing and recognition. It involves extracting the spectral envelope of a speech signal, typically using the Discrete Fourier Transform (DFT), and then mapping it to the mel frequency scale, which better reflects human perception of sound. From this mel-scaled spectrogram, the MFCCs are obtained by taking the logarithm of the power spectrum and performing a discrete cosine transform. The resulting MFCCs capture the most relevant information of the speech signal, such as phonetic content, speaker identity, and emotion. MFCCs are commonly used as inputs to machine learning algorithms for speech recognition and related tasks.- `Chroma` feature extraction: It is a technique used to represent the harmonic content of an audio signal in a compact manner. Chroma features are based on the pitch class profiles of musical notes, which are invariant to octave transposition and are typically represented using a circular layout called the chroma circle. Chroma features can be computed from the short-term Fourier transform (STFT) of an audio signal, by first mapping the power spectrum to the pitch class domain and then summing the energy of each pitch class over time. The resulting chroma feature matrix can be used as input to machine learning algorithms for tasks such as music genre classification, chord recognition, and melody extraction.- `Pitch`: It is a perceptual attribute of sound that allows us to distinguish between high and low frequency sounds. It is closely related to the physical property of frequency, which is the number of cycles per second that a sound wave completes. High-pitched sounds have a high frequency, while low-pitched sounds have a low frequency. In music, pitch is used to describe the perceived height or lowness of a musical note. Pitch can be manipulated by altering the frequency of a sound wave using techniques such as tuning or modulation. Pitch perception is an important aspect of human auditory processing and is essential for tasks such as speech recognition and music appreciation.- `Magnitude`: In signal processing, magnitude refers to the amplitude or strength of a signal, which is a measure of how much energy is contained in the signal. It is typically calculated as the absolute value of a complex number, which is a mathematical representation of a signal that includes both its magnitude and phase. Magnitude can be used to describe various characteristics of a signal, such as its power, energy, or intensity. For example, in the context of audio signal processing, magnitude can be used to represent the loudness or volume of a sound, while in image processing, magnitude can be used to represent the strength of different frequencies in an image.## Functions for getting features of audio filesThe function `get_audio_features` extracts audio features from a given audio file path using the Librosa library in Python. It loads the audio file at the specified path, resamples it to a target rate, and separates the harmonic and percussive components of the audio signal. It then computes the pitch and magnitude of the audio signal using the PIPtrack algorithm and extracts the 20 most prominent values of each. The function also computes the Mel-frequency cepstral coefficients (MFCCs) from the audio signal and the chroma feature from the harmonic component using the Constant-Q Transform (CQT). Finally, it returns a list of features including the MFCCs, pitch, magnitude, and chroma features. The output of this function is used as input to train machine learning models for audio classification tasks.```{python}def get_audio_features(audio_path,sampling_rate):# Load audio file at given path, resample to target rate, and extract features X, sample_rate = librosa.load(audio_path, res_type='kaiser_fast', duration=2.5, sr=sampling_rate*2, offset=0.5)# Convert sample rate to a NumPy array for consistency sample_rate = np.array(sample_rate)# Separate harmonic and percussive components of audio signal y_harmonic, y_percussive = librosa.effects.hpss(X)# Compute pitch and magnitude of audio signal using PIPtrack algorithm pitches, magnitudes = librosa.core.pitch.piptrack(y=X, sr=sample_rate)# Compute Mel-frequency cepstral coefficients (MFCCs) from audio signal mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13), axis=1)# Extract the 20 most prominent pitch values from the pitch array pitches = np.trim_zeros(np.mean(pitches, axis=1))[:20]# Extract the 20 most prominent magnitude values from the magnitude array magnitudes = np.trim_zeros(np.mean(magnitudes, axis=1))[:20]# Compute chroma feature from harmonic component of audio signal using Constant-Q Transform (CQT) C = np.mean(librosa.feature.chroma_cqt(y=y_harmonic, sr=sampling_rate), axis=1)# Return a list of audio features, including MFCCs, pitch, magnitude, and chroma featuresreturn [mfccs, pitches, magnitudes, C]```The function `get_features_dataframe` takes a Pandas dataframe containing audio file paths and their corresponding labels, as well as a sampling rate as input. It uses the `get_audio_features` function to extract audio features for each file path in the dataframe, and stores these features in a new Pandas dataframe. The resulting dataframe is split into separate dataframes for each type of feature (i.e. mfcc, pitches, magnitudes, and C), which are then concatenated into a single dataframe. The function returns this combined feature dataframe along with a separate dataframe containing the original labels.```{python}def get_features_dataframe(dataframe, sampling_rate):# Create a new dataframe to hold the labels labels = pd.DataFrame(dataframe['label'])# Create an empty dataframe to hold the audio features features = pd.DataFrame(columns=['mfcc','pitches','magnitudes','C'])# Loop through each audio file path in the input dataframe and compute its featuresfor index, audio_path inenumerate(dataframe['path']): features.loc[index] = get_audio_features(audio_path, sampling_rate)# Split the features into separate dataframes for each feature type mfcc = features.mfcc.apply(pd.Series) pit = features.pitches.apply(pd.Series) mag = features.magnitudes.apply(pd.Series) C = features.C.apply(pd.Series)# Concatenate the separate feature dataframes into a single dataframe and return it with the labels combined_features = pd.concat([mfcc,pit,mag,C],axis=1,ignore_index=True)return combined_features, labels ```Getting the features of audio files using librosa (Usually takes 12-15 mins to run):```{python}trainfeatures, trainlabel = get_features_dataframe(train_df, sampling_rate)testfeatures, testlabel = get_features_dataframe(test_df, sampling_rate)```Fill NA values with 0:```{python}trainfeatures = trainfeatures.fillna(0)testfeatures = testfeatures.fillna(0)```Converting 2D to 1D using .ravel():```{python}X_train = np.array(trainfeatures)y_train = np.array(trainlabel).ravel()X_test = np.array(testfeatures)y_test = np.array(testlabel).ravel()```One-Hot Encoding the labels:```{python}lb = LabelEncoder()y_train = np_utils.to_categorical(lb.fit_transform(y_train))y_test = np_utils.to_categorical(lb.fit_transform(y_test))```Changing dimension for CNN model:```{python}x_traincnn =np.expand_dims(X_train, axis=2)x_testcnn= np.expand_dims(X_test, axis=2)```# CNN Model## Creating a ModelThis defines a Sequential model for a 1D convolutional neural network (CNN). The model includes several layers of 1D convolutional layers, activation functions, dropout regularization, max pooling, and a dense layer. The model takes as input the number of MFCCs and number of frames, and outputs a probability distribution over a set of classes. The optimizer used in this model is RMSprop with a specified learning rate and decay. Overall, this code defines a CNN model for audio classification tasks, where the input is a set of features extracted from audio signals and the output is the predicted class of the audio sample.```{python}# Define a Sequential model for 1D convolutional neural network (CNN)model = Sequential()# Add a 1D convolutional layer with 256 filters, kernel size 5, padding same and input shape (number of MFCCs, number of frames)model.add(Conv1D(256, 5,padding='same', input_shape=(x_traincnn.shape[1],x_traincnn.shape[2])))# Add a ReLU activation functionmodel.add(Activation('relu'))# Add a 1D convolutional layer with 128 filters, kernel size 5, padding samemodel.add(Conv1D(128, 5,padding='same'))# Add a ReLU activation functionmodel.add(Activation('relu'))# Add a dropout layer with dropout rate 0.1model.add(Dropout(0.1)) #Dropout Regularization# Add a max pooling layer with pool size 8model.add(MaxPooling1D(pool_size=(8)))# Add a 1D convolutional layer with 128 filters, kernel size 5, padding samemodel.add(Conv1D(128, 5,padding='same',))# Add a ReLU activation functionmodel.add(Activation('relu'))# Add a 1D convolutional layer with 128 filters, kernel size 5, padding samemodel.add(Conv1D(128, 5,padding='same',))# Add a ReLU activation functionmodel.add(Activation('relu'))# Flatten the output from convolutional layersmodel.add(Flatten()) #Flattening the input# Add a dense layer with number of neurons equal to number of classesmodel.add(Dense(y_train.shape[1]))# Add a softmax activation function to output probabilitiesmodel.add(Activation('softmax'))# Define the RMSprop optimizer with learning rate and decayopt = tensorflow.keras.optimizers.legacy.RMSprop(learning_rate=0.00001, decay=1e-6) ```Summary of the Sequential Model:```{python}model.summary()```Compile the model:```{python}model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy'])```## Training and EvaluationTraining the model:```{python}cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=370, validation_data=(x_testcnn, y_test), verbose=0)```Plotting Loss Vs Iterations:```{python}plt.plot(cnnhistory.history['loss'])plt.plot(cnnhistory.history['val_loss'])plt.title('model loss')plt.ylabel('loss')plt.xlabel('epoch')plt.legend(['train', 'test'], loc='upper left')plt.savefig('loss_vs_iterations.png')plt.show()```## Saving the modelSaving .h5 model file:```{python}model_name ='Speech_Emotion_Recognition_Model.h5'save_dir = os.path.join(os.getcwd(), 'Trained_Models')# Save model and weightsifnot os.path.isdir(save_dir): os.makedirs(save_dir)model_path = os.path.join(save_dir, model_name)model.save(model_path)print('Saved trained model at %s '% model_path)```Saving .json model file:```{python}import jsonmodel_json = model.to_json()withopen("model.json", "w") as json_file: json_file.write(model_json)```## Loading the model```{python}# loading json and creating modelfrom keras.models import model_from_jsonjson_file =open('model.json', 'r')loaded_model_json = json_file.read()json_file.close()loaded_model = model_from_json(loaded_model_json)# load weights into new modelloaded_model.load_weights("./Trained_Models/Speech_Emotion_Recognition_Model.h5")print("Loaded model from disk")# evaluate loaded model on test dataloaded_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])score = loaded_model.evaluate(x_testcnn, y_test, verbose=0)print("%s: %.2f%%"% (loaded_model.metrics_names[1], score[1]*100))```# Test Set Prediction### Predicting emotions on the test data```{python}preds = loaded_model.predict(x_testcnn, batch_size=32, verbose=1)preds=preds.argmax(axis=1)preds = preds.astype(int).flatten()predictions = (lb.inverse_transform((preds)))preddf = pd.DataFrame({'predictedvalues': predictions})actual=y_test.argmax(axis=1)actual = actual.astype(int).flatten()actualvalues = (lb.inverse_transform((actual)))actualdf = pd.DataFrame({'actualvalues': actualvalues})finaldf = actualdf.join(preddf)```## Actual v/s Predicted emotions```{python}finaldf[:10]```Saving the predicted values in a CSV file:```{python}finaldf.to_csv('Predictions.csv', index=False)```# Demonstration on a demo audio fileLoading the demo audio file:```{python}demo_audio_path ='./demo_audio.wav'ipd.Audio(demo_audio_path)```Getting the features of the demo audio file:```{python}demo_mfcc, demo_pitch, demo_mag, demo_chrom = get_audio_features(demo_audio_path,sampling_rate)mfcc = pd.Series(demo_mfcc)pit = pd.Series(demo_pitch)mag = pd.Series(demo_mag)C = pd.Series(demo_chrom)demo_audio_features = pd.concat([mfcc,pit,mag,C],ignore_index=True)demo_audio_features= np.expand_dims(demo_audio_features, axis=0)demo_audio_features= np.expand_dims(demo_audio_features, axis=2)```Predicting the emotion of the demo audio file:```{python}livepreds = loaded_model.predict(demo_audio_features, batch_size=32, verbose=1)``````{python}emotions=["Anger","Disgust","Fear","Happy","Neutral", "Sad", "Surprise"]index = livepreds.argmax(axis=1).item()print("Emotion predicted from the model:",emotions[index])```# ConclusionSpeech Emotion Recognition (SER) is an important research area in the field of signal processing and machine learning. In this project, we used the Ravdess dataset, which contains a diverse range of emotional speech recordings. Our goal was to train a deep learning model that can accurately classify the emotional state of a speaker from their speech signal. We used a combination of signal processing techniques such as MFCC, Chroma, Pitch and Magnitude to extract features from the audio signals.We preprocessed the data by normalizing the amplitude of the signals and segmenting them into fixed-length frames. We then trained a Neural Network using the Keras API in TensorFlow to classify the emotion from the audio signal. The model architecture consisted of multiple layers of Conv1D and MaxPooling1D followed by a Flatten layer and a Dense layer with a RMSprop activation function. We used the RMSprop optimization technique to minimize the loss function and the accuracy metric to evaluate the model's performance.Our results show that the model achieved a test accuracy of 63%, which is a promising result considering the complexity of the task and the size of the dataset. We also observed that the model performed better on certain emotions, such as neutral and happy, while struggling to accurately classify others, such as disgust and surprise. This could be due to the fact that some emotions are more distinct and easily recognizable from the audio signal than others.In conclusion, our study demonstrates the feasibility of using deep learning models for speech emotion recognition tasks. The Ravdess dataset provides a rich resource for future research in this area, with its diverse range of emotions and large number of samples. However, our study also highlights some of the challenges and limitations of this approach, such as the need for large amounts of data and the difficulty of accurately classifying certain emotions. Future work could explore alternative feature extraction techniques and model architectures to further improve the performance of speech emotion recognition systems. Additionally, the development of real-world applications for this technology could have a significant impact on fields such as psychology, human-computer interaction, and entertainment.