Speech Emotion Recognition
  • Home
  • Report

On this page

  • Introduction
  • Import Libraries
  • Preparing the Data
    • Plotting different emotion audio’s waveform and its spectrogram
    • Functions for converting our data into CSV
  • Visualizing the Data
    • Counting the number of emotions
  • Pre-Processing the Data
    • Functions for getting features of audio files
  • CNN Model
    • Creating a Model
    • Training and Evaluation
    • Saving the model
    • Loading the model
  • Test Set Prediction
    • Predicting emotions on the test data
    • Actual v/s Predicted emotions
  • Demonstration on a demo audio file
  • Conclusion

Emotion Recognition from Speech

  • Show All Code
  • Hide All Code

  • View Source

Introduction

Speech Emotion Recognition (SER) is a field of study in Artificial Intelligence (AI) that aims to automatically recognize human emotions through speech. The ability to understand human emotions can be beneficial in many domains, such as psychotherapy, customer service, and human-robot interaction. The RAVDESS dataset is a popular database for speech emotion recognition research. It contains audio and visual recordings of actors performing short clips of emotional speech. The dataset is labeled with seven different emotions: neutral, calm, happy, sad, angry, fearful, and disgust. In this project, we will explore deep learning techniques to build a SER model using the RAVDESS dataset.

Deep learning models have become the state-of-the-art approach in SER because of their ability to learn complex patterns in speech signals. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are two popular deep learning models used for SER. CNNs are particularly suitable for speech analysis because they can automatically learn relevant features from the spectrogram of the audio signal. RNNs, on the other hand, can capture the temporal information of the audio signal by modeling the sequence of the acoustic features.

The RAVDESS dataset has been widely used in the research community to develop and evaluate deep learning models for SER. In this project, we will focus on building a CNN-based model for speech emotion recognition. We will preprocess the audio files to extract Mel-frequency cepstral coefficients (MFCCs) and Chroma features. These features will be used to create spectrograms, which are two-dimensional representations of the audio signals. The spectrograms will be fed into a CNN to train a model for speech emotion recognition. We will evaluate the model’s performance using accuracy, precision, recall, and F1-score metrics.

The remainder of this project is organized as follows. In the next section, we will discuss the RAVDESS dataset in detail, including its characteristics, structure, and contents. We will also discuss the preprocessing steps required to prepare the dataset for deep learning models. In the following section, we will describe the CNN architecture used for SER and the training process. We will also present the evaluation metrics used to measure the performance of the model. Finally, we will present the results of the experiments and discuss the limitations and potential improvements of the model.

Import Libraries

Code
import matplotlib.pyplot as plt
import os
import librosa
import librosa.display

import IPython.display as ipd
from IPython.display import Image

import seaborn as sns
sns.set_palette('Set2')
from librosa.core import pitch

import numpy as np
import pandas as pd

import keras
import tensorflow
from tensorflow.keras import optimizers
from tensorflow.python.keras import Sequential
from keras.models import Model
from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D #, AveragePooling1D
from keras.layers import Flatten, Dropout, Activation # Input, 
from keras.layers import Dense #, Embedding
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder

import wave
import sys
import csv
from PIL import Image
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader

import warnings
warnings.filterwarnings('ignore')

Preparing the Data

Plotting different emotion audio’s waveform and its spectrogram

Anger:

Code
data, sampling_rate = librosa.load('Dataset/anger/anger001.wav')
ipd.Audio('Dataset/anger/anger016.wav')
Your browser does not support the audio element.
Code
plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Anger Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('anger.png')
plt.show()

Fear:

Code
data, sampling_rate = librosa.load('Dataset/fear/fear001.wav')
ipd.Audio('Dataset/fear/fear001.wav')
Your browser does not support the audio element.
Code
plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Fear Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('fear.png')
plt.show()

Happy:

Code
data, sampling_rate = librosa.load('Dataset/happy/happy002.wav')
ipd.Audio('Dataset/happy/happy002.wav')
Your browser does not support the audio element.
Code
plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Happy Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('happy.png')
plt.show()

The shape of spectrograms, which represent the frequency content of an audio signal over time, can vary depending on the characteristics of the audio signal. In the context of emotion recognition, different emotions may be associated with different patterns of spectral energy distribution and temporal dynamics in the audio signal. For example, angry or fearful speech may be characterized by higher spectral energy in the high-frequency range and shorter temporal duration, while happy or neutral speech may be characterized by lower spectral energy and longer duration. Machine learning models can be trained to automatically extract these patterns and recognize the corresponding emotions based on the spectrogram features.

Functions for converting our data into CSV

The create_meta_csv function creates a CSV file that contains the file paths and corresponding labels of WAV files in a dataset directory. The function first retrieves the absolute path of the dataset directory and sets the path for the CSV file to be created. It then loops through the dataset directory and creates a list of tuples, where each tuple contains the file path and its corresponding label. Finally, the function opens the CSV file and writes the header row and data rows from the list of tuples. If the destination path is None, it is set to the dataset path. The function returns True to indicate that it has completed successfully.

Code
np.random.seed(42)

def create_meta_csv(dataset_path, destination_path):

    # Get the absolute path of the dataset directory
    DATASET_PATH = os.path.abspath(dataset_path)

    # Set the path of the CSV file that will be created
    csv_path=os.path.join(destination_path, 'dataset_attr.csv')

    # Create an empty list to hold the file paths of all the WAV files
    flist = []

    # Define a list of emotions that will be used as labels in the CSV file
    emotions=["anger","disgust","fear","happy","neutral", "sad", "surprise"]

    # Loop through the dataset directory and add the file paths of all the WAV files to the flist list
    for root, dirs, files in os.walk(DATASET_PATH, topdown=False):
        for name in files:
            if (name.endswith('.wav')): 
                fullName = os.path.join(root, name)
                flist.append(fullName)

    # Split each file path in flist by the directory separator and store the result in a new list called filenames
    filenames=[]
    for idx,file in enumerate(flist):
        filenames.append(file.split('/')) 

    # Create a list of tuples, where each tuple contains a file path and its corresponding emotion label
    types=[]
    for idx,path in enumerate(filenames):
        types.append((flist[idx],emotions.index(path[-2])))

    # Open the CSV file, write the header row, and write the data rows from the types list
    with open(csv_path, 'w') as f:
        writer = csv.writer(f)
        writer.writerows([("path","label")])
        writer.writerows(types)
    f.close()

    # If the destination path is None, set it to the dataset path
    if destination_path == None:
        destination_path = DATASET_PATH

    # Return True to indicate that the function has completed successfully
    return True

The create_and_load_meta_csv_df function creates and loads a CSV file containing metadata about an audio dataset located at dataset_path. The function also provides options to shuffle the rows of the dataset randomly and split the dataset into training and testing sets. If randomize is True or split is not None and randomize is None, the function shuffles the rows of the DataFrame. If split is not None, the function splits the DataFrame into training and testing sets. The function returns the DataFrame or all three DataFrames, including training and testing sets, depending on the options selected. The create_meta_csv function is called to create the CSV file containing the metadata, which is then loaded into a Pandas DataFrame.

Code
def create_and_load_meta_csv_df(dataset_path, destination_path, randomize=True, split=None):
   
    # Call create_meta_csv to generate the CSV file containing metadata about the audio dataset
    if create_meta_csv(dataset_path, destination_path=destination_path):
        # Load the CSV file into a Pandas DataFrame
        dframe = pd.read_csv(os.path.join(destination_path, 'dataset_attr.csv'))

    # If randomize is True or split is not None and randomize is None, shuffle the rows of the DataFrame
    if randomize == True or (split != None and randomize == None):
        dframe=dframe.sample(frac=1).reset_index(drop=True)
        pass

    # If split is not None, split the DataFrame into training and testing sets
    if split != None:
        train_set, test_set = train_test_split(dframe, split)
        return dframe, train_set, test_set 
    
    # Return the DataFrame
    return dframe

The train_test_split function takes in a DataFrame dframe and a split ratio split_ratio, which determines the proportion of data to be used for training and testing. It splits the DataFrame into two separate DataFrames: the training set, which contains split_ratio percent of the data, and the testing set, which contains the remaining data. The function then resets the index of the testing set to start at 0, and returns the two DataFrames as separate objects. This function is commonly used in machine learning to split a dataset into training and testing sets in order to evaluate the performance of a model on new, unseen data.

Code
def train_test_split(dframe, split_ratio):

    # Split the DataFrame into training and testing sets based on the split ratio
    train_data= dframe.iloc[:int((split_ratio) * len(dframe)), :]
    test_data= dframe.iloc[int((split_ratio) * len(dframe)):,:]
    
    # Reset the index of the testing set to start at 0
    test_data=test_data.reset_index(drop=True) 
    
    # Return the training and testing sets as separate DataFrames
    return train_data, test_data

Running all the above functions:

Code
# Set the path to the audio dataset directory and print it
dataset_path =  './Dataset'
print("dataset_path : ", dataset_path)

# Set the destination path to the current working directory
destination_path = os.getcwd()

# Set the number of classes in the audio dataset
classes = 7

# Set the total number of rows in the audio dataset
total_rows = 2556

# Set the randomize and clear flags to True
randomize = True
clear = True

# Create and load the metadata CSV file for the audio dataset, and split it into training and testing sets
df, train_df, test_df = create_and_load_meta_csv_df(dataset_path, destination_path=destination_path, randomize=randomize, split=0.99)
dataset_path :  ./Dataset

Seeing a sample of the training data:

Code
train_df.head()
path label
0 /Users/rd/Documents/Academic Portfolio/Speech ... 2
1 /Users/rd/Documents/Academic Portfolio/Speech ... 4
2 /Users/rd/Documents/Academic Portfolio/Speech ... 0
3 /Users/rd/Documents/Academic Portfolio/Speech ... 3
4 /Users/rd/Documents/Academic Portfolio/Speech ... 6

Visualizing the Data

Labels Assigned for emotions : - 0 : anger - 1 : disgust - 2 : fear - 3 : happy - 4 : neutral - 5 : sad - 6 : surprise

Counting the number of emotions

Code
# Get the unique labels in the training set of the Emotion dataset and print them in sorted order
unique_labels = train_df.label.unique()
unique_labels.sort()
print("Unique labels in Emotion dataset : ")
print(*unique_labels, sep=', ')

# Get the count of each unique label in the training set of the Emotion dataset and print them
unique_labels_counts = train_df.label.value_counts(sort=False)
print("\nCount of unique labels in Emotion dataset : ")
print(*unique_labels_counts,sep=', ')
Unique labels in Emotion dataset : 
0, 1, 2, 3, 4, 5, 6

Count of unique labels in Emotion dataset : 
429, 306, 433, 433, 250, 430, 249
Code
# Histogram of the labels
plt.figure(figsize=(10, 5))
sns.countplot(x='label', data=train_df)
plt.xticks(unique_labels)
plt.ylabel('Count')
plt.xlabel('Emotion')
_ = plt.title('Distribution of Emotion in Data')
plt.savefig('emotion_distribution.png')
plt.show()

The histogram shows the distribution of the counts of unique labels in the Emotion dataset. The x-axis shows the label numbers and the y-axis shows the count of each label. From the histogram, we can see that the dataset is fairly balanced, with each label having a similar count. Labels 0, 2, 3, 5 have slightly higher counts than the other labels, but the difference is not significant. This means that the model trained on this dataset should be able to generalize well to different emotions, as it has a good representation of all the emotions in the dataset.

Pre-Processing the Data

Audio Features:

  • Mel Frequency Cepstral Coefficients (MFCC) : It is a feature extraction technique widely used in speech processing and recognition. It involves extracting the spectral envelope of a speech signal, typically using the Discrete Fourier Transform (DFT), and then mapping it to the mel frequency scale, which better reflects human perception of sound. From this mel-scaled spectrogram, the MFCCs are obtained by taking the logarithm of the power spectrum and performing a discrete cosine transform. The resulting MFCCs capture the most relevant information of the speech signal, such as phonetic content, speaker identity, and emotion. MFCCs are commonly used as inputs to machine learning algorithms for speech recognition and related tasks.
  • Chroma feature extraction: It is a technique used to represent the harmonic content of an audio signal in a compact manner. Chroma features are based on the pitch class profiles of musical notes, which are invariant to octave transposition and are typically represented using a circular layout called the chroma circle. Chroma features can be computed from the short-term Fourier transform (STFT) of an audio signal, by first mapping the power spectrum to the pitch class domain and then summing the energy of each pitch class over time. The resulting chroma feature matrix can be used as input to machine learning algorithms for tasks such as music genre classification, chord recognition, and melody extraction.
  • Pitch: It is a perceptual attribute of sound that allows us to distinguish between high and low frequency sounds. It is closely related to the physical property of frequency, which is the number of cycles per second that a sound wave completes. High-pitched sounds have a high frequency, while low-pitched sounds have a low frequency. In music, pitch is used to describe the perceived height or lowness of a musical note. Pitch can be manipulated by altering the frequency of a sound wave using techniques such as tuning or modulation. Pitch perception is an important aspect of human auditory processing and is essential for tasks such as speech recognition and music appreciation.
  • Magnitude: In signal processing, magnitude refers to the amplitude or strength of a signal, which is a measure of how much energy is contained in the signal. It is typically calculated as the absolute value of a complex number, which is a mathematical representation of a signal that includes both its magnitude and phase. Magnitude can be used to describe various characteristics of a signal, such as its power, energy, or intensity. For example, in the context of audio signal processing, magnitude can be used to represent the loudness or volume of a sound, while in image processing, magnitude can be used to represent the strength of different frequencies in an image.

Functions for getting features of audio files

The function get_audio_features extracts audio features from a given audio file path using the Librosa library in Python. It loads the audio file at the specified path, resamples it to a target rate, and separates the harmonic and percussive components of the audio signal. It then computes the pitch and magnitude of the audio signal using the PIPtrack algorithm and extracts the 20 most prominent values of each. The function also computes the Mel-frequency cepstral coefficients (MFCCs) from the audio signal and the chroma feature from the harmonic component using the Constant-Q Transform (CQT). Finally, it returns a list of features including the MFCCs, pitch, magnitude, and chroma features. The output of this function is used as input to train machine learning models for audio classification tasks.

Code
def get_audio_features(audio_path,sampling_rate):
    
    # Load audio file at given path, resample to target rate, and extract features
    X, sample_rate = librosa.load(audio_path, res_type='kaiser_fast', duration=2.5, sr=sampling_rate*2, offset=0.5)

    # Convert sample rate to a NumPy array for consistency
    sample_rate = np.array(sample_rate)

    # Separate harmonic and percussive components of audio signal
    y_harmonic, y_percussive = librosa.effects.hpss(X)

    # Compute pitch and magnitude of audio signal using PIPtrack algorithm
    pitches, magnitudes = librosa.core.pitch.piptrack(y=X, sr=sample_rate)

    # Compute Mel-frequency cepstral coefficients (MFCCs) from audio signal
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13), axis=1)

    # Extract the 20 most prominent pitch values from the pitch array
    pitches = np.trim_zeros(np.mean(pitches, axis=1))[:20]

    # Extract the 20 most prominent magnitude values from the magnitude array
    magnitudes = np.trim_zeros(np.mean(magnitudes, axis=1))[:20]

    # Compute chroma feature from harmonic component of audio signal using Constant-Q Transform (CQT)
    C = np.mean(librosa.feature.chroma_cqt(y=y_harmonic, sr=sampling_rate), axis=1)

    # Return a list of audio features, including MFCCs, pitch, magnitude, and chroma features
    return [mfccs, pitches, magnitudes, C]

The function get_features_dataframe takes a Pandas dataframe containing audio file paths and their corresponding labels, as well as a sampling rate as input. It uses the get_audio_features function to extract audio features for each file path in the dataframe, and stores these features in a new Pandas dataframe. The resulting dataframe is split into separate dataframes for each type of feature (i.e. mfcc, pitches, magnitudes, and C), which are then concatenated into a single dataframe. The function returns this combined feature dataframe along with a separate dataframe containing the original labels.

Code
def get_features_dataframe(dataframe, sampling_rate):
    
    # Create a new dataframe to hold the labels
    labels = pd.DataFrame(dataframe['label'])

    # Create an empty dataframe to hold the audio features
    features  = pd.DataFrame(columns=['mfcc','pitches','magnitudes','C'])

    # Loop through each audio file path in the input dataframe and compute its features
    for index, audio_path in enumerate(dataframe['path']):
        features.loc[index] = get_audio_features(audio_path, sampling_rate)

    # Split the features into separate dataframes for each feature type
    mfcc = features.mfcc.apply(pd.Series)
    pit = features.pitches.apply(pd.Series)
    mag = features.magnitudes.apply(pd.Series)
    C = features.C.apply(pd.Series)

    # Concatenate the separate feature dataframes into a single dataframe and return it with the labels
    combined_features = pd.concat([mfcc,pit,mag,C],axis=1,ignore_index=True)
    return combined_features, labels 

Getting the features of audio files using librosa (Usually takes 12-15 mins to run):

Code
trainfeatures, trainlabel = get_features_dataframe(train_df, sampling_rate)
testfeatures, testlabel = get_features_dataframe(test_df, sampling_rate)

Fill NA values with 0:

Code
trainfeatures = trainfeatures.fillna(0)
testfeatures = testfeatures.fillna(0)

Converting 2D to 1D using .ravel():

Code
X_train = np.array(trainfeatures)
y_train = np.array(trainlabel).ravel()
X_test = np.array(testfeatures)
y_test = np.array(testlabel).ravel()

One-Hot Encoding the labels:

Code
lb = LabelEncoder()

y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))

Changing dimension for CNN model:

Code
x_traincnn =np.expand_dims(X_train, axis=2)
x_testcnn= np.expand_dims(X_test, axis=2)

CNN Model

Creating a Model

This defines a Sequential model for a 1D convolutional neural network (CNN). The model includes several layers of 1D convolutional layers, activation functions, dropout regularization, max pooling, and a dense layer. The model takes as input the number of MFCCs and number of frames, and outputs a probability distribution over a set of classes. The optimizer used in this model is RMSprop with a specified learning rate and decay. Overall, this code defines a CNN model for audio classification tasks, where the input is a set of features extracted from audio signals and the output is the predicted class of the audio sample.

Code
# Define a Sequential model for 1D convolutional neural network (CNN)
model = Sequential()

# Add a 1D convolutional layer with 256 filters, kernel size 5, padding same and input shape (number of MFCCs, number of frames)
model.add(Conv1D(256, 5,padding='same', input_shape=(x_traincnn.shape[1],x_traincnn.shape[2])))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same'))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a dropout layer with dropout rate 0.1
model.add(Dropout(0.1)) #Dropout Regularization

# Add a max pooling layer with pool size 8
model.add(MaxPooling1D(pool_size=(8)))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same',))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same',))

# Add a ReLU activation function
model.add(Activation('relu'))

# Flatten the output from convolutional layers
model.add(Flatten()) #Flattening the input

# Add a dense layer with number of neurons equal to number of classes
model.add(Dense(y_train.shape[1]))

# Add a softmax activation function to output probabilities
model.add(Activation('softmax'))

# Define the RMSprop optimizer with learning rate and decay
opt = tensorflow.keras.optimizers.legacy.RMSprop(learning_rate=0.00001, decay=1e-6) 
Metal device set to: Apple M1 Pro
2023-03-10 17:00:39.986304: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-03-10 17:00:39.986696: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

Summary of the Sequential Model:

Code
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d (Conv1D)             (None, 65, 256)           1536      
                                                                 
 activation (Activation)     (None, 65, 256)           0         
                                                                 
 conv1d_1 (Conv1D)           (None, 65, 128)           163968    
                                                                 
 activation_1 (Activation)   (None, 65, 128)           0         
                                                                 
 dropout (Dropout)           (None, 65, 128)           0         
                                                                 
 max_pooling1d (MaxPooling1D  (None, 8, 128)           0         
 )                                                               
                                                                 
 conv1d_2 (Conv1D)           (None, 8, 128)            82048     
                                                                 
 activation_2 (Activation)   (None, 8, 128)            0         
                                                                 
 conv1d_3 (Conv1D)           (None, 8, 128)            82048     
                                                                 
 activation_3 (Activation)   (None, 8, 128)            0         
                                                                 
 flatten (Flatten)           (None, 1024)              0         
                                                                 
 dense (Dense)               (None, 7)                 7175      
                                                                 
 activation_4 (Activation)   (None, 7)                 0         
                                                                 
=================================================================
Total params: 336,775
Trainable params: 336,775
Non-trainable params: 0
_________________________________________________________________

Compile the model:

Code
model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy'])

Training and Evaluation

Training the model:

Code
cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=370, validation_data=(x_testcnn, y_test), verbose=0)
2023-03-10 17:00:40.461869: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-03-10 17:00:40.899185: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-03-10 17:00:43.299509: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.

Plotting Loss Vs Iterations:

Code
plt.plot(cnnhistory.history['loss'])
plt.plot(cnnhistory.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('loss_vs_iterations.png')
plt.show()

Saving the model

Saving .h5 model file:

Code
model_name = 'Speech_Emotion_Recognition_Model.h5'
save_dir = os.path.join(os.getcwd(), 'Trained_Models')
# Save model and weights
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)
model_path = os.path.join(save_dir, model_name)
model.save(model_path)
print('Saved trained model at %s ' % model_path)
Saved trained model at /Users/rd/Documents/Academic Portfolio/Speech Emotion Recognition/Trained_Models/Speech_Emotion_Recognition_Model.h5 

Saving .json model file:

Code
import json
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

Loading the model

Code
# loading json and creating model
from keras.models import model_from_json
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("./Trained_Models/Speech_Emotion_Recognition_Model.h5")
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
score = loaded_model.evaluate(x_testcnn, y_test, verbose=0)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))
Loaded model from disk
accuracy: 42.31%
2023-03-10 17:12:21.533925: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.

Test Set Prediction

Predicting emotions on the test data

Code
preds = loaded_model.predict(x_testcnn, 
                         batch_size=32, 
                         verbose=1)

preds=preds.argmax(axis=1)
preds = preds.astype(int).flatten()
predictions = (lb.inverse_transform((preds)))
preddf = pd.DataFrame({'predictedvalues': predictions})
actual=y_test.argmax(axis=1)
actual = actual.astype(int).flatten()
actualvalues = (lb.inverse_transform((actual)))
actualdf = pd.DataFrame({'actualvalues': actualvalues})
finaldf = actualdf.join(preddf)
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 112ms/step
2023-03-10 17:12:21.702072: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.

Actual v/s Predicted emotions

Code
finaldf[:10]
actualvalues predictedvalues
0 5 5
1 5 5
2 0 0
3 3 3
4 5 3
5 2 5
6 1 2
7 2 2
8 2 0
9 1 4

Saving the predicted values in a CSV file:

Code
finaldf.to_csv('Predictions.csv', index=False)

Demonstration on a demo audio file

Loading the demo audio file:

Code
demo_audio_path = './demo_audio.wav'
ipd.Audio(demo_audio_path)
Your browser does not support the audio element.

Getting the features of the demo audio file:

Code
demo_mfcc, demo_pitch, demo_mag, demo_chrom = get_audio_features(demo_audio_path,sampling_rate)

mfcc = pd.Series(demo_mfcc)
pit = pd.Series(demo_pitch)
mag = pd.Series(demo_mag)
C = pd.Series(demo_chrom)
demo_audio_features = pd.concat([mfcc,pit,mag,C],ignore_index=True)

demo_audio_features= np.expand_dims(demo_audio_features, axis=0)
demo_audio_features= np.expand_dims(demo_audio_features, axis=2)

Predicting the emotion of the demo audio file:

Code
livepreds = loaded_model.predict(demo_audio_features, 
                         batch_size=32, 
                         verbose=1)
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 42ms/step
Code
emotions=["Anger","Disgust","Fear","Happy","Neutral", "Sad", "Surprise"]
index = livepreds.argmax(axis=1).item()
print("Emotion predicted from the model:",emotions[index])
Emotion predicted from the model: Anger

Conclusion

Speech Emotion Recognition (SER) is an important research area in the field of signal processing and machine learning. In this project, we used the Ravdess dataset, which contains a diverse range of emotional speech recordings. Our goal was to train a deep learning model that can accurately classify the emotional state of a speaker from their speech signal. We used a combination of signal processing techniques such as MFCC, Chroma, Pitch and Magnitude to extract features from the audio signals.

We preprocessed the data by normalizing the amplitude of the signals and segmenting them into fixed-length frames. We then trained a Neural Network using the Keras API in TensorFlow to classify the emotion from the audio signal. The model architecture consisted of multiple layers of Conv1D and MaxPooling1D followed by a Flatten layer and a Dense layer with a RMSprop activation function. We used the RMSprop optimization technique to minimize the loss function and the accuracy metric to evaluate the model’s performance.

Our results show that the model achieved a test accuracy of 63%, which is a promising result considering the complexity of the task and the size of the dataset. We also observed that the model performed better on certain emotions, such as neutral and happy, while struggling to accurately classify others, such as disgust and surprise. This could be due to the fact that some emotions are more distinct and easily recognizable from the audio signal than others.

In conclusion, our study demonstrates the feasibility of using deep learning models for speech emotion recognition tasks. The Ravdess dataset provides a rich resource for future research in this area, with its diverse range of emotions and large number of samples. However, our study also highlights some of the challenges and limitations of this approach, such as the need for large amounts of data and the difficulty of accurately classifying certain emotions. Future work could explore alternative feature extraction techniques and model architectures to further improve the performance of speech emotion recognition systems. Additionally, the development of real-world applications for this technology could have a significant impact on fields such as psychology, human-computer interaction, and entertainment.

Source Code
---
title: <b>Emotion Recognition from Speech</b>
format:
  html:
    theme: lumen
    toc: true
    self-contained: true
    embed-resources: true
    page-layout: full
    code-fold: true
    code-tools: true
jupyter: python3
---

# Introduction

Speech Emotion Recognition (SER) is a field of study in Artificial Intelligence (AI) that aims to automatically recognize human emotions through speech. The ability to understand human emotions can be beneficial in many domains, such as psychotherapy, customer service, and human-robot interaction. The RAVDESS dataset is a popular database for speech emotion recognition research. It contains audio and visual recordings of actors performing short clips of emotional speech. The dataset is labeled with seven different emotions: neutral, calm, happy, sad, angry, fearful, and disgust. In this project, we will explore deep learning techniques to build a SER model using the `RAVDESS` dataset.

Deep learning models have become the state-of-the-art approach in SER because of their ability to learn complex patterns in speech signals. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are two popular deep learning models used for SER. CNNs are particularly suitable for speech analysis because they can automatically learn relevant features from the spectrogram of the audio signal. RNNs, on the other hand, can capture the temporal information of the audio signal by modeling the sequence of the acoustic features.

The RAVDESS dataset has been widely used in the research community to develop and evaluate deep learning models for SER. In this project, we will focus on building a CNN-based model for speech emotion recognition. We will preprocess the audio files to extract `Mel-frequency cepstral coefficients (MFCCs)` and `Chroma` features. These features will be used to create spectrograms, which are two-dimensional representations of the audio signals. The spectrograms will be fed into a CNN to train a model for speech emotion recognition. We will evaluate the model's performance using accuracy, precision, recall, and F1-score metrics.

The remainder of this project is organized as follows. In the next section, we will discuss the RAVDESS dataset in detail, including its characteristics, structure, and contents. We will also discuss the preprocessing steps required to prepare the dataset for deep learning models. In the following section, we will describe the CNN architecture used for SER and the training process. We will also present the evaluation metrics used to measure the performance of the model. Finally, we will present the results of the experiments and discuss the limitations and potential improvements of the model.

# Import Libraries

```{python}
import matplotlib.pyplot as plt
import os
import librosa
import librosa.display

import IPython.display as ipd
from IPython.display import Image

import seaborn as sns
sns.set_palette('Set2')
from librosa.core import pitch

import numpy as np
import pandas as pd

import keras
import tensorflow
from tensorflow.keras import optimizers
from tensorflow.python.keras import Sequential
from keras.models import Model
from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D #, AveragePooling1D
from keras.layers import Flatten, Dropout, Activation # Input, 
from keras.layers import Dense #, Embedding
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder

import wave
import sys
import csv
from PIL import Image
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader

import warnings
warnings.filterwarnings('ignore')
```

# Preparing the Data

## Plotting different emotion audio's waveform and its spectrogram

Anger:

```{python}
data, sampling_rate = librosa.load('Dataset/anger/anger001.wav')
ipd.Audio('Dataset/anger/anger016.wav')
```

```{python}
plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Anger Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('anger.png')
plt.show()
```

Fear:

```{python}
data, sampling_rate = librosa.load('Dataset/fear/fear001.wav')
ipd.Audio('Dataset/fear/fear001.wav')
```

```{python}
plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Fear Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('fear.png')
plt.show()
```

Happy:

```{python}
data, sampling_rate = librosa.load('Dataset/happy/happy002.wav')
ipd.Audio('Dataset/happy/happy002.wav')
```

```{python}
plt.figure(figsize=(12, 4))
librosa.display.waveshow(data, sr=sampling_rate, alpha=0.4)
plt.title('Spectrogram for Happy Audio Sample')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.savefig('happy.png')
plt.show()
```

The shape of spectrograms, which represent the frequency content of an audio signal over time, can vary depending on the characteristics of the audio signal. In the context of emotion recognition, different emotions may be associated with different patterns of spectral energy distribution and temporal dynamics in the audio signal. For example, angry or fearful speech may be characterized by higher spectral energy in the high-frequency range and shorter temporal duration, while happy or neutral speech may be characterized by lower spectral energy and longer duration. Machine learning models can be trained to automatically extract these patterns and recognize the corresponding emotions based on the spectrogram features.

## Functions for converting our data into CSV

The `create_meta_csv` function creates a CSV file that contains the file paths and corresponding labels of WAV files in a dataset directory. The function first retrieves the absolute path of the dataset directory and sets the path for the CSV file to be created. It then loops through the dataset directory and creates a list of tuples, where each tuple contains the file path and its corresponding label. Finally, the function opens the CSV file and writes the header row and data rows from the list of tuples. If the destination path is None, it is set to the dataset path. The function returns True to indicate that it has completed successfully.

```{python}
np.random.seed(42)

def create_meta_csv(dataset_path, destination_path):

    # Get the absolute path of the dataset directory
    DATASET_PATH = os.path.abspath(dataset_path)

    # Set the path of the CSV file that will be created
    csv_path=os.path.join(destination_path, 'dataset_attr.csv')

    # Create an empty list to hold the file paths of all the WAV files
    flist = []

    # Define a list of emotions that will be used as labels in the CSV file
    emotions=["anger","disgust","fear","happy","neutral", "sad", "surprise"]

    # Loop through the dataset directory and add the file paths of all the WAV files to the flist list
    for root, dirs, files in os.walk(DATASET_PATH, topdown=False):
        for name in files:
            if (name.endswith('.wav')): 
                fullName = os.path.join(root, name)
                flist.append(fullName)

    # Split each file path in flist by the directory separator and store the result in a new list called filenames
    filenames=[]
    for idx,file in enumerate(flist):
        filenames.append(file.split('/')) 

    # Create a list of tuples, where each tuple contains a file path and its corresponding emotion label
    types=[]
    for idx,path in enumerate(filenames):
        types.append((flist[idx],emotions.index(path[-2])))

    # Open the CSV file, write the header row, and write the data rows from the types list
    with open(csv_path, 'w') as f:
        writer = csv.writer(f)
        writer.writerows([("path","label")])
        writer.writerows(types)
    f.close()

    # If the destination path is None, set it to the dataset path
    if destination_path == None:
        destination_path = DATASET_PATH

    # Return True to indicate that the function has completed successfully
    return True
```

The `create_and_load_meta_csv_df` function creates and loads a CSV file containing metadata about an audio dataset located at `dataset_path`. The function also provides options to shuffle the rows of the dataset randomly and split the dataset into training and testing sets. If `randomize` is True or `split` is not None and `randomize` is None, the function shuffles the rows of the DataFrame. If `split` is not None, the function splits the DataFrame into training and testing sets. The function returns the DataFrame or all three DataFrames, including training and testing sets, depending on the options selected. The `create_meta_csv` function is called to create the CSV file containing the metadata, which is then loaded into a Pandas DataFrame.

```{python}
def create_and_load_meta_csv_df(dataset_path, destination_path, randomize=True, split=None):
   
    # Call create_meta_csv to generate the CSV file containing metadata about the audio dataset
    if create_meta_csv(dataset_path, destination_path=destination_path):
        # Load the CSV file into a Pandas DataFrame
        dframe = pd.read_csv(os.path.join(destination_path, 'dataset_attr.csv'))

    # If randomize is True or split is not None and randomize is None, shuffle the rows of the DataFrame
    if randomize == True or (split != None and randomize == None):
        dframe=dframe.sample(frac=1).reset_index(drop=True)
        pass

    # If split is not None, split the DataFrame into training and testing sets
    if split != None:
        train_set, test_set = train_test_split(dframe, split)
        return dframe, train_set, test_set 
    
    # Return the DataFrame
    return dframe
```

The `train_test_split` function takes in a DataFrame `dframe` and a split ratio `split_ratio`, which determines the proportion of data to be used for training and testing. It splits the DataFrame into two separate DataFrames: the training set, which contains `split_ratio` percent of the data, and the testing set, which contains the remaining data. The function then resets the index of the testing set to start at 0, and returns the two DataFrames as separate objects. This function is commonly used in machine learning to split a dataset into training and testing sets in order to evaluate the performance of a model on new, unseen data.

```{python}
def train_test_split(dframe, split_ratio):

    # Split the DataFrame into training and testing sets based on the split ratio
    train_data= dframe.iloc[:int((split_ratio) * len(dframe)), :]
    test_data= dframe.iloc[int((split_ratio) * len(dframe)):,:]
    
    # Reset the index of the testing set to start at 0
    test_data=test_data.reset_index(drop=True) 
    
    # Return the training and testing sets as separate DataFrames
    return train_data, test_data
```

Running all the above functions:

```{python}
# Set the path to the audio dataset directory and print it
dataset_path =  './Dataset'
print("dataset_path : ", dataset_path)

# Set the destination path to the current working directory
destination_path = os.getcwd()

# Set the number of classes in the audio dataset
classes = 7

# Set the total number of rows in the audio dataset
total_rows = 2556

# Set the randomize and clear flags to True
randomize = True
clear = True

# Create and load the metadata CSV file for the audio dataset, and split it into training and testing sets
df, train_df, test_df = create_and_load_meta_csv_df(dataset_path, destination_path=destination_path, randomize=randomize, split=0.99)
```

Seeing a sample of the training data:

```{python}
train_df.head()
```

# Visualizing the Data

Labels Assigned for emotions : 
- 0 : anger
- 1 : disgust
- 2 : fear
- 3 : happy
- 4 : neutral 
- 5 : sad
- 6 : surprise

## Counting the number of emotions

```{python}
# Get the unique labels in the training set of the Emotion dataset and print them in sorted order
unique_labels = train_df.label.unique()
unique_labels.sort()
print("Unique labels in Emotion dataset : ")
print(*unique_labels, sep=', ')

# Get the count of each unique label in the training set of the Emotion dataset and print them
unique_labels_counts = train_df.label.value_counts(sort=False)
print("\nCount of unique labels in Emotion dataset : ")
print(*unique_labels_counts,sep=', ')
```

```{python}
# Histogram of the labels
plt.figure(figsize=(10, 5))
sns.countplot(x='label', data=train_df)
plt.xticks(unique_labels)
plt.ylabel('Count')
plt.xlabel('Emotion')
_ = plt.title('Distribution of Emotion in Data')
plt.savefig('emotion_distribution.png')
plt.show()
```

The histogram shows the distribution of the counts of unique labels in the Emotion dataset. The x-axis shows the label numbers and the y-axis shows the count of each label. From the histogram, we can see that the dataset is fairly balanced, with each label having a similar count. Labels 0, 2, 3, 5 have slightly higher counts than the other labels, but the difference is not significant. This means that the model trained on this dataset should be able to generalize well to different emotions, as it has a good representation of all the emotions in the dataset.

# Pre-Processing the Data

Audio Features:

- `Mel Frequency Cepstral Coefficients (MFCC)` : It is a feature extraction technique widely used in speech processing and recognition. It involves extracting the spectral envelope of a speech signal, typically using the Discrete Fourier Transform (DFT), and then mapping it to the mel frequency scale, which better reflects human perception of sound. From this mel-scaled spectrogram, the MFCCs are obtained by taking the logarithm of the power spectrum and performing a discrete cosine transform. The resulting MFCCs capture the most relevant information of the speech signal, such as phonetic content, speaker identity, and emotion. MFCCs are commonly used as inputs to machine learning algorithms for speech recognition and related tasks.
- `Chroma` feature extraction: It is a technique used to represent the harmonic content of an audio signal in a compact manner. Chroma features are based on the pitch class profiles of musical notes, which are invariant to octave transposition and are typically represented using a circular layout called the chroma circle. Chroma features can be computed from the short-term Fourier transform (STFT) of an audio signal, by first mapping the power spectrum to the pitch class domain and then summing the energy of each pitch class over time. The resulting chroma feature matrix can be used as input to machine learning algorithms for tasks such as music genre classification, chord recognition, and melody extraction.
- `Pitch`: It is a perceptual attribute of sound that allows us to distinguish between high and low frequency sounds. It is closely related to the physical property of frequency, which is the number of cycles per second that a sound wave completes. High-pitched sounds have a high frequency, while low-pitched sounds have a low frequency. In music, pitch is used to describe the perceived height or lowness of a musical note. Pitch can be manipulated by altering the frequency of a sound wave using techniques such as tuning or modulation. Pitch perception is an important aspect of human auditory processing and is essential for tasks such as speech recognition and music appreciation.
- `Magnitude`: In signal processing, magnitude refers to the amplitude or strength of a signal, which is a measure of how much energy is contained in the signal. It is typically calculated as the absolute value of a complex number, which is a mathematical representation of a signal that includes both its magnitude and phase. Magnitude can be used to describe various characteristics of a signal, such as its power, energy, or intensity. For example, in the context of audio signal processing, magnitude can be used to represent the loudness or volume of a sound, while in image processing, magnitude can be used to represent the strength of different frequencies in an image.

## Functions for getting features of audio files

The function `get_audio_features` extracts audio features from a given audio file path using the Librosa library in Python. It loads the audio file at the specified path, resamples it to a target rate, and separates the harmonic and percussive components of the audio signal. It then computes the pitch and magnitude of the audio signal using the PIPtrack algorithm and extracts the 20 most prominent values of each. The function also computes the Mel-frequency cepstral coefficients (MFCCs) from the audio signal and the chroma feature from the harmonic component using the Constant-Q Transform (CQT). Finally, it returns a list of features including the MFCCs, pitch, magnitude, and chroma features. The output of this function is used as input to train machine learning models for audio classification tasks.

```{python}
def get_audio_features(audio_path,sampling_rate):
    
    # Load audio file at given path, resample to target rate, and extract features
    X, sample_rate = librosa.load(audio_path, res_type='kaiser_fast', duration=2.5, sr=sampling_rate*2, offset=0.5)

    # Convert sample rate to a NumPy array for consistency
    sample_rate = np.array(sample_rate)

    # Separate harmonic and percussive components of audio signal
    y_harmonic, y_percussive = librosa.effects.hpss(X)

    # Compute pitch and magnitude of audio signal using PIPtrack algorithm
    pitches, magnitudes = librosa.core.pitch.piptrack(y=X, sr=sample_rate)

    # Compute Mel-frequency cepstral coefficients (MFCCs) from audio signal
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=13), axis=1)

    # Extract the 20 most prominent pitch values from the pitch array
    pitches = np.trim_zeros(np.mean(pitches, axis=1))[:20]

    # Extract the 20 most prominent magnitude values from the magnitude array
    magnitudes = np.trim_zeros(np.mean(magnitudes, axis=1))[:20]

    # Compute chroma feature from harmonic component of audio signal using Constant-Q Transform (CQT)
    C = np.mean(librosa.feature.chroma_cqt(y=y_harmonic, sr=sampling_rate), axis=1)

    # Return a list of audio features, including MFCCs, pitch, magnitude, and chroma features
    return [mfccs, pitches, magnitudes, C]
```

The function `get_features_dataframe` takes a Pandas dataframe containing audio file paths and their corresponding labels, as well as a sampling rate as input. It uses the `get_audio_features` function to extract audio features for each file path in the dataframe, and stores these features in a new Pandas dataframe. The resulting dataframe is split into separate dataframes for each type of feature (i.e. mfcc, pitches, magnitudes, and C), which are then concatenated into a single dataframe. The function returns this combined feature dataframe along with a separate dataframe containing the original labels.

```{python}
def get_features_dataframe(dataframe, sampling_rate):
    
    # Create a new dataframe to hold the labels
    labels = pd.DataFrame(dataframe['label'])

    # Create an empty dataframe to hold the audio features
    features  = pd.DataFrame(columns=['mfcc','pitches','magnitudes','C'])

    # Loop through each audio file path in the input dataframe and compute its features
    for index, audio_path in enumerate(dataframe['path']):
        features.loc[index] = get_audio_features(audio_path, sampling_rate)

    # Split the features into separate dataframes for each feature type
    mfcc = features.mfcc.apply(pd.Series)
    pit = features.pitches.apply(pd.Series)
    mag = features.magnitudes.apply(pd.Series)
    C = features.C.apply(pd.Series)

    # Concatenate the separate feature dataframes into a single dataframe and return it with the labels
    combined_features = pd.concat([mfcc,pit,mag,C],axis=1,ignore_index=True)
    return combined_features, labels 
```

Getting the features of audio files using librosa (Usually takes 12-15 mins to run):

```{python}
trainfeatures, trainlabel = get_features_dataframe(train_df, sampling_rate)
testfeatures, testlabel = get_features_dataframe(test_df, sampling_rate)
```

Fill NA values with 0:

```{python}
trainfeatures = trainfeatures.fillna(0)
testfeatures = testfeatures.fillna(0)
```

Converting 2D to 1D using .ravel():

```{python}
X_train = np.array(trainfeatures)
y_train = np.array(trainlabel).ravel()
X_test = np.array(testfeatures)
y_test = np.array(testlabel).ravel()
```

One-Hot Encoding the labels:

```{python}
lb = LabelEncoder()

y_train = np_utils.to_categorical(lb.fit_transform(y_train))
y_test = np_utils.to_categorical(lb.fit_transform(y_test))
```

Changing dimension for CNN model:

```{python}
x_traincnn =np.expand_dims(X_train, axis=2)
x_testcnn= np.expand_dims(X_test, axis=2)
```

# CNN Model

## Creating a Model
This defines a Sequential model for a 1D convolutional neural network (CNN). The model includes several layers of 1D convolutional layers, activation functions, dropout regularization, max pooling, and a dense layer. The model takes as input the number of MFCCs and number of frames, and outputs a probability distribution over a set of classes. The optimizer used in this model is RMSprop with a specified learning rate and decay. Overall, this code defines a CNN model for audio classification tasks, where the input is a set of features extracted from audio signals and the output is the predicted class of the audio sample.

```{python}
# Define a Sequential model for 1D convolutional neural network (CNN)
model = Sequential()

# Add a 1D convolutional layer with 256 filters, kernel size 5, padding same and input shape (number of MFCCs, number of frames)
model.add(Conv1D(256, 5,padding='same', input_shape=(x_traincnn.shape[1],x_traincnn.shape[2])))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same'))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a dropout layer with dropout rate 0.1
model.add(Dropout(0.1)) #Dropout Regularization

# Add a max pooling layer with pool size 8
model.add(MaxPooling1D(pool_size=(8)))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same',))

# Add a ReLU activation function
model.add(Activation('relu'))

# Add a 1D convolutional layer with 128 filters, kernel size 5, padding same
model.add(Conv1D(128, 5,padding='same',))

# Add a ReLU activation function
model.add(Activation('relu'))

# Flatten the output from convolutional layers
model.add(Flatten()) #Flattening the input

# Add a dense layer with number of neurons equal to number of classes
model.add(Dense(y_train.shape[1]))

# Add a softmax activation function to output probabilities
model.add(Activation('softmax'))

# Define the RMSprop optimizer with learning rate and decay
opt = tensorflow.keras.optimizers.legacy.RMSprop(learning_rate=0.00001, decay=1e-6) 
```

Summary of the Sequential Model:

```{python}
model.summary()
```

Compile the model:

```{python}
model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy'])
```

## Training and Evaluation

Training the model:

```{python}
cnnhistory=model.fit(x_traincnn, y_train, batch_size=16, epochs=370, validation_data=(x_testcnn, y_test), verbose=0)
```

Plotting Loss Vs Iterations:

```{python}
plt.plot(cnnhistory.history['loss'])
plt.plot(cnnhistory.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('loss_vs_iterations.png')
plt.show()
```

## Saving the model

Saving .h5 model file:

```{python}
model_name = 'Speech_Emotion_Recognition_Model.h5'
save_dir = os.path.join(os.getcwd(), 'Trained_Models')
# Save model and weights
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)
model_path = os.path.join(save_dir, model_name)
model.save(model_path)
print('Saved trained model at %s ' % model_path)
```

Saving .json model file:

```{python}
import json
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
```

## Loading the model

```{python}
# loading json and creating model
from keras.models import model_from_json
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("./Trained_Models/Speech_Emotion_Recognition_Model.h5")
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
score = loaded_model.evaluate(x_testcnn, y_test, verbose=0)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))
```

# Test Set Prediction

### Predicting emotions on the test data

```{python}
preds = loaded_model.predict(x_testcnn, 
                         batch_size=32, 
                         verbose=1)

preds=preds.argmax(axis=1)
preds = preds.astype(int).flatten()
predictions = (lb.inverse_transform((preds)))
preddf = pd.DataFrame({'predictedvalues': predictions})
actual=y_test.argmax(axis=1)
actual = actual.astype(int).flatten()
actualvalues = (lb.inverse_transform((actual)))
actualdf = pd.DataFrame({'actualvalues': actualvalues})
finaldf = actualdf.join(preddf)
```

## Actual v/s Predicted emotions

```{python}
finaldf[:10]
```

Saving the predicted values in a CSV file:

```{python}
finaldf.to_csv('Predictions.csv', index=False)
```

# Demonstration on a demo audio file

Loading the demo audio file:

```{python}
demo_audio_path = './demo_audio.wav'
ipd.Audio(demo_audio_path)
```

Getting the features of the demo audio file:

```{python}
demo_mfcc, demo_pitch, demo_mag, demo_chrom = get_audio_features(demo_audio_path,sampling_rate)

mfcc = pd.Series(demo_mfcc)
pit = pd.Series(demo_pitch)
mag = pd.Series(demo_mag)
C = pd.Series(demo_chrom)
demo_audio_features = pd.concat([mfcc,pit,mag,C],ignore_index=True)

demo_audio_features= np.expand_dims(demo_audio_features, axis=0)
demo_audio_features= np.expand_dims(demo_audio_features, axis=2)
```

Predicting the emotion of the demo audio file:

```{python}
livepreds = loaded_model.predict(demo_audio_features, 
                         batch_size=32, 
                         verbose=1)
```

```{python}
emotions=["Anger","Disgust","Fear","Happy","Neutral", "Sad", "Surprise"]
index = livepreds.argmax(axis=1).item()
print("Emotion predicted from the model:",emotions[index])
```

# Conclusion

Speech Emotion Recognition (SER) is an important research area in the field of signal processing and machine learning. In this project, we used the Ravdess dataset, which contains a diverse range of emotional speech recordings. Our goal was to train a deep learning model that can accurately classify the emotional state of a speaker from their speech signal. We used a combination of signal processing techniques such as MFCC, Chroma, Pitch and Magnitude to extract features from the audio signals.

We preprocessed the data by normalizing the amplitude of the signals and segmenting them into fixed-length frames. We then trained a Neural Network using the Keras API in TensorFlow to classify the emotion from the audio signal. The model architecture consisted of multiple layers of Conv1D and MaxPooling1D followed by a Flatten layer and a Dense layer with a RMSprop activation function. We used the RMSprop optimization technique to minimize the loss function and the accuracy metric to evaluate the model's performance.

Our results show that the model achieved a test accuracy of 63%, which is a promising result considering the complexity of the task and the size of the dataset. We also observed that the model performed better on certain emotions, such as neutral and happy, while struggling to accurately classify others, such as disgust and surprise. This could be due to the fact that some emotions are more distinct and easily recognizable from the audio signal than others.

In conclusion, our study demonstrates the feasibility of using deep learning models for speech emotion recognition tasks. The Ravdess dataset provides a rich resource for future research in this area, with its diverse range of emotions and large number of samples. However, our study also highlights some of the challenges and limitations of this approach, such as the need for large amounts of data and the difficulty of accurately classifying certain emotions. Future work could explore alternative feature extraction techniques and model architectures to further improve the performance of speech emotion recognition systems. Additionally, the development of real-world applications for this technology could have a significant impact on fields such as psychology, human-computer interaction, and entertainment.