Support Vector Machine Classifier for Record Data

Import Libraries

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.compose import ColumnTransformer

import warnings 
warnings.filterwarnings("ignore")

Import Data

Cleaned record data is considered to perform SVM supervised learning algorithm and predict the label variable (Podium, Top 10 or Outside Top 10).
The data consists of 26,941 rows and 22 feature variables and 1 label column.
It is a historical record data of all the races that have happened in the past 71 years with the results of every position that a driver has held in all the races.
Some of the feature variables include laps in the race, grid position held, age at time of the race, history of wins in the past, history of laps completed in the past, weather of the race, points gained in the race and many more.

Code

df = pd.read_csv('../../data/02-model-data/data_cleaned.csv')
df.head()

	season	round	season_round	driverId	raceId	circuitId	position	points	grid	laps	...	weather	stop	age_on_race	cumulative_points	cumulative_laps	pole_driverId	pole_history	win_driverId	win_history	label
0	1950	1	1950_1	642	833	9	1	9.0	1	70	...	Fine	Not Available	44	9.0	70	642	1	642	1	Podium
1	1950	1	1950_1	786	833	9	2	6.0	2	70	...	Fine	Not Available	52	6.0	70	642	0	642	0	Podium
2	1950	1	1950_1	686	833	9	3	4.0	4	70	...	Fine	Not Available	39	4.0	70	642	0	642	0	Podium
3	1950	1	1950_1	704	833	9	4	3.0	6	68	...	Fine	Not Available	46	3.0	68	642	0	642	0	Top_10
4	1950	1	1950_1	627	833	9	5	2.0	9	68	...	Fine	Not Available	45	2.0	68	642	0	642	0	Top_10

5 rows × 22 columns

Code

driver_df = pd.read_csv('../../data/00-raw-data/drivers.csv')
driver_df.head()

	driverId	driverRef	number	code	forename	surname	dob	nationality	url
0	1	hamilton	44	HAM	Lewis	Hamilton	1985-01-07	British	http://en.wikipedia.org/wiki/Lewis_Hamilton
1	2	heidfeld	\N	HEI	Nick	Heidfeld	1977-05-10	German	http://en.wikipedia.org/wiki/Nick_Heidfeld
2	3	rosberg	6	ROS	Nico	Rosberg	1985-06-27	German	http://en.wikipedia.org/wiki/Nico_Rosberg
3	4	alonso	14	ALO	Fernando	Alonso	1981-07-29	Spanish	http://en.wikipedia.org/wiki/Fernando_Alonso
4	5	kovalainen	\N	KOV	Heikki	Kovalainen	1981-10-19	Finnish	http://en.wikipedia.org/wiki/Heikki_Kovalainen

Code

df = pd.merge(df, driver_df[['driverId', 'driverRef']], on='driverId')

Code

df.shape

(26941, 23)

Data Pre-Processing and Visualization

The data was cleaned in the sections before but there are still some pre-processing left to be in order for the data to be “model-ready”.
Some unnecessary columns are dropped and columns are segregated into numeric and categorical sections.
The numerical columns are scaled using a Standard Scaler and the categorical columns are one hot encoded to minimize loss of data. All of this is done with the help of a function which use sklearn’s Pipeline module.
If a transformer and model estimator are applied separately, it will result in fitted training features being wrongly included in the test-fold of GridSearchCV.
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
If you separate feature scaling and model-fitting functions while using GridSearchCV, you will be creating a biased testing dataset that already contains information about the training set which is not good.
Furthermore, the data is split into training and testing but not traditionally (with the help of sklearn’s train_test_split). The training set is made up of races before 2021 and the testing is done on the races of 2021.

Code

df.drop(['season_round', 'constructorRef', 'raceId', 'driverId'], axis=1, inplace=True)

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26941 entries, 0 to 26940
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   season             26941 non-null  int64  
 1   round              26941 non-null  int64  
 2   circuitId          26941 non-null  int64  
 3   position           26941 non-null  int64  
 4   points             26941 non-null  float64
 5   grid               26941 non-null  int64  
 6   laps               26941 non-null  int64  
 7   status             26941 non-null  object 
 8   weather            26941 non-null  object 
 9   stop               26941 non-null  object 
 10  age_on_race        26941 non-null  int64  
 11  cumulative_points  26941 non-null  float64
 12  cumulative_laps    26941 non-null  int64  
 13  pole_driverId      26941 non-null  int64  
 14  pole_history       26941 non-null  int64  
 15  win_driverId       26941 non-null  int64  
 16  win_history        26941 non-null  int64  
 17  label              26941 non-null  object 
 18  driverRef          26941 non-null  object 
dtypes: float64(2), int64(12), object(5)
memory usage: 4.1+ MB

Code

df.head()

	season	round	circuitId	position	points	grid	laps	status	weather	stop	age_on_race	cumulative_points	cumulative_laps	pole_driverId	pole_history	win_driverId	win_history	label	driverRef
0	1950	1	9	1	9.0	1	70	Finished	Fine	Not Available	44	9.0	70	642	1	642	1	Podium	farina
1	1950	2	6	11	0.0	2	0	Accident	Not Available	Not Available	44	9.0	70	579	1	579	1	Outside_Top_10	farina
2	1950	4	66	1	9.0	2	42	Finished	Sunny	Not Available	44	18.0	112	579	1	642	2	Podium	farina
3	1950	5	13	4	4.0	1	35	Finished	Sunny	Not Available	44	22.0	147	642	2	579	2	Top_10	farina
4	1950	6	55	7	0.0	2	55	Mechanical_Issue	Sunny	Not Available	44	22.0	202	579	2	579	2	Top_10	farina

Code

df = df[df['season'] != 2022]

Splitting data into train and test:

Code

X_train = df[df['season'] != 2021].drop(columns = ['label'])
y_train = df.loc[df['season'] != 2021, ['season', 'round', 'driverRef', 'label']]
X_test = df[df['season'] == 2021].drop(columns = ['label'])
y_test = df.loc[(df['season'] == 2021), ['season', 'round', 'driverRef', 'label']]

Grouping the data by setting the index of train and test data into season, round and driver references:

Code

X_train = X_train.set_index(['season', 'round', 'driverRef'])
y_train = y_train.set_index(['season', 'round', 'driverRef'])
X_test = X_test.set_index(['season', 'round', 'driverRef'])
y_test = y_test.set_index(['season', 'round', 'driverRef'])

Code

numeric_features = ['circuitId', 'position', 'points', 'grid', 'laps', 'age_on_race', 'cumulative_points', 'cumulative_laps',
       'pole_driverId', 'pole_history', 'win_driverId', 'win_history']

categorical_features = ['status', 'weather', 'stop']

Code

display(X_test.head())
display(y_test.head())

			circuitId	position	points	grid	laps	status	weather	stop	age_on_race	cumulative_points	cumulative_laps	pole_driverId	pole_history	win_driverId	win_history
season	round	driverRef
2021	1	raikkonen	3	11	0.0	14	56	Finished	Sunny	Two	42	1863.0	17613	830	18	1	21
	2	raikkonen	21	13	0.0	16	63	Finished	Rainy	Three	42	1863.0	17676	1	18	830	21
	3	raikkonen	75	20	0.0	15	1	Mechanical_Issue	Cloudy	Not Available	42	1863.0	17677	822	18	1	21
	4	raikkonen	4	12	0.0	17	65	Lapped	Cloudy	One	42	1863.0	17742	1	18	1	21
	5	raikkonen	6	11	0.0	14	77	Lapped	Sunny	One	42	1863.0	17819	844	18	830	21

			label
season	round	driverRef
2021	1	raikkonen	Outside_Top_10
	2	raikkonen	Outside_Top_10
	3	raikkonen	Outside_Top_10
	4	raikkonen	Outside_Top_10
	5	raikkonen	Outside_Top_10

Creating a function with sklearn’s Pipeline module and transformers to convert categorical and numerical features:

Code

def prediction_model(model_type, model_id):
    # Scale numeric features using 'StandardScaler' and 'One-Hot Encode' categorical features
    scoring = ['neg_log_loss', 'accuracy']
    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
    categorical_transformer = Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown = 'ignore'))])
    preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
                                                   ('cat', categorical_transformer, categorical_features)])
    pipeline = Pipeline(steps=[('prep', preprocessor), 
                               (model_id, model_type)])
    return pipeline

SVM

Support Vector Machines are a set of supervised learning methods used for classification, regression, and outliers detection. All of these are common tasks in machine learning.
There are specific types of SVMs you can use for particular machine learning problems, like support vector regression (SVR) which is an extension of support vector classification (SVC).
SVMs are different from other classification algorithms because of the way they choose the decision boundary that maximizes the distance from the nearest data points of all the classes. The decision boundary created by SVMs is called the maximum margin classifier or the maximum margin hyper plane.
A simple linear SVM classifier works by making a straight line between two classes. That means all of the data points on one side of the line will represent a category and the data points on the other side of the line will be put into a different category. This means there can be an infinite number of lines to choose from.
What makes the linear SVM algorithm better than some of the other algorithms, like k-nearest neighbors, is that it chooses the best line to classify your data points. It chooses the line that separates the data and is the furthest away from the closet data points as possible.
Pros
- Effective on datasets with multiple features, like financial or medical data.
- Uses a subset of training points in the decision function called support vectors which makes it memory efficient.
- Different kernel functions can be specified for the decision function. You can use common kernels, but it’s also possible to specify custom kernels.
Cons
- If the number of features is a lot bigger than the number of data points, avoiding over-fitting when choosing kernel functions and regularization term is crucial.
- SVMs don’t directly provide probability estimates. Those are calculated using an expensive five-fold cross-validation.
- Works best on small sample sets because of its high training time.

Model Prediction function

After fitting the model it is important to showcase and visualize the model classification results.
The model_results function predicts the model results on test data (2021 races). It displays out the 40 results from the test data along with the prediction probabilities.
The function also fills the prediction scorecard dictionary which contains:
1. Model
2. Accuracy
3. Precision
4. Recall
5. Best parameters

Code

prediction_scorecard = {'model':[],
                        'accuracy_score':[],
                        'precision_score':[],
                        'recall_score':[],
                        'best_params':[]}

Code

def model_results(X_test, model, model_id):
    # Predict!
    pred = model.predict(X_test)
    pred_proba = model.predict_proba(X_test)
    df_pred = pd.DataFrame(np.around(pred_proba, 4), index=X_test.index, columns=['prob_0', 'prob_1', 'prob_2'])
    df_pred['prediction'] = list(pred)
    df_pred['actual'] = y_test['label']
    df_pred['grid_position'] = X_test['grid']

    # Include row if an 'actual' or 'predicted' podium occured for calculating accuracy
    # df_pred['sort'] = df_pred['prediction'] + df_pred['actual']
    # df_pred = df_pred[df_pred['sort'] > 0]
    # df_pred.reset_index(inplace=True)
    df_pred = df_pred.groupby(['round']).apply(pd.DataFrame.sort_values, 'prob_1', ascending=False)
    # df_pred.drop(['sort'], axis=1, inplace=True)
    # df_pred.reset_index(drop=True, inplace=True) 
    
    # Save Accuracy, Precision, 
    prediction_scorecard['model'].append(model_id)
    prediction_scorecard['accuracy_score'].append(accuracy_score(df_pred['actual'], df_pred['prediction']))
    prediction_scorecard['precision_score'].append(precision_score(df_pred['actual'], df_pred['prediction'], average='micro'))
    prediction_scorecard['recall_score'].append(recall_score(df_pred['actual'], df_pred['prediction'], average='micro'))
    prediction_scorecard['best_params'].append(str(model.best_params_))
    display(df_pred.head(40))

Grid search CV

Hyper-parameters are variables that you specify while building a machine-learning model. This means that it’s the user that defines the hyper-parameters while building the model. Hyper-parameters control the learning process, while parameters are learned.
The performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values.
Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV to automate the tuning of hyperparameters.
Grid search CV of the sklearn library is a module for hyperparameter tuning.
It runs through all the different parameters that is fed into the parameter grid and produces the best combination of parameters, based on a scoring metric of your choice (accuracy, f1, etc).
GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method.
The process is time consuming.

Code

svm_params= {'svm__C': [0.1],
             'svm__kernel': ['linear', 'poly'],
             'svm__degree': [2, 3],
             'svm__gamma': [0.01]}

Types of SVM Kernels:
- Linear: These are commonly recommended for text classification because most of these types of classification problems are linearly separable.
- Polynomial: The polynomial kernel isn’t used in practice very often because it isn’t as computationally efficient as other kernels and its predictions aren’t as accurate.
- Gaussian Radial Basis Function (RBF): One of the most powerful and commonly used kernels in SVMs. Usually the choice for non-linear data.

Code

scoring = ['neg_log_loss', 'accuracy']

svm_cv = GridSearchCV(prediction_model(SVC(probability=True), 'svm'),
                      param_grid=svm_params,
                      scoring=scoring, 
                      refit='neg_log_loss',  
                      verbose=0)

Fitting and Training the SVM model

Code

# Train Model
svm_cv.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('prep',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['circuitId',
                                                                          'position',
                                                                          'points',
                                                                          'grid',
                                                                          'laps',
                                                                          'age_on_race',
                                                                          'cumulative_points',
                                                                          'cumulative_laps',
                                                                          'pole_driverId',
                                                                          'pole_history',
                                                                          'win_driverId',
                                                                          'win_history']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('ohe',
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['status',
                                                                          'weather',
                                                                          'stop'])])),
                                       ('svm', SVC(probability=True))]),
             param_grid={'svm__C': [0.1], 'svm__degree': [2, 3],
                         'svm__gamma': [0.01],
                         'svm__kernel': ['linear', 'poly']},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

GridSearchCV(estimator=Pipeline(steps=[('prep',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['circuitId',
                                                                          'position',
                                                                          'points',
                                                                          'grid',
                                                                          'laps',
                                                                          'age_on_race',
                                                                          'cumulative_points',
                                                                          'cumulative_laps',
                                                                          'pole_driverId',
                                                                          'pole_history',
                                                                          'win_driverId',
                                                                          'win_history']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('ohe',
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['status',
                                                                          'weather',
                                                                          'stop'])])),
                                       ('svm', SVC(probability=True))]),
             param_grid={'svm__C': [0.1], 'svm__degree': [2, 3],
                         'svm__gamma': [0.01],
                         'svm__kernel': ['linear', 'poly']},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'])

estimator: Pipeline

Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['circuitId', 'position',
                                                   'points', 'grid', 'laps',
                                                   'age_on_race',
                                                   'cumulative_points',
                                                   'cumulative_laps',
                                                   'pole_driverId',
                                                   'pole_history',
                                                   'win_driverId',
                                                   'win_history']),
                                                 ('cat',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['status', 'weather',
                                                   'stop'])])),
                ('svm', SVC(probability=True))])

prep: ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 ['circuitId', 'position', 'points', 'grid',
                                  'laps', 'age_on_race', 'cumulative_points',
                                  'cumulative_laps', 'pole_driverId',
                                  'pole_history', 'win_driverId',
                                  'win_history']),
                                ('cat',
                                 Pipeline(steps=[('ohe',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['status', 'weather', 'stop'])])

num

['circuitId', 'position', 'points', 'grid', 'laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_driverId', 'pole_history', 'win_driverId', 'win_history']

StandardScaler

StandardScaler()

cat

['status', 'weather', 'stop']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

SVC

SVC(probability=True)

Testing the SVM model

Code

# Test Model
model_results(X_test, svm_cv, 'Support Vector Machines')
svm_results = pd.DataFrame(prediction_scorecard)

				prob_0	prob_1	prob_2	prediction	actual	grid_position
round	season	round	driverRef
1	2021	1	hamilton	0.0000	1.0000	0.0000	Podium	Podium	2
			max_verstappen	0.0000	1.0000	0.0000	Podium	Podium	1
			bottas	0.0000	0.9989	0.0011	Podium	Podium	3
			norris	0.0001	0.0106	0.9893	Top_10	Top_10	7
			raikkonen	0.9990	0.0009	0.0001	Outside_Top_10	Outside_Top_10	14
			giovinazzi	0.9998	0.0002	0.0000	Outside_Top_10	Outside_Top_10	12
			ocon	0.9999	0.0001	0.0000	Outside_Top_10	Outside_Top_10	16
			ricciardo	0.0000	0.0000	1.0000	Top_10	Top_10	6
			sainz	0.0000	0.0000	1.0000	Top_10	Top_10	8
			perez	0.0000	0.0000	1.0000	Top_10	Top_10	0
			alonso	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	9
			stroll	0.0000	0.0000	1.0000	Top_10	Top_10	10
			gasly	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	5
			leclerc	0.0000	0.0000	1.0000	Top_10	Top_10	4
			vettel	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	20
			russell	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	15
			latifi	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	17
			tsunoda	0.0000	0.0000	1.0000	Top_10	Top_10	13
			mick_schumacher	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	18
			mazepin	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	19
2	2021	2	hamilton	0.0000	1.0000	0.0000	Podium	Podium	1
			max_verstappen	0.0000	1.0000	0.0000	Podium	Podium	3
			norris	0.0000	0.9954	0.0046	Podium	Podium	7
			leclerc	0.0001	0.0128	0.9871	Top_10	Top_10	4
			perez	0.9987	0.0012	0.0001	Outside_Top_10	Outside_Top_10	2
			tsunoda	0.9997	0.0003	0.0000	Outside_Top_10	Outside_Top_10	20
			raikkonen	0.9999	0.0001	0.0000	Outside_Top_10	Outside_Top_10	16
			gasly	0.0000	0.0000	1.0000	Top_10	Top_10	5
			mick_schumacher	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	18
			latifi	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	14
			russell	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	12
			giovinazzi	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	17
			stroll	0.0000	0.0000	1.0000	Top_10	Top_10	10
			alonso	0.0000	0.0000	1.0000	Top_10	Top_10	15
			ocon	0.0000	0.0000	1.0000	Top_10	Top_10	9
			sainz	0.0000	0.0000	1.0000	Top_10	Top_10	11
			bottas	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	8
			ricciardo	0.0000	0.0000	1.0000	Top_10	Top_10	6
			vettel	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	0
			mazepin	1.0000	0.0000	0.0000	Outside_Top_10	Outside_Top_10	19

The best parameters for our model are:

Code

svm_cv.best_params_

{'svm__C': 0.1, 'svm__degree': 3, 'svm__gamma': 0.01, 'svm__kernel': 'linear'}

Code

svm_results

	model	accuracy_score	precision_score	recall_score	best_params
0	Support Vector Machines	1.0	1.0	1.0	{'svm__C': 0.1, 'svm__degree': 3, 'svm__gamma'...

Conclusion

The SVM model gives out 100% accuracy, precision and recall values.
The ideal hyperparameters are:
- C = 0.1
- degree = 3
- gamma = 0.01
- kernel = linear