Formula One Prediction
  • Home
  • Data Gathering
    • Record Data
    • Twitter Data
  • Data Cleaning
    • Record Data in R
    • Record Data in Python
    • Twitter Data
  • EDA
  • Naive Bayes’
    • Record Data
    • Twitter Data
  • Decision Trees
  • SVM
  • Clustering
  • ARM
  • Conclusion

On this page

  • Import Libraries
  • Import Data
  • Data Pre-Processing and Visualization
  • SVM
    • Model Prediction function
    • Grid search CV
    • Fitting and Training the SVM model
    • Testing the SVM model
    • Conclusion

Support Vector Machine Classifier for Record Data

  • Show All Code
  • Hide All Code

  • View Source

Import Libraries

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.compose import ColumnTransformer

import warnings 
warnings.filterwarnings("ignore")

Import Data

  • Cleaned record data is considered to perform SVM supervised learning algorithm and predict the label variable (Podium, Top 10 or Outside Top 10).
  • The data consists of 26,941 rows and 22 feature variables and 1 label column.
  • It is a historical record data of all the races that have happened in the past 71 years with the results of every position that a driver has held in all the races.
  • Some of the feature variables include laps in the race, grid position held, age at time of the race, history of wins in the past, history of laps completed in the past, weather of the race, points gained in the race and many more.
Code
df = pd.read_csv('../../data/02-model-data/data_cleaned.csv')
df.head()
season round season_round driverId raceId circuitId position points grid laps ... weather stop age_on_race cumulative_points cumulative_laps pole_driverId pole_history win_driverId win_history label
0 1950 1 1950_1 642 833 9 1 9.0 1 70 ... Fine Not Available 44 9.0 70 642 1 642 1 Podium
1 1950 1 1950_1 786 833 9 2 6.0 2 70 ... Fine Not Available 52 6.0 70 642 0 642 0 Podium
2 1950 1 1950_1 686 833 9 3 4.0 4 70 ... Fine Not Available 39 4.0 70 642 0 642 0 Podium
3 1950 1 1950_1 704 833 9 4 3.0 6 68 ... Fine Not Available 46 3.0 68 642 0 642 0 Top_10
4 1950 1 1950_1 627 833 9 5 2.0 9 68 ... Fine Not Available 45 2.0 68 642 0 642 0 Top_10

5 rows × 22 columns

Code
driver_df = pd.read_csv('../../data/00-raw-data/drivers.csv')
driver_df.head()
driverId driverRef number code forename surname dob nationality url
0 1 hamilton 44 HAM Lewis Hamilton 1985-01-07 British http://en.wikipedia.org/wiki/Lewis_Hamilton
1 2 heidfeld \N HEI Nick Heidfeld 1977-05-10 German http://en.wikipedia.org/wiki/Nick_Heidfeld
2 3 rosberg 6 ROS Nico Rosberg 1985-06-27 German http://en.wikipedia.org/wiki/Nico_Rosberg
3 4 alonso 14 ALO Fernando Alonso 1981-07-29 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso
4 5 kovalainen \N KOV Heikki Kovalainen 1981-10-19 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen
Code
df = pd.merge(df, driver_df[['driverId', 'driverRef']], on='driverId')
Code
df.shape
(26941, 23)

Data Pre-Processing and Visualization

  • The data was cleaned in the sections before but there are still some pre-processing left to be in order for the data to be “model-ready”.
  • Some unnecessary columns are dropped and columns are segregated into numeric and categorical sections.
  • The numerical columns are scaled using a Standard Scaler and the categorical columns are one hot encoded to minimize loss of data. All of this is done with the help of a function which use sklearn’s Pipeline module.
  • If a transformer and model estimator are applied separately, it will result in fitted training features being wrongly included in the test-fold of GridSearchCV.
  • Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
  • If you separate feature scaling and model-fitting functions while using GridSearchCV, you will be creating a biased testing dataset that already contains information about the training set which is not good.
  • Furthermore, the data is split into training and testing but not traditionally (with the help of sklearn’s train_test_split). The training set is made up of races before 2021 and the testing is done on the races of 2021.
Code
df.drop(['season_round', 'constructorRef', 'raceId', 'driverId'], axis=1, inplace=True)
Code
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26941 entries, 0 to 26940
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   season             26941 non-null  int64  
 1   round              26941 non-null  int64  
 2   circuitId          26941 non-null  int64  
 3   position           26941 non-null  int64  
 4   points             26941 non-null  float64
 5   grid               26941 non-null  int64  
 6   laps               26941 non-null  int64  
 7   status             26941 non-null  object 
 8   weather            26941 non-null  object 
 9   stop               26941 non-null  object 
 10  age_on_race        26941 non-null  int64  
 11  cumulative_points  26941 non-null  float64
 12  cumulative_laps    26941 non-null  int64  
 13  pole_driverId      26941 non-null  int64  
 14  pole_history       26941 non-null  int64  
 15  win_driverId       26941 non-null  int64  
 16  win_history        26941 non-null  int64  
 17  label              26941 non-null  object 
 18  driverRef          26941 non-null  object 
dtypes: float64(2), int64(12), object(5)
memory usage: 4.1+ MB
Code
df.head()
season round circuitId position points grid laps status weather stop age_on_race cumulative_points cumulative_laps pole_driverId pole_history win_driverId win_history label driverRef
0 1950 1 9 1 9.0 1 70 Finished Fine Not Available 44 9.0 70 642 1 642 1 Podium farina
1 1950 2 6 11 0.0 2 0 Accident Not Available Not Available 44 9.0 70 579 1 579 1 Outside_Top_10 farina
2 1950 4 66 1 9.0 2 42 Finished Sunny Not Available 44 18.0 112 579 1 642 2 Podium farina
3 1950 5 13 4 4.0 1 35 Finished Sunny Not Available 44 22.0 147 642 2 579 2 Top_10 farina
4 1950 6 55 7 0.0 2 55 Mechanical_Issue Sunny Not Available 44 22.0 202 579 2 579 2 Top_10 farina
Code
df = df[df['season'] != 2022]

Splitting data into train and test:

Code
X_train = df[df['season'] != 2021].drop(columns = ['label'])
y_train = df.loc[df['season'] != 2021, ['season', 'round', 'driverRef', 'label']]
X_test = df[df['season'] == 2021].drop(columns = ['label'])
y_test = df.loc[(df['season'] == 2021), ['season', 'round', 'driverRef', 'label']]

Grouping the data by setting the index of train and test data into season, round and driver references:

Code
X_train = X_train.set_index(['season', 'round', 'driverRef'])
y_train = y_train.set_index(['season', 'round', 'driverRef'])
X_test = X_test.set_index(['season', 'round', 'driverRef'])
y_test = y_test.set_index(['season', 'round', 'driverRef'])
Code
numeric_features = ['circuitId', 'position', 'points', 'grid', 'laps', 'age_on_race', 'cumulative_points', 'cumulative_laps',
       'pole_driverId', 'pole_history', 'win_driverId', 'win_history']

categorical_features = ['status', 'weather', 'stop']
Code
display(X_test.head())
display(y_test.head())
circuitId position points grid laps status weather stop age_on_race cumulative_points cumulative_laps pole_driverId pole_history win_driverId win_history
season round driverRef
2021 1 raikkonen 3 11 0.0 14 56 Finished Sunny Two 42 1863.0 17613 830 18 1 21
2 raikkonen 21 13 0.0 16 63 Finished Rainy Three 42 1863.0 17676 1 18 830 21
3 raikkonen 75 20 0.0 15 1 Mechanical_Issue Cloudy Not Available 42 1863.0 17677 822 18 1 21
4 raikkonen 4 12 0.0 17 65 Lapped Cloudy One 42 1863.0 17742 1 18 1 21
5 raikkonen 6 11 0.0 14 77 Lapped Sunny One 42 1863.0 17819 844 18 830 21
label
season round driverRef
2021 1 raikkonen Outside_Top_10
2 raikkonen Outside_Top_10
3 raikkonen Outside_Top_10
4 raikkonen Outside_Top_10
5 raikkonen Outside_Top_10

Creating a function with sklearn’s Pipeline module and transformers to convert categorical and numerical features:

Code
def prediction_model(model_type, model_id):
    # Scale numeric features using 'StandardScaler' and 'One-Hot Encode' categorical features
    scoring = ['neg_log_loss', 'accuracy']
    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
    categorical_transformer = Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown = 'ignore'))])
    preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
                                                   ('cat', categorical_transformer, categorical_features)])
    pipeline = Pipeline(steps=[('prep', preprocessor), 
                               (model_id, model_type)])
    return pipeline

SVM

  • Support Vector Machines are a set of supervised learning methods used for classification, regression, and outliers detection. All of these are common tasks in machine learning.
  • There are specific types of SVMs you can use for particular machine learning problems, like support vector regression (SVR) which is an extension of support vector classification (SVC).
  • SVMs are different from other classification algorithms because of the way they choose the decision boundary that maximizes the distance from the nearest data points of all the classes. The decision boundary created by SVMs is called the maximum margin classifier or the maximum margin hyper plane.
  • A simple linear SVM classifier works by making a straight line between two classes. That means all of the data points on one side of the line will represent a category and the data points on the other side of the line will be put into a different category. This means there can be an infinite number of lines to choose from.
  • What makes the linear SVM algorithm better than some of the other algorithms, like k-nearest neighbors, is that it chooses the best line to classify your data points. It chooses the line that separates the data and is the furthest away from the closet data points as possible.
  • Pros
    • Effective on datasets with multiple features, like financial or medical data.
    • Uses a subset of training points in the decision function called support vectors which makes it memory efficient.
    • Different kernel functions can be specified for the decision function. You can use common kernels, but it’s also possible to specify custom kernels.
  • Cons
    • If the number of features is a lot bigger than the number of data points, avoiding over-fitting when choosing kernel functions and regularization term is crucial.
    • SVMs don’t directly provide probability estimates. Those are calculated using an expensive five-fold cross-validation.
    • Works best on small sample sets because of its high training time.

Model Prediction function

  • After fitting the model it is important to showcase and visualize the model classification results.
  • The model_results function predicts the model results on test data (2021 races). It displays out the 40 results from the test data along with the prediction probabilities.
  • The function also fills the prediction scorecard dictionary which contains:
    1. Model
    2. Accuracy
    3. Precision
    4. Recall
    5. Best parameters
Code
prediction_scorecard = {'model':[],
                        'accuracy_score':[],
                        'precision_score':[],
                        'recall_score':[],
                        'best_params':[]}
Code
def model_results(X_test, model, model_id):
    # Predict!
    pred = model.predict(X_test)
    pred_proba = model.predict_proba(X_test)
    df_pred = pd.DataFrame(np.around(pred_proba, 4), index=X_test.index, columns=['prob_0', 'prob_1', 'prob_2'])
    df_pred['prediction'] = list(pred)
    df_pred['actual'] = y_test['label']
    df_pred['grid_position'] = X_test['grid']

    # Include row if an 'actual' or 'predicted' podium occured for calculating accuracy
    # df_pred['sort'] = df_pred['prediction'] + df_pred['actual']
    # df_pred = df_pred[df_pred['sort'] > 0]
    # df_pred.reset_index(inplace=True)
    df_pred = df_pred.groupby(['round']).apply(pd.DataFrame.sort_values, 'prob_1', ascending=False)
    # df_pred.drop(['sort'], axis=1, inplace=True)
    # df_pred.reset_index(drop=True, inplace=True) 
    
    # Save Accuracy, Precision, 
    prediction_scorecard['model'].append(model_id)
    prediction_scorecard['accuracy_score'].append(accuracy_score(df_pred['actual'], df_pred['prediction']))
    prediction_scorecard['precision_score'].append(precision_score(df_pred['actual'], df_pred['prediction'], average='micro'))
    prediction_scorecard['recall_score'].append(recall_score(df_pred['actual'], df_pred['prediction'], average='micro'))
    prediction_scorecard['best_params'].append(str(model.best_params_))
    display(df_pred.head(40))

Grid search CV

  • Hyper-parameters are variables that you specify while building a machine-learning model. This means that it’s the user that defines the hyper-parameters while building the model. Hyper-parameters control the learning process, while parameters are learned.
  • The performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values.
  • Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV to automate the tuning of hyperparameters.
  • Grid search CV of the sklearn library is a module for hyperparameter tuning.
  • It runs through all the different parameters that is fed into the parameter grid and produces the best combination of parameters, based on a scoring metric of your choice (accuracy, f1, etc).
  • GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method.
  • The process is time consuming.
Code
svm_params= {'svm__C': [0.1],
             'svm__kernel': ['linear', 'poly'],
             'svm__degree': [2, 3],
             'svm__gamma': [0.01]}

Types of SVM Kernels:
- Linear: These are commonly recommended for text classification because most of these types of classification problems are linearly separable.
- Polynomial: The polynomial kernel isn’t used in practice very often because it isn’t as computationally efficient as other kernels and its predictions aren’t as accurate.
- Gaussian Radial Basis Function (RBF): One of the most powerful and commonly used kernels in SVMs. Usually the choice for non-linear data.

Code
scoring = ['neg_log_loss', 'accuracy']

svm_cv = GridSearchCV(prediction_model(SVC(probability=True), 'svm'),
                      param_grid=svm_params,
                      scoring=scoring, 
                      refit='neg_log_loss',  
                      verbose=0)

Fitting and Training the SVM model

Code
# Train Model
svm_cv.fit(X_train, y_train)
GridSearchCV(estimator=Pipeline(steps=[('prep',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['circuitId',
                                                                          'position',
                                                                          'points',
                                                                          'grid',
                                                                          'laps',
                                                                          'age_on_race',
                                                                          'cumulative_points',
                                                                          'cumulative_laps',
                                                                          'pole_driverId',
                                                                          'pole_history',
                                                                          'win_driverId',
                                                                          'win_history']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('ohe',
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['status',
                                                                          'weather',
                                                                          'stop'])])),
                                       ('svm', SVC(probability=True))]),
             param_grid={'svm__C': [0.1], 'svm__degree': [2, 3],
                         'svm__gamma': [0.01],
                         'svm__kernel': ['linear', 'poly']},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=Pipeline(steps=[('prep',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['circuitId',
                                                                          'position',
                                                                          'points',
                                                                          'grid',
                                                                          'laps',
                                                                          'age_on_race',
                                                                          'cumulative_points',
                                                                          'cumulative_laps',
                                                                          'pole_driverId',
                                                                          'pole_history',
                                                                          'win_driverId',
                                                                          'win_history']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('ohe',
                                                                                          OneHotEncoder(handle_unknown='ignore'))]),
                                                                         ['status',
                                                                          'weather',
                                                                          'stop'])])),
                                       ('svm', SVC(probability=True))]),
             param_grid={'svm__C': [0.1], 'svm__degree': [2, 3],
                         'svm__gamma': [0.01],
                         'svm__kernel': ['linear', 'poly']},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'])
Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['circuitId', 'position',
                                                   'points', 'grid', 'laps',
                                                   'age_on_race',
                                                   'cumulative_points',
                                                   'cumulative_laps',
                                                   'pole_driverId',
                                                   'pole_history',
                                                   'win_driverId',
                                                   'win_history']),
                                                 ('cat',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['status', 'weather',
                                                   'stop'])])),
                ('svm', SVC(probability=True))])
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 ['circuitId', 'position', 'points', 'grid',
                                  'laps', 'age_on_race', 'cumulative_points',
                                  'cumulative_laps', 'pole_driverId',
                                  'pole_history', 'win_driverId',
                                  'win_history']),
                                ('cat',
                                 Pipeline(steps=[('ohe',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['status', 'weather', 'stop'])])
['circuitId', 'position', 'points', 'grid', 'laps', 'age_on_race', 'cumulative_points', 'cumulative_laps', 'pole_driverId', 'pole_history', 'win_driverId', 'win_history']
StandardScaler()
['status', 'weather', 'stop']
OneHotEncoder(handle_unknown='ignore')
SVC(probability=True)

Testing the SVM model

Code
# Test Model
model_results(X_test, svm_cv, 'Support Vector Machines')
svm_results = pd.DataFrame(prediction_scorecard)
prob_0 prob_1 prob_2 prediction actual grid_position
round season round driverRef
1 2021 1 hamilton 0.0000 1.0000 0.0000 Podium Podium 2
max_verstappen 0.0000 1.0000 0.0000 Podium Podium 1
bottas 0.0000 0.9989 0.0011 Podium Podium 3
norris 0.0001 0.0106 0.9893 Top_10 Top_10 7
raikkonen 0.9990 0.0009 0.0001 Outside_Top_10 Outside_Top_10 14
giovinazzi 0.9998 0.0002 0.0000 Outside_Top_10 Outside_Top_10 12
ocon 0.9999 0.0001 0.0000 Outside_Top_10 Outside_Top_10 16
ricciardo 0.0000 0.0000 1.0000 Top_10 Top_10 6
sainz 0.0000 0.0000 1.0000 Top_10 Top_10 8
perez 0.0000 0.0000 1.0000 Top_10 Top_10 0
alonso 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 9
stroll 0.0000 0.0000 1.0000 Top_10 Top_10 10
gasly 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 5
leclerc 0.0000 0.0000 1.0000 Top_10 Top_10 4
vettel 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 20
russell 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 15
latifi 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 17
tsunoda 0.0000 0.0000 1.0000 Top_10 Top_10 13
mick_schumacher 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 18
mazepin 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 19
2 2021 2 hamilton 0.0000 1.0000 0.0000 Podium Podium 1
max_verstappen 0.0000 1.0000 0.0000 Podium Podium 3
norris 0.0000 0.9954 0.0046 Podium Podium 7
leclerc 0.0001 0.0128 0.9871 Top_10 Top_10 4
perez 0.9987 0.0012 0.0001 Outside_Top_10 Outside_Top_10 2
tsunoda 0.9997 0.0003 0.0000 Outside_Top_10 Outside_Top_10 20
raikkonen 0.9999 0.0001 0.0000 Outside_Top_10 Outside_Top_10 16
gasly 0.0000 0.0000 1.0000 Top_10 Top_10 5
mick_schumacher 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 18
latifi 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 14
russell 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 12
giovinazzi 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 17
stroll 0.0000 0.0000 1.0000 Top_10 Top_10 10
alonso 0.0000 0.0000 1.0000 Top_10 Top_10 15
ocon 0.0000 0.0000 1.0000 Top_10 Top_10 9
sainz 0.0000 0.0000 1.0000 Top_10 Top_10 11
bottas 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 8
ricciardo 0.0000 0.0000 1.0000 Top_10 Top_10 6
vettel 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 0
mazepin 1.0000 0.0000 0.0000 Outside_Top_10 Outside_Top_10 19

The best parameters for our model are:

Code
svm_cv.best_params_
{'svm__C': 0.1, 'svm__degree': 3, 'svm__gamma': 0.01, 'svm__kernel': 'linear'}
Code
svm_results
model accuracy_score precision_score recall_score best_params
0 Support Vector Machines 1.0 1.0 1.0 {'svm__C': 0.1, 'svm__degree': 3, 'svm__gamma'...

Conclusion

  • The SVM model gives out 100% accuracy, precision and recall values.
  • The ideal hyperparameters are:
    • C = 0.1
    • degree = 3
    • gamma = 0.01
    • kernel = linear
Source Code
---
title: <b>Support Vector Machine Classifier for Record Data</b>
format:
  html:
    theme: lumen
    toc: true
    self-contained: true
    embed-resources: true
    page-layout: full
    code-fold: true
    code-tools: true
jupyter: python3
---

# Import Libraries

```{python}
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.compose import ColumnTransformer

import warnings 
warnings.filterwarnings("ignore")
```

# Import Data
- Cleaned record data is considered to perform SVM supervised learning algorithm and predict the label variable (Podium, Top 10 or Outside Top 10).
- The data consists of 26,941 rows and 22 feature variables and 1 label column.
- It is a historical record data of all the races that have happened in the past 71 years with the results of every position that a driver has held in all the races.
- Some of the feature variables include laps in the race, grid position held, age at time of the race, history of wins in the past, history of laps completed in the past, weather of the race, points gained in the race and many more.

```{python}
df = pd.read_csv('../../data/02-model-data/data_cleaned.csv')
df.head()
```

```{python}
driver_df = pd.read_csv('../../data/00-raw-data/drivers.csv')
driver_df.head()
```

```{python}
df = pd.merge(df, driver_df[['driverId', 'driverRef']], on='driverId')
```

```{python}
df.shape
```

# Data Pre-Processing and Visualization
- The data was cleaned in the sections before but there are still some pre-processing left to be in order for the data to be "model-ready".
- Some unnecessary columns are dropped and columns are segregated into numeric and categorical sections.
- The numerical columns are scaled using a Standard Scaler and the categorical columns are one hot encoded to minimize loss of data. All of this is done with the help of a function which use sklearn's Pipeline module.
- If a transformer and model estimator are applied separately, it will result in fitted training features being wrongly included in the test-fold of GridSearchCV.
- Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
- If you separate feature scaling and model-fitting functions while using GridSearchCV, you will be creating a biased testing dataset that already contains information about the training set which is not good.
- Furthermore, the data is split into training and testing but not traditionally (with the help of sklearn's train_test_split). The training set is made up of races before 2021 and the testing is done on the races of 2021.

```{python}
df.drop(['season_round', 'constructorRef', 'raceId', 'driverId'], axis=1, inplace=True)
```

```{python}
df.info()
```

```{python}
df.head()
```

```{python}
df = df[df['season'] != 2022]
```

Splitting data into train and test:

```{python}
X_train = df[df['season'] != 2021].drop(columns = ['label'])
y_train = df.loc[df['season'] != 2021, ['season', 'round', 'driverRef', 'label']]
X_test = df[df['season'] == 2021].drop(columns = ['label'])
y_test = df.loc[(df['season'] == 2021), ['season', 'round', 'driverRef', 'label']]
```

Grouping the data by setting the index of train and test data into season, round and driver references:

```{python}
X_train = X_train.set_index(['season', 'round', 'driverRef'])
y_train = y_train.set_index(['season', 'round', 'driverRef'])
X_test = X_test.set_index(['season', 'round', 'driverRef'])
y_test = y_test.set_index(['season', 'round', 'driverRef'])
```

```{python}
numeric_features = ['circuitId', 'position', 'points', 'grid', 'laps', 'age_on_race', 'cumulative_points', 'cumulative_laps',
       'pole_driverId', 'pole_history', 'win_driverId', 'win_history']

categorical_features = ['status', 'weather', 'stop']
```

```{python}
display(X_test.head())
display(y_test.head())
```

Creating a function with sklearn's Pipeline module and transformers to convert categorical and numerical features:

```{python}
def prediction_model(model_type, model_id):
    # Scale numeric features using 'StandardScaler' and 'One-Hot Encode' categorical features
    scoring = ['neg_log_loss', 'accuracy']
    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
    categorical_transformer = Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown = 'ignore'))])
    preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
                                                   ('cat', categorical_transformer, categorical_features)])
    pipeline = Pipeline(steps=[('prep', preprocessor), 
                               (model_id, model_type)])
    return pipeline
```

# SVM
- `Support Vector Machines` are a set of supervised learning methods used for classification, regression, and outliers detection. All of these are common tasks in machine learning.
- There are specific types of SVMs you can use for particular machine learning problems, like support vector regression (SVR) which is an extension of support vector classification (SVC).
- SVMs are different from other classification algorithms because of the way they choose the decision boundary that maximizes the distance from the nearest data points of all the classes. The decision boundary created by SVMs is called the maximum margin classifier or the maximum margin hyper plane.
- A simple linear SVM classifier works by making a straight line between two classes. That means all of the data points on one side of the line will represent a category and the data points on the other side of the line will be put into a different category. This means there can be an infinite number of lines to choose from.
- What makes the linear SVM algorithm better than some of the other algorithms, like k-nearest neighbors, is that it chooses the best line to classify your data points. It chooses the line that separates the data and is the furthest away from the closet data points as possible.
- Pros
    - Effective on datasets with multiple features, like financial or medical data.
    - Uses a subset of training points in the decision function called support vectors which makes it memory efficient.
    - Different kernel functions can be specified for the decision function. You can use common kernels, but it's also possible to specify custom kernels.
- Cons
    - If the number of features is a lot bigger than the number of data points, avoiding over-fitting when choosing kernel functions and regularization term is crucial.
    - SVMs don't directly provide probability estimates. Those are calculated using an expensive five-fold cross-validation.
    - Works best on small sample sets because of its high training time.

## Model Prediction function
- After fitting the model it is important to showcase and visualize the model classification results.
- The model_results function predicts the model results on test data (2021 races). It displays out the 40 results from the test data along with the prediction probabilities. 
- The function also fills the prediction scorecard dictionary which contains:
    1. Model
    2. Accuracy
    3. Precision
    4. Recall
    5. Best parameters

```{python}
prediction_scorecard = {'model':[],
                        'accuracy_score':[],
                        'precision_score':[],
                        'recall_score':[],
                        'best_params':[]}
```

```{python}
def model_results(X_test, model, model_id):
    # Predict!
    pred = model.predict(X_test)
    pred_proba = model.predict_proba(X_test)
    df_pred = pd.DataFrame(np.around(pred_proba, 4), index=X_test.index, columns=['prob_0', 'prob_1', 'prob_2'])
    df_pred['prediction'] = list(pred)
    df_pred['actual'] = y_test['label']
    df_pred['grid_position'] = X_test['grid']

    # Include row if an 'actual' or 'predicted' podium occured for calculating accuracy
    # df_pred['sort'] = df_pred['prediction'] + df_pred['actual']
    # df_pred = df_pred[df_pred['sort'] > 0]
    # df_pred.reset_index(inplace=True)
    df_pred = df_pred.groupby(['round']).apply(pd.DataFrame.sort_values, 'prob_1', ascending=False)
    # df_pred.drop(['sort'], axis=1, inplace=True)
    # df_pred.reset_index(drop=True, inplace=True) 
    
    # Save Accuracy, Precision, 
    prediction_scorecard['model'].append(model_id)
    prediction_scorecard['accuracy_score'].append(accuracy_score(df_pred['actual'], df_pred['prediction']))
    prediction_scorecard['precision_score'].append(precision_score(df_pred['actual'], df_pred['prediction'], average='micro'))
    prediction_scorecard['recall_score'].append(recall_score(df_pred['actual'], df_pred['prediction'], average='micro'))
    prediction_scorecard['best_params'].append(str(model.best_params_))
    display(df_pred.head(40))
```

## Grid search CV
- Hyper-parameters are variables that you specify while building a machine-learning model. This means that it’s the user that defines the hyper-parameters while building the model. Hyper-parameters control the learning process, while parameters are learned.
- The performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values. 
- Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV to automate the tuning of hyperparameters.
- Grid search CV of the sklearn library is a module for hyperparameter tuning.
- It runs through all the different parameters that is fed into the parameter grid and produces the best combination of parameters, based on a scoring metric of your choice (accuracy, f1, etc).
- GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method.
- The process is time consuming.

```{python}
svm_params= {'svm__C': [0.1],
             'svm__kernel': ['linear', 'poly'],
             'svm__degree': [2, 3],
             'svm__gamma': [0.01]}
```

Types of SVM Kernels: <br>
- `Linear`: These are commonly recommended for text classification because most of these types of classification problems are linearly separable.<br>
- `Polynomial`: The polynomial kernel isn't used in practice very often because it isn't as computationally efficient as other kernels and its predictions aren't as accurate.<br>
- `Gaussian Radial Basis Function (RBF)`: One of the most powerful and commonly used kernels in SVMs. Usually the choice for non-linear data.

```{python}
scoring = ['neg_log_loss', 'accuracy']

svm_cv = GridSearchCV(prediction_model(SVC(probability=True), 'svm'),
                      param_grid=svm_params,
                      scoring=scoring, 
                      refit='neg_log_loss',  
                      verbose=0)
```

## Fitting and Training the SVM model

```{python}
# Train Model
svm_cv.fit(X_train, y_train)
```

## Testing the SVM model

```{python}
# Test Model
model_results(X_test, svm_cv, 'Support Vector Machines')
svm_results = pd.DataFrame(prediction_scorecard)
```

The best parameters for our model are:

```{python}
svm_cv.best_params_
```

```{python}
svm_results
```

## Conclusion
- The SVM model gives out 100% accuracy, precision and recall values.
- The ideal hyperparameters are:
    - C = 0.1
    - degree = 3
    - gamma = 0.01
    - kernel = linear