Naive Bayes’ for Record Data

Import Libraries

Code

library(tidyr)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ dplyr   1.0.10
✔ tibble  3.1.8      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
✔ purrr   0.3.4      
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Code

library(dplyr)
library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

Code

library(e1071)
library(caTools)
library(yardstick)

For binary classification, the first factor level is assumed to be the event.
Use the argument `event_level = "second"` to alter this as needed.

Attaching package: 'yardstick'

The following objects are masked from 'package:caret':

    precision, recall, sensitivity, specificity

The following object is masked from 'package:readr':

    spec

Code

library(naivebayes)

naivebayes 0.9.7 loaded

Code

library(ggplot2)
library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

Code

library(sjPlot)

Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!

Code

library(klaR)

Loading required package: MASS

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Import Data

Code

df = read_csv("../../data/02-model-data/data_cleaned.csv")

Rows: 26941 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): season_round, status, constructorRef, weather, stop, label
dbl (16): season, round, driverId, raceId, circuitId, position, points, grid...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

head(df)

# A tibble: 6 × 22
  season round season…¹ drive…² raceId circu…³ posit…⁴ points  grid  laps status
   <dbl> <dbl> <chr>      <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <chr> 
1   1950     1 1950_1       642    833       9       1      9     1    70 Finis…
2   1950     1 1950_1       786    833       9       2      6     2    70 Finis…
3   1950     1 1950_1       686    833       9       3      4     4    70 Finis…
4   1950     1 1950_1       704    833       9       4      3     6    68 Lapped
5   1950     1 1950_1       627    833       9       5      2     9    68 Lapped
6   1950     1 1950_1       619    833       9       6      0    13    67 Lapped
# … with 11 more variables: constructorRef <chr>, weather <chr>, stop <chr>,
#   age_on_race <dbl>, cumulative_points <dbl>, cumulative_laps <dbl>,
#   pole_driverId <dbl>, pole_history <dbl>, win_driverId <dbl>,
#   win_history <dbl>, label <chr>, and abbreviated variable names
#   ¹season_round, ²driverId, ³circuitId, ⁴position

Data Pre-Processing and Visualization

Code

barplot(table(df$label), col = 'lightblue', main = 'Distribution of Labels', xlab = 'Labels', ylab = 'Count')

Code

plot(df$cumulative_points, df$win_history, col = 'lightblue', main = 'Points vs Win History', xlab = 'Cumulative Points', ylab = 'Win History')

Converting some numeric variables to factors:

Code

a = factor(df$status, levels = unique(df$status))
df$status = as.integer(a)

Code

a = factor(df$constructorRef, levels = unique(df$constructorRef))
df$constructorRef = as.integer(a)

Code

a = factor(df$weather, levels = unique(df$weather))
df$weather = as.integer(a)

Code

a = factor(df$stop, levels = unique(df$stop))
df$stop = as.integer(a)

Code

a = factor(df$label, levels = unique(df$label))
df$label = as.integer(a)

Dropping unnecessary columns:

Code

df = df[-c(1:3)]

Code

df$label = as.factor(df$label)
df

# A tibble: 26,941 × 19
   driverId raceId circuitId position points  grid  laps status constr…¹ weather
      <dbl>  <dbl>     <dbl>    <dbl>  <dbl> <dbl> <dbl>  <int>    <int>   <int>
 1      642    833         9        1      9     1    70      1        1       1
 2      786    833         9        2      6     2    70      1        1       1
 3      686    833         9        3      4     4    70      1        1       1
 4      704    833         9        4      3     6    68      2        2       1
 5      627    833         9        5      2     9    68      2        2       1
 6      619    833         9        6      0    13    67      2        3       1
 7      787    833         9        7      0    15    67      2        3       1
 8      741    833         9        8      0    14    65      2        2       1
 9      784    833         9        9      0    16    64      2        4       1
10      778    833         9       10      0    20    64      2        4       1
# … with 26,931 more rows, 9 more variables: stop <int>, age_on_race <dbl>,
#   cumulative_points <dbl>, cumulative_laps <dbl>, pole_driverId <dbl>,
#   pole_history <dbl>, win_driverId <dbl>, win_history <dbl>, label <fct>, and
#   abbreviated variable name ¹constructorRef

Splitting data into train and test set:

Code

set.seed(1973)

sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.8,0.2))
train  <- df[sample, ]
test   <- df[!sample, ]

Code

nrow(train)

[1] 21547

Code

nrow(test)

[1] 5394

Naive Bayes Model

Bayes’ Theorem: In probability theory and statistics, Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes’ theorem allows the risk to an individual of a known age to be assessed more accurately (by conditioning it on their age) than simply assuming that the individual is typical of the population as a whole.
Naive Bayes Algorithm is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
A fruit might be categorized as an apple, for instance, if it is red, rounded, and around 3 inches in diameter. Even if these characteristics depend on one another or on the presence of other characteristics, each of these traits separately increases the likelihood that this fruit is an apple, which is why it is called “Naive.”
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes’ Theorem can be shown by this equation: \[ P(C|X) = \frac {P(X|C) * P(C)}{P(X)} \]
In the above equation:
- P(C|X) is the posterior probability of class (C, target) given predictor (X, attributes).
- P(C) is the prior probability of class.
- P(X|C) is the likelihood which is the probability of predictor given class.
- P(X) is the prior probability of predictor.
How does Bayes Theorem work?
- Let’s take an example: A Path Lab is performing a Test of disease say “D” with two results “Positive” & “Negative.” They guarantee that their test result is 99% accurate: if you have the disease, they will give test positive 99% of the time. If you don’t have the disease, they will test negative 99% of the time. If 3% of all the people have this disease and test gives “positive” result, what is the probability that you actually have the disease?
- For solving the above problem, we will have to use conditional probability.
  - Probability of people suffering from Disease D, P(D) = 0.03 = 3%
  - Probability that test gives “positive” result and patient have the disease, P(Pos | D) = 0.99 =99%
  - Probability of people not suffering from Disease D, P(~D) = 0.97 = 97%
  - Probability that test gives “positive” result and patient does have the disease, P(Pos | ~D) = 0.01 =1%
- For calculating the probability that the patient actually have the disease i.e, P( D | Pos) we will use Bayes theorem.
- P(Pos) = P(D, pos) + P( ~D, pos) = P(pos|D) * P(D) + P(pos|~D) * P(~D) = 0.99 * 0.03 + 0.01 * 0.97 = 0.0394
- Hence, P( D | Pos) = (P(Pos | D) * P(D)) / P(Pos) = (0.99 * 0.03) / 0.0394 = 0.753807107
- So, Approximately 75% chances are there that the patient is actually suffering from disease.
- This is how Bayes’ Theorem works. Reference
Types of Naive Bayes Algorithms:
1. Gaussian Naïve Bayes Classifier: In Gaussian Naïve Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution (Normal distribution). When plotted, it gives a bell-shaped curve which is symmetric about the mean of the feature values.
2. Multinomial Naïve Bayes Classifier: Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. This is the event model typically used for document classification.
3. Bernoulli Naïve Bayes Classifier: In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence (i.e. a word occurs in a document or not) features are used rather than term frequencies (i.e. frequency of a word in the document).
Applications of Naive Bayes Algorithm:
1. Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
2. Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
3. Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
4. Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not
Advantages:
- It is easy and fast to predict class of test data set. It also perform well in multi class prediction
- When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
- It performs well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Disadvatages:
- If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
- On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
- Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Code

set.seed(1973)
model1=NaiveBayes(label ~., data=train)
model = naive_bayes(label ~., data=train)

Code

plot(model1)

Above are the density line plots for all the feature variables for all 3 label values.

Code

train_pred=predict(model,train)

Warning: predict.naive_bayes(): more features in the newdata are provided as
there are probability tables in the object. Calculation is performed based on
features to be found in the tables.

Code

train_cm = table(train_pred,train$label)
confusionMatrix(train_cm)

Confusion Matrix and Statistics

          
train_pred     1     2     3
         1  2375   327     0
         2   310  3468   256
         3     0  2569 12242

Overall Statistics
                                          
               Accuracy : 0.8393          
                 95% CI : (0.8344, 0.8442)
    No Information Rate : 0.58            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6971          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.8845   0.5449   0.9795
Specificity            0.9827   0.9627   0.7161
Pos Pred Value         0.8790   0.8597   0.8265
Neg Pred Value         0.9836   0.8346   0.9620
Prevalence             0.1246   0.2954   0.5800
Detection Rate         0.1102   0.1610   0.5682
Detection Prevalence   0.1254   0.1872   0.6874
Balanced Accuracy      0.9336   0.7538   0.8478

Code

test_pred=predict(model,test)

Warning: predict.naive_bayes(): more features in the newdata are provided as
there are probability tables in the object. Calculation is performed based on
features to be found in the tables.

Code

test_cm = table(test_pred,test$label)
confusionMatrix(test_cm)

Confusion Matrix and Statistics

         
test_pred    1    2    3
        1  669   89    1
        2   85  864   68
        3    0  661 2957

Overall Statistics
                                          
               Accuracy : 0.8324          
                 95% CI : (0.8222, 0.8423)
    No Information Rate : 0.561           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.694           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: 1 Class: 2 Class: 3
Sensitivity            0.8873   0.5353   0.9772
Specificity            0.9806   0.9595   0.7209
Pos Pred Value         0.8814   0.8496   0.8173
Neg Pred Value         0.9817   0.8286   0.9611
Prevalence             0.1398   0.2992   0.5610
Detection Rate         0.1240   0.1602   0.5482
Detection Prevalence   0.1407   0.1885   0.6707
Balanced Accuracy      0.9339   0.7474   0.8490

We get training accuracy from our model as 83.93% and balanced accuracy as 93% for Podium, 75% for Top_10 and 84% for Outisde_Top_10. Test accuracy from our model as 83.24% and balanced accuracy as 93% for Podium, 74% for Top_10 and 85% for Outisde_Top_10. There is a not a lot of difference between train and test accuracy which means our model is not over fitted.

Confusion Matrix for Train and Test Data:

Code

train_cm_df = data.frame(train_cm)
colnames(train_cm_df) = c('pred', 'truth', 'y')

ggplot(data = train_cm_df, mapping = aes(x = truth , y = pred)) +
  geom_tile(aes(fill = y), colour = "white") +
  labs(title = 'Confusion Matrix of Train Data') +
  scale_x_discrete(labels=c("1" = "Podium", "2" = "Top_10", "3" = "Outide_Top_10")) +
  scale_y_discrete(labels=c("1" = "Podium", "2" = "Top_10", "3" = "Outide_Top_10")) +
  geom_text(aes(label = sprintf("%1.0f", y)), vjust = 1, colour = 'white') +
  #scale_fill_gradient(low = "lightblue", high = "yellow") +
  theme_bw() + theme(legend.position = "none")

Code

test_cm_df = data.frame(test_cm)
colnames(test_cm_df) = c('pred', 'truth', 'y')

ggplot(data = test_cm_df, mapping = aes(x = truth , y = pred)) +
  geom_tile(aes(fill = y), colour = "white") +
  labs(title = 'Confusion Matrix of Test Data') +
  scale_x_discrete(labels=c("1" = "Podium", "2" = "Top_10", "3" = "Outide_Top_10")) +
  scale_y_discrete(labels=c("1" = "Podium", "2" = "Top_10", "3" = "Outide_Top_10")) +
  geom_text(aes(label = sprintf("%1.0f", y)), vjust = 1, colour = 'white') +
  # scale_fill_gradient(low = "cyan", high = "darkgoldenrod1") +
  theme_bw() + theme(legend.position = "none")