It has been 7 years since I first started watching Formula One on a daily basis and it still is my favourite sport to watch. It does not seem complex while watching it but there are hundreds of factors affecting any driver winning a race. Once you start learning the mechanics behind F1 cars you will understand the vastness of the factors. It has been years since I have followed F1 and will continue to do so in the upcoming years as well. But these past 3 months working on this project has made me realise a lot of new things while also my domain knowledge has made it easier to understand specifics. I have learnt a lot from this project than watching a race and would like share my findings with everyone.
Data Visualization is an important step while getting more information about a particular dataset. It helps us understand the data better and answers a lot of data science questions.
Are the points gained and wins in career related to each other for constructors and drivers both? While Michael Schumacher has more wins in his entire career than other drivers (except Lewis Hamilton), he has lesser number of points as compared to them. By domain knowledge we also know that Schumacher was a part of F1 from the early 90s to 2004, which leads us to the conclusion the point distribution for getting wins has increased over the years. This strange behaviour may also mean that other drivers apart of Schumacher gained more points while not winning (positions 2-10 also has points). A team like McLaren has been in F1 since a very early time which justifies it being the second team while seeing race wins. But the points gathered for McLaren have been lesser in the past few years compared to before 2000. Since we already know the point distribution from being 1st in the race was less before 2000 and McLaren is now a mid-tier team it makes sense that it has less overall points despite being second in Race wins. There are other factors that come into play while comparing points gained and wins in the entire career.
What led to the downfall of great drivers like Sebastian Vettel and Fernando Alonso? The number of points gained in each season for Alonso and Vettel peaked during 2011-2014 seasons while Hamilton peaked during 2017-2020. The downfall of Alonso and Vettel resulted in the success of Hamiltion who became more successful after 2014.
What really happened in 2021 (One of the best seasons ever) between Hamilton and Max? Max struggled at the early rounds with a lot of mechanical failures but gained many podiums at the later stage constantly chipping away from the difference in Hamilton’s lead while Hamilton was pretty constant throughout the season but the podiums and the points gained from Max was a little surprise to Hamilton at the end when they realised that Max was only 2 points behind Hamilton.
What are the sentiments of fans of different teams? Mercedes fans posted the most negative tweets (27.3%) compared to all the other teams but Ferrari and Redbull are not that far behind (26.7%), which leads us to believe that the more famous the team is, the higher the change that its fans would be negative. McLaren also ranks the highest in positive tweets (52.3%).
Naive Bayes Algorithm is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Record Data: We get training accuracy from our model as 83.93% and balanced accuracy as 93% for Podium, 75% for Top_10 and 84% for Outside_Top_10. Test accuracy from our model as 83.24% and balanced accuracy as 93% for Podium, 74% for Top_10 and 85% for Outisde_Top_10. There is a not a lot of difference between train and test accuracy which means our model is not over fitted. Seeing the balanced accuracy for both test and train dataset we notice that it is difficult to predict the label variable “Top_10” seeing that it’s balanced accuracy is less than the other 2 variables.
Twitter Data: Since twitter tweets data is not very accurate all the times after running Count Vectorizer and may contain a lot of words that are common to other labels of the same domain, it makes sense that the models would not do very good while predicting the label variables. Even after tuning the hyperparameters, we get the highest accuracy from the model as 76%.
Decision Trees: Decision tree builds classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. Decision Trees usually implement exactly the human thinking ability while making a decision, so it is easy to understand.
Since twitter tweets data is not very accurate all the times after running Count Vectorizer and may contain a lot of words that are common to other labels of the same domain, it makes sense that the models would not do very good while predicting the label variables. Even after tuning the hyperparameters, we get the highest accuracy from the model as 74%.
Support Vector Machines: SVMs are different from other classification algorithms because of the way they choose the decision boundary that maximizes the distance from the nearest data points of all the classes. The decision boundary created by SVMs is called the maximum margin classifier or the maximum margin hyper plane.
We run SVM on our record data while doing GridSearchCV for the best hyperparameters.
After running 810 different iterations and combinations of hyperparameters we get 99% accuracy on our dataset which is a lot as compared to the Naive Bayes model(84%). This tells us the disadvantages of the Naive Bayes model even if it has a faster computational time (30 seconds as compared to 160 seconds of SVM).
Clustering: Clustering is a Machine Learning method that groups vectors or observations (set of objects) into groups (clusters).
Through various hyper-parameters and algorithms we come to a conclusion that splitting the race positions (target variable) into 3 sections (1-3, 4-10, 11-20) was the best option as these algorithms also gave the same result.
Association Rule Mining: ARM is a technique for identifying frequent patterns, correlations, associations, or causal structures in data sets found in a variety of databases, including relational databases, transactional databases, and other types of data repositories. There are a lot of interesting relations from the rules than can be seen from the network graph:
Suppose if the weather is windy, the season is 2016 and Rosberg is on the pole (1st in the starting grid), it is highly likely that he will get Top 3 (Podium) in the race.
And if the status of the race is Lapped and Hamilton has won the race, it most likely the position that a driver got is Outside Top 10.
For the season 2021, if Max Verstappen is on the pole and the weather conditions are Sunny, it is likely that he will win that race.
Source Code
---title: <b>Conclusions</b>format: html: theme: lumen toc: true self-contained: true embed-resources: true page-layout: full code-fold: true code-tools: truejupyter: python3---- It has been 7 years since I first started watching Formula One on a daily basis and it still is my favourite sport to watch. It does not seem complex while watching it but there are hundreds of factors affecting any driver winning a race. Once you start learning the mechanics behind F1 cars you will understand the vastness of the factors. It has been years since I have followed F1 and will continue to do so in the upcoming years as well. But these past 3 months working on this project has made me realise a lot of new things while also my domain knowledge has made it easier to understand specifics. I have learnt a lot from this project than watching a race and would like share my findings with everyone.- Data Visualization is an important step while getting more information about a particular dataset. It helps us understand the data better and answers a lot of data science questions. 1. *Are the points gained and wins in career related to each other for constructors and drivers both?*<br>While Michael Schumacher has more wins in his entire career than other drivers (except Lewis Hamilton), he has lesser number of points as compared to them. By domain knowledge we also know that Schumacher was a part of F1 from the early 90s to 2004, which leads us to the conclusion the point distribution for getting wins has increased over the years. This strange behaviour may also mean that other drivers apart of Schumacher gained more points while not winning (positions 2-10 also has points). A team like McLaren has been in F1 since a very early time which justifies it being the second team while seeing race wins. But the points gathered for McLaren have been lesser in the past few years compared to before 2000. Since we already know the point distribution from being 1st in the race was less before 2000 and McLaren is now a mid-tier team it makes sense that it has less overall points despite being second in Race wins. There are other factors that come into play while comparing points gained and wins in the entire career. 2. *What led to the downfall of great drivers like Sebastian Vettel and Fernando Alonso?*<br> The number of points gained in each season for Alonso and Vettel peaked during 2011-2014 seasons while Hamilton peaked during 2017-2020. The downfall of Alonso and Vettel resulted in the success of Hamiltion who became more successful after 2014. 3. *What really happened in 2021 (One of the best seasons ever) between Hamilton and Max?*<br> Max struggled at the early rounds with a lot of mechanical failures but gained many podiums at the later stage constantly chipping away from the difference in Hamilton's lead while Hamilton was pretty constant throughout the season but the podiums and the points gained from Max was a little surprise to Hamilton at the end when they realised that Max was only 2 points behind Hamilton. 4. *What are the sentiments of fans of different teams?*<br> Mercedes fans posted the most negative tweets (27.3%) compared to all the other teams but Ferrari and Redbull are not that far behind (26.7%), which leads us to believe that the more famous the team is, the higher the change that its fans would be negative. McLaren also ranks the highest in positive tweets (52.3%).- Naive Bayes Algorithm is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. 1. Record Data: We get training accuracy from our model as 83.93% and balanced accuracy as 93% for Podium, 75% for Top_10 and 84% for Outside_Top_10. Test accuracy from our model as 83.24% and balanced accuracy as 93% for Podium, 74% for Top_10 and 85% for Outisde_Top_10. There is a not a lot of difference between train and test accuracy which means our model is not over fitted. Seeing the balanced accuracy for both test and train dataset we notice that it is difficult to predict the label variable "Top_10" seeing that it's balanced accuracy is less than the other 2 variables. 2. Twitter Data: Since twitter tweets data is not very accurate all the times after running Count Vectorizer and may contain a lot of words that are common to other labels of the same domain, it makes sense that the models would not do very good while predicting the label variables. Even after tuning the hyperparameters, we get the highest accuracy from the model as 76%.- Decision Trees: Decision tree builds classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. Decision Trees usually implement exactly the human thinking ability while making a decision, so it is easy to understand. - Since twitter tweets data is not very accurate all the times after running Count Vectorizer and may contain a lot of words that are common to other labels of the same domain, it makes sense that the models would not do very good while predicting the label variables. Even after tuning the hyperparameters, we get the highest accuracy from the model as 74%.- Support Vector Machines: SVMs are different from other classification algorithms because of the way they choose the decision boundary that maximizes the distance from the nearest data points of all the classes. The decision boundary created by SVMs is called the maximum margin classifier or the maximum margin hyper plane. 1. We run SVM on our record data while doing GridSearchCV for the best hyperparameters. 2. After running 810 different iterations and combinations of hyperparameters we get 99% accuracy on our dataset which is a lot as compared to the Naive Bayes model(84%). This tells us the disadvantages of the Naive Bayes model even if it has a faster computational time (30 seconds as compared to 160 seconds of SVM).- Clustering: Clustering is a Machine Learning method that groups vectors or observations (set of objects) into groups (clusters). - Through various hyper-parameters and algorithms we come to a conclusion that splitting the race positions (target variable) into 3 sections (1-3, 4-10, 11-20) was the best option as these algorithms also gave the same result.- Association Rule Mining: ARM is a technique for identifying frequent patterns, correlations, associations, or causal structures in data sets found in a variety of databases, including relational databases, transactional databases, and other types of data repositories. There are a lot of interesting relations from the rules than can be seen from the network graph: 1. Suppose if the weather is windy, the season is 2016 and Rosberg is on the pole (1st in the starting grid), it is highly likely that he will get Top 3 (Podium) in the race. 2. And if the status of the race is Lapped and Hamilton has won the race, it most likely the position that a driver got is Outside Top 10. 3. For the season 2021, if Max Verstappen is on the pole and the weather conditions are Sunny, it is likely that he will win that race.