Naive Bayes’ for Twitter Data

Introduction

I have taken English Tweets from Twitter of the 10 teams in Formula One from 1 week to create different Naive Bayes model for predicting which tweet belongs to which team.
The Data is cleaned in the previous sections.
Overview of Data Cleaning:
- F1 tweets provide context about sentiments about fans all over the world. Since some fans do not speak English, they tend to tweet in languages other than English. After extracting 1000 tweets each for every team from Twitter. I have only taken tweets of the English language for better understandibility.
- Various Pre-Processing tasks were applied on the tweet text like excess blank spaces, stopwords, numbers and punctuations were removed.
- Furthermore the tweets were tokenized and lemmatized for further analysis.
- I also calculated sentiments of tweets in order to understand the emotions of fans behind writing these tweets and to better understand the need of this project.
The master table has consists of 3,928 rows, 9 columns and 1 label column.
The label column is based on the teams that are racing in the current season (2022) that are:
- Ferrari
- Mercedes
- Redbull
- Williams
- Alpha Tauri
- Alfa Romeo
- McLaren
- Alpine
- Haas
- Aston Martin

Import Libraries

Code

from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as  pd

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn import metrics

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import confusion_matrix

from wordcloud import WordCloud, STOPWORDS
from sklearn.preprocessing import LabelEncoder

Import Data

Code

df = pd.read_csv('../../data/01-modified-data/all_teams_sentiment_df.csv')
df.head()

	Unnamed: 0	Team	0	text	polarity	subjectivity	sentiment	tokenized	nonstop	stemmed
0	0	Ferrari	*female f1 fans reposting shirtless driver pic...	female f fans reposting shirtless driver pics ...	0.04	0.673333	positive	['female', 'f', 'fans', 'reposting', 'shirtles...	['female', 'f', 'fans', 'reposting', 'shirtles...	['femal', 'f', 'fan', 'repost', 'shirtless', '...
1	1	Ferrari	Watch a video explaining Ferrari's development...	watch a video explaining ferraris development ...	0.35	0.550000	neutral	['watch', 'a', 'video', 'explaining', 'ferrari...	['watch', 'video', 'explaining', 'ferraris', '...	['watch', 'video', 'explain', 'ferrari', 'deve...
2	2	Ferrari	[Race Fans]\nFerrari engine’s driveability is ...	race fans\nferrari engine’s driveability is bi...	0.00	0.000000	neutral	['race', 'fans', 'ferrari', 'engine', 's', 'dr...	['race', 'fans', 'ferrari', 'engine', 'driveab...	['race', 'fan', 'ferrari', 'engin', 'driveabl'...
3	3	Ferrari	Ferrari boss Binotto set for talks with Schuma...	ferrari boss binotto set for talks with schuma...	0.00	0.125000	neutral	['ferrari', 'boss', 'binotto', 'set', 'for', '...	['ferrari', 'boss', 'binotto', 'set', 'talks',...	['ferrari', 'boss', 'binotto', 'set', 'talk', ...
4	4	Ferrari	Leclerc: F1 world championship drought not pre...	leclerc f world championship drought not press...	0.00	0.000000	positive	['leclerc', 'f', 'world', 'championship', 'dro...	['leclerc', 'f', 'world', 'championship', 'dro...	['leclerc', 'f', 'world', 'championship', 'dro...

Data Pre-Processing and Visualization

The cleaned data needs some pre-processing for it to be fed into Naive Bayes models.
Overview of Pre-Processing:
- Dropping unnecessary columns. We just need the tweet text columns and label column.
- During data cleaning, numbers were removed which also changed the word “f1” to “f” and the newline character “” was also not removed. These changes also need to be done while pre-processing.
- The label column has team names which need to be changed to numeric type.
- The data has to be split into X (tweet texts) and y (labels).

Code

for i in range(len(df)):
    
    df['text'][i] = df['text'][i].replace(" f ", " f1 ")
    df['text'][i] = df['text'][i].strip("\n")

/var/folders/80/kkd433150p52z36v9dx3c_p00000gn/T/ipykernel_48916/2467898219.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/80/kkd433150p52z36v9dx3c_p00000gn/T/ipykernel_48916/2467898219.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Code

df1 = df[['Team', 'text']]

Visualizing the Number of tweets for each Team:

Code

df1['Team'].value_counts()

Williams        574
Mclaren         550
Mercedes        534
Alpine          523
Ferrari         453
Haas            419
Redbull         322
Aston Martin    267
Alfa Romeo      180
Alpha Tauri     106
Name: Team, dtype: int64

Code

fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(x = df1['Team'].value_counts().index, y = df1['Team'].value_counts().values, ax=ax)

ax.set_xlabel('Team')
ax.set_ylabel('Number of Tweets')
ax.set_title('Number of Tweets per Team')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
plt.savefig('../../images/Number of Tweets per Team.png')
plt.show()

Code

X = df1['text']
y = df1['Team']

Code

labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)
y

array([4, 4, 4, ..., 0, 0, 0])

Code

teams = ['Ferrari', 'Mercedes', 'Redbull', 'Haas', 'Mclaren', 'Alpine', 'Williams', 'Aston Martin', 'Alpha Tauri', 'Alfa Romeo']
y1 = labelencoder.fit(teams)
label_map = dict(zip(y1.classes_, y1.transform(y1.classes_)))
label_list = list(label_map.keys())

Count Vectorizer

Whenever we work on any NLP related problem, we process a lot of textual data. The textual data after processing needs to be fed into the model.
Characters and words are incomprehensible to machines. So, when dealing with text data, we must represent it numerically so that the machine can understand it.
The Count Vectorizer method converts text to numerical data.
CountVectorizer tokenizes (tokenization means dividing the sentences in words) the text along with performing very basic preprocessing. It removes the punctuation marks and converts all the words to lowercase.
CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.
Inside CountVectorizer, these words are not stored as strings. Rather, they are given a particular index value. This way of representation is known as a Sparse Matrix
In our dataset we are taking tweets of 10 different teams and vectorizing the data so that it can fed into our Naive Bayes’ Models.

Code

import random as rd
MyCV_content=CountVectorizer(input='content',
                        stop_words='english'
                        #max_features=100
                        )

My_DTM2=MyCV_content.fit_transform(X)
ColNames=MyCV_content.get_feature_names()
My_DF_content=pd.DataFrame(My_DTM2.toarray(),columns=ColNames)


My_DF_content['LABEL'] = pd.DataFrame(y,columns=['LABEL'])
rd.seed(1973)
TrainDF, TestDF = train_test_split(My_DF_content, test_size=0.25)
TrainLabels=TrainDF["LABEL"]
TestLabels=TestDF["LABEL"]

TrainDF = TrainDF.drop(["LABEL"], axis=1)
TestDF = TestDF.drop(["LABEL"], axis=1)

from collections import Counter
Counter(y).keys()
Counter(y).values()

/Users/rd/opt/anaconda3/envs/anly503/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning:

Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

dict_values([453, 534, 322, 419, 550, 523, 574, 267, 106, 180])

Code

My_DF_content.to_csv('../../data/02-model-data/twitter_data_count_vectorizer.csv')

Naive Bayes Model

Bayes' Theorem: In probability theory and statistics, Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes’ theorem allows the risk to an individual of a known age to be assessed more accurately (by conditioning it on their age) than simply assuming that the individual is typical of the population as a whole.
Naive Bayes Algorithm is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
A fruit might be categorized as an apple, for instance, if it is red, rounded, and around 3 inches in diameter. Even if these characteristics depend on one another or on the presence of other characteristics, each of these traits separately increases the likelihood that this fruit is an apple, which is why it is called “Naive.”
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes’ Theorem can be shown by this equation: \[ P(C|X) = \frac {P(X|C) * P(C)}{P(X)} \]
In the above equation:
- P(C|X) is the posterior probability of class (C, target) given predictor (X, attributes).
- P(C) is the prior probability of class.
- P(X|C) is the likelihood which is the probability of predictor given class.
- P(X) is the prior probability of predictor.
How does Bayes Theorem work?
- Let’s take an example: A Path Lab is performing a Test of disease say “D” with two results “Positive” & “Negative.” They guarantee that their test result is 99% accurate: if you have the disease, they will give test positive 99% of the time. If you don’t have the disease, they will test negative 99% of the time. If 3% of all the people have this disease and test gives “positive” result, what is the probability that you actually have the disease?
- For solving the above problem, we will have to use conditional probability.
  - Probability of people suffering from Disease D, P(D) = 0.03 = 3%
  - Probability that test gives “positive” result and patient have the disease, P(Pos | D) = 0.99 =99%
  - Probability of people not suffering from Disease D, P(~D) = 0.97 = 97%
  - Probability that test gives “positive” result and patient does have the disease, P(Pos | ~D) = 0.01 =1%
- For calculating the probability that the patient actually have the disease i.e, P( D | Pos) we will use Bayes theorem.
- P(Pos) = P(D, pos) + P( ~D, pos) = P(pos|D)P(D) + P(pos|~D)P(~D) = 0.99 * 0.03 + 0.01 * 0.97 = 0.0394
- Hence, P( D | Pos) = (P(Pos | D) * P(D)) / P(Pos) = (0.99 * 0.03) / 0.0394 = 0.753807107
- So, Approximately 75% chances are there that the patient is actually suffering from disease.
- This is how Bayes’ Theorem works. Reference
Types of Naive Bayes Algorithms:
1. Gaussian Naïve Bayes Classifier: In Gaussian Naïve Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution (Normal distribution). When plotted, it gives a bell-shaped curve which is symmetric about the mean of the feature values.
2. Multinomial Naïve Bayes Classifier: Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. This is the event model typically used for document classification.
3. Bernoulli Naïve Bayes Classifier: In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence (i.e. a word occurs in a document or not) features are used rather than term frequencies (i.e. frequency of a word in the document).
Applications of Naive Bayes Algorithm:
1. Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
2. Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
3. Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
4. Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not
Advantages:
- It is easy and fast to predict class of test data set. It also perform well in multi class prediction
- When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
- It performs well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Disadvatages:
- If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
- On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
- Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Multinomial Naïve Bayes Classifier

There are thousands of softwares or tools for the analysis of numerical data but there are very few for texts. Multinomial Naive Bayes is one of the most popular supervised learning classifications that is used for the analysis of the categorical text data.
Text data classification is gaining popularity because there is an enormous amount of information available in email, documents, websites, etc. that needs to be analyzed. Knowing the context around a certain type of text helps in finding the perception of a software or product to users who are going to use it.
Multinomial Naive Bayes algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output.
Naive Bayes classifier is a collection of many algorithms where all the algorithms share one common principle, and that is each feature being classified is not related to any other feature. The presence or absence of a feature does not affect the presence or absence of the other feature.
Since we are dealing with text data (tweets) converted into numerical using Count Vectorizer, Multinomial Naive Bayes will be useful here.
Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naïve Bayes. It is controlled by the parameter ‘alpha’ in sklearn’s MultinomialNB. For this exercise we will take models with alpha as 1,5 and 10.

Model 1 (alpha = 1)

Code

MyModelNB= MultinomialNB(alpha = 1)

NB1=MyModelNB.fit(TrainDF, TrainLabels)
Preds = MyModelNB.predict(TestDF)
Pred_Proba = MyModelNB.predict_proba(TestDF)
print(metrics.classification_report(TestLabels, Preds))
cnf_matrix1 = confusion_matrix(TestLabels, Preds)

##Visualise Confusion Matrix
labels = label_list
ax1=plt.subplot()
sns.heatmap(confusion_matrix(TestLabels, Preds), annot=True, fmt='g', ax=ax1)

# labels, title and ticks
ax1.set_xlabel('Predicted labels');ax1.set_ylabel('True labels');
ax1.set_title('Confusion Matrix for Model 1') 
ax1.xaxis.set_ticklabels(labels)
ax1.yaxis.set_ticklabels(labels)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, horizontalalignment='right')
ax1.set_yticklabels(ax1.get_yticklabels(), rotation=45, horizontalalignment='right')
plt.savefig('../../images/Confusion Matrix for Model 1.png')
plt.show()
plt.close()

              precision    recall  f1-score   support

           0       0.75      0.66      0.70        32
           1       0.89      0.31      0.46        26
           2       0.61      0.66      0.63       140
           3       0.85      0.88      0.86        64
           4       0.81      0.72      0.76       119
           5       0.47      0.57      0.52        89
           6       0.76      0.83      0.79       136
           7       0.74      0.82      0.78       140
           8       0.89      0.64      0.75        90
           9       0.77      0.75      0.76       146

    accuracy                           0.72       982
   macro avg       0.75      0.68      0.70       982
weighted avg       0.74      0.72      0.72       982

Model 2 (alpha = 5)

Code

MyModelNB2= MultinomialNB(alpha =5)

NB2=MyModelNB2.fit(TrainDF, TrainLabels)
Preds2 = MyModelNB2.predict(TestDF)
Pred_Proba2 = MyModelNB2.predict_proba(TestDF)
print(metrics.classification_report(TestLabels, Preds2))
cnf_matrix1 = confusion_matrix(TestLabels, Preds2)

##Visualise Confusion Matrix
labels = label_list
ax1=plt.subplot()
sns.heatmap(confusion_matrix(TestLabels, Preds2), annot=True, fmt='g', ax=ax1);

# labels, title and ticks
ax1=plt.subplot()
sns.heatmap(confusion_matrix(TestLabels, Preds), annot=True, fmt='g', ax=ax1)

# labels, title and ticks
ax1.set_xlabel('Predicted labels');ax1.set_ylabel('True labels');
ax1.set_title('Confusion Matrix for Model 2') 
ax1.xaxis.set_ticklabels(labels)
ax1.yaxis.set_ticklabels(labels)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, horizontalalignment='right')
ax1.set_yticklabels(ax1.get_yticklabels(), rotation=45, horizontalalignment='right')
plt.savefig('../../images/Confusion Matrix for Model 2.png')
plt.show()
plt.close()

              precision    recall  f1-score   support

           0       0.87      0.62      0.73        32
           1       1.00      0.15      0.27        26
           2       0.66      0.74      0.69       140
           3       0.86      0.86      0.86        64
           4       0.93      0.73      0.82       119
           5       0.58      0.63      0.61        89
           6       0.78      0.84      0.81       136
           7       0.72      0.88      0.79       140
           8       0.92      0.63      0.75        90
           9       0.74      0.83      0.78       146

    accuracy                           0.75       982
   macro avg       0.80      0.69      0.71       982
weighted avg       0.77      0.75      0.75       982

/var/folders/80/kkd433150p52z36v9dx3c_p00000gn/T/ipykernel_48916/2323361071.py:15: MatplotlibDeprecationWarning:

Auto-removal of overlapping axes is deprecated since 3.6 and will be removed two minor releases later; explicitly call ax.remove() as needed.

Model 3 (alpha = 10)

Code

MyModelNB3= MultinomialNB(alpha =10)

NB3=MyModelNB3.fit(TrainDF, TrainLabels)
Preds3 = MyModelNB3.predict(TestDF)
Pred_Proba3 = MyModelNB3.predict_proba(TestDF)
print(metrics.classification_report(TestLabels, Preds3))
cnf_matrix1 = confusion_matrix(TestLabels, Preds3)

##Visualise Confusion Matrix
labels = label_list
ax1=plt.subplot()
sns.heatmap(confusion_matrix(TestLabels, Preds3), annot=True, fmt='g', ax=ax1);

# labels, title and ticks
ax1.set_xlabel('Predicted labels');ax1.set_ylabel('True labels');
ax1.set_title('Confusion Matrix for Model 3') 
ax1.xaxis.set_ticklabels(labels)
ax1.yaxis.set_ticklabels(labels)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, horizontalalignment='right')
ax1.set_yticklabels(ax1.get_yticklabels(), rotation=45, horizontalalignment='right')
plt.savefig('../../images/Confusion Matrix for Model 3.png')
plt.show()
plt.close()

              precision    recall  f1-score   support

           0       0.95      0.59      0.73        32
           1       1.00      0.15      0.27        26
           2       0.67      0.74      0.71       140
           3       0.88      0.88      0.88        64
           4       0.95      0.72      0.82       119
           5       0.60      0.65      0.63        89
           6       0.76      0.86      0.81       136
           7       0.71      0.89      0.79       140
           8       0.96      0.60      0.74        90
           9       0.74      0.84      0.79       146

    accuracy                           0.76       982
   macro avg       0.82      0.69      0.71       982
weighted avg       0.78      0.76      0.75       982

Conclusions

The accuracy and f1-score increases as we keep on increasing alpha.
For alphas 1 to 5 there is a greater increase in accuracy (~5%) than 5 to 10 (~1%).
We need a better model than Naive Bayes for getting higher accuracy.

Some other Interpretations and Visualizations

Getting the 20 most used words in tweets for each team to help understand the fans of each team:

Code

class_0_prob_sorted = NB3.feature_log_prob_[0, :].argsort()[::-1]
class_1_prob_sorted = NB3.feature_log_prob_[1, :].argsort()[::-1]
class_2_prob_sorted = NB3.feature_log_prob_[2, :].argsort()[::-1]
class_3_prob_sorted = NB3.feature_log_prob_[3, :].argsort()[::-1]
class_4_prob_sorted = NB3.feature_log_prob_[4, :].argsort()[::-1]
class_5_prob_sorted = NB3.feature_log_prob_[5, :].argsort()[::-1]
class_6_prob_sorted = NB3.feature_log_prob_[6, :].argsort()[::-1]
class_7_prob_sorted = NB3.feature_log_prob_[7, :].argsort()[::-1]
class_8_prob_sorted = NB3.feature_log_prob_[8, :].argsort()[::-1]
class_9_prob_sorted = NB3.feature_log_prob_[9, :].argsort()[::-1]

print(np.take(MyCV_content.get_feature_names(), class_0_prob_sorted[:20]))
word_cloud_0 = Counter(np.take(MyCV_content.get_feature_names(), class_0_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_1_prob_sorted[:20])) 
word_cloud_1 = Counter(np.take(MyCV_content.get_feature_names(), class_1_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_2_prob_sorted[:20]))
word_cloud_2 = Counter(np.take(MyCV_content.get_feature_names(), class_2_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_3_prob_sorted[:20])) 
word_cloud_3 = Counter(np.take(MyCV_content.get_feature_names(), class_3_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_4_prob_sorted[:20]))
word_cloud_4 = Counter(np.take(MyCV_content.get_feature_names(), class_4_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_5_prob_sorted[:20])) 
word_cloud_5 = Counter(np.take(MyCV_content.get_feature_names(), class_5_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_6_prob_sorted[:20]))
word_cloud_6 = Counter(np.take(MyCV_content.get_feature_names(), class_6_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_7_prob_sorted[:20])) 
word_cloud_7 = Counter(np.take(MyCV_content.get_feature_names(), class_7_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_8_prob_sorted[:20]))
word_cloud_8 = Counter(np.take(MyCV_content.get_feature_names(), class_8_prob_sorted[:20]))
print(np.take(MyCV_content.get_feature_names(), class_9_prob_sorted[:20])) 
word_cloud_9 = Counter(np.take(MyCV_content.get_feature_names(), class_9_prob_sorted[:20]))

['alfa' 'romeo' 'f1' 'zhou' 'formula' 'haas' 'team' 'alpine' 'williams'
 'motorsport' 'vries' 'future' 'hurry' 'ceo' 'gasly' 'decide' 'fnews'
 'says' 'mick' 'sauber']
['tauri' 'alpha' 'f1' 'tsunoda' 'yuki' 'gasly' 'vries' 'alpine' 'team'
 'williams' 'year' 'haas' 'season' 'nyck' 'remain' 'seat' 'driver' 'news'
 'alphataurif' 'alfa']
['alpine' 'f1' 'williams' 'gasly' 'vries' 'haas' 'ricciardo' 'formula'
 'driver' 'alfa' 'mick' 'piastri' 'team' 'alonso' 'seat' 'zhou' 'pierre'
 'race' 'alpha' 'romeo']
['aston' 'martin' 'f1' 'red' 'bull' 'team' 'fallows' 'formula'
 'astonmartinf' 'car' 'early' 'new' 'alonso' 'reminds' 'young' 'season'
 'similar' 'fnews' 'dan' 'driver']
['ferrari' 'f1' 'race' 'team' 'leclerc' 'formula' 'season' 'car' 'th'
 'scuderiaferrari' 'schumacher' 'like' 'driver' 'amp' 'mercedes' 'fan'
 'todt' 'gp' 'pit' 'ferraris']
['haas' 'f1' 'mick' 'williams' 'alpine' 'schumacher' 'alfa' 'vries' 'team'
 'ricciardo' 'gasly' 'seat' 'romeo' 'steiner' 'formula' 'like' 'zhou'
 'season' 'year' 'driver']
['mclaren' 'f1' 'formula' 'car' 'th' 'norris' 'mclarenf' 'lando' 'team'
 'ferrari' 'ricciardo' 'grand' 'prix' 'motorsport' 'year' 'fnews' 'tezsup'
 'racing' 'senna' 'gp']
['mercedes' 'f1' 'team' 'formula' 'hamilton' 'lewis' 'emissions'
 'mercedesamgf' 'race' 'biofuel' 'fnews' 'lewishamilton' 'win' 'freight'
 'ferrari' 'cut' 'car' 'motorsport' 'red' 'season']
['redbull' 'f1' 'formula' 'max' 'mercedes' 'car' 'bull' 'red' 'team'
 'ferrari' 'verstappen' 'season' 'like' 'redbullracing' 'maxverstappen'
 'race' 'horner' 'just' 'lewis' 'win']
['williams' 'f1' 'latifi' 'season' 'team' 'end' 'nicholas' 'vries'
 'alpine' 'driver' 'formula' 'th' 'seat' 'nicholaslatifi' 'haas' 'amp'
 'leave' 'gp' 'schumacher' 'racing']

/Users/rd/opt/anaconda3/envs/anly503/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning:

Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

We can see that for each team the most used words are their team names, driver names and one of f1, formula one and so on.
Some top tier teams also have names of their rival team and driver names.

Code

label_map

{'Alfa Romeo': 0,
 'Alpha Tauri': 1,
 'Alpine': 2,
 'Aston Martin': 3,
 'Ferrari': 4,
 'Haas': 5,
 'Mclaren': 6,
 'Mercedes': 7,
 'Redbull': 8,
 'Williams': 9}

Visualizing the top 20 words in tweets for each team using WorldClouds:

Code

print('WordCloud for:', label_list[0])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_0)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[1])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_1)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[2])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_2)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[3])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_3)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[4])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_4)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[5])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_5)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[6])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_6)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[7])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_7)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[8])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_8)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

print('WordCloud for:', label_list[9])
wordcloud = WordCloud(background_color='black').fit_words(word_cloud_9)
fig, ax = plt.subplots(figsize=(10,10))
_ = ax.imshow(wordcloud, interpolation='bilinear')
_ = ax.axis("off")

WordCloud for: Alfa Romeo
WordCloud for: Alpha Tauri

WordCloud for: Alpine

WordCloud for: Aston Martin
WordCloud for: Ferrari

WordCloud for: Haas

WordCloud for: Mclaren
WordCloud for: Mercedes

WordCloud for: Redbull

WordCloud for: Williams