Formula One Prediction
  • Home
  • Data Gathering
    • Record Data
    • Twitter Data
  • Data Cleaning
    • Record Data in R
    • Record Data in Python
    • Twitter Data
  • EDA
  • Naive Bayes’
    • Record Data
    • Twitter Data
  • Decision Trees
  • SVM
  • Clustering
  • ARM
  • Conclusion

On this page

  • Data Cleaning
  • Import Libraries
  • Import Data
  • Cleaning the Data

Record Data Cleaning in Python

  • Show All Code
  • Hide All Code

  • View Source

Data Cleaning

  • In the era of big data, cleaning or scrubbing your data has become an essential part of the data management process. Even though data cleaning can be tedious at times, it is absolutely crucial for getting accurate business intelligence (BI) that can drive your strategic decisions.
  • Incorrect or inconsistent data leads to false conclusions. And so, how well you clean and understand the data has a high impact on the quality of the results.
  • Data cleaning involve different techniques based on the problem and the data type. Different methods can be applied with each has its own trade-offs. Overall, incorrect data is either removed, corrected, or imputed.
  • Data cleaning is the process of removing incorrect, duplicate, or otherwise erroneous data from a dataset. These errors can include incorrectly formatted data, redundant entries, mislabeled data, and other issues; they often arise when two or more datasets are combined together. Data cleaning improves the quality of your data as well as any business decisions that you draw based on the data.
  • There is no one right way to clean a dataset, as every set is different and presents its own unique slate of errors that need to be corrected. Many data cleaning techniques can now be automated with the help of dedicated software, but some portion of the work must be done manually to ensure the greatest accuracy. Usually this work is done by data quality analysts, BI analysts, and business users.
  • Every organization’s data cleaning methods will vary according to their individual needs as well as the particular constraints of the dataset. However, most data cleaning steps follow a standard framework:
    1. Determine the critical data values you need for your analysis.
    2. Collect the data you need, then sort and organize it.
    3. Identify duplicate or irrelevant values and remove them.
    4. Search for missing values and fill them in, so you have a complete dataset.
    5. Fix any remaining structural or repetitive errors in the dataset.
    6. Identify outliers and remove them, so they will not interfere with your analysis.
    7. Validate your dataset to ensure it is ready for data transformation and analysis.
    8. Once the set has been validated, perform your transformation and analysis.
  • Data Cleaning vs. Data Cleansing vs. Data Scrubbing
    • You might sometimes hear the terms data cleansing or data scrubbing used instead of data cleaning. In most situations, these terms are all being used interchangeably and refer to the exact same thing. Data scrubbing may sometimes be used to refer to a specific aspect of data cleaning—namely, removing duplicate or bad data from datasets.
    • You should also know that data scrubbing can have a slightly different meaning within the specific context of data storage; in this case, it refers to an automated function that evaluates storage systems and disk drives to identify any bad sectors or blocks and to confirm the data in them can be read.
    • Note that all three of these terms—data cleaning, data cleansing, and data scrubbing—are different from data transformation, which is the act of taking clean data and converting it into a new format or structure. Data transformation is a separate process that comes after data cleaning.
  • Benefits of Data Cleaning:
    • Not having clean data exacts a high price: IBM estimates that bad data costs the U.S. over $3 trillion each year. That’s because data-driven decisions are only as good as the data you are relying on. Bad quality data leads to equally bad quality decisions. If the data you are basing your strategy on is inaccurate, then your strategy will have the same issues present in the data, even if it seems sound. In fact, sometimes no data at all is better than bad data.
    • Cleaning your data results in many benefits for your organization in both the short- and long-term. It leads to better decision making, which can boost your efficiency and your customer satisfaction, in turn giving your business a competitive edge. Over time, it also reduces your costs of data management by preemptively removing errors and other mistakes that would necessitate performing analysis over and over again.

Import Libraries

Code
import pandas as pd
import numpy as np
import ast
import json
import requests
from bs4 import BeautifulSoup
from lxml import html
import requests
from numpy import NaN
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

Import Data

Code
df = pd.read_csv('../../data/01-modified-data/data_cleaning_R.csv')
df.head()

Cleaning the Data

Code
a = 0
b = 1
c = 2
d = 3
e = 4

df['Driver_Position'] = 0
df['Driver_Points'] = 0
df['driverRef'] = 0
df['constructorRef'] = 0
df['grid_pos'] = 0
df['race_status'] = 0
df['race_time'] = 0
df['completed_laps'] = 0

for i in range(1096):
    
    df['Driver_Position'][a] = df['Driver_Position_1'][a]
    df['Driver_Position'][b] = df['Driver_Position_2'][b]
    df['Driver_Position'][c] = df['Driver_Position_3'][c]
    df['Driver_Position'][d] = df['Driver_Position_4'][d]
    df['Driver_Position'][e] = df['Driver_Position_5'][e]
    
    df['Driver_Points'][a] = df['Driver_Points_1'][a]
    df['Driver_Points'][b] = df['Driver_Points_2'][b]
    df['Driver_Points'][c] = df['Driver_Points_3'][c]
    df['Driver_Points'][d] = df['Driver_Points_4'][d]
    df['Driver_Points'][e] = df['Driver_Points_5'][e]
    
    df['driverRef'][a] = df['driverRef_1'][a]
    df['driverRef'][b] = df['driverRef_2'][b]
    df['driverRef'][c] = df['driverRef_3'][c]
    df['driverRef'][d] = df['driverRef_4'][d]
    df['driverRef'][e] = df['driverRef_5'][e]
    
    df['constructorRef'][a] = df['constructorRef_1'][a]
    df['constructorRef'][b] = df['constructorRef_2'][b]
    df['constructorRef'][c] = df['constructorRef_3'][c]
    df['constructorRef'][d] = df['constructorRef_4'][d]
    df['constructorRef'][e] = df['constructorRef_5'][e]
    
    df['grid_pos'][a] = df['grid_pos_1'][a]
    df['grid_pos'][b] = df['grid_pos_2'][b]
    df['grid_pos'][c] = df['grid_pos_3'][c]
    df['grid_pos'][d] = df['grid_pos_4'][d]
    df['grid_pos'][e] = df['grid_pos_5'][e]
    
    df['race_status'][a] = df['race_status_1'][a]
    df['race_status'][b] = df['race_status_2'][b]
    df['race_status'][c] = df['race_status_3'][c]
    df['race_status'][d] = df['race_status_4'][d]
    df['race_status'][e] = df['race_status_5'][e]
    
    df['race_time'][a] = df['race_time_1'][a]
    df['race_time'][b] = df['race_time_2'][b]
    df['race_time'][c] = df['race_time_3'][c]
    df['race_time'][d] = df['race_time_4'][d]
    df['race_time'][e] = df['race_time_5'][e]
    
    df['completed_laps'][a] = df['completed_laps_1'][a]
    df['completed_laps'][b] = df['completed_laps_2'][b]
    df['completed_laps'][c] = df['completed_laps_3'][c]
    df['completed_laps'][d] = df['completed_laps_4'][d]
    df['completed_laps'][e] = df['completed_laps_5'][e]
    
    a = a+5
    b = b+5
    c = c+5
    d = d+5
    e = e+5
    
drop = ['Driver_Position_1', 'Driver_Position_2', 'Driver_Position_3', 'Driver_Position_4', 'Driver_Position_5',
             'Driver_Points_1', 'Driver_Points_2', 'Driver_Points_3', 'Driver_Points_4', 'Driver_Points_5',
             'driverRef_1', 'driverRef_2', 'driverRef_3', 'driverRef_4', 'driverRef_5',
             'constructorRef_1', 'constructorRef_2', 'constructorRef_3', 'constructorRef_4', 'constructorRef_5',
             'grid_pos_1', 'grid_pos_2', 'grid_pos_3', 'grid_pos_4', 'grid_pos_5', 
             'completed_laps_1', 'completed_laps_2', 'completed_laps_3', 'completed_laps_4', 'completed_laps_5',
             'race_status_1', 'race_status_2', 'race_status_3', 'race_status_4', 'race_status_5',
             'race_time_1', 'race_time_2', 'race_time_3', 'race_time_4', 'race_time_5']

df.drop(drop, axis = 1, inplace=True)

df.to_csv('../../data/01-modified-data/race_result_cleaned.csv')

Getting and Un-packing Driver standings Information from our Raw data to merge it in the master data:

Code
ds_df = pd.read_csv('../../data/00-raw-data/driver_standings.csv')
ast.literal_eval(ds_df['DriverStandings'][0])[0]
Code
ds_df
Code
season_list = list(df['season'].unique())
ds_df_1 = pd.DataFrame(columns=['driverRef'] + season_list)
ds_df_1['driverRef'] = ds_df['driverRef']
ds_df_1.fillna(0, inplace=True)
Code
ds_df_1
Code
for i in range(len(ds_df)):
    
    for j in range(len(ast.literal_eval(ds_df['DriverStandings'][i]))):
        
        points = ast.literal_eval(ds_df['DriverStandings'][i])[j]['points']
        driver = ast.literal_eval(ds_df['DriverStandings'][i])[j]['Driver']['driverId']
        
        if points == '':
            points = 0
            
        for k in range(len(ds_df_1)):
            
            if driver == ds_df_1['driverRef'][k]:
                ds_df_1[int(ds_df['season'][i])][k] = points
Code
ds_df_1
Code
race_results_df = pd.read_csv('../../data/00-raw-data/race_results.csv')
Code
race_results_df
Code
driver_list = list(driver_df['driverRef'].unique())

Finding the Cumulative Points of a driver for every race he has raced:

Code
point_history_dict = {i : 0 for i in driver_list}
Code
df['Cumulative_Points'] = 0

for i in range(len(race_results_df)):
    
    res_dict = ast.literal_eval(race_results_df['Results'][i])
    
    for j in range(len(res_dict)):
        
        for k in range(len(df)):
        
            if (race_results_df['season'][i] == df['season'][k]) & (race_results_df['round'][i] == df['round'][k]) & (res_dict[j]['Driver']['driverId'] == df['driverRef'][k]):
                    
                if res_dict[j]['points'] == '':
                    res_dict[j]['points'] = 0
                
                df['Cumulative_Points'][k] = point_history_dict[res_dict[j]['Driver']['driverId']] + float(res_dict[j]['points'])    
                point_history_dict[res_dict[j]['Driver']['driverId']] += float(res_dict[j]['points'])    
Code
driver_df

Finding the year of birth of each driver at the time of the race:

Code
df['yob'] = 0
for i in range(len(df)):
    
    for j in range(len(driver_df)):
        
        if df['driverRef'][i] == driver_df['driverRef'][j]:
            df['yob'][i] = int(driver_df['dob'][j][:4])

Finding the cumulative wins of each driver up until each race:

Code
win_history_dict = {i : 0 for i in driver_list}    
Code
df['Cumulative_Wins'] = 0

for i in range(len(df)):
    
    if df['Driver_Position'][i] == 1:
        
        df['Cumulative_Wins'][i] = win_history_dict[df['driverRef'][i]] + 1
        win_history_dict[df['driverRef'][i]] += 1
    
    else:
        df['Cumulative_Wins'][i] = win_history_dict[df['driverRef'][i]]
Code
df.tail(40)
Code
podium_per_race = {}

for i in range(len(race_results_df)):
    
    result = ast.literal_eval(race_results_df['Results'][i])
     
    for j in range(len(result)):
         
        if result[j]['grid'] == '1':
            
            driver = result[j]['Driver']['driverId']
             
            podium_per_race[str(race_results_df['season'][i]) + '_' + str(race_results_df['round'][i])] = driver
Code
df['Podiums'] = ''

for i in range(len(df)):
    
    for keys in podium_per_race:
    
        if str(df['season'][i]) + '_' + str(df['round'][i]) == keys:
            
            df['Podiums'][i] = podium_per_race[keys]
Code
df1 = df.groupby(['season', 'round'], as_index = False)['Podiums'].max()
Code
df1['counter'] = ''

for i in range(len(df1)):
    
    df1['counter'][i] = dict(Counter(df1['Podiums'][:i+1]))
Code
df = pd.merge(df, df1, on = ['season', 'round'], how = 'left')
Code
df.drop(['Podiums_y'], axis = 1, inplace=True)
Code
df.rename(columns = {'Podiums_x' : 'Podiums'}, inplace=True)
Code
df['podium_count'] = 0

for i in range(len(df)):
    
    if df['driverRef'][i] in df['counter'][i].keys():
        df['podium_count'][i] = df['counter'][i][df['driverRef'][i]]
    
    else:
        df['podium_count'][i] = 0

Merging the season and round into one column for easy computations:

Code
season_info = df.groupby(['season', 'round'], as_index = False).count().iloc[:, :2]
Code
race_results_df['season_round'] = ''

for i in range(len(race_results_df)):
    
    race_results_df['season_round'][i] = str(race_results_df['season'][i]) + '_' + str(race_results_df['round'][i])
Code
season_info['season_round'] = ''

for i in range(len(season_info)):
    
    season_info['season_round'][i] = str(season_info['season'][i]) + '_' + str(season_info['round'][i])
Code
df1 = pd.merge(season_info, race_results_df, on = 'season_round', how = 'inner')
Code
df1.drop(['season_y', 'round_y', 'Unnamed: 0'], axis = 1, inplace=True)
Code
df1

Unzipping the results of each driver in each race to create our main dataframe:

Code
main_df = pd.DataFrame()

for i in range(len(df1)):
    
    data1 = pd.DataFrame(ast.literal_eval(df1['Results'][i]))
    data1['season'] = df1['season_x'][i]
    data1['round'] = df1['round_x'][i]
    data1['season_round'] = df1['season_round'][i]
    data1['circuitId'] = ast.literal_eval(df1['Circuit'][i])['circuitId']
    data1['date'] = df1['date'][i]
    data1['url'] = df1['url'][i]
    data1['raceName '] = df1['raceName'][i]
    
    main_df = pd.concat([main_df, data1], axis = 0)
Code
main_df.reset_index(inplace=True)
Code
main_df
Code
main_df['driverRef'] = ''
main_df['yob'] = 0
main_df['constructorRef'] = ''

for i in range(len(main_df)):
    
    main_df['driverRef'][i] = main_df['Driver'][i]['driverId']
    main_df['yob'][i] = int(main_df['Driver'][i]['dateOfBirth'][:4])
    main_df['constructorRef'][i] = main_df['Constructor'][i]['constructorId']

Dropping useless columns:

Code
main_df.drop(['Driver', 'Constructor', 'index', 'number', 'positionText', 'Time'], axis = 1, inplace=True)
Code
main_df

Webscraping weather information of each race of each season from wikipedia infobox:

Code
def get_infobox(url):
    response = requests.get(url)
    bs = BeautifulSoup(response.text)

    table = bs.find('table', {'class' :'infobox vevent'})
    result = {}
    row_count = 0
    if table is None:
        pass
    else:
        for tr in table.find_all('tr'):
            if tr.find('th'):
                key = tr.find('th').text.strip()
                if tr.find('td'):
                    value = tr.find('td').text.strip()
                    result[key] = value
                else:
                    result[key] = ''
            else:
                pass
    return result
Code
url_df = main_df.groupby(['season_round'], as_index = False)['url'].max()
Code
url_df['weather'] = np.nan

for i in range(0, len(url_df)):
    infobox = get_infobox(url_df['url'][i])
    infobox_keys = get_infobox(url_df['url'][i]).keys()
    
    if 'Weather' in infobox_keys:
        url_df['weather'][i] = infobox['Weather']
    else:
        url_df['weather'][i] = np.nan
Code
main_df = pd.merge(main_df, url_df, on = 'season_round', how = 'left')
Code
main_df

Classifying weather into Rainy, Snowy, Windy, Cloudy, Fine, Sunny:

Code
for i in range(len(main_df)):
    
    if main_df['weather'][i] is not NaN:
        
        if ('rain' in main_df['weather'][i].lower()) | ('wet' in main_df['weather'][i].lower()) | ('showers' in main_df['weather'][i].lower()) | ('drizzle' in main_df['weather'][i].lower()) | ('rainy' in main_df['weather'][i].lower()) | ('drizzly' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Rainy'
        
        elif ('snow' in main_df['weather'][i].lower()) | ('cold' in main_df['weather'][i].lower()) | ('frost' in main_df['weather'][i].lower()) | ('icy' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Snowy'
        
        elif ('windy' in main_df['weather'][i].lower()) | ('wind' in main_df['weather'][i].lower()) | ('gusty' in main_df['weather'][i].lower()) | ('blustery' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Windy'
        
        elif ('cloudy' in main_df['weather'][i].lower()) | ('overcast' in main_df['weather'][i].lower()) | ('partly cloudy' in main_df['weather'][i].lower()) | ('partly sunny' in main_df['weather'][i].lower()) | ('hazy' in main_df['weather'][i].lower()) | ('foggy' in main_df['weather'][i].lower()) | ('fog' in main_df['weather'][i].lower()) | ('misty' in main_df['weather'][i].lower()) | ('mist' in main_df['weather'][i].lower()) | ('cloud' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Cloudy'
            
        elif ('fine' in main_df['weather'][i].lower()) | ('fair' in main_df['weather'][i].lower()) | ('mild' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Fine'
        
        elif ('sunny' in main_df['weather'][i].lower()) | ('clear' in main_df['weather'][i].lower()) | ('warm' in main_df['weather'][i].lower()) | ('hot' in main_df['weather'][i].lower()) | ('dry' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Sunny'
            
        else:
            main_df['weather'][i] = 'Sunny'
    
    else:
        pass
Code
main_df['weather'].value_counts()
Code
main_df.drop(['url_y', 'raceName\t', 'FastestLap'], axis = 1, inplace=True)
Code
main_df

Importing other raw datas for further cleaning and engineering:

Code
pitstop_df = pd.read_csv('../../data/00-raw-data/pit_stops.csv')
pitstop_count = pitstop_df.groupby(['raceId', 'driverId'], as_index=False)['stop'].count()
Code
race_df = pd.read_csv('../../data/00-raw-data/races.csv')
race_df.rename(columns={'year':'season'}, inplace=True)
race_df =  race_df[['raceId', 'season', 'round', 'circuitId']]
race_df.head()
Code
driver_df = pd.read_csv('../../data/00-raw-data/drivers.csv')
driver_df.head()
Code
pitstop_count

Merging raw datasets with main data to get additional feature columns:

Code
main_df = pd.merge(main_df, driver_df, on = ['driverRef'], how = 'left')
Code
main_df.drop(['url_x', 'driverRef', 'code', 'number', 'nationality', 'forename', 'surname', 'dob'], axis = 1, inplace=True)
Code
main_df = pd.merge(main_df, race_df, on = ['season', 'round'], how = 'left')
Code
main_df.rename(columns={'circuitId_x' : 'circuitRef', 'circuitId_y' : 'circuitId'}, inplace=True)
main_df
Code
main_df = pd.merge(main_df, pitstop_count, on = ['raceId', 'driverId'], how = 'left')
Code
main_df
Code
for i in range(len(main_df)):
    
    if ('Lap' in main_df['status'][i]) | ('lap' in main_df['status'][i]):
        main_df['status'][i] = 'Lapped'
    
    elif 'Accident' in main_df['status'][i]:
        main_df['status'][i] = 'Accident'
        
    elif 'Finished' in main_df['status'][i]:
        main_df['status'][i] = 'Finished'
        
    else:
        main_df['status'][i] = 'Mechanical_Issue'
Code
main_df
Code
main_df['age_on_race'] = main_df['season'] - main_df['yob']
main_df.drop('yob', axis = 1, inplace=True)

Replacing null values in Pit stop and Weather columns with string “Not Available”:

Code
main_df['stop'].fillna('Not Available', inplace=True)
main_df['weather'].fillna('Not Available', inplace=True)

Converting other numerical values in stop column to text:

Code
for i in range(len(main_df)):
    
    if main_df['stop'][i] == 1.0:
        main_df['stop'][i] = 'One'
    
    elif main_df['stop'][i] == 2.0:
        main_df['stop'][i] = 'Two'
    
    elif main_df['stop'][i] == 3.0:
        main_df['stop'][i] = 'Three'
    
    elif main_df['stop'][i] == 4.0:
        main_df['stop'][i] = 'Four'
    
    elif main_df['stop'][i] == 5.0:
        main_df['stop'][i] = 'Five'
    
    elif main_df['stop'][i] == 6.0:
        main_df['stop'][i] = 'Six'
        
    elif main_df['stop'][i] == 7.0:
        main_df['stop'][i] = 'Seven'
        
    else:
        pass
Code
main_df
Code
main_df[['position', 'points', 'grid', 'laps', 'season', 'round', 'driverId', 'raceId', 'circuitId', 'age_on_race']] = main_df[['position', 'points', 'grid', 'laps', 'season', 'round', 'driverId', 'raceId', 'circuitId', 'age_on_race']].apply(pd.to_numeric)

Getting the Cumulative sum of points and laps completed by each driver for every race:

Code
main_df['cumulative_points'] = main_df.groupby(['driverId'])['points'].cumsum()
main_df['cumulative_laps'] = main_df.groupby(['driverId'])['laps'].cumsum()
Code
main_df

Getting the number of poles and wins achieved till a specific race for every driver in every race:

Code
grid_df = main_df[['season', 'round', 'driverId']][main_df['grid'] == 1]
Code
main_df = pd.merge(main_df, grid_df, on = ['season', 'round'], how = 'left')
main_df.rename(columns={'driverId_x' : 'driverId', 'driverId_y' : 'pole_driverId'}, inplace=True)
Code
history_dict = {i : 0 for i in list(main_df['driverId'].unique())} 
Code
main_df['pole_history'] = 0
    
for j in range(len(main_df)):
        
    if (main_df['driverId'][j] == main_df['pole_driverId'][j]):
            
        history_dict[main_df['driverId'][j]] += 1
        main_df['pole_history'][j] = history_dict[main_df['driverId'][j]]
        
    else:
        main_df['pole_history'][j] = history_dict[main_df['driverId'][j]]
Code
win_df = main_df[['season_round', 'driverId']][main_df['position'] == 1]
Code
main_df = pd.merge(main_df, win_df, on = ['season_round'], how = 'left')
Code
main_df.rename(columns={'driverId_x' : 'driverId', 'driverId_y' : 'win_driverId'}, inplace=True)
Code
main_df
Code
history_dict = {i : 0 for i in list(main_df['driverId'].unique())} 

main_df['win_history'] = 0
    
    
for j in range(len(main_df)):
        
    if (main_df['driverId'][j] == main_df['win_driverId'][j]):
            
        history_dict[main_df['driverId'][j]] += 1
        main_df['win_history'][j] = history_dict[main_df['driverId'][j]]
        
    else:
        main_df['win_history'][j] = history_dict[main_df['driverId'][j]]
Code
main_df.drop(['circuitRef', 'url', 'date'], axis = 1, inplace=True)
Code
main_df.columns
Code
main_df = main_df[['season', 'round', 'season_round', 'driverId', 'raceId',
                    'circuitId', 'position', 'points', 'grid', 'laps', 'status',
                    'constructorRef', 'weather', 'stop', 'age_on_race', 'cumulative_points',
                    'cumulative_laps', 'pole_driverId', 'pole_history', 'win_driverId',
                    'win_history']]
Code
main_df

Creating a label column for Supervised Machine learning with 3 label variables - Podium (Top 3 finish), Top_10 (4-10 finish) and Outisde_Top_10 (10+ finish):

Code
main_df['label'] = ''

for i in range(len(main_df)):
    
    if main_df['position'][i] in [1, 2, 3]:
        main_df['label'][i] = 'Podium'
        
    elif main_df['position'][i] in [4, 5, 6, 7, 8, 9, 10]:
        main_df['label'][i] = 'Top_10'
    
    else:
        main_df['label'][i] = 'Outside_Top_10'
Code
main_df

Exporting the dataframe to a csv file:

Code
main_df.to_csv('../../data/01-modified-data/data_cleaned.csv', index=False)
Source Code
---
title: <b>Record Data Cleaning in Python</b>
format:
  html:
    theme: lumen
    toc: true
    self-contained: true
    embed-resources: true
    page-layout: full
    code-fold: true
    code-tools: true
jupyter: python3
execute: 
  enabled: false
---

# Data Cleaning
- In the era of big data, cleaning or scrubbing your data has become an essential part of the data management process. Even though data cleaning can be tedious at times, it is absolutely crucial for getting accurate business intelligence (BI) that can drive your strategic decisions.
- Incorrect or inconsistent data leads to false conclusions. And so, how well you clean and understand the data has a high impact on the quality of the results.
- Data cleaning involve different techniques based on the problem and the data type. Different methods can be applied with each has its own trade-offs. Overall, incorrect data is either removed, corrected, or imputed.
- Data cleaning is the process of removing incorrect, duplicate, or otherwise erroneous data from a dataset. These errors can include incorrectly formatted data, redundant entries, mislabeled data, and other issues; they often arise when two or more datasets are combined together. Data cleaning improves the quality of your data as well as any business decisions that you draw based on the data.
- There is no one right way to clean a dataset, as every set is different and presents its own unique slate of errors that need to be corrected. Many data cleaning techniques can now be automated with the help of dedicated software, but some portion of the work must be done manually to ensure the greatest accuracy. Usually this work is done by data quality analysts, BI analysts, and business users.
- Every organization’s data cleaning methods will vary according to their individual needs as well as the particular constraints of the dataset. However, most data cleaning steps follow a standard framework:
    1. Determine the critical data values you need for your analysis.
    2. Collect the data you need, then sort and organize it.
    3. Identify duplicate or irrelevant values and remove them.
    4. Search for missing values and fill them in, so you have a complete dataset.
    5. Fix any remaining structural or repetitive errors in the dataset.
    6. Identify outliers and remove them, so they will not interfere with your analysis.
    7. Validate your dataset to ensure it is ready for data transformation and analysis.
    8. Once the set has been validated, perform your transformation and analysis.
- `Data Cleaning vs. Data Cleansing vs. Data Scrubbing`
    - You might sometimes hear the terms data cleansing or data scrubbing used instead of data cleaning. In most situations, these terms are all being used interchangeably and refer to the exact same thing. Data scrubbing may sometimes be used to refer to a specific aspect of data cleaning—namely, removing duplicate or bad data from datasets.
    - You should also know that data scrubbing can have a slightly different meaning within the specific context of data storage; in this case, it refers to an automated function that evaluates storage systems and disk drives to identify any bad sectors or blocks and to confirm the data in them can be read.
    - Note that all three of these terms—data cleaning, data cleansing, and data scrubbing—are different from data transformation, which is the act of taking clean data and converting it into a new format or structure. Data transformation is a separate process that comes after data cleaning.
- `Benefits of Data Cleaning:`
    - Not having clean data exacts a high price: IBM estimates that bad data costs the U.S. over $3 trillion each year. That’s because data-driven decisions are only as good as the data you are relying on. Bad quality data leads to equally bad quality decisions. If the data you are basing your strategy on is inaccurate, then your strategy will have the same issues present in the data, even if it seems sound. In fact, sometimes no data at all is better than bad data.
    - Cleaning your data results in many benefits for your organization in both the short- and long-term. It leads to better decision making, which can boost your efficiency and your customer satisfaction, in turn giving your business a competitive edge. Over time, it also reduces your costs of data management by preemptively removing errors and other mistakes that would necessitate performing analysis over and over again.

# Import Libraries

```{python}
import pandas as pd
import numpy as np
import ast
import json
import requests
from bs4 import BeautifulSoup
from lxml import html
import requests
from numpy import NaN
from collections import Counter
import warnings
warnings.filterwarnings('ignore')
```

# Import Data

```{python}
df = pd.read_csv('../../data/01-modified-data/data_cleaning_R.csv')
df.head()
```

# Cleaning the Data

```{python}
a = 0
b = 1
c = 2
d = 3
e = 4

df['Driver_Position'] = 0
df['Driver_Points'] = 0
df['driverRef'] = 0
df['constructorRef'] = 0
df['grid_pos'] = 0
df['race_status'] = 0
df['race_time'] = 0
df['completed_laps'] = 0

for i in range(1096):
    
    df['Driver_Position'][a] = df['Driver_Position_1'][a]
    df['Driver_Position'][b] = df['Driver_Position_2'][b]
    df['Driver_Position'][c] = df['Driver_Position_3'][c]
    df['Driver_Position'][d] = df['Driver_Position_4'][d]
    df['Driver_Position'][e] = df['Driver_Position_5'][e]
    
    df['Driver_Points'][a] = df['Driver_Points_1'][a]
    df['Driver_Points'][b] = df['Driver_Points_2'][b]
    df['Driver_Points'][c] = df['Driver_Points_3'][c]
    df['Driver_Points'][d] = df['Driver_Points_4'][d]
    df['Driver_Points'][e] = df['Driver_Points_5'][e]
    
    df['driverRef'][a] = df['driverRef_1'][a]
    df['driverRef'][b] = df['driverRef_2'][b]
    df['driverRef'][c] = df['driverRef_3'][c]
    df['driverRef'][d] = df['driverRef_4'][d]
    df['driverRef'][e] = df['driverRef_5'][e]
    
    df['constructorRef'][a] = df['constructorRef_1'][a]
    df['constructorRef'][b] = df['constructorRef_2'][b]
    df['constructorRef'][c] = df['constructorRef_3'][c]
    df['constructorRef'][d] = df['constructorRef_4'][d]
    df['constructorRef'][e] = df['constructorRef_5'][e]
    
    df['grid_pos'][a] = df['grid_pos_1'][a]
    df['grid_pos'][b] = df['grid_pos_2'][b]
    df['grid_pos'][c] = df['grid_pos_3'][c]
    df['grid_pos'][d] = df['grid_pos_4'][d]
    df['grid_pos'][e] = df['grid_pos_5'][e]
    
    df['race_status'][a] = df['race_status_1'][a]
    df['race_status'][b] = df['race_status_2'][b]
    df['race_status'][c] = df['race_status_3'][c]
    df['race_status'][d] = df['race_status_4'][d]
    df['race_status'][e] = df['race_status_5'][e]
    
    df['race_time'][a] = df['race_time_1'][a]
    df['race_time'][b] = df['race_time_2'][b]
    df['race_time'][c] = df['race_time_3'][c]
    df['race_time'][d] = df['race_time_4'][d]
    df['race_time'][e] = df['race_time_5'][e]
    
    df['completed_laps'][a] = df['completed_laps_1'][a]
    df['completed_laps'][b] = df['completed_laps_2'][b]
    df['completed_laps'][c] = df['completed_laps_3'][c]
    df['completed_laps'][d] = df['completed_laps_4'][d]
    df['completed_laps'][e] = df['completed_laps_5'][e]
    
    a = a+5
    b = b+5
    c = c+5
    d = d+5
    e = e+5
    
drop = ['Driver_Position_1', 'Driver_Position_2', 'Driver_Position_3', 'Driver_Position_4', 'Driver_Position_5',
             'Driver_Points_1', 'Driver_Points_2', 'Driver_Points_3', 'Driver_Points_4', 'Driver_Points_5',
             'driverRef_1', 'driverRef_2', 'driverRef_3', 'driverRef_4', 'driverRef_5',
             'constructorRef_1', 'constructorRef_2', 'constructorRef_3', 'constructorRef_4', 'constructorRef_5',
             'grid_pos_1', 'grid_pos_2', 'grid_pos_3', 'grid_pos_4', 'grid_pos_5', 
             'completed_laps_1', 'completed_laps_2', 'completed_laps_3', 'completed_laps_4', 'completed_laps_5',
             'race_status_1', 'race_status_2', 'race_status_3', 'race_status_4', 'race_status_5',
             'race_time_1', 'race_time_2', 'race_time_3', 'race_time_4', 'race_time_5']

df.drop(drop, axis = 1, inplace=True)

df.to_csv('../../data/01-modified-data/race_result_cleaned.csv')
```

Getting and Un-packing Driver standings Information from our Raw data to merge it in the master data:

```{python}
ds_df = pd.read_csv('../../data/00-raw-data/driver_standings.csv')
ast.literal_eval(ds_df['DriverStandings'][0])[0]
```

```{python}
ds_df
```

```{python}
season_list = list(df['season'].unique())
ds_df_1 = pd.DataFrame(columns=['driverRef'] + season_list)
ds_df_1['driverRef'] = ds_df['driverRef']
ds_df_1.fillna(0, inplace=True)
```

```{python}
ds_df_1
```

```{python}
for i in range(len(ds_df)):
    
    for j in range(len(ast.literal_eval(ds_df['DriverStandings'][i]))):
        
        points = ast.literal_eval(ds_df['DriverStandings'][i])[j]['points']
        driver = ast.literal_eval(ds_df['DriverStandings'][i])[j]['Driver']['driverId']
        
        if points == '':
            points = 0
            
        for k in range(len(ds_df_1)):
            
            if driver == ds_df_1['driverRef'][k]:
                ds_df_1[int(ds_df['season'][i])][k] = points
```

```{python}
ds_df_1
```

```{python}
race_results_df = pd.read_csv('../../data/00-raw-data/race_results.csv')
```

```{python}
race_results_df
```

```{python}
driver_list = list(driver_df['driverRef'].unique())
```

Finding the Cumulative Points of a driver for every race he has raced:

```{python}
point_history_dict = {i : 0 for i in driver_list}
```

```{python}
df['Cumulative_Points'] = 0

for i in range(len(race_results_df)):
    
    res_dict = ast.literal_eval(race_results_df['Results'][i])
    
    for j in range(len(res_dict)):
        
        for k in range(len(df)):
        
            if (race_results_df['season'][i] == df['season'][k]) & (race_results_df['round'][i] == df['round'][k]) & (res_dict[j]['Driver']['driverId'] == df['driverRef'][k]):
                    
                if res_dict[j]['points'] == '':
                    res_dict[j]['points'] = 0
                
                df['Cumulative_Points'][k] = point_history_dict[res_dict[j]['Driver']['driverId']] + float(res_dict[j]['points'])    
                point_history_dict[res_dict[j]['Driver']['driverId']] += float(res_dict[j]['points'])    
```

```{python}
driver_df
```

Finding the year of birth of each driver at the time of the race:

```{python}
df['yob'] = 0
for i in range(len(df)):
    
    for j in range(len(driver_df)):
        
        if df['driverRef'][i] == driver_df['driverRef'][j]:
            df['yob'][i] = int(driver_df['dob'][j][:4])
```

Finding the cumulative wins of each driver up until each race:

```{python}
win_history_dict = {i : 0 for i in driver_list}    
```

```{python}
df['Cumulative_Wins'] = 0

for i in range(len(df)):
    
    if df['Driver_Position'][i] == 1:
        
        df['Cumulative_Wins'][i] = win_history_dict[df['driverRef'][i]] + 1
        win_history_dict[df['driverRef'][i]] += 1
    
    else:
        df['Cumulative_Wins'][i] = win_history_dict[df['driverRef'][i]]
        
```

```{python}
df.tail(40)
```

```{python}
podium_per_race = {}

for i in range(len(race_results_df)):
    
    result = ast.literal_eval(race_results_df['Results'][i])
     
    for j in range(len(result)):
         
        if result[j]['grid'] == '1':
            
            driver = result[j]['Driver']['driverId']
             
            podium_per_race[str(race_results_df['season'][i]) + '_' + str(race_results_df['round'][i])] = driver
```

```{python}
df['Podiums'] = ''

for i in range(len(df)):
    
    for keys in podium_per_race:
    
        if str(df['season'][i]) + '_' + str(df['round'][i]) == keys:
            
            df['Podiums'][i] = podium_per_race[keys]
```

```{python}
df1 = df.groupby(['season', 'round'], as_index = False)['Podiums'].max()
```

```{python}
df1['counter'] = ''

for i in range(len(df1)):
    
    df1['counter'][i] = dict(Counter(df1['Podiums'][:i+1]))
```

```{python}
df = pd.merge(df, df1, on = ['season', 'round'], how = 'left')
```

```{python}
df.drop(['Podiums_y'], axis = 1, inplace=True)
```

```{python}
df.rename(columns = {'Podiums_x' : 'Podiums'}, inplace=True)
```

```{python}
df['podium_count'] = 0

for i in range(len(df)):
    
    if df['driverRef'][i] in df['counter'][i].keys():
        df['podium_count'][i] = df['counter'][i][df['driverRef'][i]]
    
    else:
        df['podium_count'][i] = 0
```

Merging the season and round into one column for easy computations:

```{python}
season_info = df.groupby(['season', 'round'], as_index = False).count().iloc[:, :2]
```

```{python}
race_results_df['season_round'] = ''

for i in range(len(race_results_df)):
    
    race_results_df['season_round'][i] = str(race_results_df['season'][i]) + '_' + str(race_results_df['round'][i])
```

```{python}
season_info['season_round'] = ''

for i in range(len(season_info)):
    
    season_info['season_round'][i] = str(season_info['season'][i]) + '_' + str(season_info['round'][i])
```

```{python}
df1 = pd.merge(season_info, race_results_df, on = 'season_round', how = 'inner')
```

```{python}
df1.drop(['season_y', 'round_y', 'Unnamed: 0'], axis = 1, inplace=True)
```

```{python}
df1
```

Unzipping the results of each driver in each race to create our main dataframe:

```{python}
main_df = pd.DataFrame()

for i in range(len(df1)):
    
    data1 = pd.DataFrame(ast.literal_eval(df1['Results'][i]))
    data1['season'] = df1['season_x'][i]
    data1['round'] = df1['round_x'][i]
    data1['season_round'] = df1['season_round'][i]
    data1['circuitId'] = ast.literal_eval(df1['Circuit'][i])['circuitId']
    data1['date'] = df1['date'][i]
    data1['url'] = df1['url'][i]
    data1['raceName '] = df1['raceName'][i]
    
    main_df = pd.concat([main_df, data1], axis = 0)
```

```{python}
main_df.reset_index(inplace=True)
```

```{python}
main_df
```

```{python}
main_df['driverRef'] = ''
main_df['yob'] = 0
main_df['constructorRef'] = ''

for i in range(len(main_df)):
    
    main_df['driverRef'][i] = main_df['Driver'][i]['driverId']
    main_df['yob'][i] = int(main_df['Driver'][i]['dateOfBirth'][:4])
    main_df['constructorRef'][i] = main_df['Constructor'][i]['constructorId']
    
```

Dropping useless columns:

```{python}
main_df.drop(['Driver', 'Constructor', 'index', 'number', 'positionText', 'Time'], axis = 1, inplace=True)
```

```{python}
main_df
```

Webscraping weather information of each race of each season from wikipedia infobox:

```{python}
def get_infobox(url):
    response = requests.get(url)
    bs = BeautifulSoup(response.text)

    table = bs.find('table', {'class' :'infobox vevent'})
    result = {}
    row_count = 0
    if table is None:
        pass
    else:
        for tr in table.find_all('tr'):
            if tr.find('th'):
                key = tr.find('th').text.strip()
                if tr.find('td'):
                    value = tr.find('td').text.strip()
                    result[key] = value
                else:
                    result[key] = ''
            else:
                pass
    return result
```

```{python}
url_df = main_df.groupby(['season_round'], as_index = False)['url'].max()
```

```{python}
url_df['weather'] = np.nan

for i in range(0, len(url_df)):
    infobox = get_infobox(url_df['url'][i])
    infobox_keys = get_infobox(url_df['url'][i]).keys()
    
    if 'Weather' in infobox_keys:
        url_df['weather'][i] = infobox['Weather']
    else:
        url_df['weather'][i] = np.nan
```

```{python}
main_df = pd.merge(main_df, url_df, on = 'season_round', how = 'left')
```

```{python}
main_df
```

Classifying weather into Rainy, Snowy, Windy, Cloudy, Fine, Sunny:

```{python}
for i in range(len(main_df)):
    
    if main_df['weather'][i] is not NaN:
        
        if ('rain' in main_df['weather'][i].lower()) | ('wet' in main_df['weather'][i].lower()) | ('showers' in main_df['weather'][i].lower()) | ('drizzle' in main_df['weather'][i].lower()) | ('rainy' in main_df['weather'][i].lower()) | ('drizzly' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Rainy'
        
        elif ('snow' in main_df['weather'][i].lower()) | ('cold' in main_df['weather'][i].lower()) | ('frost' in main_df['weather'][i].lower()) | ('icy' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Snowy'
        
        elif ('windy' in main_df['weather'][i].lower()) | ('wind' in main_df['weather'][i].lower()) | ('gusty' in main_df['weather'][i].lower()) | ('blustery' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Windy'
        
        elif ('cloudy' in main_df['weather'][i].lower()) | ('overcast' in main_df['weather'][i].lower()) | ('partly cloudy' in main_df['weather'][i].lower()) | ('partly sunny' in main_df['weather'][i].lower()) | ('hazy' in main_df['weather'][i].lower()) | ('foggy' in main_df['weather'][i].lower()) | ('fog' in main_df['weather'][i].lower()) | ('misty' in main_df['weather'][i].lower()) | ('mist' in main_df['weather'][i].lower()) | ('cloud' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Cloudy'
            
        elif ('fine' in main_df['weather'][i].lower()) | ('fair' in main_df['weather'][i].lower()) | ('mild' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Fine'
        
        elif ('sunny' in main_df['weather'][i].lower()) | ('clear' in main_df['weather'][i].lower()) | ('warm' in main_df['weather'][i].lower()) | ('hot' in main_df['weather'][i].lower()) | ('dry' in main_df['weather'][i].lower()):
            main_df['weather'][i] = 'Sunny'
            
        else:
            main_df['weather'][i] = 'Sunny'
    
    else:
        pass
```

```{python}
main_df['weather'].value_counts()
```

```{python}
main_df.drop(['url_y', 'raceName\t', 'FastestLap'], axis = 1, inplace=True)
```

```{python}
main_df
```

Importing other raw datas for further cleaning and engineering:

```{python}
pitstop_df = pd.read_csv('../../data/00-raw-data/pit_stops.csv')
pitstop_count = pitstop_df.groupby(['raceId', 'driverId'], as_index=False)['stop'].count()
```

```{python}
race_df = pd.read_csv('../../data/00-raw-data/races.csv')
race_df.rename(columns={'year':'season'}, inplace=True)
race_df =  race_df[['raceId', 'season', 'round', 'circuitId']]
race_df.head()
```

```{python}
driver_df = pd.read_csv('../../data/00-raw-data/drivers.csv')
driver_df.head()
```

```{python}
pitstop_count
```

Merging raw datasets with main data to get additional feature columns:

```{python}
main_df = pd.merge(main_df, driver_df, on = ['driverRef'], how = 'left')
```

```{python}
main_df.drop(['url_x', 'driverRef', 'code', 'number', 'nationality', 'forename', 'surname', 'dob'], axis = 1, inplace=True)
```

```{python}
main_df = pd.merge(main_df, race_df, on = ['season', 'round'], how = 'left')
```

```{python}
main_df.rename(columns={'circuitId_x' : 'circuitRef', 'circuitId_y' : 'circuitId'}, inplace=True)
main_df
```

```{python}
main_df = pd.merge(main_df, pitstop_count, on = ['raceId', 'driverId'], how = 'left')
```

```{python}
main_df
```

```{python}
for i in range(len(main_df)):
    
    if ('Lap' in main_df['status'][i]) | ('lap' in main_df['status'][i]):
        main_df['status'][i] = 'Lapped'
    
    elif 'Accident' in main_df['status'][i]:
        main_df['status'][i] = 'Accident'
        
    elif 'Finished' in main_df['status'][i]:
        main_df['status'][i] = 'Finished'
        
    else:
        main_df['status'][i] = 'Mechanical_Issue'
```

```{python}
main_df
```

```{python}
main_df['age_on_race'] = main_df['season'] - main_df['yob']
main_df.drop('yob', axis = 1, inplace=True)
```

Replacing null values in Pit stop and Weather columns with string "Not Available":

```{python}
main_df['stop'].fillna('Not Available', inplace=True)
main_df['weather'].fillna('Not Available', inplace=True)
```

Converting other numerical values in stop column to text:

```{python}
for i in range(len(main_df)):
    
    if main_df['stop'][i] == 1.0:
        main_df['stop'][i] = 'One'
    
    elif main_df['stop'][i] == 2.0:
        main_df['stop'][i] = 'Two'
    
    elif main_df['stop'][i] == 3.0:
        main_df['stop'][i] = 'Three'
    
    elif main_df['stop'][i] == 4.0:
        main_df['stop'][i] = 'Four'
    
    elif main_df['stop'][i] == 5.0:
        main_df['stop'][i] = 'Five'
    
    elif main_df['stop'][i] == 6.0:
        main_df['stop'][i] = 'Six'
        
    elif main_df['stop'][i] == 7.0:
        main_df['stop'][i] = 'Seven'
        
    else:
        pass
```

```{python}
main_df
```

```{python}
main_df[['position', 'points', 'grid', 'laps', 'season', 'round', 'driverId', 'raceId', 'circuitId', 'age_on_race']] = main_df[['position', 'points', 'grid', 'laps', 'season', 'round', 'driverId', 'raceId', 'circuitId', 'age_on_race']].apply(pd.to_numeric)
```

Getting the Cumulative sum of points and laps completed by each driver for every race:

```{python}
main_df['cumulative_points'] = main_df.groupby(['driverId'])['points'].cumsum()
main_df['cumulative_laps'] = main_df.groupby(['driverId'])['laps'].cumsum()
```

```{python}
main_df
```

Getting the number of poles and wins achieved till a specific race for every driver in every race:

```{python}
grid_df = main_df[['season', 'round', 'driverId']][main_df['grid'] == 1]
```

```{python}
main_df = pd.merge(main_df, grid_df, on = ['season', 'round'], how = 'left')
main_df.rename(columns={'driverId_x' : 'driverId', 'driverId_y' : 'pole_driverId'}, inplace=True)
```

```{python}
history_dict = {i : 0 for i in list(main_df['driverId'].unique())} 
```

```{python}
main_df['pole_history'] = 0
    
for j in range(len(main_df)):
        
    if (main_df['driverId'][j] == main_df['pole_driverId'][j]):
            
        history_dict[main_df['driverId'][j]] += 1
        main_df['pole_history'][j] = history_dict[main_df['driverId'][j]]
        
    else:
        main_df['pole_history'][j] = history_dict[main_df['driverId'][j]]
```

```{python}
win_df = main_df[['season_round', 'driverId']][main_df['position'] == 1]
```

```{python}
main_df = pd.merge(main_df, win_df, on = ['season_round'], how = 'left')
```

```{python}
main_df.rename(columns={'driverId_x' : 'driverId', 'driverId_y' : 'win_driverId'}, inplace=True)
```

```{python}
main_df
```

```{python}
history_dict = {i : 0 for i in list(main_df['driverId'].unique())} 

main_df['win_history'] = 0
    
    
for j in range(len(main_df)):
        
    if (main_df['driverId'][j] == main_df['win_driverId'][j]):
            
        history_dict[main_df['driverId'][j]] += 1
        main_df['win_history'][j] = history_dict[main_df['driverId'][j]]
        
    else:
        main_df['win_history'][j] = history_dict[main_df['driverId'][j]]
```

```{python}
main_df.drop(['circuitRef', 'url', 'date'], axis = 1, inplace=True)
```

```{python}
main_df.columns
```

```{python}
main_df = main_df[['season', 'round', 'season_round', 'driverId', 'raceId',
                    'circuitId', 'position', 'points', 'grid', 'laps', 'status',
                    'constructorRef', 'weather', 'stop', 'age_on_race', 'cumulative_points',
                    'cumulative_laps', 'pole_driverId', 'pole_history', 'win_driverId',
                    'win_history']]
```

```{python}
main_df
```

Creating a label column for Supervised Machine learning with 3 label variables - Podium (Top 3 finish), Top_10 (4-10 finish) and Outisde_Top_10 (10+ finish):

```{python}
main_df['label'] = ''

for i in range(len(main_df)):
    
    if main_df['position'][i] in [1, 2, 3]:
        main_df['label'][i] = 'Podium'
        
    elif main_df['position'][i] in [4, 5, 6, 7, 8, 9, 10]:
        main_df['label'][i] = 'Top_10'
    
    else:
        main_df['label'][i] = 'Outside_Top_10'
    
```

```{python}
main_df
```

Exporting the dataframe to a csv file:

```{python}
main_df.to_csv('../../data/01-modified-data/data_cleaned.csv', index=False)
```