Thrombosis Analysis
  • Home
  • Report

On this page

  • Data Summary
  • Initial Questions
  • Data Munging
    • Import Libraries
    • Import Data
      • Data Sample
    • Data Cleaning
      • Tsumoto_A and Tsumoto_B Merge
        • Missing Values
      • Tsumoto_C Merge
        • Missing Values
      • Final Merge
  • Exploratory Analysis
    • Preliminary Analysis
    • Visual Analysis
  • Final Plots
  • Technical Summary

EDA on Medical Data of Thrombosis Diagnosis

  • Show All Code
  • Hide All Code

  • View Source

Data Summary

Collagen is a fibrous protein found in cartilage and other connective tissue. Collagen diseases are autoimmune diseases in which the immune system of the body attacks its own skin, tissues, and organs. For example, if a patient generates antibodies for lung, they will lose their ability to do respiration and will die. The extent and causes of these diseases are partially known and not well understood and hence their classification can be a challenging task.
One of these diseases is Thrombosis, which is an important and severe complication and is also one of the major causes of death in collagen diseases. It was recently discovered by medical physicians that Thrombosis is closely related to anti-cardiolipin antibodies. The Databases used in this project are donated by one of these physicians from a University Hospital where patients came regarding collagen diseases and were recommended by their local physicians, home doctors, and other medical specialists.

Initial Questions

  1. Is it possible for some age bands to be more likely to get diagnosed with higher degrees of Thrombosis than others?
  2. Are females more likely to get diagnosed with Thrombosis or is it vice-versa?
  3. How can other medical tests be incorporated to improve the accuracy of the diagnosis?

Data Munging

Import Libraries

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datetime
import snakecase
import re

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)

Import Data

Code
a = pd.read_csv("data/TSUMOTO_A.CSV")
b = pd.read_csv("data/TSUMOTO_B.CSV")
c = pd.read_csv("data/TSUMOTO_C.CSV")

Data Sample

Tsumoto_A - Basic Information about Patients (Input by Experts). This dataset includes all patients

Code
a.head()
ID SEX Birthday Description First Date Admission Diagnosis
0 2110 F 2/13/34 94.02.14 93.02.10 + RA susp.
1 11408 F 5/2/37 96.12.01 73.01.01 + PSS
2 12052 F 4/14/56 91.08.13 NaN + SLE
3 14872 F 9/21/53 97.08.13 NaN + MCTD
4 27654 F 3/25/36 NaN 92.02.03 + RA, SLE susp

Tsumoto_B - Special Laboratory Examinations (Input by Experts) (Measured by the Laboratory on Collagen Diseases)This dataset does not include all the patients, but includes the patients with these special tests

Code
b.head()
ID Examination Date aCL IgG aCL IgM ANA ANA Pattern aCL IgA Diagnosis KCT RVVT LAC Symptoms Thrombosis
0 14872.0 5/27/97 1.3 1.6 256 P 0.0 MCTD, AMI NaN NaN - AMI 1
1 48473.0 12/21/92 4.3 4.6 256 P,S 3.3 SLE - - - NaN 0
2 102490.0 4/20/95 2.3 2.5 0 NaN 3.5 PSS NaN NaN NaN NaN 0
3 108788.0 5/6/97 0.0 0.0 16 S 0.0 NaN NaN NaN - NaN 0
4 122405.0 4/2/98 0.0 4.0 4 P 0.0 SLE, SjS, vertigo NaN NaN NaN NaN 0

Tsumoto_C - Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) All the data includes ordinary laboratory examinations and have temporal stamps

Code
c.head()
ID Date GOT GPT LDH ALP TP ALB UA UN CRE T-BIL T-CHO TG CPK GLU WBC RBC HGB HCT PLT PT APTT FG PIC TAT TAT2 U-PRO IGG IGA IGM CRP RA RF C3 C4 RNP SM SC170 SSA SSB CENTROMEA DNA DNA-II Unnamed: 44 Unnamed: 45
0 2110 860419 24 12 152 63 7.5 4.5 3.4 16.0 0.6 0.4 192 NaN NaN NaN 5.9 4.69 9 31.6 380 NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2110 860430 25 12 162 76 7.9 4.6 4.7 18.0 0.6 0.4 187 76 31 NaN 6.9 4.73 9.3 31.8 323 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <0.002 - NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2110 860502 22 8 144 68 7 4.2 5 18.0 0.6 0.4 191 NaN 23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN - NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2110 860506 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2110 860507 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Data Cleaning

Tsumoto_A and Tsumoto_B Merge

Merge B with A on ID:

Code
b_new = b.dropna(subset=["ID"])  # DROP ROWS WITH NA IN ID COLUMN
merge_df = pd.merge(
    b_new, a[["ID", "SEX", "Birthday", "Diagnosis"]], on="ID", how="left"
)
merge_df.head()
ID Examination Date aCL IgG aCL IgM ANA ANA Pattern aCL IgA Diagnosis_x KCT RVVT LAC Symptoms Thrombosis SEX Birthday Diagnosis_y
0 14872.0 5/27/97 1.3 1.6 256 P 0.0 MCTD, AMI NaN NaN - AMI 1 F 9/21/53 MCTD
1 48473.0 12/21/92 4.3 4.6 256 P,S 3.3 SLE - - - NaN 0 F 10/7/48 SLE
2 102490.0 4/20/95 2.3 2.5 0 NaN 3.5 PSS NaN NaN NaN NaN 0 F 4/1/82 PSS
3 108788.0 5/6/97 0.0 0.0 16 S 0.0 NaN NaN NaN - NaN 0 F 3/15/42 SJS
4 122405.0 4/2/98 0.0 4.0 4 P 0.0 SLE, SjS, vertigo NaN NaN NaN NaN 0 F 5/22/61 SJS

Missing Values

Code
print("The number of rows and columns in the new merged dataframe are:", merge_df.shape)
print("--------------------------------------------------------------------------")
print(
    "The number of missing values in the new merged dataframe are: \n",
    merge_df.isna().sum(),
)
The number of rows and columns in the new merged dataframe are: (770, 16)
--------------------------------------------------------------------------
The number of missing values in the new merged dataframe are: 
 ID                    0
Examination Date      7
aCL IgG               0
aCL IgM               0
ANA                  20
ANA Pattern         236
aCL IgA               0
Diagnosis_x         312
KCT                 625
RVVT                625
LAC                 549
Symptoms            695
Thrombosis            0
SEX                 354
Birthday            353
Diagnosis_y         353
dtype: int64

Handling missing values in the merged dataset:

Code
# MERGING THE 2 DIAGNOSIS COLUMNS INTO 1
# CREATING LISTS OF DIAGNOSIS FOR EACH ROW
for i in range(len(merge_df)):
    if merge_df["Diagnosis_x"].isna()[i] == False:
        merge_df["Diagnosis_x"][i] = merge_df["Diagnosis_x"][i].split(",")
    if merge_df["Diagnosis_y"].isna()[i] == False:
        merge_df["Diagnosis_y"][i] = merge_df["Diagnosis_y"][i].split(",")

# CREATING A NEW COLUMN CALLED DIAGNOSIS AND FILLING IT WITH THE APPROPRIATE DIAGNOSIS
merge_df["Diagnosis"] = ""
for i in range(len(merge_df)):
    if (merge_df["Diagnosis_x"].isna()[i] == False) & (
        merge_df["Diagnosis_y"].isna()[i] == True
    ):
        merge_df["Diagnosis"][i] = merge_df["Diagnosis_x"][i]
    elif (merge_df["Diagnosis_x"].isna()[i] == True) & (
        merge_df["Diagnosis_y"].isna()[i] == False
    ):
        merge_df["Diagnosis"][i] = merge_df["Diagnosis_y"][i]
    elif (merge_df["Diagnosis_x"].isna()[i] == False) & (
        merge_df["Diagnosis_y"].isna()[i] == False
    ):
        merge_df["Diagnosis"][i] = list(
            set(merge_df["Diagnosis_x"][i] + merge_df["Diagnosis_y"][i])
        )

# REMOVING THE DUPLICATES IN THE DIAGNOSIS COLUMN
for i in range(len(merge_df)):
    for j in range(len(merge_df["Diagnosis"][i])):
        merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].strip()
        merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].lower()
    merge_df["Diagnosis"][i] = list(set(merge_df["Diagnosis"][i]))

    if merge_df["Diagnosis"][i] == []:
        merge_df["Diagnosis"][i] = "No Diagnosis"
    else:
        pass

# DROPPING THE ORIGINAL DIAGNOSIS COLUMNS
merge_df.drop(["Diagnosis_x", "Diagnosis_y"], axis=1, inplace=True)

# FILLING NAN VALUES IN THE BIRTHDAY COLUMN WITH A DATE OF 0/0/0
merge_df["Birthday"].fillna("0/0/0", inplace=True)
# DROPPING ROWS WITH NAN VALUES IN THE EXAMINATION DATE COLUMN
merge_df.dropna(subset=["Examination Date"], inplace=True)
merge_df.reset_index(drop=True, inplace=True)

# CREATING A NEW COLUMN CALLED AGE AND FILLING IT WITH THE DIFFERENCE BETWEEN THE EXAMINATION DATE AND BIRTHDAY
merge_df["Age"] = 0
for i in range(len(merge_df)):
    if merge_df["Birthday"][i] == "0/0/0":
        merge_df["Age"][i] = "Not Available"
    elif merge_df["Birthday"][i] != "0/0/0":
        merge_df["Age"][i] = int(merge_df["Examination Date"][i].split("/")[2]) - int(
            merge_df["Birthday"][i].split("/")[2]
        )
    else:
        merge_df["Age"][i] = "Not Available"

# DROPPING THE KCT, RVVT, AND LAC COLUMNS SINCE THEY HAVE MORE THAN 70% MISSING VALUES
merge_df.drop(["KCT", "RVVT", "LAC"], axis=1, inplace=True)
# FILLING MISSING VALUES IN THE SYMPTOMS COLUMN WITH "None"
merge_df["Symptoms"].fillna("None", inplace=True)
merge_df.reset_index(drop=True, inplace=True)
# FILLING MISSING VALUES IN THE ANA COLUMN WITH "0"
merge_df["ANA"].fillna("0", inplace=True)

# FILLING THE VALUES IN THE ANA PATTERN COLUMN WITH "None" IF THE ANA COLUMN IS "0"
for i in range(len(merge_df)):
    if merge_df["ANA"][i] == "0":
        merge_df["ANA Pattern"][i] = "None"
    else:
        pass

# FILLING MISSING VALUES IN THE ANA PATTERN COLUMN WITH "Not Available"
merge_df["ANA Pattern"].fillna("Not Available", inplace=True)
# FILLING MISSING VALUES IN THE SEX COLUMN WITH "Not Available"
merge_df["SEX"].fillna("Not Available", inplace=True)
# DROPPING ROWS WITH MISSING VALUES IN THE ENTIRE DATAFRAME
merge_df.dropna(inplace=True)
merge_df.reset_index(drop=True, inplace=True)

print(
    "The number of missing values in the new merged dataframe are: \n",
    merge_df.isna().sum(),
)
The number of missing values in the new merged dataframe are: 
 ID                  0
Examination Date    0
aCL IgG             0
aCL IgM             0
ANA                 0
ANA Pattern         0
aCL IgA             0
Symptoms            0
Thrombosis          0
SEX                 0
Birthday            0
Diagnosis           0
Age                 0
dtype: int64

Final dataframe sample after handling missing values:

Code
merge_df.head()
ID Examination Date aCL IgG aCL IgM ANA ANA Pattern aCL IgA Symptoms Thrombosis SEX Birthday Diagnosis Age
0 14872.0 5/27/97 1.3 1.6 256 P 0.0 AMI 1 F 9/21/53 [mctd, ami] 44
1 48473.0 12/21/92 4.3 4.6 256 P,S 3.3 None 0 F 10/7/48 [sle] 44
2 102490.0 4/20/95 2.3 2.5 0 None 3.5 None 0 F 4/1/82 [pss] 13
3 108788.0 5/6/97 0.0 0.0 16 S 0.0 None 0 F 3/15/42 [sjs] 55
4 122405.0 4/2/98 0.0 4.0 4 P 0.0 None 0 F 5/22/61 [vertigo, sjs, sle] 37

Converting the Examination Date and Date of the Test to Date Time Format:

Code
c["Date"] = pd.to_datetime(c["Date"], format="%y%m%d")
merge_df["Examination Date"] = pd.to_datetime(merge_df["Examination Date"], format="%x")

Tsumoto_C Merge

Merging the Dataframe of Thrombosis examination (A) and Demographic information of the Patient (B) with the Hospital Records of every Patient (C):

Code
df = pd.merge(merge_df[["ID", "Examination Date"]], c, on=["ID"], how="left")
df.head()
ID Examination Date Date GOT GPT LDH ALP TP ALB UA UN CRE T-BIL T-CHO TG CPK GLU WBC RBC HGB HCT PLT PT APTT FG PIC TAT TAT2 U-PRO IGG IGA IGM CRP RA RF C3 C4 RNP SM SC170 SSA SSB CENTROMEA DNA DNA-II Unnamed: 44 Unnamed: 45
0 14872.0 1997-05-27 1981-02-17 22 30 179 41 6.6 4.1 3.5 13.0 0.9 0.3 193 79 NaN NaN 10 4.44 12.5 38.2 256 NaN NaN NaN NaN NaN NaN 0 NaN NaN NaN - - NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 14872.0 1997-05-27 1981-03-16 22 19 155 37 7.2 4.5 3.2 12.0 0.8 0.3 185 NaN 13 NaN 7.2 4.5 11.8 37.7 199 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN - NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 14872.0 1997-05-27 1981-04-06 21 20 143 42 7.5 4.7 4.1 12.0 0.8 0.3 199 NaN 11 NaN 5.6 4.76 12.8 39.4 199 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN - NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 14872.0 1997-05-27 1981-04-27 32 20 218 43 7.2 4.5 3.5 9.0 0.9 0.3 185 NaN 22 NaN 3.3 4.76 12.3 39.4 183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN - NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 14872.0 1997-05-27 1981-05-01 33 19 277 38 6.9 4.4 3.3 11.0 0.8 0.3 184 NaN 17 NaN 5.8 4.43 11.7 36 215 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN - NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Missing Values

Dropping Rows and Columns according to Missing Values in the Merged Dataset:

Code
# DROPPING UNNCESSARY COLUMNS CREATED WHILE IMPORTING THE DATA
df.drop(["Unnamed: 44", "Unnamed: 45", "CRP"], axis=1, inplace=True)
# DROPPING THE COLUMNS AND ROWS WITH MORE THAN 70% MISSING VALUES
df.dropna(thresh=len(df) * 0.3, axis=1, inplace=True)
df.dropna(thresh=len(df.columns) * 0.3, inplace=True)
df.reset_index(inplace=True)

Feature Extraction - Tagging the hospital records of Patient on the basis of the Thrombosis Examination:

Code
df["B/A_Tag"] = ""
for i in range(len(df)):
    if (df["Examination Date"][i] - df["Date"][i]).days >= 0:
        df["B/A_Tag"][i] = "Before"
    else:
        df["B/A_Tag"][i] = "After"

Converting all the values in the dataframe to float (from string):

Code
col_conv = [
    "GOT",
    "GPT",
    "LDH",
    "ALP",
    "TP",
    "ALB",
    "UA",
    "UN",
    "CRE",
    "T-BIL",
    "T-CHO",
    "TG",
    "WBC",
    "RBC",
    "HGB",
    "HCT",
    "PLT",
    "U-PRO",
    "C3",
    "C4",
]
for col in col_conv:
    if df[col].dtype == "object":
        for i in range(len(df)):
            if df[col].isna()[i] == False:
                if type(df[col][i]) == str:
                    df[col][i] = re.sub(r"[^\d.]", "", df[col][i])
                    if df[col][i] == "":
                        df[col][i] = 0
                    else:
                        df[col][i] = float(df[col][i])
                else:
                    pass
            else:
                pass
    else:
        pass

# CREATING A COPY OF THE DATAFRAME TO AVOID CHANGING THE ORIGINAL DATAFRAME
df1 = df.copy()
# FILLING MISSING VALUES IN THE COLUMNS WITH 0
df1.fillna(0, inplace=True)

# CONVERTING THE COLUMNS TO FLOAT
for col in col_conv:
    df[col] = pd.to_numeric(df[col], errors="coerce").astype(float)

# CREATING A NEW DATAFRAME WITH ONLY THE COLUMNS NEEDED FOR THE GROUPING
df2 = df1[
    [
        "ID",
        "GOT",
        "GPT",
        "LDH",
        "ALP",
        "TP",
        "ALB",
        "UA",
        "UN",
        "CRE",
        "T-BIL",
        "T-CHO",
        "TG",
        "WBC",
        "RBC",
        "HGB",
        "HCT",
        "PLT",
        "U-PRO",
        "C3",
        "C4",
        "B/A_Tag",
    ]
]

Grouping the dataframe by ID and Before-After Tag and taking the mean of the values of each tests:

Code
df3 = df2.groupby(["ID", "B/A_Tag"], as_index=False, dropna=True).mean()

Final Merge

Merging the above dataset with our First Merge Dataset to get the final Data:

Code
final_df = pd.merge(df3, merge_df, on=["ID"], how="left")
final_df.sort_values(by=["ID"], inplace=True)
# CONVERTING THE COLUMN NAMES TO SNAKECASE
final_df.columns = final_df.columns.map(snakecase.convert)

Final Data Cleaning:

Code
# CREATING AGE BANDS FOR THE AGE COLUMN
for i in range(len(final_df)):
    if final_df["age"][i] == "Not Available":
        pass
    else:
        if 0 <= final_df["age"][i] <= 18:
            final_df["age"][i] = "0-18"
        elif 19 <= final_df["age"][i] <= 30:
            final_df["age"][i] = "19-30"
        elif 31 <= final_df["age"][i] <= 45:
            final_df["age"][i] = "31-45"
        elif 46 <= final_df["age"][i] <= 60:
            final_df["age"][i] = "46-60"
        else:
            final_df["age"][i] = "61+"

# CONVERTING ANA COLUMN TO NUMERIC
for i in range(len(final_df)):
    final_df["ana"][i] = re.sub(r"[^\d.]", "", final_df["ana"][i])
final_df["ana"] = pd.to_numeric(final_df["ana"], errors="coerce").astype(float)

# CREATING A THROMBOSIS DIAGNOSIS COLUMN
final_df["thrombosis_diagnosis"] = final_df["thrombosis"].apply(
    lambda x: 1 if x > 0 else 0
)

# DROPPING UNNECESSARY COLUMNS
final_df.drop(["birthday", "examination _date"], axis=1, inplace=True)

# SAVING THE FINAL DATAFRAME TO A CSV FILE
final_df.to_csv("data/FINAL_DATA.csv", index=False)

Final Data Sample:

Code
final_df.head()
id b/a__tag got gpt ldh alp tp alb ua un cre t-bil t-cho tg wbc rbc hgb hct plt u-pro c3 c4 a_cl _ig_g a_cl _ig_m ana ana _pattern a_cl _ig_a symptoms thrombosis sex diagnosis age thrombosis_diagnosis
0 14872.0 Before 20.872340 15.829787 119.255319 42.085106 5.378723 3.285106 2.889362 8.319149 0.568085 0.259574 135.829787 15.234043 6.855319 3.720851 10.578723 32.304255 144.680851 0.000000 0.000000 0.00 1.3 1.6 256.0 P 0.0 AMI 1 F [mctd, ami] 31-45 1
1 48473.0 Before 38.036364 45.745455 111.818182 83.763636 6.103636 3.598182 1.765455 11.181818 0.667273 0.500000 126.290909 23.490909 7.478182 4.728727 10.689091 33.834545 111.236364 0.163636 0.000000 0.00 4.3 4.6 256.0 P,S 3.3 None 0 F [sle] 31-45 0
2 102490.0 Before 23.885714 16.685714 134.628571 117.942857 7.271429 4.097143 4.694286 12.400000 0.814286 0.557143 147.200000 40.457143 6.594286 3.335429 10.271429 31.265714 232.714286 0.028571 0.000000 0.00 2.3 2.5 0.0 None 3.5 None 0 F [pss] 0-18 0
3 108788.0 Before 13.256410 8.717949 68.410256 38.897436 4.620513 2.443590 2.687179 7.384615 0.458974 0.394872 86.230769 5.820513 3.269231 3.594103 10.894872 33.502564 175.025641 0.076923 0.000000 0.00 0.0 0.0 16.0 S 0.0 None 0 F [sjs] 46-60 0
4 122405.0 Before 30.833333 22.950000 126.766667 69.966667 7.626667 3.616667 4.045000 11.800000 0.708333 0.391667 184.283333 8.100000 2.981667 3.795167 10.638333 33.691667 217.616667 0.000000 32.233333 6.45 0.0 4.0 4.0 P 0.0 None 0 F [vertigo, sjs, sle] 31-45 0

Exploratory Analysis

Read in the Data from Local Machine:

Code
final_df = pd.read_csv("data/FINAL_DATA.csv")
final_df.head()
id b/a__tag got gpt ldh alp tp alb ua un cre t-bil t-cho tg wbc rbc hgb hct plt u-pro c3 c4 a_cl _ig_g a_cl _ig_m ana ana _pattern a_cl _ig_a symptoms thrombosis sex diagnosis age thrombosis_diagnosis
0 14872.0 Before 20.872340 15.829787 119.255319 42.085106 5.378723 3.285106 2.889362 8.319149 0.568085 0.259574 135.829787 15.234043 6.855319 3.720851 10.578723 32.304255 144.680851 0.000000 0.000000 0.00 1.3 1.6 256.0 P 0.0 AMI 1 F ['mctd', 'ami'] 31-45 1
1 48473.0 Before 38.036364 45.745455 111.818182 83.763636 6.103636 3.598182 1.765455 11.181818 0.667273 0.500000 126.290909 23.490909 7.478182 4.728727 10.689091 33.834545 111.236364 0.163636 0.000000 0.00 4.3 4.6 256.0 P,S 3.3 None 0 F ['sle'] 31-45 0
2 102490.0 Before 23.885714 16.685714 134.628571 117.942857 7.271429 4.097143 4.694286 12.400000 0.814286 0.557143 147.200000 40.457143 6.594286 3.335429 10.271429 31.265714 232.714286 0.028571 0.000000 0.00 2.3 2.5 0.0 None 3.5 None 0 F ['pss'] 0-18 0
3 108788.0 Before 13.256410 8.717949 68.410256 38.897436 4.620513 2.443590 2.687179 7.384615 0.458974 0.394872 86.230769 5.820513 3.269231 3.594103 10.894872 33.502564 175.025641 0.076923 0.000000 0.00 0.0 0.0 16.0 S 0.0 None 0 F ['sjs'] 46-60 0
4 122405.0 Before 30.833333 22.950000 126.766667 69.966667 7.626667 3.616667 4.045000 11.800000 0.708333 0.391667 184.283333 8.100000 2.981667 3.795167 10.638333 33.691667 217.616667 0.000000 32.233333 6.45 0.0 4.0 4.0 P 0.0 None 0 F ['vertigo', 'sjs', 'sle'] 31-45 0

Setting the default style of the plots:

Code
sns.set_style("white")
sns.set_palette("Accent")

Preliminary Analysis

Information about the dataset features:

Code
final_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 33 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    682 non-null    float64
 1   b/a__tag              682 non-null    object 
 2   got                   682 non-null    float64
 3   gpt                   682 non-null    float64
 4   ldh                   682 non-null    float64
 5   alp                   682 non-null    float64
 6   tp                    682 non-null    float64
 7   alb                   682 non-null    float64
 8   ua                    682 non-null    float64
 9   un                    682 non-null    float64
 10  cre                   682 non-null    float64
 11  t-bil                 682 non-null    float64
 12  t-cho                 682 non-null    float64
 13  tg                    682 non-null    float64
 14  wbc                   682 non-null    float64
 15  rbc                   682 non-null    float64
 16  hgb                   682 non-null    float64
 17  hct                   682 non-null    float64
 18  plt                   682 non-null    float64
 19  u-pro                 682 non-null    float64
 20  c3                    682 non-null    float64
 21  c4                    682 non-null    float64
 22  a_cl _ig_g            682 non-null    float64
 23  a_cl _ig_m            682 non-null    float64
 24  ana                   682 non-null    float64
 25  ana _pattern          682 non-null    object 
 26  a_cl _ig_a            682 non-null    float64
 27  symptoms              682 non-null    object 
 28  thrombosis            682 non-null    int64  
 29  sex                   682 non-null    object 
 30  diagnosis             682 non-null    object 
 31  age                   682 non-null    object 
 32  thrombosis_diagnosis  682 non-null    int64  
dtypes: float64(25), int64(2), object(6)
memory usage: 176.0+ KB

Statistical Analysis of the Dataset:

Code
final_df.describe()
id got gpt ldh alp tp alb ua un cre t-bil t-cho tg wbc rbc hgb hct plt u-pro c3 c4 a_cl _ig_g a_cl _ig_m ana a_cl _ig_a thrombosis thrombosis_diagnosis
count 6.820000e+02 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000 682.000000
mean 4.701481e+06 25.554588 25.432193 348.515716 128.068644 6.288748 3.487965 3.632142 11.902266 0.545950 0.420021 148.807125 80.852254 6.463791 4.089160 12.100520 36.489660 224.730387 8.664434 49.893716 15.610479 16.571584 280.929619 478.392962 77.456452 0.199413 0.139296
std 1.369721e+06 29.433264 27.838277 204.129160 85.133350 1.476372 0.909155 1.331038 5.051944 0.237703 0.420512 57.803023 65.466995 2.810230 0.496685 1.524636 4.370481 85.121063 50.517087 26.652478 9.944405 109.647864 7165.056400 1095.472267 1858.855122 0.545815 0.346509
min 1.487200e+04 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.942857 2.090000 6.100000 18.150000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4.537218e+06 15.255682 12.000000 254.836957 83.458766 5.696410 3.166250 2.807353 9.054924 0.425875 0.263356 116.139161 39.856456 4.483442 3.811571 11.300000 34.076639 173.307018 0.000000 33.000000 8.498464 0.000000 1.500000 4.000000 0.000000 0.000000 0.000000
50% 5.317729e+06 19.000000 17.000000 327.722222 113.061224 6.630128 3.693990 3.600000 11.313333 0.520000 0.355495 153.000000 70.200521 5.812500 4.124375 12.176562 36.732796 215.785714 0.000000 51.873016 15.000000 1.000000 2.200000 16.000000 1.800000 0.000000 0.000000
75% 5.555816e+06 25.232955 27.887821 409.270534 153.805978 7.200000 4.100000 4.323295 14.000000 0.618566 0.500000 187.967742 108.187500 7.778217 4.395000 13.189904 39.483462 263.503906 1.428571 67.080357 21.245192 2.000000 4.000000 256.000000 6.300000 0.000000 0.000000
max 5.779550e+06 422.500000 320.000000 3159.000000 940.500000 9.700000 5.000000 9.706218 60.264249 3.543161 8.635294 433.000000 569.000000 26.866667 5.687143 16.900000 49.685714 685.000000 1000.000000 154.500000 71.000000 1502.400000 187122.000000 4096.000000 48547.000000 3.000000 1.000000

Visual Analysis

Distribution of the Age of the Patients:

Code
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A COUNT PLOT
sns.countplot(
    x="age", data=final_df, ax=ax, order=["0-18", "19-30", "31-45", "46-60", "61+"]
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Age Distribution of the Patients", fontsize=15)
ax.set_xlabel("Age Bands", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)

# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])

plt.show()

Majority of the patients that tested for thrombosis were in the age range of 19-30 years. However, this does not mean that this age group is more likely to get diagnosed with Thrombosis than others. This just shows that the age group of 19-30 years is the most common age group in the dataset and were tested for thrombosis the most number of times.

The Distribution of the number of people who were diagnosed with Thrombosis along with those who were not:

Code
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A BAR PLOT
sns.barplot(
    x=final_df["thrombosis_diagnosis"].value_counts().index,
    y=final_df["thrombosis_diagnosis"].value_counts().values,
    ax=ax,
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Distribution of Positive and Negative Thrombosis Patients", fontsize=15)
ax.set_xlabel("Diagnosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
_ = ax.set_xticklabels(["Negative", "Positive"])

# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])

plt.show()

Only 16% of the patients who went for the Thrombosis Test were diagnosed Positive.

Outlier Detection of all the features present in the Dataset:

Code
# DIVIDING THE COLUMNS INTO GROUPS
cols1 = ["got", "gpt", "ldh", "alp", "tp", "alb"]
cols2 = ["ua", "un", "cre", "t-bil", "t-cho", "tg", "u-pro"]
cols3 = ["wbc", "rbc", "hgb", "hct", "plt", "c3", "c4"]
cols4 = ["a_cl _ig_g", "a_cl _ig_m", "ana", "a_cl _ig_a"]

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(2, 2, figsize=(10, 10))

# CREATING A BOXPLOT FOR EACH GROUP
sns.boxplot(data=final_df[cols1], ax=ax[0, 0])
sns.boxplot(data=final_df[cols2], ax=ax[0, 1])
sns.boxplot(data=final_df[cols3], ax=ax[1, 0])
sns.boxplot(data=final_df[cols4], ax=ax[1, 1])

# SETTING THE TITLE
plt.suptitle("Outlier Detection of the Features", fontsize=16, y=0.94)
ax[0, 0].set_title("Blood Chemistry", fontsize=14)
ax[0, 1].set_title("Urinalysis", fontsize=14)
ax[1, 0].set_title("Complete Blood Count", fontsize=14)
ax[1, 1].set_title("Immunology", fontsize=14)

# SETTING THE TICKSIZE AND GRID LINES
for i in range(2):
    for j in range(2):
        ax[i, j].tick_params(labelsize=11)
        ax[i, j].grid(True, axis="y", linestyle=":", linewidth=1)

plt.show()

Some features like “u-pro (Proteinuria)”, “ldh (Lactate Dehydrogenase)” and “plt (Platelet) have a lot of outliers present while features like”acl_igA” and “acl_igM” have less number of outliers but it is a very large value which needs to be dealt with.

Distribution of Diagnosis of Thrombosis with respect to the gender of the patient:

Code
# CREATING A NEW DATAFRAME FOR GENDER PRECENT DISTRIBUTION
x, y = "sex", "thrombosis_diagnosis"
new = final_df[final_df[x] != "Not Available"]
new = new.groupby(x)[y].value_counts(normalize=True)
new = new.mul(100)
new = new.rename("Percent").reset_index()
new["thrombosis_diagnosis"] = new["thrombosis_diagnosis"].apply(
    lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A BAR PLOT
sns.barplot(x="sex", y="Percent", hue="thrombosis_diagnosis", data=new, ax=ax)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_ylim(0, 100)
ax.set_title(
    "Thrombosis Diagnosis Distribution w.r.t Gender of the Patient", fontsize=15
)
ax.set_xlabel("Gender", fontsize=13)
ax.set_ylabel("Percent", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_xticklabels(["Female", "Male"])

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
ax.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
for p in ax.patches:
    txt = str(p.get_height().round(2)) + "%"
    txt_x = p.get_x()
    txt_y = p.get_height()
    ax.text(txt_x + 0.1, txt_y, txt, va="bottom", fontsize=11)

plt.savefig("plot_outputs/plot-01.png")
plt.show()

From the above graph it is evident that Females are more likely to test positive for Thrombosis as compared to Males. But these reports are from just one hospital and this observation cannot be generalized for the entire world. To make that generalization we would need more information about the Demographics of the area where the University is located and to what extent have those factors affected the patients.

Distribution of the severity of Thrombosis among different Age groups:

Code
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A COUNT PLOT
sns.countplot(
    x=final_df["thrombosis"][final_df["thrombosis"] > 0],
    ax=ax,
    hue=final_df["age"],
    hue_order=["0-18", "19-30", "31-45", "46-60", "61+"],
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Distribution of Degrees of Thrombosis w.r.t Age Bands", fontsize=15)
ax.set_title(
    "1 = Positive | 2 = Positive and very severe | 3 = Positive and extremely severe",
    fontsize=11,
)
ax.set_xlabel("Thrombosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_yticks(np.arange(0, 21, 5))

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
for container in ax.containers:
    _ = ax.bar_label(container)
ax.legend(title="Age Bands", title_fontsize=11, fontsize=10)

plt.savefig("plot_outputs/plot-02.png")
plt.show()

There are a lot of patients for Mild Thrombosis in all age bands as compared to other Degrees of Thrombosis, but Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. And finally, patients in the age group of 31-60 get diagnosed with Extremely Severe Thrombosis. It can be inferred from this data that people of age 61 and above are least likely to get diagnosed with Thrombosis.

Predictors that are correlated to Thrombosis (target variable):

Code
# CREATING A CORRELATION MATRIX
corr_df = final_df.drop(["id", "thrombosis_diagnosis"], axis=1)

# CREATING A CORRELATION PLOT
corr = corr_df.corr()
corr["thrombosis"].sort_values(ascending=False)

# SETTING THE FIGURE SIZE
plt.figure(figsize=(8, 5))

# CREATING A BAR PLOT BASED ON THE CORRELATION COEFFICIENTS
corr["thrombosis"].sort_values(ascending=False).plot(kind="bar")

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.title("Correlation between Thrombosis and other Variables", fontsize=15)
plt.grid(True, axis="y", linestyle=":", linewidth=1)
plt.ylabel("Correlation Coefficient", fontsize=13)
plt.xlabel("Variables", fontsize=13)
plt.tick_params(labelsize=11)

plt.savefig("plot_outputs/plot-03.png")
plt.show()

Features that are highly correlated to Thrombosis (target variable) are:
- acl_igG (Anticardiolipin Antibody IgG)
- ANA (Antinuclear Antibody)
- U-Pro (Proteinuria)
- GPT (ALT glutamic pylvic transaminase)
- GOT (AST glutamic oxaloacetic transaminase)
- C3 (Complement 3)
- C4 (Complement 4)
- RBC (Red Blood Cells)
- HCT (Hematoclit)
- PLT (Platelet)
- HGB (Hemoglobin)

Final Plots

Comparing Different Medical Tests with Tests specific to Thrombosis (Anti-Cardiolipin Antibody (IgG)):
Since Anti-Cardiolipin Antibody (IgG) is the correlated feature to Thrombosis, we will compare it with other medical tests to see if they can be used to improve the accuracy of the diagnosis.

Anti-Cardiolipin Antibody (IgG) V/S ALT glutamic pylvic transaminase (GPT):

Code
# CREATING A NEW DATAFRAME FOR COMPARISON BETWEEN MEDICAL TESTS AND THROMBOSIS SPECIFIC TESTS
new_df = final_df[final_df["a_cl _ig_g"] < 100]
new_df["thrombosis_diagnosis"] = new_df["thrombosis_diagnosis"].apply(
    lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)
new_df1 = final_df[
    (final_df["a_cl _ig_g"] < 100) & (final_df["thrombosis_diagnosis"] > 0)
]
new_df1["thrombosis"] = new_df["thrombosis"].apply(
    lambda x: "Severe" if x > 1 else "Mild"
)

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="gpt",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="gpt",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
    "Anti-Cardiolipin Antibody V/S ALT glutamic pylvic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=12)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "ALT glutamic pylvic transaminase (GPT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-04.png")
plt.show()

ALT glutamic pylvic transaminase (GPT) has a normal range of <60. It can be observed that many patients who were diagnosed with Thrombosis had their GPT <60 and there were only a few exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patient who has GPT of 300+, the patient has a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high GPT and Severeness of Thrombosis can be performed with a larger volume of data.

Anti-Cardiolipin Antibody (IgG) V/S AST glutamic oxaloacetic transaminase (GOT):

Code
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="got",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="got",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
    "Anti-Cardiolipin Antibody V/S AST glutamic oxaloacetic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(
    0.5, 0.02, "AST glutamic oxaloacetic transaminase (GOT)", ha="center", fontsize=13
)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-05.png")
plt.show()

AST glutamic oxaloacetic transaminase (GOT) has a normal range of <60 (similar to GPT). It can be observed that many patients who were diagnosed with Thrombosis had their GOT <60 and there were some exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patients who have GOT of 100+, the patients have a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but can be used while diagnosing cases of Severe Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Complement 3 (C3):

Code
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="c3",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="c3",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 3", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 3 (C3)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-06.png")
plt.show()

Complement 3 (C3) has a normal range of >35. It can be observed that many patients who were diagnosed with Thrombosis had their C3 <35 which implies that the patients who have an abnormal C3 are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C3 is a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Complement 4 (C4):

Code
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="c4",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="c4",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 4", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 4 (C4)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-07.png")
plt.show()

Complement 4 (C4) has a normal range of >10. It can be observed that almost all patients who were diagnosed with Thrombosis had their C4 <18 which implies that the patients who have an abnormal C4 (from 0 to 15) are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C4 is a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Hemoglobin (HGB):

Code
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="hgb",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="hgb",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hemoglobin", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hemoglobin (HGB)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-08.png")
plt.show()

Hemoglobin (HGB) has a normal range between 10 and 17. It can be observed that all patients who were diagnosed with Thrombosis had their HGB between 10 and 17. This shows that HGB is not a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Hematoclit (HCT):

Code
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="hct",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="hct",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hematoclit", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hematoclit (HCT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-09.png")
plt.show()

Hematoclit (HCT) has a normal range between 29 and 52. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their HCT between 29 and 52. This shows that HCT is not a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Platelet Count (PLT):

Code
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="plt",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="plt",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Platelet Count", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Platelet Count (PLT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-10.png")
plt.show()

Platelet (PLT) has a normal range between 100 and 400. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their PLT between 100 and 400. This shows that PLT is not a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Proteinuria (U-PRO):

Code
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="u-pro",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="u-pro",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Proteinuria", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Proteinuria (U-PRO)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-11.png")
plt.show()

Proteinuria (U-PRO) has a normal range between 0 and 30. It can be observed that many patients who were diagnosed with Thrombosis had their U-PRO between 0 and 30 and there were only a few exceptions where patients with U-PRO outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal U-PRO except a few patients. Concentrating on the patients who have U-PRO of 100+, the patients have a severe case of thrombosis which is interesting. This shows that U-PRO is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high U-PRO and Severeness of Thrombosis can be performed with a larger volume of data.

Technical Summary

The 3 datasets provided by the University Hospital revolved mainly around the Thrombosis Diagnosis. All the datasets had ID of the patient which was the joining parameter used to perform analytical joins. These datasets roughly covered the demographic aspect of each patient, the medical test history of each patient and the factors for thrombosis diagnosis of some patients. There were tons of missing data in all these datasets since there are patients who opt-out of providing their personal information and/or have done only a few medical tests per visit. These missing values were carefully imputed after going through entire datasets and also with the help of some domain knowledge. Along with the imputation, the columns and rows which had more than 70% missing values were dropped from the data to avoid assumption of wrong values to impute. Some feature extraction was performed such as calculating the Age of the Patient at the time when Thrombosis tests were performed which would be later useful for analysis.
After getting the final dataframe, Exploratory Data Analysis (Preliminary and Visual) was performed to help understand the data better and answer the initial questions. Most of the Patients who tested for Thrombosis were between ages 19 to 45 and only a few patients who were more than 61 years of age tested for Thrombosis. Further, the distribution of Positive and Negative Thrombosis patients after testing had a huge difference (only 16% of the patients tested were diagnosed as positive). If further statistical analysis and machine learning algorithms are to be applied to predict the Thrombosis results, more data for Positive thrombosis patients need to be introduced since the dataset is imbalanced currently. Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. Females are more likely to test positive for Thrombosis as compared to Males but these reports are from just one hospital and this observation cannot be generalized for the entire world.
Apart from Thrombosis specific tests such as acl_igG (Anticardiolipin Antibody IgG) and ANA (Antinuclear Antibody), common medical tests such as U-Pro (Proteinuria), GPT (ALT glutamic pylvic transaminase), GOT (AST glutamic oxaloacetic transaminase), C3 and C4 (Complement 3 and 4), RBC (Red Blood Cells), HCT (Hematoclit), PLT (Platelet) and HGB (Hemoglobin) were found to be correlated to the Thrombosis Diagnosis. Since, correlation coefficients cannot be the only measure to see if these tests are relevant while diagnosis Thrombosis, further Visual analysis was performed if there are any significant patterns in these tests to look out for while diagnosing Thrombosis.
Tests like C3 and C4 had significant patterns while tests like HGB, HCT and Platelet count did not show any significant patterns while diagnosing Thrombosis. Some tests like U-PRO, GOT and GPT had few patterns that might be good to look at while diagnosing extreme severe cases of Thrombosis. Overall the data required a lot of assumptions while imputing the missing values and the analysis performed was only preliminary. These analysis can be further broken down to the granular level to understand the factors affecting Thrombosis Diagnosis by performing Machine Learning algorithms to predict the Thrombosis Diagnosis.

Source Code
---
title: EDA on Medical Data of Thrombosis Diagnosis
title-location: center
format:
  html:
    embed-resources: true
    self-contained: true
    theme: lumen
    toc: true
    toc-location: right
    toc-depth: 6
    code-fold: true
    code-tools: true
    page-layout: full
jupyter: python3
---

# Data Summary

`Collagen` is a fibrous protein found in cartilage and other connective tissue. Collagen diseases are autoimmune diseases in which the immune system of the body attacks its own skin, tissues, and organs. For example, if a patient generates antibodies for lung, they will lose their ability to do respiration and will die. The extent and causes of these diseases are partially known and not well understood and hence their classification can be a challenging task.<br>
One of these diseases is `Thrombosis`, which is an important and severe complication and is also one of the major causes of death in collagen diseases. It was recently discovered by medical physicians that Thrombosis is closely related to anti-cardiolipin antibodies. The Databases used in this project are donated by one of these physicians from a University Hospital where patients came regarding collagen diseases and were recommended by their local physicians, home doctors, and other medical specialists.

# Initial Questions

1. Is it possible for some age bands to be more likely to get diagnosed with higher degrees of Thrombosis than others?
2. Are females more likely to get diagnosed with Thrombosis or is it vice-versa?
3. How can other medical tests be incorporated to improve the accuracy of the diagnosis?

# Data Munging

## Import Libraries

```{python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datetime
import snakecase
import re

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
```

## Import Data

```{python}
a = pd.read_csv("data/TSUMOTO_A.CSV")
b = pd.read_csv("data/TSUMOTO_B.CSV")
c = pd.read_csv("data/TSUMOTO_C.CSV")
```

### Data Sample

**Tsumoto_A - Basic Information about Patients (Input by Experts). This dataset includes all patients**

```{python}
a.head()
```

**Tsumoto_B - Special Laboratory Examinations (Input by Experts) (Measured by the Laboratory on Collagen Diseases)This dataset does not include all the patients, but includes the patients with these special tests**

```{python}
b.head()
```

**Tsumoto_C - Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) All the data includes ordinary laboratory examinations and have temporal stamps**

```{python}
c.head()
```

## Data Cleaning

### Tsumoto_A and Tsumoto_B Merge

**Merge B with A on `ID`:**

```{python}
b_new = b.dropna(subset=["ID"])  # DROP ROWS WITH NA IN ID COLUMN
merge_df = pd.merge(
    b_new, a[["ID", "SEX", "Birthday", "Diagnosis"]], on="ID", how="left"
)
merge_df.head()
```

#### Missing Values

```{python}
print("The number of rows and columns in the new merged dataframe are:", merge_df.shape)
print("--------------------------------------------------------------------------")
print(
    "The number of missing values in the new merged dataframe are: \n",
    merge_df.isna().sum(),
)
```

**Handling missing values in the merged dataset:**

```{python}
# MERGING THE 2 DIAGNOSIS COLUMNS INTO 1
# CREATING LISTS OF DIAGNOSIS FOR EACH ROW
for i in range(len(merge_df)):
    if merge_df["Diagnosis_x"].isna()[i] == False:
        merge_df["Diagnosis_x"][i] = merge_df["Diagnosis_x"][i].split(",")
    if merge_df["Diagnosis_y"].isna()[i] == False:
        merge_df["Diagnosis_y"][i] = merge_df["Diagnosis_y"][i].split(",")

# CREATING A NEW COLUMN CALLED DIAGNOSIS AND FILLING IT WITH THE APPROPRIATE DIAGNOSIS
merge_df["Diagnosis"] = ""
for i in range(len(merge_df)):
    if (merge_df["Diagnosis_x"].isna()[i] == False) & (
        merge_df["Diagnosis_y"].isna()[i] == True
    ):
        merge_df["Diagnosis"][i] = merge_df["Diagnosis_x"][i]
    elif (merge_df["Diagnosis_x"].isna()[i] == True) & (
        merge_df["Diagnosis_y"].isna()[i] == False
    ):
        merge_df["Diagnosis"][i] = merge_df["Diagnosis_y"][i]
    elif (merge_df["Diagnosis_x"].isna()[i] == False) & (
        merge_df["Diagnosis_y"].isna()[i] == False
    ):
        merge_df["Diagnosis"][i] = list(
            set(merge_df["Diagnosis_x"][i] + merge_df["Diagnosis_y"][i])
        )

# REMOVING THE DUPLICATES IN THE DIAGNOSIS COLUMN
for i in range(len(merge_df)):
    for j in range(len(merge_df["Diagnosis"][i])):
        merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].strip()
        merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].lower()
    merge_df["Diagnosis"][i] = list(set(merge_df["Diagnosis"][i]))

    if merge_df["Diagnosis"][i] == []:
        merge_df["Diagnosis"][i] = "No Diagnosis"
    else:
        pass

# DROPPING THE ORIGINAL DIAGNOSIS COLUMNS
merge_df.drop(["Diagnosis_x", "Diagnosis_y"], axis=1, inplace=True)

# FILLING NAN VALUES IN THE BIRTHDAY COLUMN WITH A DATE OF 0/0/0
merge_df["Birthday"].fillna("0/0/0", inplace=True)
# DROPPING ROWS WITH NAN VALUES IN THE EXAMINATION DATE COLUMN
merge_df.dropna(subset=["Examination Date"], inplace=True)
merge_df.reset_index(drop=True, inplace=True)

# CREATING A NEW COLUMN CALLED AGE AND FILLING IT WITH THE DIFFERENCE BETWEEN THE EXAMINATION DATE AND BIRTHDAY
merge_df["Age"] = 0
for i in range(len(merge_df)):
    if merge_df["Birthday"][i] == "0/0/0":
        merge_df["Age"][i] = "Not Available"
    elif merge_df["Birthday"][i] != "0/0/0":
        merge_df["Age"][i] = int(merge_df["Examination Date"][i].split("/")[2]) - int(
            merge_df["Birthday"][i].split("/")[2]
        )
    else:
        merge_df["Age"][i] = "Not Available"

# DROPPING THE KCT, RVVT, AND LAC COLUMNS SINCE THEY HAVE MORE THAN 70% MISSING VALUES
merge_df.drop(["KCT", "RVVT", "LAC"], axis=1, inplace=True)
# FILLING MISSING VALUES IN THE SYMPTOMS COLUMN WITH "None"
merge_df["Symptoms"].fillna("None", inplace=True)
merge_df.reset_index(drop=True, inplace=True)
# FILLING MISSING VALUES IN THE ANA COLUMN WITH "0"
merge_df["ANA"].fillna("0", inplace=True)

# FILLING THE VALUES IN THE ANA PATTERN COLUMN WITH "None" IF THE ANA COLUMN IS "0"
for i in range(len(merge_df)):
    if merge_df["ANA"][i] == "0":
        merge_df["ANA Pattern"][i] = "None"
    else:
        pass

# FILLING MISSING VALUES IN THE ANA PATTERN COLUMN WITH "Not Available"
merge_df["ANA Pattern"].fillna("Not Available", inplace=True)
# FILLING MISSING VALUES IN THE SEX COLUMN WITH "Not Available"
merge_df["SEX"].fillna("Not Available", inplace=True)
# DROPPING ROWS WITH MISSING VALUES IN THE ENTIRE DATAFRAME
merge_df.dropna(inplace=True)
merge_df.reset_index(drop=True, inplace=True)

print(
    "The number of missing values in the new merged dataframe are: \n",
    merge_df.isna().sum(),
)
```

**Final dataframe sample after handling missing values:**

```{python}
merge_df.head()
```

**Converting the Examination Date and Date of the Test to Date Time Format:**

```{python}
c["Date"] = pd.to_datetime(c["Date"], format="%y%m%d")
merge_df["Examination Date"] = pd.to_datetime(merge_df["Examination Date"], format="%x")
```

### Tsumoto_C Merge

**Merging the Dataframe of Thrombosis examination (A) and Demographic information of the Patient (B) with the Hospital Records of every Patient (C):**

```{python}
df = pd.merge(merge_df[["ID", "Examination Date"]], c, on=["ID"], how="left")
df.head()
```

#### Missing Values

**Dropping Rows and Columns according to Missing Values in the Merged Dataset:**

```{python}
# DROPPING UNNCESSARY COLUMNS CREATED WHILE IMPORTING THE DATA
df.drop(["Unnamed: 44", "Unnamed: 45", "CRP"], axis=1, inplace=True)
# DROPPING THE COLUMNS AND ROWS WITH MORE THAN 70% MISSING VALUES
df.dropna(thresh=len(df) * 0.3, axis=1, inplace=True)
df.dropna(thresh=len(df.columns) * 0.3, inplace=True)
df.reset_index(inplace=True)
```

**Feature Extraction - Tagging the hospital records of Patient on the basis of the Thrombosis Examination:**

```{python}
df["B/A_Tag"] = ""
for i in range(len(df)):
    if (df["Examination Date"][i] - df["Date"][i]).days >= 0:
        df["B/A_Tag"][i] = "Before"
    else:
        df["B/A_Tag"][i] = "After"
```

**Converting all the values in the dataframe to float (from string):**

```{python}
col_conv = [
    "GOT",
    "GPT",
    "LDH",
    "ALP",
    "TP",
    "ALB",
    "UA",
    "UN",
    "CRE",
    "T-BIL",
    "T-CHO",
    "TG",
    "WBC",
    "RBC",
    "HGB",
    "HCT",
    "PLT",
    "U-PRO",
    "C3",
    "C4",
]
for col in col_conv:
    if df[col].dtype == "object":
        for i in range(len(df)):
            if df[col].isna()[i] == False:
                if type(df[col][i]) == str:
                    df[col][i] = re.sub(r"[^\d.]", "", df[col][i])
                    if df[col][i] == "":
                        df[col][i] = 0
                    else:
                        df[col][i] = float(df[col][i])
                else:
                    pass
            else:
                pass
    else:
        pass

# CREATING A COPY OF THE DATAFRAME TO AVOID CHANGING THE ORIGINAL DATAFRAME
df1 = df.copy()
# FILLING MISSING VALUES IN THE COLUMNS WITH 0
df1.fillna(0, inplace=True)

# CONVERTING THE COLUMNS TO FLOAT
for col in col_conv:
    df[col] = pd.to_numeric(df[col], errors="coerce").astype(float)

# CREATING A NEW DATAFRAME WITH ONLY THE COLUMNS NEEDED FOR THE GROUPING
df2 = df1[
    [
        "ID",
        "GOT",
        "GPT",
        "LDH",
        "ALP",
        "TP",
        "ALB",
        "UA",
        "UN",
        "CRE",
        "T-BIL",
        "T-CHO",
        "TG",
        "WBC",
        "RBC",
        "HGB",
        "HCT",
        "PLT",
        "U-PRO",
        "C3",
        "C4",
        "B/A_Tag",
    ]
]
```

**Grouping the dataframe by `ID` and `Before-After Tag` and taking the mean of the values of each tests:**

```{python}
df3 = df2.groupby(["ID", "B/A_Tag"], as_index=False, dropna=True).mean()
```

### Final Merge

**Merging the above dataset with our First Merge Dataset to get the final Data:**

```{python}
final_df = pd.merge(df3, merge_df, on=["ID"], how="left")
final_df.sort_values(by=["ID"], inplace=True)
# CONVERTING THE COLUMN NAMES TO SNAKECASE
final_df.columns = final_df.columns.map(snakecase.convert)
```

**Final Data Cleaning:**

```{python}
# CREATING AGE BANDS FOR THE AGE COLUMN
for i in range(len(final_df)):
    if final_df["age"][i] == "Not Available":
        pass
    else:
        if 0 <= final_df["age"][i] <= 18:
            final_df["age"][i] = "0-18"
        elif 19 <= final_df["age"][i] <= 30:
            final_df["age"][i] = "19-30"
        elif 31 <= final_df["age"][i] <= 45:
            final_df["age"][i] = "31-45"
        elif 46 <= final_df["age"][i] <= 60:
            final_df["age"][i] = "46-60"
        else:
            final_df["age"][i] = "61+"

# CONVERTING ANA COLUMN TO NUMERIC
for i in range(len(final_df)):
    final_df["ana"][i] = re.sub(r"[^\d.]", "", final_df["ana"][i])
final_df["ana"] = pd.to_numeric(final_df["ana"], errors="coerce").astype(float)

# CREATING A THROMBOSIS DIAGNOSIS COLUMN
final_df["thrombosis_diagnosis"] = final_df["thrombosis"].apply(
    lambda x: 1 if x > 0 else 0
)

# DROPPING UNNECESSARY COLUMNS
final_df.drop(["birthday", "examination _date"], axis=1, inplace=True)

# SAVING THE FINAL DATAFRAME TO A CSV FILE
final_df.to_csv("data/FINAL_DATA.csv", index=False)
```

**Final Data Sample:**

```{python}
final_df.head()
```

# Exploratory Analysis

Read in the Data from Local Machine:

```{python}
final_df = pd.read_csv("data/FINAL_DATA.csv")
final_df.head()
```

Setting the default style of the plots:

```{python}
sns.set_style("white")
sns.set_palette("Accent")
```

## Preliminary Analysis

**Information about the dataset features:**

```{python}
final_df.info()
```

**Statistical Analysis of the Dataset:**

```{python}
final_df.describe()
```

## Visual Analysis

**Distribution of the Age of the Patients:**

```{python}
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A COUNT PLOT
sns.countplot(
    x="age", data=final_df, ax=ax, order=["0-18", "19-30", "31-45", "46-60", "61+"]
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Age Distribution of the Patients", fontsize=15)
ax.set_xlabel("Age Bands", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)

# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])

plt.show()
```

_Majority of the patients that tested for thrombosis were in the age range of 19-30 years. However, this does not mean that this age group is more likely to get diagnosed with Thrombosis than others. This just shows that the age group of 19-30 years is the most common age group in the dataset and were tested for thrombosis the most number of times._

**The Distribution of the number of people who were diagnosed with Thrombosis along with those who were not:**

```{python}
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A BAR PLOT
sns.barplot(
    x=final_df["thrombosis_diagnosis"].value_counts().index,
    y=final_df["thrombosis_diagnosis"].value_counts().values,
    ax=ax,
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Distribution of Positive and Negative Thrombosis Patients", fontsize=15)
ax.set_xlabel("Diagnosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
_ = ax.set_xticklabels(["Negative", "Positive"])

# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])

plt.show()
```

_Only 16% of the patients who went for the Thrombosis Test were diagnosed Positive._

**Outlier Detection of all the features present in the Dataset:**

```{python}
# DIVIDING THE COLUMNS INTO GROUPS
cols1 = ["got", "gpt", "ldh", "alp", "tp", "alb"]
cols2 = ["ua", "un", "cre", "t-bil", "t-cho", "tg", "u-pro"]
cols3 = ["wbc", "rbc", "hgb", "hct", "plt", "c3", "c4"]
cols4 = ["a_cl _ig_g", "a_cl _ig_m", "ana", "a_cl _ig_a"]

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(2, 2, figsize=(10, 10))

# CREATING A BOXPLOT FOR EACH GROUP
sns.boxplot(data=final_df[cols1], ax=ax[0, 0])
sns.boxplot(data=final_df[cols2], ax=ax[0, 1])
sns.boxplot(data=final_df[cols3], ax=ax[1, 0])
sns.boxplot(data=final_df[cols4], ax=ax[1, 1])

# SETTING THE TITLE
plt.suptitle("Outlier Detection of the Features", fontsize=16, y=0.94)
ax[0, 0].set_title("Blood Chemistry", fontsize=14)
ax[0, 1].set_title("Urinalysis", fontsize=14)
ax[1, 0].set_title("Complete Blood Count", fontsize=14)
ax[1, 1].set_title("Immunology", fontsize=14)

# SETTING THE TICKSIZE AND GRID LINES
for i in range(2):
    for j in range(2):
        ax[i, j].tick_params(labelsize=11)
        ax[i, j].grid(True, axis="y", linestyle=":", linewidth=1)

plt.show()
```

_Some features like "u-pro (Proteinuria)", "ldh (Lactate Dehydrogenase)" and "plt (Platelet) have a lot of outliers present while features like "acl_igA" and "acl_igM" have less number of outliers but it is a very large value which needs to be dealt with._

**Distribution of Diagnosis of Thrombosis with respect to the gender of the patient:**

```{python}
# CREATING A NEW DATAFRAME FOR GENDER PRECENT DISTRIBUTION
x, y = "sex", "thrombosis_diagnosis"
new = final_df[final_df[x] != "Not Available"]
new = new.groupby(x)[y].value_counts(normalize=True)
new = new.mul(100)
new = new.rename("Percent").reset_index()
new["thrombosis_diagnosis"] = new["thrombosis_diagnosis"].apply(
    lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A BAR PLOT
sns.barplot(x="sex", y="Percent", hue="thrombosis_diagnosis", data=new, ax=ax)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_ylim(0, 100)
ax.set_title(
    "Thrombosis Diagnosis Distribution w.r.t Gender of the Patient", fontsize=15
)
ax.set_xlabel("Gender", fontsize=13)
ax.set_ylabel("Percent", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_xticklabels(["Female", "Male"])

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
ax.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
for p in ax.patches:
    txt = str(p.get_height().round(2)) + "%"
    txt_x = p.get_x()
    txt_y = p.get_height()
    ax.text(txt_x + 0.1, txt_y, txt, va="bottom", fontsize=11)

plt.savefig("plot_outputs/plot-01.png")
plt.show()
```

_From the above graph it is evident that Females are more likely to test positive for Thrombosis as compared to Males. But these reports are from just one hospital and this observation cannot be generalized for the entire world. To make that generalization we would need more information about the Demographics of the area where the University is located and to what extent have those factors affected the patients._

**Distribution of the severity of Thrombosis among different Age groups:**

```{python}
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A COUNT PLOT
sns.countplot(
    x=final_df["thrombosis"][final_df["thrombosis"] > 0],
    ax=ax,
    hue=final_df["age"],
    hue_order=["0-18", "19-30", "31-45", "46-60", "61+"],
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Distribution of Degrees of Thrombosis w.r.t Age Bands", fontsize=15)
ax.set_title(
    "1 = Positive | 2 = Positive and very severe | 3 = Positive and extremely severe",
    fontsize=11,
)
ax.set_xlabel("Thrombosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_yticks(np.arange(0, 21, 5))

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
for container in ax.containers:
    _ = ax.bar_label(container)
ax.legend(title="Age Bands", title_fontsize=11, fontsize=10)

plt.savefig("plot_outputs/plot-02.png")
plt.show()
```

_There are a lot of patients for Mild Thrombosis in all age bands as compared to other Degrees of Thrombosis, but Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. And finally, patients in the age group of 31-60 get diagnosed with Extremely Severe Thrombosis. It can be inferred from this data that people of age 61 and above are least likely to get diagnosed with Thrombosis._

**Predictors that are correlated to Thrombosis (target variable):**

```{python}
# CREATING A CORRELATION MATRIX
corr_df = final_df.drop(["id", "thrombosis_diagnosis"], axis=1)

# CREATING A CORRELATION PLOT
corr = corr_df.corr()
corr["thrombosis"].sort_values(ascending=False)

# SETTING THE FIGURE SIZE
plt.figure(figsize=(8, 5))

# CREATING A BAR PLOT BASED ON THE CORRELATION COEFFICIENTS
corr["thrombosis"].sort_values(ascending=False).plot(kind="bar")

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.title("Correlation between Thrombosis and other Variables", fontsize=15)
plt.grid(True, axis="y", linestyle=":", linewidth=1)
plt.ylabel("Correlation Coefficient", fontsize=13)
plt.xlabel("Variables", fontsize=13)
plt.tick_params(labelsize=11)

plt.savefig("plot_outputs/plot-03.png")
plt.show()
```

_Features that are highly correlated to Thrombosis (target variable) are:_ <br>
- acl_igG (Anticardiolipin Antibody IgG)<br>
- ANA (Antinuclear Antibody)<br>
- U-Pro (Proteinuria)<br>
- GPT (ALT glutamic pylvic transaminase)<br>
- GOT (AST glutamic oxaloacetic transaminase)<br>
- C3 (Complement 3)<br>
- C4 (Complement 4)<br>
- RBC (Red Blood Cells)<br>
- HCT (Hematoclit)<br>
- PLT (Platelet)<br>
- HGB (Hemoglobin)<br>

# Final Plots

**Comparing Different Medical Tests with Tests specific to Thrombosis (Anti-Cardiolipin Antibody (IgG)):**<br>
Since Anti-Cardiolipin Antibody (IgG) is the correlated feature to Thrombosis, we will compare it with other medical tests to see if they can be used to improve the accuracy of the diagnosis.

**_Anti-Cardiolipin Antibody (IgG) V/S ALT glutamic pylvic transaminase (GPT):_**

```{python}
# CREATING A NEW DATAFRAME FOR COMPARISON BETWEEN MEDICAL TESTS AND THROMBOSIS SPECIFIC TESTS
new_df = final_df[final_df["a_cl _ig_g"] < 100]
new_df["thrombosis_diagnosis"] = new_df["thrombosis_diagnosis"].apply(
    lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)
new_df1 = final_df[
    (final_df["a_cl _ig_g"] < 100) & (final_df["thrombosis_diagnosis"] > 0)
]
new_df1["thrombosis"] = new_df["thrombosis"].apply(
    lambda x: "Severe" if x > 1 else "Mild"
)

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="gpt",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="gpt",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
    "Anti-Cardiolipin Antibody V/S ALT glutamic pylvic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=12)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "ALT glutamic pylvic transaminase (GPT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-04.png")
plt.show()
```

_ALT glutamic pylvic transaminase (GPT) has a normal range of <60. It can be observed that many patients who were diagnosed with Thrombosis had their GPT <60 and there were only a few exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patient who has GPT of 300+, the patient has a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high GPT and Severeness of Thrombosis can be performed with a larger volume of data._

**_Anti-Cardiolipin Antibody (IgG) V/S AST glutamic oxaloacetic transaminase (GOT):_**

```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="got",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="got",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
    "Anti-Cardiolipin Antibody V/S AST glutamic oxaloacetic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(
    0.5, 0.02, "AST glutamic oxaloacetic transaminase (GOT)", ha="center", fontsize=13
)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-05.png")
plt.show()
```

_AST glutamic oxaloacetic transaminase (GOT) has a normal range of <60 (similar to GPT). It can be observed that many patients who were diagnosed with Thrombosis had their GOT <60 and there were some exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patients who have GOT of 100+, the patients have a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but can be used while diagnosing cases of Severe Thrombosis._

**_Anti-Cardiolipin Antibody (IgG) V/S Complement 3 (C3):_**

```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="c3",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="c3",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 3", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 3 (C3)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-06.png")
plt.show()
```

_Complement 3 (C3) has a normal range of >35. It can be observed that many patients who were diagnosed with Thrombosis had their C3 <35 which implies that the patients who have an abnormal C3 are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C3 is a good test to be taken into consideration while diagnosing Thrombosis._

**_Anti-Cardiolipin Antibody (IgG) V/S Complement 4 (C4):_**

```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="c4",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="c4",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 4", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 4 (C4)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-07.png")
plt.show()
```

_Complement 4 (C4) has a normal range of >10. It can be observed that almost all patients who were diagnosed with Thrombosis had their C4 <18 which implies that the patients who have an abnormal C4 (from 0 to 15) are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C4 is a good test to be taken into consideration while diagnosing Thrombosis._

**_Anti-Cardiolipin Antibody (IgG) V/S Hemoglobin (HGB):_**

```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="hgb",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="hgb",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hemoglobin", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hemoglobin (HGB)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-08.png")
plt.show()
```

_Hemoglobin (HGB) has a normal range between 10 and 17. It can be observed that all patients who were diagnosed with Thrombosis had their HGB between 10 and 17. This shows that HGB is not a good test to be taken into consideration while diagnosing Thrombosis._

**_Anti-Cardiolipin Antibody (IgG) V/S Hematoclit (HCT):_**

```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="hct",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="hct",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hematoclit", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hematoclit (HCT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-09.png")
plt.show()
```

_Hematoclit (HCT) has a normal range between 29 and 52. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their HCT between 29 and 52. This shows that HCT is not a good test to be taken into consideration while diagnosing Thrombosis._

**_Anti-Cardiolipin Antibody (IgG) V/S Platelet Count (PLT):_**

```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="plt",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="plt",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Platelet Count", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Platelet Count (PLT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-10.png")
plt.show()
```

_Platelet (PLT) has a normal range between 100 and 400. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their PLT between 100 and 400. This shows that PLT is not a good test to be taken into consideration while diagnosing Thrombosis._

**_Anti-Cardiolipin Antibody (IgG) V/S Proteinuria (U-PRO):_**

```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="u-pro",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="u-pro",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Proteinuria", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Proteinuria (U-PRO)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-11.png")
plt.show()
```

_Proteinuria (U-PRO) has a normal range between 0 and 30. It can be observed that many patients who were diagnosed with Thrombosis had their U-PRO between 0 and 30 and there were only a few exceptions where patients with U-PRO outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal U-PRO except a few patients. Concentrating on the patients who have U-PRO of 100+, the patients have a severe case of thrombosis which is interesting. This shows that U-PRO is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high U-PRO and Severeness of Thrombosis can be performed with a larger volume of data._

# Technical Summary

The 3 datasets provided by the University Hospital revolved mainly around the Thrombosis Diagnosis. All the datasets had ID of the patient which was the joining parameter used to perform analytical joins. `These datasets roughly covered the demographic aspect of each patient, the medical test history of each patient and the factors for thrombosis diagnosis of some patients`. There were tons of missing data in all these datasets since there are patients who opt-out of providing their personal information and/or have done only a few medical tests per visit. `These missing values were carefully imputed` after going through entire datasets and also with the help of some domain knowledge. Along with the imputation, the columns and rows which had more than 70% missing values were dropped from the data to avoid assumption of wrong values to impute. Some `feature extraction` was performed such as calculating the Age of the Patient at the time when Thrombosis tests were performed which would be later useful for analysis.<br>After getting the final dataframe, `Exploratory Data Analysis (Preliminary and Visual)` was performed to help understand the data better and answer the initial questions. Most of the Patients who tested for Thrombosis were between ages 19 to 45 and only a few patients who were more than 61 years of age tested for Thrombosis. Further, the distribution of Positive and Negative Thrombosis patients after testing had a huge difference (only 16% of the patients tested were diagnosed as positive). If further statistical analysis and machine learning algorithms are to be applied to predict the Thrombosis results, more data for Positive thrombosis patients need to be introduced since the dataset is imbalanced currently. Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. Females are more likely to test positive for Thrombosis as compared to Males but these reports are from just one hospital and this observation cannot be generalized for the entire world.<br>Apart from Thrombosis specific tests such as acl_igG (Anticardiolipin Antibody IgG) and ANA (Antinuclear Antibody), common medical tests such as U-Pro (Proteinuria), GPT (ALT glutamic pylvic transaminase), GOT (AST glutamic oxaloacetic transaminase), C3 and C4 (Complement 3 and 4), RBC (Red Blood Cells), HCT (Hematoclit), PLT (Platelet) and HGB (Hemoglobin) were found to be correlated to the Thrombosis Diagnosis. Since, correlation coefficients cannot be the only measure to see if these tests are relevant while diagnosis Thrombosis, further Visual analysis was performed if there are any significant patterns in these tests to look out for while diagnosing Thrombosis.<br>Tests like C3 and C4 had significant patterns while tests like HGB, HCT and Platelet count did not show any significant patterns while diagnosing Thrombosis. Some tests like U-PRO, GOT and GPT had few patterns that might be good to look at while diagnosing extreme severe cases of Thrombosis. Overall the data required a lot of assumptions while imputing the missing values and the analysis performed was only preliminary. `These analysis can be further broken down to the granular level to understand the factors affecting Thrombosis Diagnosis by performing Machine Learning algorithms to predict the Thrombosis Diagnosis.`