Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datetime
import snakecase
import re
"ignore")
warnings.filterwarnings("display.max_columns", None) pd.set_option(
Collagen
is a fibrous protein found in cartilage and other connective tissue. Collagen diseases are autoimmune diseases in which the immune system of the body attacks its own skin, tissues, and organs. For example, if a patient generates antibodies for lung, they will lose their ability to do respiration and will die. The extent and causes of these diseases are partially known and not well understood and hence their classification can be a challenging task.
One of these diseases is Thrombosis
, which is an important and severe complication and is also one of the major causes of death in collagen diseases. It was recently discovered by medical physicians that Thrombosis is closely related to anti-cardiolipin antibodies. The Databases used in this project are donated by one of these physicians from a University Hospital where patients came regarding collagen diseases and were recommended by their local physicians, home doctors, and other medical specialists.
Tsumoto_A - Basic Information about Patients (Input by Experts). This dataset includes all patients
ID | SEX | Birthday | Description | First Date | Admission | Diagnosis | |
---|---|---|---|---|---|---|---|
0 | 2110 | F | 2/13/34 | 94.02.14 | 93.02.10 | + | RA susp. |
1 | 11408 | F | 5/2/37 | 96.12.01 | 73.01.01 | + | PSS |
2 | 12052 | F | 4/14/56 | 91.08.13 | NaN | + | SLE |
3 | 14872 | F | 9/21/53 | 97.08.13 | NaN | + | MCTD |
4 | 27654 | F | 3/25/36 | NaN | 92.02.03 | + | RA, SLE susp |
Tsumoto_B - Special Laboratory Examinations (Input by Experts) (Measured by the Laboratory on Collagen Diseases)This dataset does not include all the patients, but includes the patients with these special tests
ID | Examination Date | aCL IgG | aCL IgM | ANA | ANA Pattern | aCL IgA | Diagnosis | KCT | RVVT | LAC | Symptoms | Thrombosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14872.0 | 5/27/97 | 1.3 | 1.6 | 256 | P | 0.0 | MCTD, AMI | NaN | NaN | - | AMI | 1 |
1 | 48473.0 | 12/21/92 | 4.3 | 4.6 | 256 | P,S | 3.3 | SLE | - | - | - | NaN | 0 |
2 | 102490.0 | 4/20/95 | 2.3 | 2.5 | 0 | NaN | 3.5 | PSS | NaN | NaN | NaN | NaN | 0 |
3 | 108788.0 | 5/6/97 | 0.0 | 0.0 | 16 | S | 0.0 | NaN | NaN | NaN | - | NaN | 0 |
4 | 122405.0 | 4/2/98 | 0.0 | 4.0 | 4 | P | 0.0 | SLE, SjS, vertigo | NaN | NaN | NaN | NaN | 0 |
Tsumoto_C - Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) All the data includes ordinary laboratory examinations and have temporal stamps
ID | Date | GOT | GPT | LDH | ALP | TP | ALB | UA | UN | CRE | T-BIL | T-CHO | TG | CPK | GLU | WBC | RBC | HGB | HCT | PLT | PT | APTT | FG | PIC | TAT | TAT2 | U-PRO | IGG | IGA | IGM | CRP | RA | RF | C3 | C4 | RNP | SM | SC170 | SSA | SSB | CENTROMEA | DNA | DNA-II | Unnamed: 44 | Unnamed: 45 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2110 | 860419 | 24 | 12 | 152 | 63 | 7.5 | 4.5 | 3.4 | 16.0 | 0.6 | 0.4 | 192 | NaN | NaN | NaN | 5.9 | 4.69 | 9 | 31.6 | 380 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2110 | 860430 | 25 | 12 | 162 | 76 | 7.9 | 4.6 | 4.7 | 18.0 | 0.6 | 0.4 | 187 | 76 | 31 | NaN | 6.9 | 4.73 | 9.3 | 31.8 | 323 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <0.002 | - | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2110 | 860502 | 22 | 8 | 144 | 68 | 7 | 4.2 | 5 | 18.0 | 0.6 | 0.4 | 191 | NaN | 23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | - | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2110 | 860506 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2110 | 860507 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Merge B with A on ID
:
ID | Examination Date | aCL IgG | aCL IgM | ANA | ANA Pattern | aCL IgA | Diagnosis_x | KCT | RVVT | LAC | Symptoms | Thrombosis | SEX | Birthday | Diagnosis_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14872.0 | 5/27/97 | 1.3 | 1.6 | 256 | P | 0.0 | MCTD, AMI | NaN | NaN | - | AMI | 1 | F | 9/21/53 | MCTD |
1 | 48473.0 | 12/21/92 | 4.3 | 4.6 | 256 | P,S | 3.3 | SLE | - | - | - | NaN | 0 | F | 10/7/48 | SLE |
2 | 102490.0 | 4/20/95 | 2.3 | 2.5 | 0 | NaN | 3.5 | PSS | NaN | NaN | NaN | NaN | 0 | F | 4/1/82 | PSS |
3 | 108788.0 | 5/6/97 | 0.0 | 0.0 | 16 | S | 0.0 | NaN | NaN | NaN | - | NaN | 0 | F | 3/15/42 | SJS |
4 | 122405.0 | 4/2/98 | 0.0 | 4.0 | 4 | P | 0.0 | SLE, SjS, vertigo | NaN | NaN | NaN | NaN | 0 | F | 5/22/61 | SJS |
The number of rows and columns in the new merged dataframe are: (770, 16)
--------------------------------------------------------------------------
The number of missing values in the new merged dataframe are:
ID 0
Examination Date 7
aCL IgG 0
aCL IgM 0
ANA 20
ANA Pattern 236
aCL IgA 0
Diagnosis_x 312
KCT 625
RVVT 625
LAC 549
Symptoms 695
Thrombosis 0
SEX 354
Birthday 353
Diagnosis_y 353
dtype: int64
Handling missing values in the merged dataset:
# MERGING THE 2 DIAGNOSIS COLUMNS INTO 1
# CREATING LISTS OF DIAGNOSIS FOR EACH ROW
for i in range(len(merge_df)):
if merge_df["Diagnosis_x"].isna()[i] == False:
merge_df["Diagnosis_x"][i] = merge_df["Diagnosis_x"][i].split(",")
if merge_df["Diagnosis_y"].isna()[i] == False:
merge_df["Diagnosis_y"][i] = merge_df["Diagnosis_y"][i].split(",")
# CREATING A NEW COLUMN CALLED DIAGNOSIS AND FILLING IT WITH THE APPROPRIATE DIAGNOSIS
merge_df["Diagnosis"] = ""
for i in range(len(merge_df)):
if (merge_df["Diagnosis_x"].isna()[i] == False) & (
merge_df["Diagnosis_y"].isna()[i] == True
):
merge_df["Diagnosis"][i] = merge_df["Diagnosis_x"][i]
elif (merge_df["Diagnosis_x"].isna()[i] == True) & (
merge_df["Diagnosis_y"].isna()[i] == False
):
merge_df["Diagnosis"][i] = merge_df["Diagnosis_y"][i]
elif (merge_df["Diagnosis_x"].isna()[i] == False) & (
merge_df["Diagnosis_y"].isna()[i] == False
):
merge_df["Diagnosis"][i] = list(
set(merge_df["Diagnosis_x"][i] + merge_df["Diagnosis_y"][i])
)
# REMOVING THE DUPLICATES IN THE DIAGNOSIS COLUMN
for i in range(len(merge_df)):
for j in range(len(merge_df["Diagnosis"][i])):
merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].strip()
merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].lower()
merge_df["Diagnosis"][i] = list(set(merge_df["Diagnosis"][i]))
if merge_df["Diagnosis"][i] == []:
merge_df["Diagnosis"][i] = "No Diagnosis"
else:
pass
# DROPPING THE ORIGINAL DIAGNOSIS COLUMNS
merge_df.drop(["Diagnosis_x", "Diagnosis_y"], axis=1, inplace=True)
# FILLING NAN VALUES IN THE BIRTHDAY COLUMN WITH A DATE OF 0/0/0
merge_df["Birthday"].fillna("0/0/0", inplace=True)
# DROPPING ROWS WITH NAN VALUES IN THE EXAMINATION DATE COLUMN
merge_df.dropna(subset=["Examination Date"], inplace=True)
merge_df.reset_index(drop=True, inplace=True)
# CREATING A NEW COLUMN CALLED AGE AND FILLING IT WITH THE DIFFERENCE BETWEEN THE EXAMINATION DATE AND BIRTHDAY
merge_df["Age"] = 0
for i in range(len(merge_df)):
if merge_df["Birthday"][i] == "0/0/0":
merge_df["Age"][i] = "Not Available"
elif merge_df["Birthday"][i] != "0/0/0":
merge_df["Age"][i] = int(merge_df["Examination Date"][i].split("/")[2]) - int(
merge_df["Birthday"][i].split("/")[2]
)
else:
merge_df["Age"][i] = "Not Available"
# DROPPING THE KCT, RVVT, AND LAC COLUMNS SINCE THEY HAVE MORE THAN 70% MISSING VALUES
merge_df.drop(["KCT", "RVVT", "LAC"], axis=1, inplace=True)
# FILLING MISSING VALUES IN THE SYMPTOMS COLUMN WITH "None"
merge_df["Symptoms"].fillna("None", inplace=True)
merge_df.reset_index(drop=True, inplace=True)
# FILLING MISSING VALUES IN THE ANA COLUMN WITH "0"
merge_df["ANA"].fillna("0", inplace=True)
# FILLING THE VALUES IN THE ANA PATTERN COLUMN WITH "None" IF THE ANA COLUMN IS "0"
for i in range(len(merge_df)):
if merge_df["ANA"][i] == "0":
merge_df["ANA Pattern"][i] = "None"
else:
pass
# FILLING MISSING VALUES IN THE ANA PATTERN COLUMN WITH "Not Available"
merge_df["ANA Pattern"].fillna("Not Available", inplace=True)
# FILLING MISSING VALUES IN THE SEX COLUMN WITH "Not Available"
merge_df["SEX"].fillna("Not Available", inplace=True)
# DROPPING ROWS WITH MISSING VALUES IN THE ENTIRE DATAFRAME
merge_df.dropna(inplace=True)
merge_df.reset_index(drop=True, inplace=True)
print(
"The number of missing values in the new merged dataframe are: \n",
merge_df.isna().sum(),
)
The number of missing values in the new merged dataframe are:
ID 0
Examination Date 0
aCL IgG 0
aCL IgM 0
ANA 0
ANA Pattern 0
aCL IgA 0
Symptoms 0
Thrombosis 0
SEX 0
Birthday 0
Diagnosis 0
Age 0
dtype: int64
Final dataframe sample after handling missing values:
ID | Examination Date | aCL IgG | aCL IgM | ANA | ANA Pattern | aCL IgA | Symptoms | Thrombosis | SEX | Birthday | Diagnosis | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14872.0 | 5/27/97 | 1.3 | 1.6 | 256 | P | 0.0 | AMI | 1 | F | 9/21/53 | [mctd, ami] | 44 |
1 | 48473.0 | 12/21/92 | 4.3 | 4.6 | 256 | P,S | 3.3 | None | 0 | F | 10/7/48 | [sle] | 44 |
2 | 102490.0 | 4/20/95 | 2.3 | 2.5 | 0 | None | 3.5 | None | 0 | F | 4/1/82 | [pss] | 13 |
3 | 108788.0 | 5/6/97 | 0.0 | 0.0 | 16 | S | 0.0 | None | 0 | F | 3/15/42 | [sjs] | 55 |
4 | 122405.0 | 4/2/98 | 0.0 | 4.0 | 4 | P | 0.0 | None | 0 | F | 5/22/61 | [vertigo, sjs, sle] | 37 |
Converting the Examination Date and Date of the Test to Date Time Format:
Merging the Dataframe of Thrombosis examination (A) and Demographic information of the Patient (B) with the Hospital Records of every Patient (C):
ID | Examination Date | Date | GOT | GPT | LDH | ALP | TP | ALB | UA | UN | CRE | T-BIL | T-CHO | TG | CPK | GLU | WBC | RBC | HGB | HCT | PLT | PT | APTT | FG | PIC | TAT | TAT2 | U-PRO | IGG | IGA | IGM | CRP | RA | RF | C3 | C4 | RNP | SM | SC170 | SSA | SSB | CENTROMEA | DNA | DNA-II | Unnamed: 44 | Unnamed: 45 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14872.0 | 1997-05-27 | 1981-02-17 | 22 | 30 | 179 | 41 | 6.6 | 4.1 | 3.5 | 13.0 | 0.9 | 0.3 | 193 | 79 | NaN | NaN | 10 | 4.44 | 12.5 | 38.2 | 256 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | - | - | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 14872.0 | 1997-05-27 | 1981-03-16 | 22 | 19 | 155 | 37 | 7.2 | 4.5 | 3.2 | 12.0 | 0.8 | 0.3 | 185 | NaN | 13 | NaN | 7.2 | 4.5 | 11.8 | 37.7 | 199 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | - | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 14872.0 | 1997-05-27 | 1981-04-06 | 21 | 20 | 143 | 42 | 7.5 | 4.7 | 4.1 | 12.0 | 0.8 | 0.3 | 199 | NaN | 11 | NaN | 5.6 | 4.76 | 12.8 | 39.4 | 199 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | - | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 14872.0 | 1997-05-27 | 1981-04-27 | 32 | 20 | 218 | 43 | 7.2 | 4.5 | 3.5 | 9.0 | 0.9 | 0.3 | 185 | NaN | 22 | NaN | 3.3 | 4.76 | 12.3 | 39.4 | 183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | - | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 14872.0 | 1997-05-27 | 1981-05-01 | 33 | 19 | 277 | 38 | 6.9 | 4.4 | 3.3 | 11.0 | 0.8 | 0.3 | 184 | NaN | 17 | NaN | 5.8 | 4.43 | 11.7 | 36 | 215 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | - | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Dropping Rows and Columns according to Missing Values in the Merged Dataset:
# DROPPING UNNCESSARY COLUMNS CREATED WHILE IMPORTING THE DATA
df.drop(["Unnamed: 44", "Unnamed: 45", "CRP"], axis=1, inplace=True)
# DROPPING THE COLUMNS AND ROWS WITH MORE THAN 70% MISSING VALUES
df.dropna(thresh=len(df) * 0.3, axis=1, inplace=True)
df.dropna(thresh=len(df.columns) * 0.3, inplace=True)
df.reset_index(inplace=True)
Feature Extraction - Tagging the hospital records of Patient on the basis of the Thrombosis Examination:
Converting all the values in the dataframe to float (from string):
col_conv = [
"GOT",
"GPT",
"LDH",
"ALP",
"TP",
"ALB",
"UA",
"UN",
"CRE",
"T-BIL",
"T-CHO",
"TG",
"WBC",
"RBC",
"HGB",
"HCT",
"PLT",
"U-PRO",
"C3",
"C4",
]
for col in col_conv:
if df[col].dtype == "object":
for i in range(len(df)):
if df[col].isna()[i] == False:
if type(df[col][i]) == str:
df[col][i] = re.sub(r"[^\d.]", "", df[col][i])
if df[col][i] == "":
df[col][i] = 0
else:
df[col][i] = float(df[col][i])
else:
pass
else:
pass
else:
pass
# CREATING A COPY OF THE DATAFRAME TO AVOID CHANGING THE ORIGINAL DATAFRAME
df1 = df.copy()
# FILLING MISSING VALUES IN THE COLUMNS WITH 0
df1.fillna(0, inplace=True)
# CONVERTING THE COLUMNS TO FLOAT
for col in col_conv:
df[col] = pd.to_numeric(df[col], errors="coerce").astype(float)
# CREATING A NEW DATAFRAME WITH ONLY THE COLUMNS NEEDED FOR THE GROUPING
df2 = df1[
[
"ID",
"GOT",
"GPT",
"LDH",
"ALP",
"TP",
"ALB",
"UA",
"UN",
"CRE",
"T-BIL",
"T-CHO",
"TG",
"WBC",
"RBC",
"HGB",
"HCT",
"PLT",
"U-PRO",
"C3",
"C4",
"B/A_Tag",
]
]
Grouping the dataframe by ID
and Before-After Tag
and taking the mean of the values of each tests:
Merging the above dataset with our First Merge Dataset to get the final Data:
Final Data Cleaning:
# CREATING AGE BANDS FOR THE AGE COLUMN
for i in range(len(final_df)):
if final_df["age"][i] == "Not Available":
pass
else:
if 0 <= final_df["age"][i] <= 18:
final_df["age"][i] = "0-18"
elif 19 <= final_df["age"][i] <= 30:
final_df["age"][i] = "19-30"
elif 31 <= final_df["age"][i] <= 45:
final_df["age"][i] = "31-45"
elif 46 <= final_df["age"][i] <= 60:
final_df["age"][i] = "46-60"
else:
final_df["age"][i] = "61+"
# CONVERTING ANA COLUMN TO NUMERIC
for i in range(len(final_df)):
final_df["ana"][i] = re.sub(r"[^\d.]", "", final_df["ana"][i])
final_df["ana"] = pd.to_numeric(final_df["ana"], errors="coerce").astype(float)
# CREATING A THROMBOSIS DIAGNOSIS COLUMN
final_df["thrombosis_diagnosis"] = final_df["thrombosis"].apply(
lambda x: 1 if x > 0 else 0
)
# DROPPING UNNECESSARY COLUMNS
final_df.drop(["birthday", "examination _date"], axis=1, inplace=True)
# SAVING THE FINAL DATAFRAME TO A CSV FILE
final_df.to_csv("data/FINAL_DATA.csv", index=False)
Final Data Sample:
id | b/a__tag | got | gpt | ldh | alp | tp | alb | ua | un | cre | t-bil | t-cho | tg | wbc | rbc | hgb | hct | plt | u-pro | c3 | c4 | a_cl _ig_g | a_cl _ig_m | ana | ana _pattern | a_cl _ig_a | symptoms | thrombosis | sex | diagnosis | age | thrombosis_diagnosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14872.0 | Before | 20.872340 | 15.829787 | 119.255319 | 42.085106 | 5.378723 | 3.285106 | 2.889362 | 8.319149 | 0.568085 | 0.259574 | 135.829787 | 15.234043 | 6.855319 | 3.720851 | 10.578723 | 32.304255 | 144.680851 | 0.000000 | 0.000000 | 0.00 | 1.3 | 1.6 | 256.0 | P | 0.0 | AMI | 1 | F | [mctd, ami] | 31-45 | 1 |
1 | 48473.0 | Before | 38.036364 | 45.745455 | 111.818182 | 83.763636 | 6.103636 | 3.598182 | 1.765455 | 11.181818 | 0.667273 | 0.500000 | 126.290909 | 23.490909 | 7.478182 | 4.728727 | 10.689091 | 33.834545 | 111.236364 | 0.163636 | 0.000000 | 0.00 | 4.3 | 4.6 | 256.0 | P,S | 3.3 | None | 0 | F | [sle] | 31-45 | 0 |
2 | 102490.0 | Before | 23.885714 | 16.685714 | 134.628571 | 117.942857 | 7.271429 | 4.097143 | 4.694286 | 12.400000 | 0.814286 | 0.557143 | 147.200000 | 40.457143 | 6.594286 | 3.335429 | 10.271429 | 31.265714 | 232.714286 | 0.028571 | 0.000000 | 0.00 | 2.3 | 2.5 | 0.0 | None | 3.5 | None | 0 | F | [pss] | 0-18 | 0 |
3 | 108788.0 | Before | 13.256410 | 8.717949 | 68.410256 | 38.897436 | 4.620513 | 2.443590 | 2.687179 | 7.384615 | 0.458974 | 0.394872 | 86.230769 | 5.820513 | 3.269231 | 3.594103 | 10.894872 | 33.502564 | 175.025641 | 0.076923 | 0.000000 | 0.00 | 0.0 | 0.0 | 16.0 | S | 0.0 | None | 0 | F | [sjs] | 46-60 | 0 |
4 | 122405.0 | Before | 30.833333 | 22.950000 | 126.766667 | 69.966667 | 7.626667 | 3.616667 | 4.045000 | 11.800000 | 0.708333 | 0.391667 | 184.283333 | 8.100000 | 2.981667 | 3.795167 | 10.638333 | 33.691667 | 217.616667 | 0.000000 | 32.233333 | 6.45 | 0.0 | 4.0 | 4.0 | P | 0.0 | None | 0 | F | [vertigo, sjs, sle] | 31-45 | 0 |
Read in the Data from Local Machine:
id | b/a__tag | got | gpt | ldh | alp | tp | alb | ua | un | cre | t-bil | t-cho | tg | wbc | rbc | hgb | hct | plt | u-pro | c3 | c4 | a_cl _ig_g | a_cl _ig_m | ana | ana _pattern | a_cl _ig_a | symptoms | thrombosis | sex | diagnosis | age | thrombosis_diagnosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14872.0 | Before | 20.872340 | 15.829787 | 119.255319 | 42.085106 | 5.378723 | 3.285106 | 2.889362 | 8.319149 | 0.568085 | 0.259574 | 135.829787 | 15.234043 | 6.855319 | 3.720851 | 10.578723 | 32.304255 | 144.680851 | 0.000000 | 0.000000 | 0.00 | 1.3 | 1.6 | 256.0 | P | 0.0 | AMI | 1 | F | ['mctd', 'ami'] | 31-45 | 1 |
1 | 48473.0 | Before | 38.036364 | 45.745455 | 111.818182 | 83.763636 | 6.103636 | 3.598182 | 1.765455 | 11.181818 | 0.667273 | 0.500000 | 126.290909 | 23.490909 | 7.478182 | 4.728727 | 10.689091 | 33.834545 | 111.236364 | 0.163636 | 0.000000 | 0.00 | 4.3 | 4.6 | 256.0 | P,S | 3.3 | None | 0 | F | ['sle'] | 31-45 | 0 |
2 | 102490.0 | Before | 23.885714 | 16.685714 | 134.628571 | 117.942857 | 7.271429 | 4.097143 | 4.694286 | 12.400000 | 0.814286 | 0.557143 | 147.200000 | 40.457143 | 6.594286 | 3.335429 | 10.271429 | 31.265714 | 232.714286 | 0.028571 | 0.000000 | 0.00 | 2.3 | 2.5 | 0.0 | None | 3.5 | None | 0 | F | ['pss'] | 0-18 | 0 |
3 | 108788.0 | Before | 13.256410 | 8.717949 | 68.410256 | 38.897436 | 4.620513 | 2.443590 | 2.687179 | 7.384615 | 0.458974 | 0.394872 | 86.230769 | 5.820513 | 3.269231 | 3.594103 | 10.894872 | 33.502564 | 175.025641 | 0.076923 | 0.000000 | 0.00 | 0.0 | 0.0 | 16.0 | S | 0.0 | None | 0 | F | ['sjs'] | 46-60 | 0 |
4 | 122405.0 | Before | 30.833333 | 22.950000 | 126.766667 | 69.966667 | 7.626667 | 3.616667 | 4.045000 | 11.800000 | 0.708333 | 0.391667 | 184.283333 | 8.100000 | 2.981667 | 3.795167 | 10.638333 | 33.691667 | 217.616667 | 0.000000 | 32.233333 | 6.45 | 0.0 | 4.0 | 4.0 | P | 0.0 | None | 0 | F | ['vertigo', 'sjs', 'sle'] | 31-45 | 0 |
Setting the default style of the plots:
Information about the dataset features:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 682 non-null float64
1 b/a__tag 682 non-null object
2 got 682 non-null float64
3 gpt 682 non-null float64
4 ldh 682 non-null float64
5 alp 682 non-null float64
6 tp 682 non-null float64
7 alb 682 non-null float64
8 ua 682 non-null float64
9 un 682 non-null float64
10 cre 682 non-null float64
11 t-bil 682 non-null float64
12 t-cho 682 non-null float64
13 tg 682 non-null float64
14 wbc 682 non-null float64
15 rbc 682 non-null float64
16 hgb 682 non-null float64
17 hct 682 non-null float64
18 plt 682 non-null float64
19 u-pro 682 non-null float64
20 c3 682 non-null float64
21 c4 682 non-null float64
22 a_cl _ig_g 682 non-null float64
23 a_cl _ig_m 682 non-null float64
24 ana 682 non-null float64
25 ana _pattern 682 non-null object
26 a_cl _ig_a 682 non-null float64
27 symptoms 682 non-null object
28 thrombosis 682 non-null int64
29 sex 682 non-null object
30 diagnosis 682 non-null object
31 age 682 non-null object
32 thrombosis_diagnosis 682 non-null int64
dtypes: float64(25), int64(2), object(6)
memory usage: 176.0+ KB
Statistical Analysis of the Dataset:
id | got | gpt | ldh | alp | tp | alb | ua | un | cre | t-bil | t-cho | tg | wbc | rbc | hgb | hct | plt | u-pro | c3 | c4 | a_cl _ig_g | a_cl _ig_m | ana | a_cl _ig_a | thrombosis | thrombosis_diagnosis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 6.820000e+02 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 | 682.000000 |
mean | 4.701481e+06 | 25.554588 | 25.432193 | 348.515716 | 128.068644 | 6.288748 | 3.487965 | 3.632142 | 11.902266 | 0.545950 | 0.420021 | 148.807125 | 80.852254 | 6.463791 | 4.089160 | 12.100520 | 36.489660 | 224.730387 | 8.664434 | 49.893716 | 15.610479 | 16.571584 | 280.929619 | 478.392962 | 77.456452 | 0.199413 | 0.139296 |
std | 1.369721e+06 | 29.433264 | 27.838277 | 204.129160 | 85.133350 | 1.476372 | 0.909155 | 1.331038 | 5.051944 | 0.237703 | 0.420512 | 57.803023 | 65.466995 | 2.810230 | 0.496685 | 1.524636 | 4.370481 | 85.121063 | 50.517087 | 26.652478 | 9.944405 | 109.647864 | 7165.056400 | 1095.472267 | 1858.855122 | 0.545815 | 0.346509 |
min | 1.487200e+04 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.942857 | 2.090000 | 6.100000 | 18.150000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 4.537218e+06 | 15.255682 | 12.000000 | 254.836957 | 83.458766 | 5.696410 | 3.166250 | 2.807353 | 9.054924 | 0.425875 | 0.263356 | 116.139161 | 39.856456 | 4.483442 | 3.811571 | 11.300000 | 34.076639 | 173.307018 | 0.000000 | 33.000000 | 8.498464 | 0.000000 | 1.500000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 5.317729e+06 | 19.000000 | 17.000000 | 327.722222 | 113.061224 | 6.630128 | 3.693990 | 3.600000 | 11.313333 | 0.520000 | 0.355495 | 153.000000 | 70.200521 | 5.812500 | 4.124375 | 12.176562 | 36.732796 | 215.785714 | 0.000000 | 51.873016 | 15.000000 | 1.000000 | 2.200000 | 16.000000 | 1.800000 | 0.000000 | 0.000000 |
75% | 5.555816e+06 | 25.232955 | 27.887821 | 409.270534 | 153.805978 | 7.200000 | 4.100000 | 4.323295 | 14.000000 | 0.618566 | 0.500000 | 187.967742 | 108.187500 | 7.778217 | 4.395000 | 13.189904 | 39.483462 | 263.503906 | 1.428571 | 67.080357 | 21.245192 | 2.000000 | 4.000000 | 256.000000 | 6.300000 | 0.000000 | 0.000000 |
max | 5.779550e+06 | 422.500000 | 320.000000 | 3159.000000 | 940.500000 | 9.700000 | 5.000000 | 9.706218 | 60.264249 | 3.543161 | 8.635294 | 433.000000 | 569.000000 | 26.866667 | 5.687143 | 16.900000 | 49.685714 | 685.000000 | 1000.000000 | 154.500000 | 71.000000 | 1502.400000 | 187122.000000 | 4096.000000 | 48547.000000 | 3.000000 | 1.000000 |
Distribution of the Age of the Patients:
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A COUNT PLOT
sns.countplot(
x="age", data=final_df, ax=ax, order=["0-18", "19-30", "31-45", "46-60", "61+"]
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Age Distribution of the Patients", fontsize=15)
ax.set_xlabel("Age Bands", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])
plt.show()
Majority of the patients that tested for thrombosis were in the age range of 19-30 years. However, this does not mean that this age group is more likely to get diagnosed with Thrombosis than others. This just shows that the age group of 19-30 years is the most common age group in the dataset and were tested for thrombosis the most number of times.
The Distribution of the number of people who were diagnosed with Thrombosis along with those who were not:
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A BAR PLOT
sns.barplot(
x=final_df["thrombosis_diagnosis"].value_counts().index,
y=final_df["thrombosis_diagnosis"].value_counts().values,
ax=ax,
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Distribution of Positive and Negative Thrombosis Patients", fontsize=15)
ax.set_xlabel("Diagnosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
_ = ax.set_xticklabels(["Negative", "Positive"])
# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])
plt.show()
Only 16% of the patients who went for the Thrombosis Test were diagnosed Positive.
Outlier Detection of all the features present in the Dataset:
# DIVIDING THE COLUMNS INTO GROUPS
cols1 = ["got", "gpt", "ldh", "alp", "tp", "alb"]
cols2 = ["ua", "un", "cre", "t-bil", "t-cho", "tg", "u-pro"]
cols3 = ["wbc", "rbc", "hgb", "hct", "plt", "c3", "c4"]
cols4 = ["a_cl _ig_g", "a_cl _ig_m", "ana", "a_cl _ig_a"]
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
# CREATING A BOXPLOT FOR EACH GROUP
sns.boxplot(data=final_df[cols1], ax=ax[0, 0])
sns.boxplot(data=final_df[cols2], ax=ax[0, 1])
sns.boxplot(data=final_df[cols3], ax=ax[1, 0])
sns.boxplot(data=final_df[cols4], ax=ax[1, 1])
# SETTING THE TITLE
plt.suptitle("Outlier Detection of the Features", fontsize=16, y=0.94)
ax[0, 0].set_title("Blood Chemistry", fontsize=14)
ax[0, 1].set_title("Urinalysis", fontsize=14)
ax[1, 0].set_title("Complete Blood Count", fontsize=14)
ax[1, 1].set_title("Immunology", fontsize=14)
# SETTING THE TICKSIZE AND GRID LINES
for i in range(2):
for j in range(2):
ax[i, j].tick_params(labelsize=11)
ax[i, j].grid(True, axis="y", linestyle=":", linewidth=1)
plt.show()
Some features like “u-pro (Proteinuria)”, “ldh (Lactate Dehydrogenase)” and “plt (Platelet) have a lot of outliers present while features like”acl_igA” and “acl_igM” have less number of outliers but it is a very large value which needs to be dealt with.
Distribution of Diagnosis of Thrombosis with respect to the gender of the patient:
# CREATING A NEW DATAFRAME FOR GENDER PRECENT DISTRIBUTION
x, y = "sex", "thrombosis_diagnosis"
new = final_df[final_df[x] != "Not Available"]
new = new.groupby(x)[y].value_counts(normalize=True)
new = new.mul(100)
new = new.rename("Percent").reset_index()
new["thrombosis_diagnosis"] = new["thrombosis_diagnosis"].apply(
lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A BAR PLOT
sns.barplot(x="sex", y="Percent", hue="thrombosis_diagnosis", data=new, ax=ax)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_ylim(0, 100)
ax.set_title(
"Thrombosis Diagnosis Distribution w.r.t Gender of the Patient", fontsize=15
)
ax.set_xlabel("Gender", fontsize=13)
ax.set_ylabel("Percent", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_xticklabels(["Female", "Male"])
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
ax.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
for p in ax.patches:
txt = str(p.get_height().round(2)) + "%"
txt_x = p.get_x()
txt_y = p.get_height()
ax.text(txt_x + 0.1, txt_y, txt, va="bottom", fontsize=11)
plt.savefig("plot_outputs/plot-01.png")
plt.show()
From the above graph it is evident that Females are more likely to test positive for Thrombosis as compared to Males. But these reports are from just one hospital and this observation cannot be generalized for the entire world. To make that generalization we would need more information about the Demographics of the area where the University is located and to what extent have those factors affected the patients.
Distribution of the severity of Thrombosis among different Age groups:
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A COUNT PLOT
sns.countplot(
x=final_df["thrombosis"][final_df["thrombosis"] > 0],
ax=ax,
hue=final_df["age"],
hue_order=["0-18", "19-30", "31-45", "46-60", "61+"],
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Distribution of Degrees of Thrombosis w.r.t Age Bands", fontsize=15)
ax.set_title(
"1 = Positive | 2 = Positive and very severe | 3 = Positive and extremely severe",
fontsize=11,
)
ax.set_xlabel("Thrombosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_yticks(np.arange(0, 21, 5))
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
for container in ax.containers:
_ = ax.bar_label(container)
ax.legend(title="Age Bands", title_fontsize=11, fontsize=10)
plt.savefig("plot_outputs/plot-02.png")
plt.show()
There are a lot of patients for Mild Thrombosis in all age bands as compared to other Degrees of Thrombosis, but Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. And finally, patients in the age group of 31-60 get diagnosed with Extremely Severe Thrombosis. It can be inferred from this data that people of age 61 and above are least likely to get diagnosed with Thrombosis.
Predictors that are correlated to Thrombosis (target variable):
# CREATING A CORRELATION MATRIX
corr_df = final_df.drop(["id", "thrombosis_diagnosis"], axis=1)
# CREATING A CORRELATION PLOT
corr = corr_df.corr()
corr["thrombosis"].sort_values(ascending=False)
# SETTING THE FIGURE SIZE
plt.figure(figsize=(8, 5))
# CREATING A BAR PLOT BASED ON THE CORRELATION COEFFICIENTS
corr["thrombosis"].sort_values(ascending=False).plot(kind="bar")
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.title("Correlation between Thrombosis and other Variables", fontsize=15)
plt.grid(True, axis="y", linestyle=":", linewidth=1)
plt.ylabel("Correlation Coefficient", fontsize=13)
plt.xlabel("Variables", fontsize=13)
plt.tick_params(labelsize=11)
plt.savefig("plot_outputs/plot-03.png")
plt.show()
Features that are highly correlated to Thrombosis (target variable) are:
- acl_igG (Anticardiolipin Antibody IgG)
- ANA (Antinuclear Antibody)
- U-Pro (Proteinuria)
- GPT (ALT glutamic pylvic transaminase)
- GOT (AST glutamic oxaloacetic transaminase)
- C3 (Complement 3)
- C4 (Complement 4)
- RBC (Red Blood Cells)
- HCT (Hematoclit)
- PLT (Platelet)
- HGB (Hemoglobin)
Comparing Different Medical Tests with Tests specific to Thrombosis (Anti-Cardiolipin Antibody (IgG)):
Since Anti-Cardiolipin Antibody (IgG) is the correlated feature to Thrombosis, we will compare it with other medical tests to see if they can be used to improve the accuracy of the diagnosis.
Anti-Cardiolipin Antibody (IgG) V/S ALT glutamic pylvic transaminase (GPT):
# CREATING A NEW DATAFRAME FOR COMPARISON BETWEEN MEDICAL TESTS AND THROMBOSIS SPECIFIC TESTS
new_df = final_df[final_df["a_cl _ig_g"] < 100]
new_df["thrombosis_diagnosis"] = new_df["thrombosis_diagnosis"].apply(
lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)
new_df1 = final_df[
(final_df["a_cl _ig_g"] < 100) & (final_df["thrombosis_diagnosis"] > 0)
]
new_df1["thrombosis"] = new_df["thrombosis"].apply(
lambda x: "Severe" if x > 1 else "Mild"
)
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="gpt",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="gpt",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
"Anti-Cardiolipin Antibody V/S ALT glutamic pylvic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=12)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "ALT glutamic pylvic transaminase (GPT)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-04.png")
plt.show()
ALT glutamic pylvic transaminase (GPT) has a normal range of <60. It can be observed that many patients who were diagnosed with Thrombosis had their GPT <60 and there were only a few exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patient who has GPT of 300+, the patient has a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high GPT and Severeness of Thrombosis can be performed with a larger volume of data.
Anti-Cardiolipin Antibody (IgG) V/S AST glutamic oxaloacetic transaminase (GOT):
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="got",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="got",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
"Anti-Cardiolipin Antibody V/S AST glutamic oxaloacetic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(
0.5, 0.02, "AST glutamic oxaloacetic transaminase (GOT)", ha="center", fontsize=13
)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-05.png")
plt.show()
AST glutamic oxaloacetic transaminase (GOT) has a normal range of <60 (similar to GPT). It can be observed that many patients who were diagnosed with Thrombosis had their GOT <60 and there were some exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patients who have GOT of 100+, the patients have a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but can be used while diagnosing cases of Severe Thrombosis.
Anti-Cardiolipin Antibody (IgG) V/S Complement 3 (C3):
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="c3",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="c3",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 3", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 3 (C3)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-06.png")
plt.show()
Complement 3 (C3) has a normal range of >35. It can be observed that many patients who were diagnosed with Thrombosis had their C3 <35 which implies that the patients who have an abnormal C3 are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C3 is a good test to be taken into consideration while diagnosing Thrombosis.
Anti-Cardiolipin Antibody (IgG) V/S Complement 4 (C4):
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="c4",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="c4",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 4", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 4 (C4)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-07.png")
plt.show()
Complement 4 (C4) has a normal range of >10. It can be observed that almost all patients who were diagnosed with Thrombosis had their C4 <18 which implies that the patients who have an abnormal C4 (from 0 to 15) are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C4 is a good test to be taken into consideration while diagnosing Thrombosis.
Anti-Cardiolipin Antibody (IgG) V/S Hemoglobin (HGB):
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="hgb",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="hgb",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hemoglobin", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hemoglobin (HGB)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-08.png")
plt.show()
Hemoglobin (HGB) has a normal range between 10 and 17. It can be observed that all patients who were diagnosed with Thrombosis had their HGB between 10 and 17. This shows that HGB is not a good test to be taken into consideration while diagnosing Thrombosis.
Anti-Cardiolipin Antibody (IgG) V/S Hematoclit (HCT):
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="hct",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="hct",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hematoclit", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hematoclit (HCT)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-09.png")
plt.show()
Hematoclit (HCT) has a normal range between 29 and 52. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their HCT between 29 and 52. This shows that HCT is not a good test to be taken into consideration while diagnosing Thrombosis.
Anti-Cardiolipin Antibody (IgG) V/S Platelet Count (PLT):
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="plt",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="plt",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Platelet Count", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Platelet Count (PLT)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-10.png")
plt.show()
Platelet (PLT) has a normal range between 100 and 400. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their PLT between 100 and 400. This shows that PLT is not a good test to be taken into consideration while diagnosing Thrombosis.
Anti-Cardiolipin Antibody (IgG) V/S Proteinuria (U-PRO):
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="u-pro",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="u-pro",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Proteinuria", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Proteinuria (U-PRO)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-11.png")
plt.show()
Proteinuria (U-PRO) has a normal range between 0 and 30. It can be observed that many patients who were diagnosed with Thrombosis had their U-PRO between 0 and 30 and there were only a few exceptions where patients with U-PRO outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal U-PRO except a few patients. Concentrating on the patients who have U-PRO of 100+, the patients have a severe case of thrombosis which is interesting. This shows that U-PRO is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high U-PRO and Severeness of Thrombosis can be performed with a larger volume of data.
The 3 datasets provided by the University Hospital revolved mainly around the Thrombosis Diagnosis. All the datasets had ID of the patient which was the joining parameter used to perform analytical joins. These datasets roughly covered the demographic aspect of each patient, the medical test history of each patient and the factors for thrombosis diagnosis of some patients
. There were tons of missing data in all these datasets since there are patients who opt-out of providing their personal information and/or have done only a few medical tests per visit. These missing values were carefully imputed
after going through entire datasets and also with the help of some domain knowledge. Along with the imputation, the columns and rows which had more than 70% missing values were dropped from the data to avoid assumption of wrong values to impute. Some feature extraction
was performed such as calculating the Age of the Patient at the time when Thrombosis tests were performed which would be later useful for analysis.
After getting the final dataframe, Exploratory Data Analysis (Preliminary and Visual)
was performed to help understand the data better and answer the initial questions. Most of the Patients who tested for Thrombosis were between ages 19 to 45 and only a few patients who were more than 61 years of age tested for Thrombosis. Further, the distribution of Positive and Negative Thrombosis patients after testing had a huge difference (only 16% of the patients tested were diagnosed as positive). If further statistical analysis and machine learning algorithms are to be applied to predict the Thrombosis results, more data for Positive thrombosis patients need to be introduced since the dataset is imbalanced currently. Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. Females are more likely to test positive for Thrombosis as compared to Males but these reports are from just one hospital and this observation cannot be generalized for the entire world.
Apart from Thrombosis specific tests such as acl_igG (Anticardiolipin Antibody IgG) and ANA (Antinuclear Antibody), common medical tests such as U-Pro (Proteinuria), GPT (ALT glutamic pylvic transaminase), GOT (AST glutamic oxaloacetic transaminase), C3 and C4 (Complement 3 and 4), RBC (Red Blood Cells), HCT (Hematoclit), PLT (Platelet) and HGB (Hemoglobin) were found to be correlated to the Thrombosis Diagnosis. Since, correlation coefficients cannot be the only measure to see if these tests are relevant while diagnosis Thrombosis, further Visual analysis was performed if there are any significant patterns in these tests to look out for while diagnosing Thrombosis.
Tests like C3 and C4 had significant patterns while tests like HGB, HCT and Platelet count did not show any significant patterns while diagnosing Thrombosis. Some tests like U-PRO, GOT and GPT had few patterns that might be good to look at while diagnosing extreme severe cases of Thrombosis. Overall the data required a lot of assumptions while imputing the missing values and the analysis performed was only preliminary. These analysis can be further broken down to the granular level to understand the factors affecting Thrombosis Diagnosis by performing Machine Learning algorithms to predict the Thrombosis Diagnosis.
---
title: EDA on Medical Data of Thrombosis Diagnosis
title-location: center
format:
html:
embed-resources: true
self-contained: true
theme: lumen
toc: true
toc-location: right
toc-depth: 6
code-fold: true
code-tools: true
page-layout: full
jupyter: python3
---
# Data Summary
`Collagen` is a fibrous protein found in cartilage and other connective tissue. Collagen diseases are autoimmune diseases in which the immune system of the body attacks its own skin, tissues, and organs. For example, if a patient generates antibodies for lung, they will lose their ability to do respiration and will die. The extent and causes of these diseases are partially known and not well understood and hence their classification can be a challenging task.<br>
One of these diseases is `Thrombosis`, which is an important and severe complication and is also one of the major causes of death in collagen diseases. It was recently discovered by medical physicians that Thrombosis is closely related to anti-cardiolipin antibodies. The Databases used in this project are donated by one of these physicians from a University Hospital where patients came regarding collagen diseases and were recommended by their local physicians, home doctors, and other medical specialists.
# Initial Questions
1. Is it possible for some age bands to be more likely to get diagnosed with higher degrees of Thrombosis than others?
2. Are females more likely to get diagnosed with Thrombosis or is it vice-versa?
3. How can other medical tests be incorporated to improve the accuracy of the diagnosis?
# Data Munging
## Import Libraries
```{python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datetime
import snakecase
import re
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
```
## Import Data
```{python}
a = pd.read_csv("data/TSUMOTO_A.CSV")
b = pd.read_csv("data/TSUMOTO_B.CSV")
c = pd.read_csv("data/TSUMOTO_C.CSV")
```
### Data Sample
**Tsumoto_A - Basic Information about Patients (Input by Experts). This dataset includes all patients**
```{python}
a.head()
```
**Tsumoto_B - Special Laboratory Examinations (Input by Experts) (Measured by the Laboratory on Collagen Diseases)This dataset does not include all the patients, but includes the patients with these special tests**
```{python}
b.head()
```
**Tsumoto_C - Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) All the data includes ordinary laboratory examinations and have temporal stamps**
```{python}
c.head()
```
## Data Cleaning
### Tsumoto_A and Tsumoto_B Merge
**Merge B with A on `ID`:**
```{python}
b_new = b.dropna(subset=["ID"]) # DROP ROWS WITH NA IN ID COLUMN
merge_df = pd.merge(
b_new, a[["ID", "SEX", "Birthday", "Diagnosis"]], on="ID", how="left"
)
merge_df.head()
```
#### Missing Values
```{python}
print("The number of rows and columns in the new merged dataframe are:", merge_df.shape)
print("--------------------------------------------------------------------------")
print(
"The number of missing values in the new merged dataframe are: \n",
merge_df.isna().sum(),
)
```
**Handling missing values in the merged dataset:**
```{python}
# MERGING THE 2 DIAGNOSIS COLUMNS INTO 1
# CREATING LISTS OF DIAGNOSIS FOR EACH ROW
for i in range(len(merge_df)):
if merge_df["Diagnosis_x"].isna()[i] == False:
merge_df["Diagnosis_x"][i] = merge_df["Diagnosis_x"][i].split(",")
if merge_df["Diagnosis_y"].isna()[i] == False:
merge_df["Diagnosis_y"][i] = merge_df["Diagnosis_y"][i].split(",")
# CREATING A NEW COLUMN CALLED DIAGNOSIS AND FILLING IT WITH THE APPROPRIATE DIAGNOSIS
merge_df["Diagnosis"] = ""
for i in range(len(merge_df)):
if (merge_df["Diagnosis_x"].isna()[i] == False) & (
merge_df["Diagnosis_y"].isna()[i] == True
):
merge_df["Diagnosis"][i] = merge_df["Diagnosis_x"][i]
elif (merge_df["Diagnosis_x"].isna()[i] == True) & (
merge_df["Diagnosis_y"].isna()[i] == False
):
merge_df["Diagnosis"][i] = merge_df["Diagnosis_y"][i]
elif (merge_df["Diagnosis_x"].isna()[i] == False) & (
merge_df["Diagnosis_y"].isna()[i] == False
):
merge_df["Diagnosis"][i] = list(
set(merge_df["Diagnosis_x"][i] + merge_df["Diagnosis_y"][i])
)
# REMOVING THE DUPLICATES IN THE DIAGNOSIS COLUMN
for i in range(len(merge_df)):
for j in range(len(merge_df["Diagnosis"][i])):
merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].strip()
merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].lower()
merge_df["Diagnosis"][i] = list(set(merge_df["Diagnosis"][i]))
if merge_df["Diagnosis"][i] == []:
merge_df["Diagnosis"][i] = "No Diagnosis"
else:
pass
# DROPPING THE ORIGINAL DIAGNOSIS COLUMNS
merge_df.drop(["Diagnosis_x", "Diagnosis_y"], axis=1, inplace=True)
# FILLING NAN VALUES IN THE BIRTHDAY COLUMN WITH A DATE OF 0/0/0
merge_df["Birthday"].fillna("0/0/0", inplace=True)
# DROPPING ROWS WITH NAN VALUES IN THE EXAMINATION DATE COLUMN
merge_df.dropna(subset=["Examination Date"], inplace=True)
merge_df.reset_index(drop=True, inplace=True)
# CREATING A NEW COLUMN CALLED AGE AND FILLING IT WITH THE DIFFERENCE BETWEEN THE EXAMINATION DATE AND BIRTHDAY
merge_df["Age"] = 0
for i in range(len(merge_df)):
if merge_df["Birthday"][i] == "0/0/0":
merge_df["Age"][i] = "Not Available"
elif merge_df["Birthday"][i] != "0/0/0":
merge_df["Age"][i] = int(merge_df["Examination Date"][i].split("/")[2]) - int(
merge_df["Birthday"][i].split("/")[2]
)
else:
merge_df["Age"][i] = "Not Available"
# DROPPING THE KCT, RVVT, AND LAC COLUMNS SINCE THEY HAVE MORE THAN 70% MISSING VALUES
merge_df.drop(["KCT", "RVVT", "LAC"], axis=1, inplace=True)
# FILLING MISSING VALUES IN THE SYMPTOMS COLUMN WITH "None"
merge_df["Symptoms"].fillna("None", inplace=True)
merge_df.reset_index(drop=True, inplace=True)
# FILLING MISSING VALUES IN THE ANA COLUMN WITH "0"
merge_df["ANA"].fillna("0", inplace=True)
# FILLING THE VALUES IN THE ANA PATTERN COLUMN WITH "None" IF THE ANA COLUMN IS "0"
for i in range(len(merge_df)):
if merge_df["ANA"][i] == "0":
merge_df["ANA Pattern"][i] = "None"
else:
pass
# FILLING MISSING VALUES IN THE ANA PATTERN COLUMN WITH "Not Available"
merge_df["ANA Pattern"].fillna("Not Available", inplace=True)
# FILLING MISSING VALUES IN THE SEX COLUMN WITH "Not Available"
merge_df["SEX"].fillna("Not Available", inplace=True)
# DROPPING ROWS WITH MISSING VALUES IN THE ENTIRE DATAFRAME
merge_df.dropna(inplace=True)
merge_df.reset_index(drop=True, inplace=True)
print(
"The number of missing values in the new merged dataframe are: \n",
merge_df.isna().sum(),
)
```
**Final dataframe sample after handling missing values:**
```{python}
merge_df.head()
```
**Converting the Examination Date and Date of the Test to Date Time Format:**
```{python}
c["Date"] = pd.to_datetime(c["Date"], format="%y%m%d")
merge_df["Examination Date"] = pd.to_datetime(merge_df["Examination Date"], format="%x")
```
### Tsumoto_C Merge
**Merging the Dataframe of Thrombosis examination (A) and Demographic information of the Patient (B) with the Hospital Records of every Patient (C):**
```{python}
df = pd.merge(merge_df[["ID", "Examination Date"]], c, on=["ID"], how="left")
df.head()
```
#### Missing Values
**Dropping Rows and Columns according to Missing Values in the Merged Dataset:**
```{python}
# DROPPING UNNCESSARY COLUMNS CREATED WHILE IMPORTING THE DATA
df.drop(["Unnamed: 44", "Unnamed: 45", "CRP"], axis=1, inplace=True)
# DROPPING THE COLUMNS AND ROWS WITH MORE THAN 70% MISSING VALUES
df.dropna(thresh=len(df) * 0.3, axis=1, inplace=True)
df.dropna(thresh=len(df.columns) * 0.3, inplace=True)
df.reset_index(inplace=True)
```
**Feature Extraction - Tagging the hospital records of Patient on the basis of the Thrombosis Examination:**
```{python}
df["B/A_Tag"] = ""
for i in range(len(df)):
if (df["Examination Date"][i] - df["Date"][i]).days >= 0:
df["B/A_Tag"][i] = "Before"
else:
df["B/A_Tag"][i] = "After"
```
**Converting all the values in the dataframe to float (from string):**
```{python}
col_conv = [
"GOT",
"GPT",
"LDH",
"ALP",
"TP",
"ALB",
"UA",
"UN",
"CRE",
"T-BIL",
"T-CHO",
"TG",
"WBC",
"RBC",
"HGB",
"HCT",
"PLT",
"U-PRO",
"C3",
"C4",
]
for col in col_conv:
if df[col].dtype == "object":
for i in range(len(df)):
if df[col].isna()[i] == False:
if type(df[col][i]) == str:
df[col][i] = re.sub(r"[^\d.]", "", df[col][i])
if df[col][i] == "":
df[col][i] = 0
else:
df[col][i] = float(df[col][i])
else:
pass
else:
pass
else:
pass
# CREATING A COPY OF THE DATAFRAME TO AVOID CHANGING THE ORIGINAL DATAFRAME
df1 = df.copy()
# FILLING MISSING VALUES IN THE COLUMNS WITH 0
df1.fillna(0, inplace=True)
# CONVERTING THE COLUMNS TO FLOAT
for col in col_conv:
df[col] = pd.to_numeric(df[col], errors="coerce").astype(float)
# CREATING A NEW DATAFRAME WITH ONLY THE COLUMNS NEEDED FOR THE GROUPING
df2 = df1[
[
"ID",
"GOT",
"GPT",
"LDH",
"ALP",
"TP",
"ALB",
"UA",
"UN",
"CRE",
"T-BIL",
"T-CHO",
"TG",
"WBC",
"RBC",
"HGB",
"HCT",
"PLT",
"U-PRO",
"C3",
"C4",
"B/A_Tag",
]
]
```
**Grouping the dataframe by `ID` and `Before-After Tag` and taking the mean of the values of each tests:**
```{python}
df3 = df2.groupby(["ID", "B/A_Tag"], as_index=False, dropna=True).mean()
```
### Final Merge
**Merging the above dataset with our First Merge Dataset to get the final Data:**
```{python}
final_df = pd.merge(df3, merge_df, on=["ID"], how="left")
final_df.sort_values(by=["ID"], inplace=True)
# CONVERTING THE COLUMN NAMES TO SNAKECASE
final_df.columns = final_df.columns.map(snakecase.convert)
```
**Final Data Cleaning:**
```{python}
# CREATING AGE BANDS FOR THE AGE COLUMN
for i in range(len(final_df)):
if final_df["age"][i] == "Not Available":
pass
else:
if 0 <= final_df["age"][i] <= 18:
final_df["age"][i] = "0-18"
elif 19 <= final_df["age"][i] <= 30:
final_df["age"][i] = "19-30"
elif 31 <= final_df["age"][i] <= 45:
final_df["age"][i] = "31-45"
elif 46 <= final_df["age"][i] <= 60:
final_df["age"][i] = "46-60"
else:
final_df["age"][i] = "61+"
# CONVERTING ANA COLUMN TO NUMERIC
for i in range(len(final_df)):
final_df["ana"][i] = re.sub(r"[^\d.]", "", final_df["ana"][i])
final_df["ana"] = pd.to_numeric(final_df["ana"], errors="coerce").astype(float)
# CREATING A THROMBOSIS DIAGNOSIS COLUMN
final_df["thrombosis_diagnosis"] = final_df["thrombosis"].apply(
lambda x: 1 if x > 0 else 0
)
# DROPPING UNNECESSARY COLUMNS
final_df.drop(["birthday", "examination _date"], axis=1, inplace=True)
# SAVING THE FINAL DATAFRAME TO A CSV FILE
final_df.to_csv("data/FINAL_DATA.csv", index=False)
```
**Final Data Sample:**
```{python}
final_df.head()
```
# Exploratory Analysis
Read in the Data from Local Machine:
```{python}
final_df = pd.read_csv("data/FINAL_DATA.csv")
final_df.head()
```
Setting the default style of the plots:
```{python}
sns.set_style("white")
sns.set_palette("Accent")
```
## Preliminary Analysis
**Information about the dataset features:**
```{python}
final_df.info()
```
**Statistical Analysis of the Dataset:**
```{python}
final_df.describe()
```
## Visual Analysis
**Distribution of the Age of the Patients:**
```{python}
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A COUNT PLOT
sns.countplot(
x="age", data=final_df, ax=ax, order=["0-18", "19-30", "31-45", "46-60", "61+"]
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Age Distribution of the Patients", fontsize=15)
ax.set_xlabel("Age Bands", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])
plt.show()
```
_Majority of the patients that tested for thrombosis were in the age range of 19-30 years. However, this does not mean that this age group is more likely to get diagnosed with Thrombosis than others. This just shows that the age group of 19-30 years is the most common age group in the dataset and were tested for thrombosis the most number of times._
**The Distribution of the number of people who were diagnosed with Thrombosis along with those who were not:**
```{python}
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A BAR PLOT
sns.barplot(
x=final_df["thrombosis_diagnosis"].value_counts().index,
y=final_df["thrombosis_diagnosis"].value_counts().values,
ax=ax,
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Distribution of Positive and Negative Thrombosis Patients", fontsize=15)
ax.set_xlabel("Diagnosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
_ = ax.set_xticklabels(["Negative", "Positive"])
# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])
plt.show()
```
_Only 16% of the patients who went for the Thrombosis Test were diagnosed Positive._
**Outlier Detection of all the features present in the Dataset:**
```{python}
# DIVIDING THE COLUMNS INTO GROUPS
cols1 = ["got", "gpt", "ldh", "alp", "tp", "alb"]
cols2 = ["ua", "un", "cre", "t-bil", "t-cho", "tg", "u-pro"]
cols3 = ["wbc", "rbc", "hgb", "hct", "plt", "c3", "c4"]
cols4 = ["a_cl _ig_g", "a_cl _ig_m", "ana", "a_cl _ig_a"]
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
# CREATING A BOXPLOT FOR EACH GROUP
sns.boxplot(data=final_df[cols1], ax=ax[0, 0])
sns.boxplot(data=final_df[cols2], ax=ax[0, 1])
sns.boxplot(data=final_df[cols3], ax=ax[1, 0])
sns.boxplot(data=final_df[cols4], ax=ax[1, 1])
# SETTING THE TITLE
plt.suptitle("Outlier Detection of the Features", fontsize=16, y=0.94)
ax[0, 0].set_title("Blood Chemistry", fontsize=14)
ax[0, 1].set_title("Urinalysis", fontsize=14)
ax[1, 0].set_title("Complete Blood Count", fontsize=14)
ax[1, 1].set_title("Immunology", fontsize=14)
# SETTING THE TICKSIZE AND GRID LINES
for i in range(2):
for j in range(2):
ax[i, j].tick_params(labelsize=11)
ax[i, j].grid(True, axis="y", linestyle=":", linewidth=1)
plt.show()
```
_Some features like "u-pro (Proteinuria)", "ldh (Lactate Dehydrogenase)" and "plt (Platelet) have a lot of outliers present while features like "acl_igA" and "acl_igM" have less number of outliers but it is a very large value which needs to be dealt with._
**Distribution of Diagnosis of Thrombosis with respect to the gender of the patient:**
```{python}
# CREATING A NEW DATAFRAME FOR GENDER PRECENT DISTRIBUTION
x, y = "sex", "thrombosis_diagnosis"
new = final_df[final_df[x] != "Not Available"]
new = new.groupby(x)[y].value_counts(normalize=True)
new = new.mul(100)
new = new.rename("Percent").reset_index()
new["thrombosis_diagnosis"] = new["thrombosis_diagnosis"].apply(
lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A BAR PLOT
sns.barplot(x="sex", y="Percent", hue="thrombosis_diagnosis", data=new, ax=ax)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_ylim(0, 100)
ax.set_title(
"Thrombosis Diagnosis Distribution w.r.t Gender of the Patient", fontsize=15
)
ax.set_xlabel("Gender", fontsize=13)
ax.set_ylabel("Percent", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_xticklabels(["Female", "Male"])
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
ax.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
for p in ax.patches:
txt = str(p.get_height().round(2)) + "%"
txt_x = p.get_x()
txt_y = p.get_height()
ax.text(txt_x + 0.1, txt_y, txt, va="bottom", fontsize=11)
plt.savefig("plot_outputs/plot-01.png")
plt.show()
```
_From the above graph it is evident that Females are more likely to test positive for Thrombosis as compared to Males. But these reports are from just one hospital and this observation cannot be generalized for the entire world. To make that generalization we would need more information about the Demographics of the area where the University is located and to what extent have those factors affected the patients._
**Distribution of the severity of Thrombosis among different Age groups:**
```{python}
# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A COUNT PLOT
sns.countplot(
x=final_df["thrombosis"][final_df["thrombosis"] > 0],
ax=ax,
hue=final_df["age"],
hue_order=["0-18", "19-30", "31-45", "46-60", "61+"],
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Distribution of Degrees of Thrombosis w.r.t Age Bands", fontsize=15)
ax.set_title(
"1 = Positive | 2 = Positive and very severe | 3 = Positive and extremely severe",
fontsize=11,
)
ax.set_xlabel("Thrombosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_yticks(np.arange(0, 21, 5))
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
for container in ax.containers:
_ = ax.bar_label(container)
ax.legend(title="Age Bands", title_fontsize=11, fontsize=10)
plt.savefig("plot_outputs/plot-02.png")
plt.show()
```
_There are a lot of patients for Mild Thrombosis in all age bands as compared to other Degrees of Thrombosis, but Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. And finally, patients in the age group of 31-60 get diagnosed with Extremely Severe Thrombosis. It can be inferred from this data that people of age 61 and above are least likely to get diagnosed with Thrombosis._
**Predictors that are correlated to Thrombosis (target variable):**
```{python}
# CREATING A CORRELATION MATRIX
corr_df = final_df.drop(["id", "thrombosis_diagnosis"], axis=1)
# CREATING A CORRELATION PLOT
corr = corr_df.corr()
corr["thrombosis"].sort_values(ascending=False)
# SETTING THE FIGURE SIZE
plt.figure(figsize=(8, 5))
# CREATING A BAR PLOT BASED ON THE CORRELATION COEFFICIENTS
corr["thrombosis"].sort_values(ascending=False).plot(kind="bar")
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.title("Correlation between Thrombosis and other Variables", fontsize=15)
plt.grid(True, axis="y", linestyle=":", linewidth=1)
plt.ylabel("Correlation Coefficient", fontsize=13)
plt.xlabel("Variables", fontsize=13)
plt.tick_params(labelsize=11)
plt.savefig("plot_outputs/plot-03.png")
plt.show()
```
_Features that are highly correlated to Thrombosis (target variable) are:_ <br>
- acl_igG (Anticardiolipin Antibody IgG)<br>
- ANA (Antinuclear Antibody)<br>
- U-Pro (Proteinuria)<br>
- GPT (ALT glutamic pylvic transaminase)<br>
- GOT (AST glutamic oxaloacetic transaminase)<br>
- C3 (Complement 3)<br>
- C4 (Complement 4)<br>
- RBC (Red Blood Cells)<br>
- HCT (Hematoclit)<br>
- PLT (Platelet)<br>
- HGB (Hemoglobin)<br>
# Final Plots
**Comparing Different Medical Tests with Tests specific to Thrombosis (Anti-Cardiolipin Antibody (IgG)):**<br>
Since Anti-Cardiolipin Antibody (IgG) is the correlated feature to Thrombosis, we will compare it with other medical tests to see if they can be used to improve the accuracy of the diagnosis.
**_Anti-Cardiolipin Antibody (IgG) V/S ALT glutamic pylvic transaminase (GPT):_**
```{python}
# CREATING A NEW DATAFRAME FOR COMPARISON BETWEEN MEDICAL TESTS AND THROMBOSIS SPECIFIC TESTS
new_df = final_df[final_df["a_cl _ig_g"] < 100]
new_df["thrombosis_diagnosis"] = new_df["thrombosis_diagnosis"].apply(
lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)
new_df1 = final_df[
(final_df["a_cl _ig_g"] < 100) & (final_df["thrombosis_diagnosis"] > 0)
]
new_df1["thrombosis"] = new_df["thrombosis"].apply(
lambda x: "Severe" if x > 1 else "Mild"
)
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="gpt",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="gpt",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
"Anti-Cardiolipin Antibody V/S ALT glutamic pylvic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=12)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "ALT glutamic pylvic transaminase (GPT)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-04.png")
plt.show()
```
_ALT glutamic pylvic transaminase (GPT) has a normal range of <60. It can be observed that many patients who were diagnosed with Thrombosis had their GPT <60 and there were only a few exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patient who has GPT of 300+, the patient has a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high GPT and Severeness of Thrombosis can be performed with a larger volume of data._
**_Anti-Cardiolipin Antibody (IgG) V/S AST glutamic oxaloacetic transaminase (GOT):_**
```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="got",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="got",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
"Anti-Cardiolipin Antibody V/S AST glutamic oxaloacetic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(
0.5, 0.02, "AST glutamic oxaloacetic transaminase (GOT)", ha="center", fontsize=13
)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-05.png")
plt.show()
```
_AST glutamic oxaloacetic transaminase (GOT) has a normal range of <60 (similar to GPT). It can be observed that many patients who were diagnosed with Thrombosis had their GOT <60 and there were some exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patients who have GOT of 100+, the patients have a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but can be used while diagnosing cases of Severe Thrombosis._
**_Anti-Cardiolipin Antibody (IgG) V/S Complement 3 (C3):_**
```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="c3",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="c3",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 3", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 3 (C3)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-06.png")
plt.show()
```
_Complement 3 (C3) has a normal range of >35. It can be observed that many patients who were diagnosed with Thrombosis had their C3 <35 which implies that the patients who have an abnormal C3 are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C3 is a good test to be taken into consideration while diagnosing Thrombosis._
**_Anti-Cardiolipin Antibody (IgG) V/S Complement 4 (C4):_**
```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="c4",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="c4",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 4", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 4 (C4)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-07.png")
plt.show()
```
_Complement 4 (C4) has a normal range of >10. It can be observed that almost all patients who were diagnosed with Thrombosis had their C4 <18 which implies that the patients who have an abnormal C4 (from 0 to 15) are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C4 is a good test to be taken into consideration while diagnosing Thrombosis._
**_Anti-Cardiolipin Antibody (IgG) V/S Hemoglobin (HGB):_**
```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="hgb",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="hgb",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hemoglobin", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hemoglobin (HGB)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-08.png")
plt.show()
```
_Hemoglobin (HGB) has a normal range between 10 and 17. It can be observed that all patients who were diagnosed with Thrombosis had their HGB between 10 and 17. This shows that HGB is not a good test to be taken into consideration while diagnosing Thrombosis._
**_Anti-Cardiolipin Antibody (IgG) V/S Hematoclit (HCT):_**
```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="hct",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="hct",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hematoclit", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hematoclit (HCT)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-09.png")
plt.show()
```
_Hematoclit (HCT) has a normal range between 29 and 52. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their HCT between 29 and 52. This shows that HCT is not a good test to be taken into consideration while diagnosing Thrombosis._
**_Anti-Cardiolipin Antibody (IgG) V/S Platelet Count (PLT):_**
```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="plt",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="plt",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Platelet Count", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Platelet Count (PLT)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-10.png")
plt.show()
```
_Platelet (PLT) has a normal range between 100 and 400. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their PLT between 100 and 400. This shows that PLT is not a good test to be taken into consideration while diagnosing Thrombosis._
**_Anti-Cardiolipin Antibody (IgG) V/S Proteinuria (U-PRO):_**
```{python}
# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
# CREATING SCATTER PLOTS
sns.scatterplot(
data=new_df,
y="a_cl _ig_g",
x="u-pro",
ax=ax1,
hue="thrombosis_diagnosis",
s=50,
edgecolor="black",
linewidth=0.4,
)
sns.scatterplot(
data=new_df1,
y="a_cl _ig_g",
x="u-pro",
ax=ax2,
hue="thrombosis",
s=50,
edgecolor="black",
linewidth=0.4,
palette="YlOrBr",
)
# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Proteinuria", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Proteinuria (U-PRO)", ha="center", fontsize=13)
# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)
plt.savefig("plot_outputs/plot-11.png")
plt.show()
```
_Proteinuria (U-PRO) has a normal range between 0 and 30. It can be observed that many patients who were diagnosed with Thrombosis had their U-PRO between 0 and 30 and there were only a few exceptions where patients with U-PRO outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal U-PRO except a few patients. Concentrating on the patients who have U-PRO of 100+, the patients have a severe case of thrombosis which is interesting. This shows that U-PRO is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high U-PRO and Severeness of Thrombosis can be performed with a larger volume of data._
# Technical Summary
The 3 datasets provided by the University Hospital revolved mainly around the Thrombosis Diagnosis. All the datasets had ID of the patient which was the joining parameter used to perform analytical joins. `These datasets roughly covered the demographic aspect of each patient, the medical test history of each patient and the factors for thrombosis diagnosis of some patients`. There were tons of missing data in all these datasets since there are patients who opt-out of providing their personal information and/or have done only a few medical tests per visit. `These missing values were carefully imputed` after going through entire datasets and also with the help of some domain knowledge. Along with the imputation, the columns and rows which had more than 70% missing values were dropped from the data to avoid assumption of wrong values to impute. Some `feature extraction` was performed such as calculating the Age of the Patient at the time when Thrombosis tests were performed which would be later useful for analysis.<br>After getting the final dataframe, `Exploratory Data Analysis (Preliminary and Visual)` was performed to help understand the data better and answer the initial questions. Most of the Patients who tested for Thrombosis were between ages 19 to 45 and only a few patients who were more than 61 years of age tested for Thrombosis. Further, the distribution of Positive and Negative Thrombosis patients after testing had a huge difference (only 16% of the patients tested were diagnosed as positive). If further statistical analysis and machine learning algorithms are to be applied to predict the Thrombosis results, more data for Positive thrombosis patients need to be introduced since the dataset is imbalanced currently. Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. Females are more likely to test positive for Thrombosis as compared to Males but these reports are from just one hospital and this observation cannot be generalized for the entire world.<br>Apart from Thrombosis specific tests such as acl_igG (Anticardiolipin Antibody IgG) and ANA (Antinuclear Antibody), common medical tests such as U-Pro (Proteinuria), GPT (ALT glutamic pylvic transaminase), GOT (AST glutamic oxaloacetic transaminase), C3 and C4 (Complement 3 and 4), RBC (Red Blood Cells), HCT (Hematoclit), PLT (Platelet) and HGB (Hemoglobin) were found to be correlated to the Thrombosis Diagnosis. Since, correlation coefficients cannot be the only measure to see if these tests are relevant while diagnosis Thrombosis, further Visual analysis was performed if there are any significant patterns in these tests to look out for while diagnosing Thrombosis.<br>Tests like C3 and C4 had significant patterns while tests like HGB, HCT and Platelet count did not show any significant patterns while diagnosing Thrombosis. Some tests like U-PRO, GOT and GPT had few patterns that might be good to look at while diagnosing extreme severe cases of Thrombosis. Overall the data required a lot of assumptions while imputing the missing values and the analysis performed was only preliminary. `These analysis can be further broken down to the granular level to understand the factors affecting Thrombosis Diagnosis by performing Machine Learning algorithms to predict the Thrombosis Diagnosis.`