EDA on Medical Data of Thrombosis Diagnosis

Data Summary

Collagen is a fibrous protein found in cartilage and other connective tissue. Collagen diseases are autoimmune diseases in which the immune system of the body attacks its own skin, tissues, and organs. For example, if a patient generates antibodies for lung, they will lose their ability to do respiration and will die. The extent and causes of these diseases are partially known and not well understood and hence their classification can be a challenging task.
One of these diseases is Thrombosis, which is an important and severe complication and is also one of the major causes of death in collagen diseases. It was recently discovered by medical physicians that Thrombosis is closely related to anti-cardiolipin antibodies. The Databases used in this project are donated by one of these physicians from a University Hospital where patients came regarding collagen diseases and were recommended by their local physicians, home doctors, and other medical specialists.

Initial Questions

Is it possible for some age bands to be more likely to get diagnosed with higher degrees of Thrombosis than others?
Are females more likely to get diagnosed with Thrombosis or is it vice-versa?
How can other medical tests be incorporated to improve the accuracy of the diagnosis?

Data Munging

Import Libraries

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datetime
import snakecase
import re

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)

Import Data

Code

a = pd.read_csv("data/TSUMOTO_A.CSV")
b = pd.read_csv("data/TSUMOTO_B.CSV")
c = pd.read_csv("data/TSUMOTO_C.CSV")

Data Sample

Tsumoto_A - Basic Information about Patients (Input by Experts). This dataset includes all patients

Code

a.head()

	ID	SEX	Birthday	Description	First Date	Admission	Diagnosis
0	2110	F	2/13/34	94.02.14	93.02.10	+	RA susp.
1	11408	F	5/2/37	96.12.01	73.01.01	+	PSS
2	12052	F	4/14/56	91.08.13	NaN	+	SLE
3	14872	F	9/21/53	97.08.13	NaN	+	MCTD
4	27654	F	3/25/36	NaN	92.02.03	+	RA, SLE susp

Tsumoto_B - Special Laboratory Examinations (Input by Experts) (Measured by the Laboratory on Collagen Diseases)This dataset does not include all the patients, but includes the patients with these special tests

Code

b.head()

	ID	Examination Date	aCL IgG	aCL IgM	ANA	ANA Pattern	aCL IgA	Diagnosis	KCT	RVVT	LAC	Symptoms	Thrombosis
0	14872.0	5/27/97	1.3	1.6	256	P	0.0	MCTD, AMI	NaN	NaN	-	AMI	1
1	48473.0	12/21/92	4.3	4.6	256	P,S	3.3	SLE	-	-	-	NaN	0
2	102490.0	4/20/95	2.3	2.5	0	NaN	3.5	PSS	NaN	NaN	NaN	NaN	0
3	108788.0	5/6/97	0.0	0.0	16	S	0.0	NaN	NaN	NaN	-	NaN	0
4	122405.0	4/2/98	0.0	4.0	4	P	0.0	SLE, SjS, vertigo	NaN	NaN	NaN	NaN	0

Tsumoto_C - Laboratory Examinations stored in Hospital Information Systems (Stored from 1980 to 1999.3) All the data includes ordinary laboratory examinations and have temporal stamps

Code

c.head()

	ID	Date	GOT	GPT	LDH	ALP	TP	ALB	UA	UN	CRE	T-BIL	T-CHO	TG	CPK	GLU	WBC	RBC	HGB	HCT	PLT	PT	APTT	FG	PIC	TAT	TAT2	U-PRO	IGG	IGA	IGM	CRP	RA	RF	C3	C4	RNP	SM	SC170	SSA	SSB	CENTROMEA	DNA	DNA-II	Unnamed: 44	Unnamed: 45
0	2110	860419	24	12	152	63	7.5	4.5	3.4	16.0	0.6	0.4	192	NaN	NaN	NaN	5.9	4.69	9	31.6	380	NaN	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2110	860430	25	12	162	76	7.9	4.6	4.7	18.0	0.6	0.4	187	76	31	NaN	6.9	4.73	9.3	31.8	323	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<0.002	-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2110	860502	22	8	144	68	7	4.2	5	18.0	0.6	0.4	191	NaN	23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2110	860506	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2110	860507	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Data Cleaning

Tsumoto_A and Tsumoto_B Merge

Merge B with A on ID:

Code

b_new = b.dropna(subset=["ID"])  # DROP ROWS WITH NA IN ID COLUMN
merge_df = pd.merge(
    b_new, a[["ID", "SEX", "Birthday", "Diagnosis"]], on="ID", how="left"
)
merge_df.head()

	ID	Examination Date	aCL IgG	aCL IgM	ANA	ANA Pattern	aCL IgA	Diagnosis_x	KCT	RVVT	LAC	Symptoms	Thrombosis	SEX	Birthday	Diagnosis_y
0	14872.0	5/27/97	1.3	1.6	256	P	0.0	MCTD, AMI	NaN	NaN	-	AMI	1	F	9/21/53	MCTD
1	48473.0	12/21/92	4.3	4.6	256	P,S	3.3	SLE	-	-	-	NaN	0	F	10/7/48	SLE
2	102490.0	4/20/95	2.3	2.5	0	NaN	3.5	PSS	NaN	NaN	NaN	NaN	0	F	4/1/82	PSS
3	108788.0	5/6/97	0.0	0.0	16	S	0.0	NaN	NaN	NaN	-	NaN	0	F	3/15/42	SJS
4	122405.0	4/2/98	0.0	4.0	4	P	0.0	SLE, SjS, vertigo	NaN	NaN	NaN	NaN	0	F	5/22/61	SJS

Missing Values

Code

print("The number of rows and columns in the new merged dataframe are:", merge_df.shape)
print("--------------------------------------------------------------------------")
print(
    "The number of missing values in the new merged dataframe are: \n",
    merge_df.isna().sum(),
)

The number of rows and columns in the new merged dataframe are: (770, 16)
--------------------------------------------------------------------------
The number of missing values in the new merged dataframe are: 
 ID                    0
Examination Date      7
aCL IgG               0
aCL IgM               0
ANA                  20
ANA Pattern         236
aCL IgA               0
Diagnosis_x         312
KCT                 625
RVVT                625
LAC                 549
Symptoms            695
Thrombosis            0
SEX                 354
Birthday            353
Diagnosis_y         353
dtype: int64

Handling missing values in the merged dataset:

Code

# MERGING THE 2 DIAGNOSIS COLUMNS INTO 1
# CREATING LISTS OF DIAGNOSIS FOR EACH ROW
for i in range(len(merge_df)):
    if merge_df["Diagnosis_x"].isna()[i] == False:
        merge_df["Diagnosis_x"][i] = merge_df["Diagnosis_x"][i].split(",")
    if merge_df["Diagnosis_y"].isna()[i] == False:
        merge_df["Diagnosis_y"][i] = merge_df["Diagnosis_y"][i].split(",")

# CREATING A NEW COLUMN CALLED DIAGNOSIS AND FILLING IT WITH THE APPROPRIATE DIAGNOSIS
merge_df["Diagnosis"] = ""
for i in range(len(merge_df)):
    if (merge_df["Diagnosis_x"].isna()[i] == False) & (
        merge_df["Diagnosis_y"].isna()[i] == True
    ):
        merge_df["Diagnosis"][i] = merge_df["Diagnosis_x"][i]
    elif (merge_df["Diagnosis_x"].isna()[i] == True) & (
        merge_df["Diagnosis_y"].isna()[i] == False
    ):
        merge_df["Diagnosis"][i] = merge_df["Diagnosis_y"][i]
    elif (merge_df["Diagnosis_x"].isna()[i] == False) & (
        merge_df["Diagnosis_y"].isna()[i] == False
    ):
        merge_df["Diagnosis"][i] = list(
            set(merge_df["Diagnosis_x"][i] + merge_df["Diagnosis_y"][i])
        )

# REMOVING THE DUPLICATES IN THE DIAGNOSIS COLUMN
for i in range(len(merge_df)):
    for j in range(len(merge_df["Diagnosis"][i])):
        merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].strip()
        merge_df["Diagnosis"][i][j] = merge_df["Diagnosis"][i][j].lower()
    merge_df["Diagnosis"][i] = list(set(merge_df["Diagnosis"][i]))

    if merge_df["Diagnosis"][i] == []:
        merge_df["Diagnosis"][i] = "No Diagnosis"
    else:
        pass

# DROPPING THE ORIGINAL DIAGNOSIS COLUMNS
merge_df.drop(["Diagnosis_x", "Diagnosis_y"], axis=1, inplace=True)

# FILLING NAN VALUES IN THE BIRTHDAY COLUMN WITH A DATE OF 0/0/0
merge_df["Birthday"].fillna("0/0/0", inplace=True)
# DROPPING ROWS WITH NAN VALUES IN THE EXAMINATION DATE COLUMN
merge_df.dropna(subset=["Examination Date"], inplace=True)
merge_df.reset_index(drop=True, inplace=True)

# CREATING A NEW COLUMN CALLED AGE AND FILLING IT WITH THE DIFFERENCE BETWEEN THE EXAMINATION DATE AND BIRTHDAY
merge_df["Age"] = 0
for i in range(len(merge_df)):
    if merge_df["Birthday"][i] == "0/0/0":
        merge_df["Age"][i] = "Not Available"
    elif merge_df["Birthday"][i] != "0/0/0":
        merge_df["Age"][i] = int(merge_df["Examination Date"][i].split("/")[2]) - int(
            merge_df["Birthday"][i].split("/")[2]
        )
    else:
        merge_df["Age"][i] = "Not Available"

# DROPPING THE KCT, RVVT, AND LAC COLUMNS SINCE THEY HAVE MORE THAN 70% MISSING VALUES
merge_df.drop(["KCT", "RVVT", "LAC"], axis=1, inplace=True)
# FILLING MISSING VALUES IN THE SYMPTOMS COLUMN WITH "None"
merge_df["Symptoms"].fillna("None", inplace=True)
merge_df.reset_index(drop=True, inplace=True)
# FILLING MISSING VALUES IN THE ANA COLUMN WITH "0"
merge_df["ANA"].fillna("0", inplace=True)

# FILLING THE VALUES IN THE ANA PATTERN COLUMN WITH "None" IF THE ANA COLUMN IS "0"
for i in range(len(merge_df)):
    if merge_df["ANA"][i] == "0":
        merge_df["ANA Pattern"][i] = "None"
    else:
        pass

# FILLING MISSING VALUES IN THE ANA PATTERN COLUMN WITH "Not Available"
merge_df["ANA Pattern"].fillna("Not Available", inplace=True)
# FILLING MISSING VALUES IN THE SEX COLUMN WITH "Not Available"
merge_df["SEX"].fillna("Not Available", inplace=True)
# DROPPING ROWS WITH MISSING VALUES IN THE ENTIRE DATAFRAME
merge_df.dropna(inplace=True)
merge_df.reset_index(drop=True, inplace=True)

print(
    "The number of missing values in the new merged dataframe are: \n",
    merge_df.isna().sum(),
)

The number of missing values in the new merged dataframe are: 
 ID                  0
Examination Date    0
aCL IgG             0
aCL IgM             0
ANA                 0
ANA Pattern         0
aCL IgA             0
Symptoms            0
Thrombosis          0
SEX                 0
Birthday            0
Diagnosis           0
Age                 0
dtype: int64

Final dataframe sample after handling missing values:

Code

merge_df.head()

	ID	Examination Date	aCL IgG	aCL IgM	ANA	ANA Pattern	aCL IgA	Symptoms	Thrombosis	SEX	Birthday	Diagnosis	Age
0	14872.0	5/27/97	1.3	1.6	256	P	0.0	AMI	1	F	9/21/53	[mctd, ami]	44
1	48473.0	12/21/92	4.3	4.6	256	P,S	3.3	None	0	F	10/7/48	[sle]	44
2	102490.0	4/20/95	2.3	2.5	0	None	3.5	None	0	F	4/1/82	[pss]	13
3	108788.0	5/6/97	0.0	0.0	16	S	0.0	None	0	F	3/15/42	[sjs]	55
4	122405.0	4/2/98	0.0	4.0	4	P	0.0	None	0	F	5/22/61	[vertigo, sjs, sle]	37

Converting the Examination Date and Date of the Test to Date Time Format:

Code

c["Date"] = pd.to_datetime(c["Date"], format="%y%m%d")
merge_df["Examination Date"] = pd.to_datetime(merge_df["Examination Date"], format="%x")

Tsumoto_C Merge

Merging the Dataframe of Thrombosis examination (A) and Demographic information of the Patient (B) with the Hospital Records of every Patient (C):

Code

df = pd.merge(merge_df[["ID", "Examination Date"]], c, on=["ID"], how="left")
df.head()

	ID	Examination Date	Date	GOT	GPT	LDH	ALP	TP	ALB	UA	UN	CRE	T-BIL	T-CHO	TG	CPK	GLU	WBC	RBC	HGB	HCT	PLT	PT	APTT	FG	PIC	TAT	TAT2	U-PRO	IGG	IGA	IGM	CRP	RA	RF	C3	C4	RNP	SM	SC170	SSA	SSB	CENTROMEA	DNA	DNA-II	Unnamed: 44	Unnamed: 45
0	14872.0	1997-05-27	1981-02-17	22	30	179	41	6.6	4.1	3.5	13.0	0.9	0.3	193	79	NaN	NaN	10	4.44	12.5	38.2	256	NaN	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN	NaN	-	-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	14872.0	1997-05-27	1981-03-16	22	19	155	37	7.2	4.5	3.2	12.0	0.8	0.3	185	NaN	13	NaN	7.2	4.5	11.8	37.7	199	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	14872.0	1997-05-27	1981-04-06	21	20	143	42	7.5	4.7	4.1	12.0	0.8	0.3	199	NaN	11	NaN	5.6	4.76	12.8	39.4	199	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	14872.0	1997-05-27	1981-04-27	32	20	218	43	7.2	4.5	3.5	9.0	0.9	0.3	185	NaN	22	NaN	3.3	4.76	12.3	39.4	183	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	14872.0	1997-05-27	1981-05-01	33	19	277	38	6.9	4.4	3.3	11.0	0.8	0.3	184	NaN	17	NaN	5.8	4.43	11.7	36	215	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Missing Values

Dropping Rows and Columns according to Missing Values in the Merged Dataset:

Code

# DROPPING UNNCESSARY COLUMNS CREATED WHILE IMPORTING THE DATA
df.drop(["Unnamed: 44", "Unnamed: 45", "CRP"], axis=1, inplace=True)
# DROPPING THE COLUMNS AND ROWS WITH MORE THAN 70% MISSING VALUES
df.dropna(thresh=len(df) * 0.3, axis=1, inplace=True)
df.dropna(thresh=len(df.columns) * 0.3, inplace=True)
df.reset_index(inplace=True)

Feature Extraction - Tagging the hospital records of Patient on the basis of the Thrombosis Examination:

Code

df["B/A_Tag"] = ""
for i in range(len(df)):
    if (df["Examination Date"][i] - df["Date"][i]).days >= 0:
        df["B/A_Tag"][i] = "Before"
    else:
        df["B/A_Tag"][i] = "After"

Converting all the values in the dataframe to float (from string):

Code

col_conv = [
    "GOT",
    "GPT",
    "LDH",
    "ALP",
    "TP",
    "ALB",
    "UA",
    "UN",
    "CRE",
    "T-BIL",
    "T-CHO",
    "TG",
    "WBC",
    "RBC",
    "HGB",
    "HCT",
    "PLT",
    "U-PRO",
    "C3",
    "C4",
]
for col in col_conv:
    if df[col].dtype == "object":
        for i in range(len(df)):
            if df[col].isna()[i] == False:
                if type(df[col][i]) == str:
                    df[col][i] = re.sub(r"[^\d.]", "", df[col][i])
                    if df[col][i] == "":
                        df[col][i] = 0
                    else:
                        df[col][i] = float(df[col][i])
                else:
                    pass
            else:
                pass
    else:
        pass

# CREATING A COPY OF THE DATAFRAME TO AVOID CHANGING THE ORIGINAL DATAFRAME
df1 = df.copy()
# FILLING MISSING VALUES IN THE COLUMNS WITH 0
df1.fillna(0, inplace=True)

# CONVERTING THE COLUMNS TO FLOAT
for col in col_conv:
    df[col] = pd.to_numeric(df[col], errors="coerce").astype(float)

# CREATING A NEW DATAFRAME WITH ONLY THE COLUMNS NEEDED FOR THE GROUPING
df2 = df1[
    [
        "ID",
        "GOT",
        "GPT",
        "LDH",
        "ALP",
        "TP",
        "ALB",
        "UA",
        "UN",
        "CRE",
        "T-BIL",
        "T-CHO",
        "TG",
        "WBC",
        "RBC",
        "HGB",
        "HCT",
        "PLT",
        "U-PRO",
        "C3",
        "C4",
        "B/A_Tag",
    ]
]

Grouping the dataframe by ID and Before-After Tag and taking the mean of the values of each tests:

Code

df3 = df2.groupby(["ID", "B/A_Tag"], as_index=False, dropna=True).mean()

Final Merge

Merging the above dataset with our First Merge Dataset to get the final Data:

Code

final_df = pd.merge(df3, merge_df, on=["ID"], how="left")
final_df.sort_values(by=["ID"], inplace=True)
# CONVERTING THE COLUMN NAMES TO SNAKECASE
final_df.columns = final_df.columns.map(snakecase.convert)

Final Data Cleaning:

Code

# CREATING AGE BANDS FOR THE AGE COLUMN
for i in range(len(final_df)):
    if final_df["age"][i] == "Not Available":
        pass
    else:
        if 0 <= final_df["age"][i] <= 18:
            final_df["age"][i] = "0-18"
        elif 19 <= final_df["age"][i] <= 30:
            final_df["age"][i] = "19-30"
        elif 31 <= final_df["age"][i] <= 45:
            final_df["age"][i] = "31-45"
        elif 46 <= final_df["age"][i] <= 60:
            final_df["age"][i] = "46-60"
        else:
            final_df["age"][i] = "61+"

# CONVERTING ANA COLUMN TO NUMERIC
for i in range(len(final_df)):
    final_df["ana"][i] = re.sub(r"[^\d.]", "", final_df["ana"][i])
final_df["ana"] = pd.to_numeric(final_df["ana"], errors="coerce").astype(float)

# CREATING A THROMBOSIS DIAGNOSIS COLUMN
final_df["thrombosis_diagnosis"] = final_df["thrombosis"].apply(
    lambda x: 1 if x > 0 else 0
)

# DROPPING UNNECESSARY COLUMNS
final_df.drop(["birthday", "examination _date"], axis=1, inplace=True)

# SAVING THE FINAL DATAFRAME TO A CSV FILE
final_df.to_csv("data/FINAL_DATA.csv", index=False)

Final Data Sample:

Code

final_df.head()

	id	b/a__tag	got	gpt	ldh	alp	tp	alb	ua	un	cre	t-bil	t-cho	tg	wbc	rbc	hgb	hct	plt	u-pro	c3	c4	a_cl _ig_g	a_cl _ig_m	ana	ana _pattern	a_cl _ig_a	symptoms	thrombosis	sex	diagnosis	age	thrombosis_diagnosis
0	14872.0	Before	20.872340	15.829787	119.255319	42.085106	5.378723	3.285106	2.889362	8.319149	0.568085	0.259574	135.829787	15.234043	6.855319	3.720851	10.578723	32.304255	144.680851	0.000000	0.000000	0.00	1.3	1.6	256.0	P	0.0	AMI	1	F	[mctd, ami]	31-45	1
1	48473.0	Before	38.036364	45.745455	111.818182	83.763636	6.103636	3.598182	1.765455	11.181818	0.667273	0.500000	126.290909	23.490909	7.478182	4.728727	10.689091	33.834545	111.236364	0.163636	0.000000	0.00	4.3	4.6	256.0	P,S	3.3	None	0	F	[sle]	31-45	0
2	102490.0	Before	23.885714	16.685714	134.628571	117.942857	7.271429	4.097143	4.694286	12.400000	0.814286	0.557143	147.200000	40.457143	6.594286	3.335429	10.271429	31.265714	232.714286	0.028571	0.000000	0.00	2.3	2.5	0.0	None	3.5	None	0	F	[pss]	0-18	0
3	108788.0	Before	13.256410	8.717949	68.410256	38.897436	4.620513	2.443590	2.687179	7.384615	0.458974	0.394872	86.230769	5.820513	3.269231	3.594103	10.894872	33.502564	175.025641	0.076923	0.000000	0.00	0.0	0.0	16.0	S	0.0	None	0	F	[sjs]	46-60	0
4	122405.0	Before	30.833333	22.950000	126.766667	69.966667	7.626667	3.616667	4.045000	11.800000	0.708333	0.391667	184.283333	8.100000	2.981667	3.795167	10.638333	33.691667	217.616667	0.000000	32.233333	6.45	0.0	4.0	4.0	P	0.0	None	0	F	[vertigo, sjs, sle]	31-45	0

Exploratory Analysis

Read in the Data from Local Machine:

Code

final_df = pd.read_csv("data/FINAL_DATA.csv")
final_df.head()

	id	b/a__tag	got	gpt	ldh	alp	tp	alb	ua	un	cre	t-bil	t-cho	tg	wbc	rbc	hgb	hct	plt	u-pro	c3	c4	a_cl _ig_g	a_cl _ig_m	ana	ana _pattern	a_cl _ig_a	symptoms	thrombosis	sex	diagnosis	age	thrombosis_diagnosis
0	14872.0	Before	20.872340	15.829787	119.255319	42.085106	5.378723	3.285106	2.889362	8.319149	0.568085	0.259574	135.829787	15.234043	6.855319	3.720851	10.578723	32.304255	144.680851	0.000000	0.000000	0.00	1.3	1.6	256.0	P	0.0	AMI	1	F	['mctd', 'ami']	31-45	1
1	48473.0	Before	38.036364	45.745455	111.818182	83.763636	6.103636	3.598182	1.765455	11.181818	0.667273	0.500000	126.290909	23.490909	7.478182	4.728727	10.689091	33.834545	111.236364	0.163636	0.000000	0.00	4.3	4.6	256.0	P,S	3.3	None	0	F	['sle']	31-45	0
2	102490.0	Before	23.885714	16.685714	134.628571	117.942857	7.271429	4.097143	4.694286	12.400000	0.814286	0.557143	147.200000	40.457143	6.594286	3.335429	10.271429	31.265714	232.714286	0.028571	0.000000	0.00	2.3	2.5	0.0	None	3.5	None	0	F	['pss']	0-18	0
3	108788.0	Before	13.256410	8.717949	68.410256	38.897436	4.620513	2.443590	2.687179	7.384615	0.458974	0.394872	86.230769	5.820513	3.269231	3.594103	10.894872	33.502564	175.025641	0.076923	0.000000	0.00	0.0	0.0	16.0	S	0.0	None	0	F	['sjs']	46-60	0
4	122405.0	Before	30.833333	22.950000	126.766667	69.966667	7.626667	3.616667	4.045000	11.800000	0.708333	0.391667	184.283333	8.100000	2.981667	3.795167	10.638333	33.691667	217.616667	0.000000	32.233333	6.45	0.0	4.0	4.0	P	0.0	None	0	F	['vertigo', 'sjs', 'sle']	31-45	0

Setting the default style of the plots:

Code

sns.set_style("white")
sns.set_palette("Accent")

Preliminary Analysis

Information about the dataset features:

Code

final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 33 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    682 non-null    float64
 1   b/a__tag              682 non-null    object 
 2   got                   682 non-null    float64
 3   gpt                   682 non-null    float64
 4   ldh                   682 non-null    float64
 5   alp                   682 non-null    float64
 6   tp                    682 non-null    float64
 7   alb                   682 non-null    float64
 8   ua                    682 non-null    float64
 9   un                    682 non-null    float64
 10  cre                   682 non-null    float64
 11  t-bil                 682 non-null    float64
 12  t-cho                 682 non-null    float64
 13  tg                    682 non-null    float64
 14  wbc                   682 non-null    float64
 15  rbc                   682 non-null    float64
 16  hgb                   682 non-null    float64
 17  hct                   682 non-null    float64
 18  plt                   682 non-null    float64
 19  u-pro                 682 non-null    float64
 20  c3                    682 non-null    float64
 21  c4                    682 non-null    float64
 22  a_cl _ig_g            682 non-null    float64
 23  a_cl _ig_m            682 non-null    float64
 24  ana                   682 non-null    float64
 25  ana _pattern          682 non-null    object 
 26  a_cl _ig_a            682 non-null    float64
 27  symptoms              682 non-null    object 
 28  thrombosis            682 non-null    int64  
 29  sex                   682 non-null    object 
 30  diagnosis             682 non-null    object 
 31  age                   682 non-null    object 
 32  thrombosis_diagnosis  682 non-null    int64  
dtypes: float64(25), int64(2), object(6)
memory usage: 176.0+ KB

Statistical Analysis of the Dataset:

Code

final_df.describe()

	id	got	gpt	ldh	alp	tp	alb	ua	un	cre	t-bil	t-cho	tg	wbc	rbc	hgb	hct	plt	u-pro	c3	c4	a_cl _ig_g	a_cl _ig_m	ana	a_cl _ig_a	thrombosis	thrombosis_diagnosis
count	6.820000e+02	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000	682.000000
mean	4.701481e+06	25.554588	25.432193	348.515716	128.068644	6.288748	3.487965	3.632142	11.902266	0.545950	0.420021	148.807125	80.852254	6.463791	4.089160	12.100520	36.489660	224.730387	8.664434	49.893716	15.610479	16.571584	280.929619	478.392962	77.456452	0.199413	0.139296
std	1.369721e+06	29.433264	27.838277	204.129160	85.133350	1.476372	0.909155	1.331038	5.051944	0.237703	0.420512	57.803023	65.466995	2.810230	0.496685	1.524636	4.370481	85.121063	50.517087	26.652478	9.944405	109.647864	7165.056400	1095.472267	1858.855122	0.545815	0.346509
min	1.487200e+04	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.942857	2.090000	6.100000	18.150000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	4.537218e+06	15.255682	12.000000	254.836957	83.458766	5.696410	3.166250	2.807353	9.054924	0.425875	0.263356	116.139161	39.856456	4.483442	3.811571	11.300000	34.076639	173.307018	0.000000	33.000000	8.498464	0.000000	1.500000	4.000000	0.000000	0.000000	0.000000
50%	5.317729e+06	19.000000	17.000000	327.722222	113.061224	6.630128	3.693990	3.600000	11.313333	0.520000	0.355495	153.000000	70.200521	5.812500	4.124375	12.176562	36.732796	215.785714	0.000000	51.873016	15.000000	1.000000	2.200000	16.000000	1.800000	0.000000	0.000000
75%	5.555816e+06	25.232955	27.887821	409.270534	153.805978	7.200000	4.100000	4.323295	14.000000	0.618566	0.500000	187.967742	108.187500	7.778217	4.395000	13.189904	39.483462	263.503906	1.428571	67.080357	21.245192	2.000000	4.000000	256.000000	6.300000	0.000000	0.000000
max	5.779550e+06	422.500000	320.000000	3159.000000	940.500000	9.700000	5.000000	9.706218	60.264249	3.543161	8.635294	433.000000	569.000000	26.866667	5.687143	16.900000	49.685714	685.000000	1000.000000	154.500000	71.000000	1502.400000	187122.000000	4096.000000	48547.000000	3.000000	1.000000

Visual Analysis

Distribution of the Age of the Patients:

Code

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))
# CREATING A COUNT PLOT
sns.countplot(
    x="age", data=final_df, ax=ax, order=["0-18", "19-30", "31-45", "46-60", "61+"]
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Age Distribution of the Patients", fontsize=15)
ax.set_xlabel("Age Bands", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)

# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])

plt.show()

Majority of the patients that tested for thrombosis were in the age range of 19-30 years. However, this does not mean that this age group is more likely to get diagnosed with Thrombosis than others. This just shows that the age group of 19-30 years is the most common age group in the dataset and were tested for thrombosis the most number of times.

The Distribution of the number of people who were diagnosed with Thrombosis along with those who were not:

Code

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A BAR PLOT
sns.barplot(
    x=final_df["thrombosis_diagnosis"].value_counts().index,
    y=final_df["thrombosis_diagnosis"].value_counts().values,
    ax=ax,
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_title("Distribution of Positive and Negative Thrombosis Patients", fontsize=15)
ax.set_xlabel("Diagnosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
_ = ax.set_xticklabels(["Negative", "Positive"])

# SETTING THE GRID LINES AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
_ = ax.bar_label(ax.containers[0])

plt.show()

Only 16% of the patients who went for the Thrombosis Test were diagnosed Positive.

Outlier Detection of all the features present in the Dataset:

Code

# DIVIDING THE COLUMNS INTO GROUPS
cols1 = ["got", "gpt", "ldh", "alp", "tp", "alb"]
cols2 = ["ua", "un", "cre", "t-bil", "t-cho", "tg", "u-pro"]
cols3 = ["wbc", "rbc", "hgb", "hct", "plt", "c3", "c4"]
cols4 = ["a_cl _ig_g", "a_cl _ig_m", "ana", "a_cl _ig_a"]

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(2, 2, figsize=(10, 10))

# CREATING A BOXPLOT FOR EACH GROUP
sns.boxplot(data=final_df[cols1], ax=ax[0, 0])
sns.boxplot(data=final_df[cols2], ax=ax[0, 1])
sns.boxplot(data=final_df[cols3], ax=ax[1, 0])
sns.boxplot(data=final_df[cols4], ax=ax[1, 1])

# SETTING THE TITLE
plt.suptitle("Outlier Detection of the Features", fontsize=16, y=0.94)
ax[0, 0].set_title("Blood Chemistry", fontsize=14)
ax[0, 1].set_title("Urinalysis", fontsize=14)
ax[1, 0].set_title("Complete Blood Count", fontsize=14)
ax[1, 1].set_title("Immunology", fontsize=14)

# SETTING THE TICKSIZE AND GRID LINES
for i in range(2):
    for j in range(2):
        ax[i, j].tick_params(labelsize=11)
        ax[i, j].grid(True, axis="y", linestyle=":", linewidth=1)

plt.show()

Some features like “u-pro (Proteinuria)”, “ldh (Lactate Dehydrogenase)” and “plt (Platelet) have a lot of outliers present while features like”acl_igA” and “acl_igM” have less number of outliers but it is a very large value which needs to be dealt with.

Distribution of Diagnosis of Thrombosis with respect to the gender of the patient:

Code

# CREATING A NEW DATAFRAME FOR GENDER PRECENT DISTRIBUTION
x, y = "sex", "thrombosis_diagnosis"
new = final_df[final_df[x] != "Not Available"]
new = new.groupby(x)[y].value_counts(normalize=True)
new = new.mul(100)
new = new.rename("Percent").reset_index()
new["thrombosis_diagnosis"] = new["thrombosis_diagnosis"].apply(
    lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A BAR PLOT
sns.barplot(x="sex", y="Percent", hue="thrombosis_diagnosis", data=new, ax=ax)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
ax.set_ylim(0, 100)
ax.set_title(
    "Thrombosis Diagnosis Distribution w.r.t Gender of the Patient", fontsize=15
)
ax.set_xlabel("Gender", fontsize=13)
ax.set_ylabel("Percent", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_xticklabels(["Female", "Male"])

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
ax.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
for p in ax.patches:
    txt = str(p.get_height().round(2)) + "%"
    txt_x = p.get_x()
    txt_y = p.get_height()
    ax.text(txt_x + 0.1, txt_y, txt, va="bottom", fontsize=11)

plt.savefig("plot_outputs/plot-01.png")
plt.show()

From the above graph it is evident that Females are more likely to test positive for Thrombosis as compared to Males. But these reports are from just one hospital and this observation cannot be generalized for the entire world. To make that generalization we would need more information about the Demographics of the area where the University is located and to what extent have those factors affected the patients.

Distribution of the severity of Thrombosis among different Age groups:

Code

# INITIALISING FIGURE AND AXES
fig, ax = plt.subplots(figsize=(8, 5))

# CREATING A COUNT PLOT
sns.countplot(
    x=final_df["thrombosis"][final_df["thrombosis"] > 0],
    ax=ax,
    hue=final_df["age"],
    hue_order=["0-18", "19-30", "31-45", "46-60", "61+"],
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Distribution of Degrees of Thrombosis w.r.t Age Bands", fontsize=15)
ax.set_title(
    "1 = Positive | 2 = Positive and very severe | 3 = Positive and extremely severe",
    fontsize=11,
)
ax.set_xlabel("Thrombosis", fontsize=13)
ax.set_ylabel("Count", fontsize=13)
ax.tick_params(labelsize=11)
ax.set_yticks(np.arange(0, 21, 5))

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax.grid(True, axis="y", linestyle=":", linewidth=1)
for container in ax.containers:
    _ = ax.bar_label(container)
ax.legend(title="Age Bands", title_fontsize=11, fontsize=10)

plt.savefig("plot_outputs/plot-02.png")
plt.show()

There are a lot of patients for Mild Thrombosis in all age bands as compared to other Degrees of Thrombosis, but Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. And finally, patients in the age group of 31-60 get diagnosed with Extremely Severe Thrombosis. It can be inferred from this data that people of age 61 and above are least likely to get diagnosed with Thrombosis.

Predictors that are correlated to Thrombosis (target variable):

Code

# CREATING A CORRELATION MATRIX
corr_df = final_df.drop(["id", "thrombosis_diagnosis"], axis=1)

# CREATING A CORRELATION PLOT
corr = corr_df.corr()
corr["thrombosis"].sort_values(ascending=False)

# SETTING THE FIGURE SIZE
plt.figure(figsize=(8, 5))

# CREATING A BAR PLOT BASED ON THE CORRELATION COEFFICIENTS
corr["thrombosis"].sort_values(ascending=False).plot(kind="bar")

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.title("Correlation between Thrombosis and other Variables", fontsize=15)
plt.grid(True, axis="y", linestyle=":", linewidth=1)
plt.ylabel("Correlation Coefficient", fontsize=13)
plt.xlabel("Variables", fontsize=13)
plt.tick_params(labelsize=11)

plt.savefig("plot_outputs/plot-03.png")
plt.show()

Features that are highly correlated to Thrombosis (target variable) are:
- acl_igG (Anticardiolipin Antibody IgG)
- ANA (Antinuclear Antibody)
- U-Pro (Proteinuria)
- GPT (ALT glutamic pylvic transaminase)
- GOT (AST glutamic oxaloacetic transaminase)
- C3 (Complement 3)
- C4 (Complement 4)
- RBC (Red Blood Cells)
- HCT (Hematoclit)
- PLT (Platelet)
- HGB (Hemoglobin)

Final Plots

Comparing Different Medical Tests with Tests specific to Thrombosis (Anti-Cardiolipin Antibody (IgG)):
Since Anti-Cardiolipin Antibody (IgG) is the correlated feature to Thrombosis, we will compare it with other medical tests to see if they can be used to improve the accuracy of the diagnosis.

Anti-Cardiolipin Antibody (IgG) V/S ALT glutamic pylvic transaminase (GPT):

Code

# CREATING A NEW DATAFRAME FOR COMPARISON BETWEEN MEDICAL TESTS AND THROMBOSIS SPECIFIC TESTS
new_df = final_df[final_df["a_cl _ig_g"] < 100]
new_df["thrombosis_diagnosis"] = new_df["thrombosis_diagnosis"].apply(
    lambda x: "Thrombosis" if x > 0 else "No Thrombosis"
)
new_df1 = final_df[
    (final_df["a_cl _ig_g"] < 100) & (final_df["thrombosis_diagnosis"] > 0)
]
new_df1["thrombosis"] = new_df["thrombosis"].apply(
    lambda x: "Severe" if x > 1 else "Mild"
)

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="gpt",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="gpt",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
    "Anti-Cardiolipin Antibody V/S ALT glutamic pylvic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=12)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "ALT glutamic pylvic transaminase (GPT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-04.png")
plt.show()

ALT glutamic pylvic transaminase (GPT) has a normal range of <60. It can be observed that many patients who were diagnosed with Thrombosis had their GPT <60 and there were only a few exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patient who has GPT of 300+, the patient has a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high GPT and Severeness of Thrombosis can be performed with a larger volume of data.

Anti-Cardiolipin Antibody (IgG) V/S AST glutamic oxaloacetic transaminase (GOT):

Code

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="got",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="got",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle(
    "Anti-Cardiolipin Antibody V/S AST glutamic oxaloacetic transaminase", fontsize=15
)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(
    0.5, 0.02, "AST glutamic oxaloacetic transaminase (GOT)", ha="center", fontsize=13
)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-05.png")
plt.show()

AST glutamic oxaloacetic transaminase (GOT) has a normal range of <60 (similar to GPT). It can be observed that many patients who were diagnosed with Thrombosis had their GOT <60 and there were some exceptions where patients with GPT outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal GPT except a few patients. Concentrating on the patients who have GOT of 100+, the patients have a severe case of thrombosis which is interesting. This shows that GPT is not a good test to be taken into consideration while diagnosing Mild Thrombosis but can be used while diagnosing cases of Severe Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Complement 3 (C3):

Code

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="c3",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="c3",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 3", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 3 (C3)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-06.png")
plt.show()

Complement 3 (C3) has a normal range of >35. It can be observed that many patients who were diagnosed with Thrombosis had their C3 <35 which implies that the patients who have an abnormal C3 are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C3 is a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Complement 4 (C4):

Code

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="c4",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="c4",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Complement 4", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Complement 4 (C4)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-07.png")
plt.show()

Complement 4 (C4) has a normal range of >10. It can be observed that almost all patients who were diagnosed with Thrombosis had their C4 <18 which implies that the patients who have an abnormal C4 (from 0 to 15) are more prone to Thrombosis. Delving further into Degrees of Thrombosis, there is an equal distribution for Mild and Severe degrees of Thrombosis. This shows that C4 is a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Hemoglobin (HGB):

Code

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="hgb",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="hgb",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hemoglobin", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hemoglobin (HGB)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-08.png")
plt.show()

Hemoglobin (HGB) has a normal range between 10 and 17. It can be observed that all patients who were diagnosed with Thrombosis had their HGB between 10 and 17. This shows that HGB is not a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Hematoclit (HCT):

Code

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="hct",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="hct",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Hematoclit", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Hematoclit (HCT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-09.png")
plt.show()

Hematoclit (HCT) has a normal range between 29 and 52. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their HCT between 29 and 52. This shows that HCT is not a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Platelet Count (PLT):

Code

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="plt",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="plt",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Platelet Count", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Platelet Count (PLT)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-10.png")
plt.show()

Platelet (PLT) has a normal range between 100 and 400. It can be observed that more than 95% of the patients who were diagnosed with Thrombosis had their PLT between 100 and 400. This shows that PLT is not a good test to be taken into consideration while diagnosing Thrombosis.

Anti-Cardiolipin Antibody (IgG) V/S Proteinuria (U-PRO):

Code

# INITIALISING FIGURE AND AXES
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 5))

# CREATING SCATTER PLOTS
sns.scatterplot(
    data=new_df,
    y="a_cl _ig_g",
    x="u-pro",
    ax=ax1,
    hue="thrombosis_diagnosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
)
sns.scatterplot(
    data=new_df1,
    y="a_cl _ig_g",
    x="u-pro",
    ax=ax2,
    hue="thrombosis",
    s=50,
    edgecolor="black",
    linewidth=0.4,
    palette="YlOrBr",
)

# SETTING THE TITLE, X-AXIS LABEL, Y-AXIS LABEL AND TICKS SIZE
plt.suptitle("Anti-Cardiolipin Antibody V/S Proteinuria", fontsize=15)
ax1.set_ylabel("Anti-Cardiolipin Antibody (IgG)", fontsize=13)
ax1.set_xlabel("")
ax1.tick_params(labelsize=11)
ax2.set_ylabel("")
ax2.set_xlabel("")
ax2.tick_params(labelsize=11)
fig.text(0.5, 0.02, "Proteinuria (U-PRO)", ha="center", fontsize=13)

# SETTING THE GRID LINES, LEGEND AND BAR LABELS
ax1.legend(title="Diagnosis", title_fontsize=11, fontsize=10, loc="upper right")
ax1.grid(True, axis="y", linestyle=":", linewidth=1)
ax2.legend(title="Degree", title_fontsize=11, fontsize=10, loc="upper right")
ax2.grid(True, axis="y", linestyle=":", linewidth=1)

plt.savefig("plot_outputs/plot-11.png")
plt.show()

Proteinuria (U-PRO) has a normal range between 0 and 30. It can be observed that many patients who were diagnosed with Thrombosis had their U-PRO between 0 and 30 and there were only a few exceptions where patients with U-PRO outside the normal range were also diagnosed with Thrombosis. Delving further into Degrees of Thrombosis, it can be observed that most of the patients who are diagnosed with thrombosis have normal U-PRO except a few patients. Concentrating on the patients who have U-PRO of 100+, the patients have a severe case of thrombosis which is interesting. This shows that U-PRO is not a good test to be taken into consideration while diagnosing Mild Thrombosis but further research with regards to abnormally high U-PRO and Severeness of Thrombosis can be performed with a larger volume of data.

Technical Summary

The 3 datasets provided by the University Hospital revolved mainly around the Thrombosis Diagnosis. All the datasets had ID of the patient which was the joining parameter used to perform analytical joins. These datasets roughly covered the demographic aspect of each patient, the medical test history of each patient and the factors for thrombosis diagnosis of some patients. There were tons of missing data in all these datasets since there are patients who opt-out of providing their personal information and/or have done only a few medical tests per visit. These missing values were carefully imputed after going through entire datasets and also with the help of some domain knowledge. Along with the imputation, the columns and rows which had more than 70% missing values were dropped from the data to avoid assumption of wrong values to impute. Some feature extraction was performed such as calculating the Age of the Patient at the time when Thrombosis tests were performed which would be later useful for analysis.
After getting the final dataframe, Exploratory Data Analysis (Preliminary and Visual) was performed to help understand the data better and answer the initial questions. Most of the Patients who tested for Thrombosis were between ages 19 to 45 and only a few patients who were more than 61 years of age tested for Thrombosis. Further, the distribution of Positive and Negative Thrombosis patients after testing had a huge difference (only 16% of the patients tested were diagnosed as positive). If further statistical analysis and machine learning algorithms are to be applied to predict the Thrombosis results, more data for Positive thrombosis patients need to be introduced since the dataset is imbalanced currently. Patients from the age of 19 to 45 have the highest probability of getting diagnosed with Mild Thrombosis. This pattern changes for Severe Thrombosis where patients who are not adults yet (0-18 years) are most likely to get diagnosed with Severe Thrombosis. Females are more likely to test positive for Thrombosis as compared to Males but these reports are from just one hospital and this observation cannot be generalized for the entire world.
Apart from Thrombosis specific tests such as acl_igG (Anticardiolipin Antibody IgG) and ANA (Antinuclear Antibody), common medical tests such as U-Pro (Proteinuria), GPT (ALT glutamic pylvic transaminase), GOT (AST glutamic oxaloacetic transaminase), C3 and C4 (Complement 3 and 4), RBC (Red Blood Cells), HCT (Hematoclit), PLT (Platelet) and HGB (Hemoglobin) were found to be correlated to the Thrombosis Diagnosis. Since, correlation coefficients cannot be the only measure to see if these tests are relevant while diagnosis Thrombosis, further Visual analysis was performed if there are any significant patterns in these tests to look out for while diagnosing Thrombosis.
Tests like C3 and C4 had significant patterns while tests like HGB, HCT and Platelet count did not show any significant patterns while diagnosing Thrombosis. Some tests like U-PRO, GOT and GPT had few patterns that might be good to look at while diagnosing extreme severe cases of Thrombosis. Overall the data required a lot of assumptions while imputing the missing values and the analysis performed was only preliminary. These analysis can be further broken down to the granular level to understand the factors affecting Thrombosis Diagnosis by performing Machine Learning algorithms to predict the Thrombosis Diagnosis.