This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Acccording to NIH, "Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. Sometimes your body doesn’t make enough—or any—insulin or doesn’t use insulin well. Glucose then stays in your blood and doesn’t reach your cells.
Over time, having too much glucose in your blood can cause health problems. Although diabetes has no cure, you can take steps to manage your diabetes and stay healthy.
Sometimes people call diabetes “a touch of sugar” or “borderline diabetes.” These terms suggest that someone doesn’t really have diabetes or has a less serious case, but every case of diabetes is serious.
Type 1 diabetes If you have type 1 diabetes, your body does not make insulin. Your immune system attacks and destroys the cells in your pancreas that make insulin. Type 1 diabetes is usually diagnosed in children and young adults, although it can appear at any age. People with type 1 diabetes need to take insulin every day to stay alive.
Type 2 diabetes If you have type 2 diabetes, your body does not make or use insulin well. You can develop type 2 diabetes at any age, even during childhood. However, this type of diabetes occurs most often in middle-aged and older people. Type 2 is the most common type of diabetes.
Gestational diabetes Gestational diabetes develops in some women when they are pregnant. Most of the time, this type of diabetes goes away after the baby is born. However, if you’ve had gestational diabetes, you have a greater chance of developing type 2 diabetes later in life. Sometimes diabetes diagnosed during pregnancy is actually type 2 diabetes.
Other types of diabetes Less common types include monogenic diabetes, which is an inherited form of diabetes, and cystic fibrosis-related diabetes ."
`
"The Pima (or Akimel O'odham, also spelled Akimel O'otham, "River People", formerly known as Pima) are a group of Native Americans living in an area consisting of what is now central and southern Arizona. The majority population of the surviving two bands of the Akimel O'odham are based in two reservations: the Keli Akimel O'otham on the Gila River Indian Community (GRIC) and the On'k Akimel O'odham on the Salt River Pima-Maricopa Indian Community (SRPMIC)." Wikipedia
#data Manipulation
import pandas as pd
#Mathematical operation
import numpy as np
#data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
#Remove Warnings
import warnings
warnings.filterwarnings('ignore')
#Ml Algoriothm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score,classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score,mean_absolute_error,mean_squared_error,r2_score,confusion_matrix
#Load the Dataset
data = pd.read_csv(r"C:\Users\Lenovo\Documents\jupyter\DataSets\diabetes.csv")
data.sample(10)
# Creating copy of actual dataframe
df=data.copy()
#check for shape
data.shape
(768, 9)
From above cell we see that there are 768 observation and 9 features in our data
#check for columns
data.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')
# check info of each column
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
From above cell we see that there are two column contain float values and 7 column contain integer value also there are no missing values in our data
#check for duplicate value
data.duplicated().sum()
0
From above cell we see that there are no duplicate values in our data
# summary statistics of numerical columns
data.describe()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
univariate analysis is the first step in statistical analysis, providing a foundation for understanding the properties of individual variables before moving on to more complex analyses involving multiple variables.
ax=sns.countplot(x='Pregnancies',data=data,palette='magma')
for i in ax.containers:
ax.bar_label(i)
plt.title("Pregnancies")
plt.show()
sns.displot(x='Glucose',data=data,kind='hist',kde=True,palette='Set1')
plt.title('Distribution of Glucose')
plt.show()
From above chart we see that there are some people whose glucose level is 0 mg/dl that is not possible.
so we can say that we have incorrect information of these people.
now let's remove this incorrect data
#Let's check incorrect data
data[data['Glucose']==0]
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
75 | 1 | 0 | 48 | 20 | 0 | 24.7 | 0.140 | 22 | 0 |
182 | 1 | 0 | 74 | 20 | 23 | 27.7 | 0.299 | 21 | 0 |
342 | 1 | 0 | 68 | 35 | 0 | 32.0 | 0.389 | 22 | 0 |
349 | 5 | 0 | 80 | 32 | 0 | 41.0 | 0.346 | 37 | 1 |
502 | 6 | 0 | 68 | 41 | 0 | 39.0 | 0.727 | 41 | 1 |
From above data we can see that above data is incorrect beacause these people's have insulin level is also 0 mu that is also not possible, so let's drop these rows
x=data[data['Glucose']==0].index.to_list()
data.drop(index=x,inplace=True)
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))
sns.histplot(df['Glucose'],kde=True,ax=axes[0])
sns.histplot(data['Glucose'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Glucose before replacing')
axes[1].set_title('Distribution of Glucose after replacing')
plt.show()
sns.displot(x='BloodPressure',data=data,kind='hist',kde=True)
plt.title('Distribution of Blood Pressure')
plt.show()
From above distribution plot we see that there are a few people whose blood pressure is 0 mm(hg) that is also not possible
BloodPressure
column.#check incorrect data
print("Shape of incorrect data:", data[data['BloodPressure']==0].shape)
data[data['BloodPressure']==0]
Shape of incorrect data: (35, 9)
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
7 | 10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | 0 |
15 | 7 | 100 | 0 | 0 | 0 | 30.0 | 0.484 | 32 | 1 |
49 | 7 | 105 | 0 | 0 | 0 | 0.0 | 0.305 | 24 | 0 |
60 | 2 | 84 | 0 | 0 | 0 | 0.0 | 0.304 | 21 | 0 |
78 | 0 | 131 | 0 | 0 | 0 | 43.2 | 0.270 | 26 | 1 |
81 | 2 | 74 | 0 | 0 | 0 | 0.0 | 0.102 | 22 | 0 |
172 | 2 | 87 | 0 | 23 | 0 | 28.9 | 0.773 | 25 | 0 |
193 | 11 | 135 | 0 | 0 | 0 | 52.3 | 0.578 | 40 | 1 |
222 | 7 | 119 | 0 | 0 | 0 | 25.2 | 0.209 | 37 | 0 |
261 | 3 | 141 | 0 | 0 | 0 | 30.0 | 0.761 | 27 | 1 |
266 | 0 | 138 | 0 | 0 | 0 | 36.3 | 0.933 | 25 | 1 |
269 | 2 | 146 | 0 | 0 | 0 | 27.5 | 0.240 | 28 | 1 |
300 | 0 | 167 | 0 | 0 | 0 | 32.3 | 0.839 | 30 | 1 |
332 | 1 | 180 | 0 | 0 | 0 | 43.3 | 0.282 | 41 | 1 |
336 | 0 | 117 | 0 | 0 | 0 | 33.8 | 0.932 | 44 | 0 |
347 | 3 | 116 | 0 | 0 | 0 | 23.5 | 0.187 | 23 | 0 |
357 | 13 | 129 | 0 | 30 | 0 | 39.9 | 0.569 | 44 | 1 |
426 | 0 | 94 | 0 | 0 | 0 | 0.0 | 0.256 | 25 | 0 |
430 | 2 | 99 | 0 | 0 | 0 | 22.2 | 0.108 | 23 | 0 |
435 | 0 | 141 | 0 | 0 | 0 | 42.4 | 0.205 | 29 | 1 |
453 | 2 | 119 | 0 | 0 | 0 | 19.6 | 0.832 | 72 | 0 |
468 | 8 | 120 | 0 | 0 | 0 | 30.0 | 0.183 | 38 | 1 |
484 | 0 | 145 | 0 | 0 | 0 | 44.2 | 0.630 | 31 | 1 |
494 | 3 | 80 | 0 | 0 | 0 | 0.0 | 0.174 | 22 | 0 |
522 | 6 | 114 | 0 | 0 | 0 | 0.0 | 0.189 | 26 | 0 |
533 | 6 | 91 | 0 | 0 | 0 | 29.8 | 0.501 | 31 | 0 |
535 | 4 | 132 | 0 | 0 | 0 | 32.9 | 0.302 | 23 | 1 |
589 | 0 | 73 | 0 | 0 | 0 | 21.1 | 0.342 | 25 | 0 |
601 | 6 | 96 | 0 | 0 | 0 | 23.7 | 0.190 | 28 | 0 |
604 | 4 | 183 | 0 | 0 | 0 | 28.4 | 0.212 | 36 | 1 |
619 | 0 | 119 | 0 | 0 | 0 | 32.4 | 0.141 | 24 | 1 |
643 | 4 | 90 | 0 | 0 | 0 | 28.0 | 0.610 | 31 | 0 |
697 | 0 | 99 | 0 | 0 | 0 | 25.0 | 0.253 | 22 | 0 |
703 | 2 | 129 | 0 | 0 | 0 | 38.5 | 0.304 | 41 | 0 |
706 | 10 | 115 | 0 | 0 | 0 | 0.0 | 0.261 | 30 | 1 |
From above we see that there are 35 observations with incorrect data in Glucose column
BloodPressure
columndata['BloodPressure'].replace(0,np.nan,inplace=True)
data['BloodPressure'].fillna(data['BloodPressure'].mean(skipna=True),inplace=True)
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))
sns.histplot(df['BloodPressure'],kde=True,ax=axes[0])
sns.histplot(data['BloodPressure'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Blood Pressure before replacing')
axes[1].set_title('Distribution of Blood Pressure after replacing')
plt.show()
sns.displot(x='SkinThickness',data=data,kind='hist',kde=True)
plt.title('Distribution of Skin Thickness')
plt.show()
From above displot we see that there are two people who does not follow the trend so we can consider them as a outlier Let's remove the outlier
data[data['SkinThickness']>60]
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
445 | 0 | 180 | 78.0 | 63 | 14 | 59.4 | 2.420 | 25 | 1 |
579 | 2 | 197 | 70.0 | 99 | 0 | 34.7 | 0.575 | 62 | 1 |
# drop outliers
data.drop(index=[445,579],inplace=True)
sns.displot(x='SkinThickness',data=data,kind='hist',kde=True)
plt.title('Distribution of Skin Thickness')
plt.show()
From above distribution chart we see that there are also might be incorrect data in skin thickness
column
data[data['SkinThickness']==0]
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
2 | 8 | 183 | 64.000000 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
5 | 5 | 116 | 74.000000 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
7 | 10 | 115 | 72.438187 | 0 | 0 | 35.3 | 0.134 | 29 | 0 |
9 | 8 | 125 | 96.000000 | 0 | 0 | 0.0 | 0.232 | 54 | 1 |
10 | 4 | 110 | 92.000000 | 0 | 0 | 37.6 | 0.191 | 30 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
757 | 0 | 123 | 72.000000 | 0 | 0 | 36.3 | 0.258 | 52 | 1 |
758 | 1 | 106 | 76.000000 | 0 | 0 | 37.5 | 0.197 | 26 | 0 |
759 | 6 | 190 | 92.000000 | 0 | 0 | 35.5 | 0.278 | 66 | 1 |
762 | 9 | 89 | 62.000000 | 0 | 0 | 22.5 | 0.142 | 33 | 0 |
766 | 1 | 126 | 60.000000 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
227 rows × 9 columns
From above data we can see that there are 227 observation with incorrect values in Skin Thickness column
data[data['SkinThickness']!=0]['SkinThickness'].describe()
count 534.000000 mean 28.955056 std 9.960421 min 7.000000 25% 22.000000 50% 29.000000 75% 36.000000 max 60.000000 Name: SkinThickness, dtype: float64
From above cell we see that the IQR range of Skin Thickness is 22-36 So I am replacing the 0 values with any random value in range 22-36
def change(x):
if x==0:
return np.random.randint(22,36)
else:
return x
data['SkinThickness']=data['SkinThickness'].apply(change)
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))
sns.histplot(df['SkinThickness'],kde=True,ax=axes[0])
sns.histplot(data['SkinThickness'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Skin Thickness before replacing')
axes[1].set_title('Distribution of Skin Thickness after replaing')
plt.show()
sns.displot(x='Insulin',data=data,kind='hist',kde=True)
plt.title('Distribution of Insulin')
plt.show()
From above displot we see that there are some people who does not follow the trend so we can consider them as a outlier Let's remove the outlier
#checking the outlier
data[data['Insulin']>600]
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
13 | 1 | 189 | 60.0 | 23 | 846 | 30.1 | 0.398 | 59 | 1 |
228 | 4 | 197 | 70.0 | 39 | 744 | 36.7 | 2.329 | 31 | 0 |
247 | 0 | 165 | 90.0 | 33 | 680 | 52.3 | 0.427 | 23 | 0 |
#REMOVE THE OUTLIER
data.drop(index=[13,228,247],inplace=True)
sns.displot(x='Insulin',data=data,kind='hist',kde=True)
plt.title('Distribution of Insulin')
plt.show()
From above distribution chart we see that there are also might be incorrect data in Insulin
column
# check incorrect data
data[(data['Insulin']==0)&(data['Outcome']==0)]
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 85 | 66.000000 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
5 | 5 | 116 | 74.000000 | 23 | 0 | 25.6 | 0.201 | 30 | 0 |
7 | 10 | 115 | 72.438187 | 24 | 0 | 35.3 | 0.134 | 29 | 0 |
10 | 4 | 110 | 92.000000 | 35 | 0 | 37.6 | 0.191 | 30 | 0 |
12 | 10 | 139 | 80.000000 | 27 | 0 | 27.1 | 1.441 | 57 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
756 | 7 | 137 | 90.000000 | 41 | 0 | 32.0 | 0.391 | 39 | 0 |
758 | 1 | 106 | 76.000000 | 23 | 0 | 37.5 | 0.197 | 26 | 0 |
762 | 9 | 89 | 62.000000 | 29 | 0 | 22.5 | 0.142 | 33 | 0 |
764 | 2 | 122 | 70.000000 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
767 | 1 | 93 | 70.000000 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
234 rows × 9 columns
From above data we can see that there are 234 observation with incorrect values in Insulin column because these people is not diabetic so insulin level of these people is not possible to be 0 mu.
data[(data['Insulin']==0)&(data['Outcome']==0)]['Insulin'].replace(0,data['Insulin'].median(),inplace=True)
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))
sns.histplot(df['Insulin'],kde=True,ax=axes[0])
sns.histplot(data['Insulin'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Insulin before replacing')
axes[1].set_title('Distribution of Insulin after replacing')
plt.show()
sns.displot(x='BMI',data=data,kind='hist',kde=True)
plt.title('Distribution of BMI')
plt.show()
From above distribution chart we see that there are also might be incorrect data in BMI
column
data[data['BMI']==0]
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
9 | 8 | 125 | 96.000000 | 27 | 0 | 0.0 | 0.232 | 54 | 1 |
49 | 7 | 105 | 72.438187 | 22 | 0 | 0.0 | 0.305 | 24 | 0 |
60 | 2 | 84 | 72.438187 | 22 | 0 | 0.0 | 0.304 | 21 | 0 |
81 | 2 | 74 | 72.438187 | 31 | 0 | 0.0 | 0.102 | 22 | 0 |
145 | 0 | 102 | 75.000000 | 23 | 0 | 0.0 | 0.572 | 21 | 0 |
371 | 0 | 118 | 64.000000 | 23 | 89 | 0.0 | 1.731 | 21 | 0 |
426 | 0 | 94 | 72.438187 | 31 | 0 | 0.0 | 0.256 | 25 | 0 |
494 | 3 | 80 | 72.438187 | 34 | 0 | 0.0 | 0.174 | 22 | 0 |
522 | 6 | 114 | 72.438187 | 29 | 0 | 0.0 | 0.189 | 26 | 0 |
684 | 5 | 136 | 82.000000 | 24 | 0 | 0.0 | 0.640 | 69 | 0 |
706 | 10 | 115 | 72.438187 | 25 | 0 | 0.0 | 0.261 | 30 | 1 |
From above data we can see that there are 11 observation with incorrect values in BMI column
#replace values
data['BMI'].replace(0,data['BMI'].mean(),inplace=True)
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))
sns.histplot(df['BMI'],kde=True,ax=axes[0])
sns.histplot(data['BMI'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of BMI before replacing')
axes[1].set_title('Distribution of BMI after replacing')
plt.show()
sns.displot(x='DiabetesPedigreeFunction',data=data,kind='hist',kde=True)
plt.title('Percentage degree function of diabetes')
plt.show()
sns.displot(x='Age',data=data,kind='hist',kde=True)
plt.title('Distribution of Age')
plt.show()
data.Outcome.value_counts()
0 495 1 263 Name: Outcome, dtype: int64
plt.pie(data.Outcome.value_counts(),labels=['Healthy','Diabetic'],autopct='%.2f%%')
plt.title('Outcome')
plt.show()
From above pie chart we see that there are 65.30%
persons are Healthy and 34.70%
persons are Diabetic
bivariate analysis is a fundamental tool for understanding the relationship between two variables, providing insights into how they are connected and how they might impact each other.
data.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')
sns.catplot(x="Outcome", y="Age", kind="swarm", data=data)
plt.title("Age Vs Outcome")
plt.show()
From above graph, it is clear that most of the patients are in the age group of 20-30 years. Patients in the age range 40-55 years are more likely to be diabetic, as compared to other age groups.
sns.boxplot(x='Outcome', y='Glucose', data=data).set_title('Glucose vs Diabetes')
plt.show()
From above boxplot we see that the people with 120 mg/dl or more than 120 mg/dl glucose level is more likely to be diabetic and people with around 100 mg/dl or less than 100 mg/dl glucose level is possible to be non diabetic
sns.violinplot(x='Outcome', y='BloodPressure', data=data, ).set_title('Blood Pressure and Diabetes')
plt.show()
From above voilinplot we can say that the distribution of the blood pressure for the diabetic patients is slightly higher than the non-diabetic patients.
sns.boxplot(x='Outcome',y='Insulin',data=data).set_title('Insulin vs Diabetes')
plt.show()
Here the boxplot shows the distribution of insulin level in patients. In non diabetic patients the insulin level is near to 100, whereas in diabetic patients the insulin level is near to 200. we can say that if one person has insulin level high then thier is a possibility person is diabetec
sns.violinplot(x='Outcome',y='BMI',data=data)
plt.title("BMI vs Diabetes")
plt.show()
From above violinplot reveals the BMI distribution, where the non dibetic patients have a increased spread from 25 to 35 with narrows after 35. However in diabetic patients there is increased spread at 35 and increased spread 45-50 as compared to non diabetic patients.Therefore BMI is a good predictor of diabetes and obese people are more likely to be diabetic.
sns.countplot(x=data['Pregnancies'],hue=data['Outcome'])
plt.title("Pregnancies Vs Diabetes")
plt.xticks(rotation=90)
plt.show()
From above countplot we see that female with more pregnancies more likely to be diabetec
Reason why female with more pregnancies more likely to be diabetic-
# Visualize the correlation map
plt.figure(figsize=(10,5))
sns.heatmap(data.corr(),annot=True,cmap='RdYlBu',fmt='.2f',
annot_kws=None,
linewidths=1)
plt.title("Understand the correlation with each columns")
plt.show()
data.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')
x=data[['Pregnancies', 'Glucose','BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']]
x
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72.0 | 35 | 0 | 33.6 | 0.627 | 50 |
1 | 1 | 85 | 66.0 | 29 | 0 | 26.6 | 0.351 | 31 |
2 | 8 | 183 | 64.0 | 23 | 0 | 23.3 | 0.672 | 32 |
3 | 1 | 89 | 66.0 | 23 | 94 | 28.1 | 0.167 | 21 |
4 | 0 | 137 | 40.0 | 35 | 168 | 43.1 | 2.288 | 33 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10 | 101 | 76.0 | 48 | 180 | 32.9 | 0.171 | 63 |
764 | 2 | 122 | 70.0 | 27 | 0 | 36.8 | 0.340 | 27 |
765 | 5 | 121 | 72.0 | 23 | 112 | 26.2 | 0.245 | 30 |
766 | 1 | 126 | 60.0 | 33 | 0 | 30.1 | 0.349 | 47 |
767 | 1 | 93 | 70.0 | 31 | 0 | 30.4 | 0.315 | 23 |
758 rows × 8 columns
In above cell I created a new dataframe that contains highly correlated feature to target variable (Outcome).
y=data[['Outcome']]
y
Outcome | |
---|---|
0 | 1 |
1 | 0 |
2 | 1 |
3 | 0 |
4 | 1 |
... | ... |
763 | 0 |
764 | 0 |
765 | 0 |
766 | 1 |
767 | 0 |
758 rows × 1 columns
In above cell I separate target variable Outcome
with other features
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
In Above cell-
training
and testing
sets using the train_test_split()
function from sklearn.The training set is used to train the models, and the testing set is used to evaluate their performance.
Take test_size
ratio of 75:25
it means we give 75% data to training and 30% for testing
random_state
= 1
(it means that it takes one observations randomly.)#Decision Tree Classifier
dtc=DecisionTreeClassifier(random_state=1)
# Train the model on the training data
dtc.fit(x_train,y_train)
# Make predictions on the test data
dtc_pred=dtc.predict(x_test)
# Evaluate the model's performance
dtc_accuracy=accuracy_score(dtc_pred,y_test)*100
print(f"Accuracy: {dtc_accuracy:.2f}%")
report = classification_report(dtc_pred, y_test)
# Print the classification report
print("Classification Report:")
print(report)
Accuracy: 68.86% Classification Report: precision recall f1-score support 0 0.78 0.76 0.77 156 1 0.51 0.53 0.52 72 accuracy 0.69 228 macro avg 0.64 0.65 0.64 228 weighted avg 0.69 0.69 0.69 228
Precision: From above cell precision represent that,
Recall: From above cell represent that,
f1-score:The F1-score is the harmonic mean of precision and recall and provides a balanced measure of the classifier's performance. It takes into account both false positives and false negatives. A higher F1-score indicates better performance.
#Naive Bayes
nv=GaussianNB()
# Train the model on the training data
nv.fit(x_train,y_train)
# Make predictions on the test data
nv_pred=nv.predict(x_test)
# Evaluate the model's performance
nv_accuracy=accuracy_score(nv_pred,y_test)*100
print(f"Accuracy: {nv_accuracy:.2f}%")
report = classification_report(nv_pred,y_test)
# Print the classification report
print("Classification Report:")
print(report)
Accuracy: 74.56% Classification Report: precision recall f1-score support 0 0.81 0.81 0.81 153 1 0.61 0.61 0.61 75 accuracy 0.75 228 macro avg 0.71 0.71 0.71 228 weighted avg 0.75 0.75 0.75 228
Precision: From above cell,
Recall: From above cell,
F1-score: From above cell,
Support: From above cell,
Accuracy: From above cell,
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.
#Logistic Regression
model = LogisticRegression()
# Train the model on the training data
model.fit(x_train, y_train)
# Make predictions on the test data
lr_pred = model.predict(x_test)
# Evaluate the model's performance
lr_accuracy = accuracy_score(y_test, lr_pred)*100
print(f"Accuracy: {lr_accuracy:.2f}%")
report = classification_report(lr_pred,y_test)
# Print the classification report
print("Classification Report:")
print(report)
Accuracy: 75.88% Classification Report: precision recall f1-score support 0 0.88 0.78 0.83 172 1 0.51 0.68 0.58 56 accuracy 0.76 228 macro avg 0.69 0.73 0.71 228 weighted avg 0.79 0.76 0.77 228
Precision: From above cell,
Recall: From above cell,
F1-score: From above cell,
Support: From above cell,
Accuracy: From above cell,
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.
# Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)
rf_accuracy = rf_model.score(x_test, y_test)*100
print(f'Random Forest Accuracy: {rf_accuracy:.2f}%')
# Make predictions on the test data
rf_pred = rf_model.predict(x_test)
report = classification_report(rf_pred,y_test)
# Print the classification report
print("Classification Report:")
print(report)
Random Forest Accuracy: 77.63% Classification Report: precision recall f1-score support 0 0.88 0.81 0.84 166 1 0.57 0.69 0.63 62 accuracy 0.78 228 macro avg 0.72 0.75 0.73 228 weighted avg 0.79 0.78 0.78 228
Precision: From above cell,
Recall: From above cell,
F1-score: From above cell,
Support: From above cell,
Accuracy: From above cell,
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.
#SVM
svm_class=svm.SVC(kernel='linear')
# Train the model on the training data
svm_class.fit(x_train,y_train)
# Make predictions on the test data
svm_class_pred=svm_class.predict(x_test)
# Evaluate the model's performance
svm_accuracy=accuracy_score(svm_class_pred,y_test)*100
print(f"Accuracy: {svm_accuracy:.2f}%")
report = classification_report(svm_class_pred,y_test)
# Print the classification report
print("Classification Report:")
print(report)
Accuracy: 78.51% Classification Report: precision recall f1-score support 0 0.88 0.81 0.85 166 1 0.59 0.71 0.64 62 accuracy 0.79 228 macro avg 0.73 0.76 0.74 228 weighted avg 0.80 0.79 0.79 228
Precision: From above cell,
Recall (Sensitivity or True Positive Rate): From above cell,
F1-score: From above cell,
Support: From above cell,
Accuracy: From above cell,
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.
# k-Nearest Neighbors (k-NN) Classifier
k_neighbors = 5
knn_model = KNeighborsClassifier(n_neighbors=k_neighbors)
knn_model.fit(x_train, y_train)
# Make predictions on the test data
knn_pred = knn_model.predict(x_test)
knn_accuracy = knn_model.score(x_test, y_test)*100
print(f'Accuracy: {knn_accuracy:.2f}%')
report = classification_report(knn_pred,y_test)
# Print the classification report
print("Classification Report:")
print(report)
Accuracy: 77.19% Classification Report: precision recall f1-score support 0 0.84 0.83 0.83 155 1 0.64 0.66 0.65 73 accuracy 0.77 228 macro avg 0.74 0.74 0.74 228 weighted avg 0.77 0.77 0.77 228
Precision: From above cell,
Recall (Sensitivity or True Positive Rate): From above cell,
F1-score: From above cell,
Support: From above cell,
Accuracy: From above cell,
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.
sns.heatmap(confusion_matrix(y_test, dtc_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Decision Tree')
plt.show()
The diagonal boxes in the confusion matrix represent the number of correct predictions (true positives) made for each class. The label at the top of the diagonal represents the predicted class, while the label on the left side of the diagonal represents the actual class.
The off-diagonal boxes in the confusion matrix show the number of incorrect predictions (false positives). These are instances where the model predicted a sample to belong to a particular class, but it actually belonged to a different class.
ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(dtc_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Decision Tree')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()
Above distribution plot clearly visualizes the accuracy of the model. The red color represents the actual values and the green color represents the predicted values. The more the overlapping of the two colors, the more accurate of the model is.
print('Accuracy Score: ',accuracy_score(y_test,dtc_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,dtc_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,dtc_pred))
print('R2 Score: ',r2_score(y_test,dtc_pred))
Accuracy Score: 0.6885964912280702 Mean Absolute Error: 0.31140350877192985 Mean Squared Error: 0.31140350877192985 R2 Score: -0.410718954248366
sns.heatmap(confusion_matrix(y_test, nv_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Naive Bayes(GaussianNB)')
plt.show()
ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(nv_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Naive Bayes')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()
In above graph we see that the colors are more overlapping it means that Gaussian naive bayes is good fit for data.
print('Accuracy Score: ',accuracy_score(y_test,nv_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,nv_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,nv_pred))
print('R2 Score: ',r2_score(y_test,nv_pred))
Accuracy Score: 0.7456140350877193 Mean Absolute Error: 0.2543859649122807 Mean Squared Error: 0.2543859649122807 R2 Score: -0.15241830065359463
sns.heatmap(confusion_matrix(y_test, lr_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Logistic Regression')
plt.show()
ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(lr_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Logistic Regression')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()
print('Accuracy Score: ',accuracy_score(y_test,lr_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,lr_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,lr_pred))
print('R2 Score: ',r2_score(y_test,lr_pred))
Accuracy Score: 0.7587719298245614 Mean Absolute Error: 0.2412280701754386 Mean Squared Error: 0.2412280701754386 R2 Score: -0.09281045751633976
sns.heatmap(confusion_matrix(y_test, rf_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Random forest')
plt.show()
ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(rf_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Random Forest Classifier')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()
print('Accuracy Score: ',accuracy_score(y_test,rf_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,rf_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,rf_pred))
print('R2 Score: ',r2_score(y_test,rf_pred))
Accuracy Score: 0.7763157894736842 Mean Absolute Error: 0.2236842105263158 Mean Squared Error: 0.2236842105263158 R2 Score: -0.013333333333333197
sns.heatmap(confusion_matrix(y_test, svm_class_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Support Vector Machine')
plt.show()
ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(svm_class_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Support Vector Machine')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()
print('Accuracy Score: ',accuracy_score(y_test,svm_class_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,svm_class_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,svm_class_pred))
print('R2 Score: ',r2_score(y_test,svm_class_pred))
Accuracy Score: 0.7850877192982456 Mean Absolute Error: 0.2149122807017544 Mean Squared Error: 0.2149122807017544 R2 Score: 0.026405228758169974
sns.heatmap(confusion_matrix(y_test, knn_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for K nearest neighbors')
plt.show()
ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(knn_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value K nearest neighbors')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()
print('Accuracy Score: ',accuracy_score(y_test,knn_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,knn_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,knn_pred))
print('R2 Score: ',r2_score(y_test,knn_pred))
Accuracy Score: 0.7719298245614035 Mean Absolute Error: 0.22807017543859648 Mean Squared Error: 0.22807017543859648 R2 Score: -0.033202614379084894
l = [dtc_accuracy, lr_accuracy, nv_accuracy, knn_accuracy, rf_accuracy, svm_accuracy]
m = ['Decision Tree', 'Logistic Regression', 'Naive Bayes', 'KNN', 'Random Forest', 'Support Vector Machine']
plt.bar(m, l, color=['red', 'green', 'blue', 'cyan', 'magenta', 'yellow'])
# Adding bar labels to the bars
for index, value in enumerate(l):
plt.text(index, value, str(round(value, 2)), ha='center', va='bottom')
plt.xticks(rotation=90)
plt.title("Accuracies of Comparison of Models")
plt.show()
Based on the above analysis and above graph-
Office:- 660, Sector 14A, Vasundhara, Ghaziabad, Uttar Pradesh - 201012, India