Pima Indian Diabetes Classification with EDA

Pima Indian diabetes classification

Table of Contents¶

Data Preprocessing-
- 1.1 Importing libraries
- 1.2 Data Preprocessing
Exploratory Data Analysis-
- 2.1 Univariate Analysis
- 2.2 Bivariate Analysis
Classification-
- 3.1 Splitting the dataset
- 3.2 Decision Tree Classifier
- 3.3 Naive Bayes(Gaussian NB)
- 3.4 Logistic Regression
- 3.5 Random Forest Classifer
- 3.6 Support Vector Machine
- 3.7 K-nearest Neighbors Classifer
Model Evaluation-
- 4.1 Decision Tree
- 4.2 Naive Bayes(Gaussian NB)
- 4.3 Logistic Regressio
- 4.4 Random Forest
- 4.5 Support Vector Machine
- 4.6 K-nearest Neighbors
- 4.7 Comparision of Accuracies
Conclusion

About the Dataset¶

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Dataset Link - https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database¶

What is diabetes ?¶

Acccording to NIH, "Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. Sometimes your body doesn’t make enough—or any—insulin or doesn’t use insulin well. Glucose then stays in your blood and doesn’t reach your cells.

Over time, having too much glucose in your blood can cause health problems. Although diabetes has no cure, you can take steps to manage your diabetes and stay healthy.

Sometimes people call diabetes “a touch of sugar” or “borderline diabetes.” These terms suggest that someone doesn’t really have diabetes or has a less serious case, but every case of diabetes is serious.

What are the different types of diabetes? The most common types of diabetes are type 1, type 2, and gestational diabetes.

Type 1 diabetes If you have type 1 diabetes, your body does not make insulin. Your immune system attacks and destroys the cells in your pancreas that make insulin. Type 1 diabetes is usually diagnosed in children and young adults, although it can appear at any age. People with type 1 diabetes need to take insulin every day to stay alive.
Type 2 diabetes If you have type 2 diabetes, your body does not make or use insulin well. You can develop type 2 diabetes at any age, even during childhood. However, this type of diabetes occurs most often in middle-aged and older people. Type 2 is the most common type of diabetes.
Gestational diabetes Gestational diabetes develops in some women when they are pregnant. Most of the time, this type of diabetes goes away after the baby is born. However, if you’ve had gestational diabetes, you have a greater chance of developing type 2 diabetes later in life. Sometimes diabetes diagnosed during pregnancy is actually type 2 diabetes.

Other types of diabetes Less common types include monogenic diabetes, which is an inherited form of diabetes, and cystic fibrosis-related diabetes ."

Data Dictionary¶

Pregnancies - Number of times pregnant
Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure - Diastolic blood pressure (mm Hg)
SkinThickness - Triceps skin fold thickness (mm)
Insulin - 2-Hour serum insulin (mu U/ml)
BMI - Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction - Diabetes pedigree function
Age - Age (years)
Outcome - it consists of data whether person is diabetec(1) or healthy(0) `

GLUCOSE - glucose is a type of sugar that gives our bodies the energy to function. In diabetes, the glucose level in the blood becomes too high, and it needs to be managed carefully to avoid health complications.

Blood Pressure - Having high blood pressure when you already have diabetes can worsen the diabetes-related complications. High blood pressure puts extra stress on the heart and blood vessels, increasing the risk of heart disease, stroke, kidney problems, and eye issues.

Skin Thickness - Skin thickness refers to the measurement of the thickness of the skin, which is the outer covering of the human body. The skin is the largest organ in the body and plays a vital role in protecting the underlying tissues, regulating body temperature, and providing sensory information to the brain.

Insulin - Insulin is a hormone produced by a gland in the body called the pancreas. When we eat food, especially carbohydrates, our body breaks down the food into glucose, which is a type of sugar. Glucose is the primary source of energy for our cells, including those in our muscles and organs.

DiabetesPedigreeFunction - the Diabetes Pedigree Function is a tool used to estimate the genetic risk of type 2 diabetes based on the family history of the disease. It is not a direct measure of diabetes management or treatment but can be helpful in identifying individuals at higher risk, prompting early preventive measures and health monitoring.

Who is Pima Indians ?¶

"The Pima (or Akimel O'odham, also spelled Akimel O'otham, "River People", formerly known as Pima) are a group of Native Americans living in an area consisting of what is now central and southern Arizona. The majority population of the surviving two bands of the Akimel O'odham are based in two reservations: the Keli Akimel O'otham on the Gila River Indian Community (GRIC) and the On'k Akimel O'odham on the Salt River Pima-Maricopa Indian Community (SRPMIC)." Wikipedia

1.1 Importing required libraries¶

In [1]:

#data Manipulation
import pandas as pd

#Mathematical operation
import numpy as np

#data Visualization
import matplotlib.pyplot as plt
import seaborn as sns


#Remove Warnings
import warnings
warnings.filterwarnings('ignore')

#Ml Algoriothm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score,classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score,mean_absolute_error,mean_squared_error,r2_score,confusion_matrix

1.2 Data Preprocessing¶

In the data preprocessing phase we first examined the shape of the data to understand its dimension
Next, We checked for null values in the dataset and removed if any were found
Additionally, we performed a check for duplicate values and removed them to ensure data integrity
To gain insights into relationships between different variables we performed univariate,bivariate and multivariate analysis and then visualized the correlation map using a heatmap. This visualization allowed us to identify the patterns and dependencies among the features in the dataset.

In [2]:

#Load the Dataset
data = pd.read_csv(r"C:\Users\Lenovo\Documents\jupyter\DataSets\diabetes.csv")
data.sample(10)

# Creating copy of actual dataframe
df=data.copy()

In [3]:

#check for shape
data.shape

Out[3]:

(768, 9)

From above cell we see that there are 768 observation and 9 features in our data

In [4]:

#check for columns
data.columns

Out[4]:

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [5]:

# check info of each column
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

From above cell we see that there are two column contain float values and 7 column contain integer value also there are no missing values in our data

In [6]:

#check for duplicate value
data.duplicated().sum()

Out[6]:

From above cell we see that there are no duplicate values in our data

In [7]:

# summary statistics of numerical columns
data.describe()

Out[7]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

2.1 Univariate Analysis¶

univariate analysis is the first step in statistical analysis, providing a foundation for understanding the properties of individual variables before moving on to more complex analyses involving multiple variables.

Pregnancies¶

In [8]:

ax=sns.countplot(x='Pregnancies',data=data,palette='magma')
for i in ax.containers:
    ax.bar_label(i)
plt.title("Pregnancies")
plt.show()

Glucose¶

In [9]:

sns.displot(x='Glucose',data=data,kind='hist',kde=True,palette='Set1')
plt.title('Distribution of Glucose')
plt.show()

From above chart we see that there are some people whose glucose level is 0 mg/dl that is not possible.

so we can say that we have incorrect information of these people.

now let's remove this incorrect data

In [10]:

#Let's check incorrect data 
data[data['Glucose']==0]

Out[10]:

	Pregnancies	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
75	1	48	20	0	24.7	0.140	22	0
182	1	74	20	23	27.7	0.299	21	0
342	1	68	35	0	32.0	0.389	22	0
349	5	80	32	0	41.0	0.346	37	1
502	6	68	41	0	39.0	0.727	41	1

From above data we can see that above data is incorrect beacause these people's have insulin level is also 0 mu that is also not possible, so let's drop these rows

In [11]:

x=data[data['Glucose']==0].index.to_list()
data.drop(index=x,inplace=True)

In [12]:

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))

sns.histplot(df['Glucose'],kde=True,ax=axes[0])
sns.histplot(data['Glucose'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Glucose before replacing')
axes[1].set_title('Distribution of Glucose after replacing')

plt.show()

Blood Pressure¶

In [13]:

sns.displot(x='BloodPressure',data=data,kind='hist',kde=True)
plt.title('Distribution of Blood Pressure')
plt.show()

From above distribution plot we see that there are a few people whose blood pressure is 0 mm(hg) that is also not possible

In medical practice, a blood pressure reading of 0 mm(hg) would typically indicate that the blood pressure measurement was not taken correctly or that there is a significant error in the measurement process. In real-life situations, blood pressure levels below normal (hypotension) can occur, but they would not reach zero. Severe hypotension can lead to fainting, dizziness, and other health problems.
so we can say that there are also incorrect data in our BloodPressure column.
now we have two option either we can remove this data or we can replace this incorrect data
Firstly we have to check that incorrect data

In [14]:

#check incorrect data
print("Shape of incorrect data:", data[data['BloodPressure']==0].shape)
data[data['BloodPressure']==0]

Shape of incorrect data: (35, 9)

Out[14]:

	Pregnancies	Glucose	SkinThickness	BMI	DiabetesPedigreeFunction	Age	Outcome
7	10	115	0	35.3	0.134	29	0
15	7	100	0	30.0	0.484	32	1
49	7	105	0	0.0	0.305	24	0
60	2	84	0	0.0	0.304	21	0
78	0	131	0	43.2	0.270	26	1
81	2	74	0	0.0	0.102	22	0
172	2	87	23	28.9	0.773	25	0
193	11	135	0	52.3	0.578	40	1
222	7	119	0	25.2	0.209	37	0
261	3	141	0	30.0	0.761	27	1
266	0	138	0	36.3	0.933	25	1
269	2	146	0	27.5	0.240	28	1
300	0	167	0	32.3	0.839	30	1
332	1	180	0	43.3	0.282	41	1
336	0	117	0	33.8	0.932	44	0
347	3	116	0	23.5	0.187	23	0
357	13	129	30	39.9	0.569	44	1
426	0	94	0	0.0	0.256	25	0
430	2	99	0	22.2	0.108	23	0
435	0	141	0	42.4	0.205	29	1
453	2	119	0	19.6	0.832	72	0
468	8	120	0	30.0	0.183	38	1
484	0	145	0	44.2	0.630	31	1
494	3	80	0	0.0	0.174	22	0
522	6	114	0	0.0	0.189	26	0
533	6	91	0	29.8	0.501	31	0
535	4	132	0	32.9	0.302	23	1
589	0	73	0	21.1	0.342	25	0
601	6	96	0	23.7	0.190	28	0
604	4	183	0	28.4	0.212	36	1
619	0	119	0	32.4	0.141	24	1
643	4	90	0	28.0	0.610	31	0
697	0	99	0	25.0	0.253	22	0
703	2	129	0	38.5	0.304	41	0
706	10	115	0	0.0	0.261	30	1

From above we see that there are 35 observations with incorrect data in Glucose column

Let's replace this data with mean of BloodPressure column
Why i choose mean to replace 0:
- Because we have limited data and removing observations with incorrect data might lead to a loss of useful information
- because the slope is high and data's distribution is normal
- generally the blood Pressure of people is around 70 - 80 and mean of the data is also in between 70 - 80 so we replace with mean instead of median.

In [15]:

data['BloodPressure'].replace(0,np.nan,inplace=True)

data['BloodPressure'].fillna(data['BloodPressure'].mean(skipna=True),inplace=True)

In [16]:

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))

sns.histplot(df['BloodPressure'],kde=True,ax=axes[0])
sns.histplot(data['BloodPressure'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Blood Pressure before replacing')
axes[1].set_title('Distribution of Blood Pressure after replacing')

plt.show()

skin thickness¶

In [17]:

sns.displot(x='SkinThickness',data=data,kind='hist',kde=True)
plt.title('Distribution of Skin Thickness')
plt.show()

From above displot we see that there are two people who does not follow the trend so we can consider them as a outlier Let's remove the outlier

In [18]:

data[data['SkinThickness']>60]

Out[18]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
445	0	180	78.0	63	14	59.4	2.420	25	1
579	2	197	70.0	99	0	34.7	0.575	62	1

In [19]:

# drop outliers
data.drop(index=[445,579],inplace=True)

In [20]:

sns.displot(x='SkinThickness',data=data,kind='hist',kde=True)
plt.title('Distribution of Skin Thickness')
plt.show()

From above distribution chart we see that there are also might be incorrect data in skin thickness column

because there are also several people whose skin thickness is 0 mm that is not possible generally
Let's check that incorrect data

In [21]:

data[data['SkinThickness']==0]

Out[21]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
2	8	183	64.000000	0	0	23.3	0.672	32	1
5	5	116	74.000000	0	0	25.6	0.201	30	0
7	10	115	72.438187	0	0	35.3	0.134	29	0
9	8	125	96.000000	0	0	0.0	0.232	54	1
10	4	110	92.000000	0	0	37.6	0.191	30	0
...	...	...	...	...	...	...	...	...	...
757	0	123	72.000000	0	0	36.3	0.258	52	1
758	1	106	76.000000	0	0	37.5	0.197	26	0
759	6	190	92.000000	0	0	35.5	0.278	66	1
762	9	89	62.000000	0	0	22.5	0.142	33	0
766	1	126	60.000000	0	0	30.1	0.349	47	1

227 rows × 9 columns

From above data we can see that there are 227 observation with incorrect values in Skin Thickness column

We have less data so we shouldn't remove these observsations so we have to replace it.
how to replace?
- the variance of data is high and slope is low and the spread of the data is also high so instead of taking mean and median.
- let's check the IQR range of skin thickness

In [22]:

data[data['SkinThickness']!=0]['SkinThickness'].describe()

Out[22]:

count    534.000000
mean      28.955056
std        9.960421
min        7.000000
25%       22.000000
50%       29.000000
75%       36.000000
max       60.000000
Name: SkinThickness, dtype: float64

From above cell we see that the IQR range of Skin Thickness is 22-36 So I am replacing the 0 values with any random value in range 22-36

In [23]:

def change(x):
    if x==0:
        return np.random.randint(22,36)
    else:
        return x

In [24]:

data['SkinThickness']=data['SkinThickness'].apply(change)

In [25]:

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))

sns.histplot(df['SkinThickness'],kde=True,ax=axes[0])
sns.histplot(data['SkinThickness'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Skin Thickness before replacing')
axes[1].set_title('Distribution of Skin Thickness after replaing')

plt.show()

Insulin¶

In [26]:

sns.displot(x='Insulin',data=data,kind='hist',kde=True)
plt.title('Distribution of Insulin')
plt.show()

From above displot we see that there are some people who does not follow the trend so we can consider them as a outlier Let's remove the outlier

In [27]:

#checking the outlier
data[data['Insulin']>600]

Out[27]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
13	1	189	60.0	23	846	30.1	0.398	59	1
228	4	197	70.0	39	744	36.7	2.329	31	0
247	0	165	90.0	33	680	52.3	0.427	23	0

In [28]:

#REMOVE THE OUTLIER
data.drop(index=[13,228,247],inplace=True)

In [29]:

sns.displot(x='Insulin',data=data,kind='hist',kde=True)
plt.title('Distribution of Insulin')
plt.show()

From above distribution chart we see that there are also might be incorrect data in Insulin column

because there are also several people whose insulin level is 0 mu that is not possible for all people it is possible only for diabetic people
Let's check that incorrect data

In [30]:

# check incorrect data
data[(data['Insulin']==0)&(data['Outcome']==0)]

Out[30]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
1	1	85	66.000000	29	0	26.6	0.351	31	0
5	5	116	74.000000	23	0	25.6	0.201	30	0
7	10	115	72.438187	24	0	35.3	0.134	29	0
10	4	110	92.000000	35	0	37.6	0.191	30	0
12	10	139	80.000000	27	0	27.1	1.441	57	0
...	...	...	...	...	...	...	...	...	...
756	7	137	90.000000	41	0	32.0	0.391	39	0
758	1	106	76.000000	23	0	37.5	0.197	26	0
762	9	89	62.000000	29	0	22.5	0.142	33	0
764	2	122	70.000000	27	0	36.8	0.340	27	0
767	1	93	70.000000	31	0	30.4	0.315	23	0

234 rows × 9 columns

From above data we can see that there are 234 observation with incorrect values in Insulin column because these people is not diabetic so insulin level of these people is not possible to be 0 mu.

Now let's replace these values with median
- Because data is right skewed
- Median is more robust to outliers

In [31]:

data[(data['Insulin']==0)&(data['Outcome']==0)]['Insulin'].replace(0,data['Insulin'].median(),inplace=True)

In [32]:

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))

sns.histplot(df['Insulin'],kde=True,ax=axes[0])
sns.histplot(data['Insulin'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Insulin before replacing')
axes[1].set_title('Distribution of Insulin after replacing')

plt.show()

t2TAQwEEJRkLR4QMoZOR1nI5oTC+nyi10FdMwbhrtmptZatfeuc051989TAwBf2fWQ6Xo8kQAAAEAkl3YAAAAIJdoBAAAglGgHAACAUKIdAAAAQol2AAAACCXaAQAAIJRoBwAAgFCiHQAAAEKJdgAAAAgl2gEAACCUaAcAAIBQL+nNAJf8iDscAAAAAElFTkSuQmCC

BMI¶

In [33]:

sns.displot(x='BMI',data=data,kind='hist',kde=True)
plt.title('Distribution of BMI')
plt.show()

From above distribution chart we see that there are also might be incorrect data in BMI column

because there are also several people whose BMI is 0 that is not possible
Let's check that incorrect data

In [34]:

data[data['BMI']==0]

Out[34]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	DiabetesPedigreeFunction	Age	Outcome
9	8	125	96.000000	27	0	0.232	54	1
49	7	105	72.438187	22	0	0.305	24	0
60	2	84	72.438187	22	0	0.304	21	0
81	2	74	72.438187	31	0	0.102	22	0
145	0	102	75.000000	23	0	0.572	21	0
371	0	118	64.000000	23	89	1.731	21	0
426	0	94	72.438187	31	0	0.256	25	0
494	3	80	72.438187	34	0	0.174	22	0
522	6	114	72.438187	29	0	0.189	26	0
684	5	136	82.000000	24	0	0.640	69	0
706	10	115	72.438187	25	0	0.261	30	1

From above data we can see that there are 11 observation with incorrect values in BMI column

We shouldn't remove these observation because we have less data
We have to replace these values with mean
- the slope of the data is high and spread of data is low so we choose mean instead of taking median

In [35]:

#replace values
data['BMI'].replace(0,data['BMI'].mean(),inplace=True)

In [36]:

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))

sns.histplot(df['BMI'],kde=True,ax=axes[0])
sns.histplot(data['BMI'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of BMI before replacing')
axes[1].set_title('Distribution of BMI after replacing')

plt.show()

Diabetes pedigree function¶

In [37]:

sns.displot(x='DiabetesPedigreeFunction',data=data,kind='hist',kde=True)
plt.title('Percentage degree function of diabetes')
plt.show()

Age¶

In [38]:

sns.displot(x='Age',data=data,kind='hist',kde=True)
plt.title('Distribution of Age')
plt.show()

Outcome¶

In [39]:

data.Outcome.value_counts()

Out[39]:

0    495
1    263
Name: Outcome, dtype: int64

In [40]:

plt.pie(data.Outcome.value_counts(),labels=['Healthy','Diabetic'],autopct='%.2f%%')
plt.title('Outcome')
plt.show()

From above pie chart we see that there are 65.30% persons are Healthy and 34.70% persons are Diabetic

2.2 Bivariate Analysis¶

bivariate analysis is a fundamental tool for understanding the relationship between two variables, providing insights into how they are connected and how they might impact each other.

In [41]:

data.columns

Out[41]:

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

Age Vs Outcome¶

In [42]:

sns.catplot(x="Outcome", y="Age", kind="swarm", data=data)
plt.title("Age Vs Outcome")
plt.show()

From above graph, it is clear that most of the patients are in the age group of 20-30 years. Patients in the age range 40-55 years are more likely to be diabetic, as compared to other age groups.

Glucose Vs Outcome¶

In [43]:

sns.boxplot(x='Outcome', y='Glucose', data=data).set_title('Glucose vs Diabetes')
plt.show()

From above boxplot we see that the people with 120 mg/dl or more than 120 mg/dl glucose level is more likely to be diabetic and people with around 100 mg/dl or less than 100 mg/dl glucose level is possible to be non diabetic

Blood Pressure Vs Outcome¶

In [44]:

sns.violinplot(x='Outcome', y='BloodPressure', data=data, ).set_title('Blood Pressure and Diabetes')
plt.show()

+OKdPnyYsLOyOji2EKB6KWlK9BYUQQgghnIC07AghhBDCrUnYEUIIIYRbk7AjhBBCCLcmYUcIIYQQbk3CjhBCCCHcmoQdIYQQQrg1CTtCCCGEcGsSdoQQQgjh1iTsCCGEEMKtSdgRQgghhFuTsCOEEEIItyZhRwghhBBu7f8Bym1RYUoSzI8AAAAASUVORK5CYII=

From above voilinplot we can say that the distribution of the blood pressure for the diabetic patients is slightly higher than the non-diabetic patients.

Insulin Vs Outcome¶

In [45]:

sns.boxplot(x='Outcome',y='Insulin',data=data).set_title('Insulin vs Diabetes')
plt.show()

Here the boxplot shows the distribution of insulin level in patients. In non diabetic patients the insulin level is near to 100, whereas in diabetic patients the insulin level is near to 200. we can say that if one person has insulin level high then thier is a possibility person is diabetec

BMI Vs Outcome¶

In [46]:

sns.violinplot(x='Outcome',y='BMI',data=data)
plt.title("BMI vs Diabetes")
plt.show()

From above violinplot reveals the BMI distribution, where the non dibetic patients have a increased spread from 25 to 35 with narrows after 35. However in diabetic patients there is increased spread at 35 and increased spread 45-50 as compared to non diabetic patients.Therefore BMI is a good predictor of diabetes and obese people are more likely to be diabetic.

Pregnancies Vs Outcome¶

In [47]:

sns.countplot(x=data['Pregnancies'],hue=data['Outcome'])
plt.title("Pregnancies Vs Diabetes")
plt.xticks(rotation=90)
plt.show()

From above countplot we see that female with more pregnancies more likely to be diabetec

Reason why female with more pregnancies more likely to be diabetic-

Hormonal changes during pregnancy can lead to insulin resistance, where the body's cells become less responsive to insulin. This condition can result in high blood sugar levels, which may pose risks to both the mother and the baby.

Correlation matrix¶

In [48]:

# Visualize the correlation map
plt.figure(figsize=(10,5))
sns.heatmap(data.corr(),annot=True,cmap='RdYlBu',fmt='.2f',
    annot_kws=None,
    linewidths=1)
plt.title("Understand the correlation with each columns")
plt.show()

3. ML Modelling¶

3.1 Splitting the dataset¶

In [49]:

data.columns

Out[49]:

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [50]:

x=data[['Pregnancies', 'Glucose','BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]
x

Out[50]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6	148	72.0	35	0	33.6	0.627	50
1	1	85	66.0	29	0	26.6	0.351	31
2	8	183	64.0	23	0	23.3	0.672	32
3	1	89	66.0	23	94	28.1	0.167	21
4	0	137	40.0	35	168	43.1	2.288	33
...	...	...	...	...	...	...	...	...
763	10	101	76.0	48	180	32.9	0.171	63
764	2	122	70.0	27	0	36.8	0.340	27
765	5	121	72.0	23	112	26.2	0.245	30
766	1	126	60.0	33	0	30.1	0.349	47
767	1	93	70.0	31	0	30.4	0.315	23

758 rows × 8 columns

In above cell I created a new dataframe that contains highly correlated feature to target variable (Outcome).

In [51]:

y=data[['Outcome']]
y

Out[51]:

	Outcome
0	1
1	0
2	1
3	0
4	1
...	...
763	0
764	0
765	0
766	1
767	0

758 rows × 1 columns

In above cell I separate target variable Outcome with other features

In [52]:

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)

In Above cell-

The code splits the dataset into training and testing sets using the train_test_split() function from sklearn.
The training set is used to train the models, and the testing set is used to evaluate their performance.
Take test_size ratio of 75:25 it means we give 75% data to training and 30% for testing
random_state = 1 (it means that it takes one observations randomly.)

3.2 Decision Tree Classifier¶

In [53]:

#Decision Tree Classifier
dtc=DecisionTreeClassifier(random_state=1)

# Train the model on the training data
dtc.fit(x_train,y_train)

# Make predictions on the test data
dtc_pred=dtc.predict(x_test)

# Evaluate the model's performance
dtc_accuracy=accuracy_score(dtc_pred,y_test)*100
print(f"Accuracy: {dtc_accuracy:.2f}%")

report = classification_report(dtc_pred, y_test)

# Print the classification report
print("Classification Report:")
print(report)
        

Accuracy: 68.86%
Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.76      0.77       156
           1       0.51      0.53      0.52        72

    accuracy                           0.69       228
   macro avg       0.64      0.65      0.64       228
weighted avg       0.69      0.69      0.69       228

Precision: From above cell precision represent that,
- for class 0 (non-diabetic), the precision is 0.79, which means 79% of the samples predicted as non-diabetic are correct.
- For class 1 (diabetic), the precision is 0.47, which means 47% of the samples predicted as diabetic are correct.
Recall: From above cell represent that,
- for class 0 (non-diabetic), the recall is 0.75, which means the model correctly identifies 75% of the actual non-diabetic samples.
- For class 1 (diabetic), the recall is 0.52, which means the model correctly identifies 52% of the actual diabetic samples.
f1-score:The F1-score is the harmonic mean of precision and recall and provides a balanced measure of the classifier's performance. It takes into account both false positives and false negatives. A higher F1-score indicates better performance.
- From above cell f1-score represent that, for class 0 (non-diabetic) is 0.77, and for class 1 (diabetic) is 0.49.
support: The support is the number of occurrences of each class in the actual data.
- In this case, there are 161 samples of class 0 (non-diabetic) and 67 samples of class 1 (diabetic).
Accuracy: Accuracy is the overall correct predictions divided by the total number of samples.
- In this case, the accuracy is 0.68, which means the model correctly classifies 68% of the total samples.

3.3 Naive Baiyes (GaussianNB)¶

In [54]:

#Naive Bayes
nv=GaussianNB()

# Train the model on the training data
nv.fit(x_train,y_train)

# Make predictions on the test data
nv_pred=nv.predict(x_test)

# Evaluate the model's performance
nv_accuracy=accuracy_score(nv_pred,y_test)*100
print(f"Accuracy: {nv_accuracy:.2f}%")

report = classification_report(nv_pred,y_test)

# Print the classification report
print("Classification Report:")
print(report)

Accuracy: 74.56%
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.81      0.81       153
           1       0.61      0.61      0.61        75

    accuracy                           0.75       228
   macro avg       0.71      0.71      0.71       228
weighted avg       0.75      0.75      0.75       228

Precision: From above cell,
- The precision for class 0 (non-diabetic) is 0.81, which means 81% of the samples predicted as non-diabetic are correct. For class 1 (diabetic), the precision is 0.61, indicating that 61% of the samples predicted as diabetic are correct.
Recall: From above cell,
- The recall for class 0 (non-diabetic) is 0.81, meaning the model correctly identifies 81% of the actual non-diabetic samples. For class 1 (diabetic), the recall is 0.61, which means the model correctly identifies 61% of the actual diabetic samples.
F1-score: From above cell,
- The F1-score for class 0 (non-diabetic) is 0.81, and for class 1 (diabetic) is 0.61. The F1-score takes into account both precision and recall and provides a balanced measure of the classifier's performance.
Support: From above cell,
- The support is the number of occurrences of each class in the actual data. In this case, there are 153 samples of class 0 (non-diabetic) and 75 samples of class 1 (diabetic).
Accuracy: From above cell,
- The overall accuracy of the Gaussian NB model is 74.56%, which means the model correctly classifies 74.56% of the total samples.
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.

3.4 Logistic Regression Classifier¶

In [55]:

#Logistic Regression
model = LogisticRegression()

# Train the model on the training data
model.fit(x_train, y_train)

# Make predictions on the test data
lr_pred = model.predict(x_test)

# Evaluate the model's performance
lr_accuracy = accuracy_score(y_test, lr_pred)*100
print(f"Accuracy: {lr_accuracy:.2f}%")

report = classification_report(lr_pred,y_test)

# Print the classification report
print("Classification Report:")
print(report)

Accuracy: 75.88%
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.78      0.83       172
           1       0.51      0.68      0.58        56

    accuracy                           0.76       228
   macro avg       0.69      0.73      0.71       228
weighted avg       0.79      0.76      0.77       228

Precision: From above cell,
- The precision for class 0 (non-diabetic) is 0.90, which means 90% of the samples predicted as non-diabetic are correct. For class 1 (diabetic), the precision is 0.48, indicating that 48% of the samples predicted as diabetic are correct.
Recall: From above cell,
- The recall for class 0 (non-diabetic) is 0.78, meaning the model correctly identifies 78% of the actual non-diabetic samples. For class 1 (diabetic), the recall is 0.69, which means the model correctly identifies 69% of the actual diabetic samples.
F1-score: From above cell,
- The F1-score for class 0 (non-diabetic) is 0.83, and for class 1 (diabetic) is 0.57. The F1-score takes into account both precision and recall and provides a balanced measure of the classifier's performance.
Support: From above cell,
- The support is the number of occurrences of each class in the actual data. In this case, there are 176 samples of class 0 (non-diabetic) and 52 samples of class 1 (diabetic).
Accuracy: From above cell,
- The overall accuracy of the Logistic Regression model is 75.88%, which means the model correctly classifies 75.88% of the total samples.
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.

3.5 Random Forest Classifer¶

In [56]:

# Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)

rf_accuracy = rf_model.score(x_test, y_test)*100
print(f'Random Forest Accuracy: {rf_accuracy:.2f}%')

# Make predictions on the test data
rf_pred = rf_model.predict(x_test)


report = classification_report(rf_pred,y_test)

# Print the classification report
print("Classification Report:")
print(report)

Random Forest Accuracy: 77.63%
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.81      0.84       166
           1       0.57      0.69      0.63        62

    accuracy                           0.78       228
   macro avg       0.72      0.75      0.73       228
weighted avg       0.79      0.78      0.78       228

Precision: From above cell,
- The precision for class 0 (non-diabetic) is 0.84, which means 84% of the samples predicted as non-diabetic are correct. For class 1 (diabetic), the precision is 0.60, indicating that 60% of the samples predicted as diabetic are correct.
Recall: From above cell,
- The recall for class 0 (non-diabetic) is 0.81, meaning the model correctly identifies 81% of the actual non-diabetic samples. For class 1 (diabetic), the recall is 0.65, which means the model correctly identifies 65% of the actual diabetic samples.
F1-score: From above cell,
- The F1-score for class 0 (non-diabetic) is 0.83, and for class 1 (diabetic) is 0.63. The F1-score takes into account both precision and recall and provides a balanced measure of the classifier's performance.
Support: From above cell,
- The support is the number of occurrences of each class in the actual data. In this case, there are 159 samples of class 0 (non-diabetic) and 69 samples of class 1 (diabetic).
Accuracy: From above cell,
- The overall accuracy of the Random Forest model is 76%, which means the model correctly classifies 76% of the total samples.
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.

3.6 Support Vector Machine¶

In [57]:

#SVM
svm_class=svm.SVC(kernel='linear')


# Train the model on the training data
svm_class.fit(x_train,y_train)


# Make predictions on the test data
svm_class_pred=svm_class.predict(x_test)

# Evaluate the model's performance
svm_accuracy=accuracy_score(svm_class_pred,y_test)*100
print(f"Accuracy: {svm_accuracy:.2f}%")

report = classification_report(svm_class_pred,y_test)

# Print the classification report
print("Classification Report:")
print(report)

Accuracy: 78.51%
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.81      0.85       166
           1       0.59      0.71      0.64        62

    accuracy                           0.79       228
   macro avg       0.73      0.76      0.74       228
weighted avg       0.80      0.79      0.79       228

Precision: From above cell,
- The precision for class 0 (non-diabetic) is 0.88, which means 88% of the samples predicted as non-diabetic are correct. For class 1 (diabetic), the precision is 0.59, indicating that 59% of the samples predicted as diabetic are correct.
Recall (Sensitivity or True Positive Rate): From above cell,
- The recall for class 0 (non-diabetic) is 0.81, meaning the model correctly identifies 81% of the actual non-diabetic samples. For class 1 (diabetic), the recall is 0.71, which means the model correctly identifies 71% of the actual diabetic samples.
F1-score: From above cell,
- The F1-score for class 0 (non-diabetic) is 0.85, and for class 1 (diabetic) is 0.64. The F1-score takes into account both precision and recall and provides a balanced measure of the classifier's performance.
Support: From above cell,
- The support is the number of occurrences of each class in the actual data. In this case, there are 166 samples of class 0 (non-diabetic) and 62 samples of class 1 (diabetic).
Accuracy: From above cell,
- The overall accuracy of the SVM model is 79%, which means the model correctly classifies 79% of the total samples.
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.

3.7 k-nearest Neighbor Classifer¶

In [58]:

# k-Nearest Neighbors (k-NN) Classifier
k_neighbors = 5

knn_model = KNeighborsClassifier(n_neighbors=k_neighbors)

knn_model.fit(x_train, y_train)

# Make predictions on the test data
knn_pred = knn_model.predict(x_test)

knn_accuracy = knn_model.score(x_test, y_test)*100
print(f'Accuracy: {knn_accuracy:.2f}%')

report = classification_report(knn_pred,y_test)

# Print the classification report
print("Classification Report:")
print(report)

Accuracy: 77.19%
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.83      0.83       155
           1       0.64      0.66      0.65        73

    accuracy                           0.77       228
   macro avg       0.74      0.74      0.74       228
weighted avg       0.77      0.77      0.77       228

Precision: From above cell,
- The precision for class 0 (non-diabetic) is 0.82, which means 82% of the samples predicted as non-diabetic are correct. For class 1 (diabetic), the precision is 0.67, indicating that 67% of the samples predicted as diabetic are correct.
Recall (Sensitivity or True Positive Rate): From above cell,
- The recall for class 0 (non-diabetic) is 0.83, meaning the model correctly identifies 83% of the actual non-diabetic samples. For class 1 (diabetic), the recall is 0.65, which means the model correctly identifies 65% of the actual diabetic samples.
F1-score: From above cell,
- The F1-score for class 0 (non-diabetic) is 0.83, and for class 1 (diabetic) is 0.66. The F1-score takes into account both precision and recall and provides a balanced measure of the classifier's performance.
Support: From above cell,
- The support is the number of occurrences of each class in the actual data. In this case, there are 151 samples of class 0 (non-diabetic) and 77 samples of class 1 (diabetic).
Accuracy: From above cell,
- The overall accuracy of the KNN model is 77%, which means the model correctly classifies 77% of the total samples.
Macro Avg and Weighted Avg: These are the averages of precision, recall, and F1-score across classes, weighted or unweighted by class support, respectively.

4. Model Evaluation¶

4.1 Decision Tree Classifer¶

In [59]:

sns.heatmap(confusion_matrix(y_test, dtc_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Decision Tree')
plt.show()

The diagonal boxes in the confusion matrix represent the number of correct predictions (true positives) made for each class. The label at the top of the diagonal represents the predicted class, while the label on the left side of the diagonal represents the actual class.

The off-diagonal boxes in the confusion matrix show the number of incorrect predictions (false positives). These are instances where the model predicted a sample to belong to a particular class, but it actually belonged to a different class.

In [60]:

ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(dtc_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Decision Tree')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()

Above distribution plot clearly visualizes the accuracy of the model. The red color represents the actual values and the green color represents the predicted values. The more the overlapping of the two colors, the more accurate of the model is.

Performance Metrics¶

In [61]:

print('Accuracy Score: ',accuracy_score(y_test,dtc_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,dtc_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,dtc_pred))
print('R2 Score: ',r2_score(y_test,dtc_pred))

Accuracy Score:  0.6885964912280702
Mean Absolute Error:  0.31140350877192985
Mean Squared Error:  0.31140350877192985
R2 Score:  -0.410718954248366

4.2 Naive Bayes(GaussianNB)¶

In [62]:

sns.heatmap(confusion_matrix(y_test, nv_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Naive Bayes(GaussianNB)')
plt.show()

In [63]:

ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(nv_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Naive Bayes')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()

In above graph we see that the colors are more overlapping it means that Gaussian naive bayes is good fit for data.

Performance Metrics¶

In [64]:

print('Accuracy Score: ',accuracy_score(y_test,nv_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,nv_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,nv_pred))
print('R2 Score: ',r2_score(y_test,nv_pred))

Accuracy Score:  0.7456140350877193
Mean Absolute Error:  0.2543859649122807
Mean Squared Error:  0.2543859649122807
R2 Score:  -0.15241830065359463

4.3 Logistic Regression¶

In [65]:

sns.heatmap(confusion_matrix(y_test, lr_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Logistic Regression')
plt.show()

In [66]:

ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(lr_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Logistic Regression')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()

Performance Metrics¶

In [67]:

print('Accuracy Score: ',accuracy_score(y_test,lr_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,lr_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,lr_pred))
print('R2 Score: ',r2_score(y_test,lr_pred))

Accuracy Score:  0.7587719298245614
Mean Absolute Error:  0.2412280701754386
Mean Squared Error:  0.2412280701754386
R2 Score:  -0.09281045751633976

4.4 Random Forest¶

In [68]:

sns.heatmap(confusion_matrix(y_test, rf_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Random forest')
plt.show()

In [69]:

ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(rf_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Random Forest Classifier')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()

Performance Metrics¶

In [70]:

print('Accuracy Score: ',accuracy_score(y_test,rf_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,rf_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,rf_pred))
print('R2 Score: ',r2_score(y_test,rf_pred))

Accuracy Score:  0.7763157894736842
Mean Absolute Error:  0.2236842105263158
Mean Squared Error:  0.2236842105263158
R2 Score:  -0.013333333333333197

4.5 Support vector machine¶

In [71]:

sns.heatmap(confusion_matrix(y_test, svm_class_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for Support Vector Machine')
plt.show()

In [72]:

ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(svm_class_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value Support Vector Machine')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()

Performance Metrics¶

In [73]:

print('Accuracy Score: ',accuracy_score(y_test,svm_class_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,svm_class_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,svm_class_pred))
print('R2 Score: ',r2_score(y_test,svm_class_pred))

Accuracy Score:  0.7850877192982456
Mean Absolute Error:  0.2149122807017544
Mean Squared Error:  0.2149122807017544
R2 Score:  0.026405228758169974

4.6 Knn¶

In [74]:

sns.heatmap(confusion_matrix(y_test, knn_pred), annot=True, cmap='RdYlBu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix for K nearest neighbors')
plt.show()

PrrryhTpkyOzoeIPi2FEG9d8E9EREQkIVY2iIiISFZMNoiIiEhWTDaIiIhIVkw2iIiISFZMNoiIiEhWTDaIiIhIVkw2iIiISFZMNoiIiEhWTDaIiIhIVkw2iIiISFZMNoiIiEhWTDaIiIhIVv8Pka3KaOH3Lp4AAAAASUVORK5CYII=

In [75]:

ax = sns.distplot(y_test, color='r', label='Actual Value',hist=False)
sns.distplot(knn_pred, color='g', label='Predicted Value',hist=False,ax=ax)
plt.title('Actual vs Predicted Value K nearest neighbors')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.legend()
plt.show()

Performance Metrics¶

In [76]:

print('Accuracy Score: ',accuracy_score(y_test,knn_pred))
print('Mean Absolute Error: ',mean_absolute_error(y_test,knn_pred))
print('Mean Squared Error: ',mean_squared_error(y_test,knn_pred))
print('R2 Score: ',r2_score(y_test,knn_pred))

Accuracy Score:  0.7719298245614035
Mean Absolute Error:  0.22807017543859648
Mean Squared Error:  0.22807017543859648
R2 Score:  -0.033202614379084894

4.7 Comparision of Accuracies¶

In [77]:

l = [dtc_accuracy, lr_accuracy, nv_accuracy, knn_accuracy, rf_accuracy, svm_accuracy]
m = ['Decision Tree', 'Logistic Regression', 'Naive Bayes', 'KNN', 'Random Forest', 'Support Vector Machine']

plt.bar(m, l, color=['red', 'green', 'blue', 'cyan', 'magenta', 'yellow'])

# Adding bar labels to the bars
for index, value in enumerate(l):
    plt.text(index, value, str(round(value, 2)), ha='center', va='bottom')

plt.xticks(rotation=90)
plt.title("Accuracies of Comparison of Models")
plt.show()

5. Conclusion-¶

Based on the above analysis and above graph-

the Support Vector Machine (SVM) model appears to perform slightly better than the other models for the given data. It achieves the highest accuracy among all the models and exhibits balanced F1-scores for both classes.
- SVM performed well, where the data is linearly separable or can be transformed into a higher-dimensional space using kernels to achieve separability.
- It's a powerful and versatile algorithm suitable for a wide range of classification problems, especially in binary classification tasks
- It is performing best for small size data
The Decision Tree Classifier has the lowest accuracy among all the models, indicating that it correctly classifies only about 68.42% of the total samples. The accuracy is relatively lower compared to other models.
- The Decision Tree Classifier is a simple and interpretable algorithm, In our data there is various continuos column hence decision tree is not good for our data
- It is performing well when we have more categorical columns, in our data there is more continuous column

In [ ]:

Pima Indian Diabetes Classification with EDA

Table of Contents¶

About the Dataset¶

Dataset Link - https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database¶

What is diabetes ?¶

Data Dictionary¶

Who is Pima Indians ?¶

1.1 Importing required libraries¶

1.2 Data Preprocessing¶

2.1 Univariate Analysis¶

Pregnancies¶

Glucose¶

Blood Pressure¶

skin thickness¶

Insulin¶

BMI¶

Diabetes pedigree function¶

Age¶

Outcome¶

2.2 Bivariate Analysis¶

Age Vs Outcome¶

Glucose Vs Outcome¶

Blood Pressure Vs Outcome¶

Insulin Vs Outcome¶

BMI Vs Outcome¶

Pregnancies Vs Outcome¶

Correlation matrix¶

3. ML Modelling¶

3.1 Splitting the dataset¶

3.2 Decision Tree Classifier¶

3.3 Naive Baiyes (GaussianNB)¶

3.4 Logistic Regression Classifier¶

3.5 Random Forest Classifer¶

3.6 Support Vector Machine¶

3.7 k-nearest Neighbor Classifer¶

4. Model Evaluation¶

4.1 Decision Tree Classifer¶

Performance Metrics¶

4.2 Naive Bayes(GaussianNB)¶

Performance Metrics¶

4.3 Logistic Regression¶

Performance Metrics¶

4.4 Random Forest¶

Performance Metrics¶

4.5 Support vector machine¶

Performance Metrics¶

4.6 Knn¶

Performance Metrics¶

4.7 Comparision of Accuracies¶

5. Conclusion-¶

Talk to our Industry Experts for Career Counselling

Company

Platform

Resources

Get in touch