This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Acccording to NIH, "Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. Sometimes your body doesn’t make enough—or any—insulin or doesn’t use insulin well. Glucose then stays in your blood and doesn’t reach your cells.
Over time, having too much glucose in your blood can cause health problems. Although diabetes has no cure, you can take steps to manage your diabetes and stay healthy.
Sometimes people call diabetes “a touch of sugar” or “borderline diabetes.” These terms suggest that someone doesn’t really have diabetes or has a less serious case, but every case of diabetes is serious.
Type 1 diabetes If you have type 1 diabetes, your body does not make insulin. Your immune system attacks and destroys the cells in your pancreas that make insulin. Type 1 diabetes is usually diagnosed in children and young adults, although it can appear at any age. People with type 1 diabetes need to take insulin every day to stay alive.
Type 2 diabetes If you have type 2 diabetes, your body does not make or use insulin well. You can develop type 2 diabetes at any age, even during childhood. However, this type of diabetes occurs most often in middle-aged and older people. Type 2 is the most common type of diabetes.
Gestational diabetes Gestational diabetes develops in some women when they are pregnant. Most of the time, this type of diabetes goes away after the baby is born. However, if you’ve had gestational diabetes, you have a greater chance of developing type 2 diabetes later in life. Sometimes diabetes diagnosed during pregnancy is actually type 2 diabetes.
Other types of diabetes Less common types include monogenic diabetes, which is an inherited form of diabetes, and cystic fibrosis-related diabetes ."
`
"The Pima (or Akimel O'odham, also spelled Akimel O'otham, "River People", formerly known as Pima) are a group of Native Americans living in an area consisting of what is now central and southern Arizona. The majority population of the surviving two bands of the Akimel O'odham are based in two reservations: the Keli Akimel O'otham on the Gila River Indian Community (GRIC) and the On'k Akimel O'odham on the Salt River Pima-Maricopa Indian Community (SRPMIC)." Wikipedia
#data Manipulation
import pandas as pd
#Mathematical operation
import numpy as np
#data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
#Remove Warnings
import warnings
warnings.filterwarnings('ignore')
#Ml Algoriothm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score,classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score,mean_absolute_error,mean_squared_error,r2_score,confusion_matrix
#Load the Dataset
data = pd.read_csv(r"C:\Users\Lenovo\Documents\jupyter\DataSets\diabetes.csv")
data.sample(10)
# Creating copy of actual dataframe
df=data.copy()
#check for shape
data.shape
(768, 9)
From above cell we see that there are 768 observation and 9 features in our data
#check for columns
data.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')
# check info of each column
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
From above cell we see that there are two column contain float values and 7 column contain integer value also there are no missing values in our data
#check for duplicate value
data.duplicated().sum()
0
From above cell we see that there are no duplicate values in our data
# summary statistics of numerical columns
data.describe()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
univariate analysis is the first step in statistical analysis, providing a foundation for understanding the properties of individual variables before moving on to more complex analyses involving multiple variables.
ax=sns.countplot(x='Pregnancies',data=data,palette='magma')
for i in ax.containers:
ax.bar_label(i)
plt.title("Pregnancies")
plt.show()
sns.displot(x='Glucose',data=data,kind='hist',kde=True,palette='Set1')
plt.title('Distribution of Glucose')
plt.show()
From above chart we see that there are some people whose glucose level is 0 mg/dl that is not possible.
so we can say that we have incorrect information of these people.
now let's remove this incorrect data
#Let's check incorrect data
data[data['Glucose']==0]
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
75 | 1 | 0 | 48 | 20 | 0 | 24.7 | 0.140 | 22 | 0 |
182 | 1 | 0 | 74 | 20 | 23 | 27.7 | 0.299 | 21 | 0 |
342 | 1 | 0 | 68 | 35 | 0 | 32.0 | 0.389 | 22 | 0 |
349 | 5 | 0 | 80 | 32 | 0 | 41.0 | 0.346 | 37 | 1 |
502 | 6 | 0 | 68 | 41 | 0 | 39.0 | 0.727 | 41 | 1 |
From above data we can see that above data is incorrect beacause these people's have insulin level is also 0 mu that is also not possible, so let's drop these rows
x=data[data['Glucose']==0].index.to_list()
data.drop(index=x,inplace=True)
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))
sns.histplot(df['Glucose'],kde=True,ax=axes[0])
sns.histplot(data['Glucose'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Glucose before replacing')
axes[1].set_title('Distribution of Glucose after replacing')
plt.show()
sns.displot(x='BloodPressure',data=data,kind='hist',kde=True)
plt.title('Distribution of Blood Pressure')
plt.show()
From above distribution plot we see that there are a few people whose blood pressure is 0 mm(hg) that is also not possible
BloodPressure
column.#check incorrect data
print("Shape of incorrect data:", data[data['BloodPressure']==0].shape)
data[data['BloodPressure']==0]
Shape of incorrect data: (35, 9)
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
7 | 10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | 0 |
15 | 7 | 100 | 0 | 0 | 0 | 30.0 | 0.484 | 32 | 1 |
49 | 7 | 105 | 0 | 0 | 0 | 0.0 | 0.305 | 24 | 0 |
60 | 2 | 84 | 0 | 0 | 0 | 0.0 | 0.304 | 21 | 0 |
78 | 0 | 131 | 0 | 0 | 0 | 43.2 | 0.270 | 26 | 1 |
81 | 2 | 74 | 0 | 0 | 0 | 0.0 | 0.102 | 22 | 0 |
172 | 2 | 87 | 0 | 23 | 0 | 28.9 | 0.773 | 25 | 0 |
193 | 11 | 135 | 0 | 0 | 0 | 52.3 | 0.578 | 40 | 1 |
222 | 7 | 119 | 0 | 0 | 0 | 25.2 | 0.209 | 37 | 0 |
261 | 3 | 141 | 0 | 0 | 0 | 30.0 | 0.761 | 27 | 1 |
266 | 0 | 138 | 0 | 0 | 0 | 36.3 | 0.933 | 25 | 1 |
269 | 2 | 146 | 0 | 0 | 0 | 27.5 | 0.240 | 28 | 1 |
300 | 0 | 167 | 0 | 0 | 0 | 32.3 | 0.839 | 30 | 1 |
332 | 1 | 180 | 0 | 0 | 0 | 43.3 | 0.282 | 41 | 1 |
336 | 0 | 117 | 0 | 0 | 0 | 33.8 | 0.932 | 44 | 0 |
347 | 3 | 116 | 0 | 0 | 0 | 23.5 | 0.187 | 23 | 0 |
357 | 13 | 129 | 0 | 30 | 0 | 39.9 | 0.569 | 44 | 1 |
426 | 0 | 94 | 0 | 0 | 0 | 0.0 | 0.256 | 25 | 0 |
430 | 2 | 99 | 0 | 0 | 0 | 22.2 | 0.108 | 23 | 0 |
435 | 0 | 141 | 0 | 0 | 0 | 42.4 | 0.205 | 29 | 1 |
453 | 2 | 119 | 0 | 0 | 0 | 19.6 | 0.832 | 72 | 0 |
468 | 8 | 120 | 0 | 0 | 0 | 30.0 | 0.183 | 38 | 1 |
484 | 0 | 145 | 0 | 0 | 0 | 44.2 | 0.630 | 31 | 1 |
494 | 3 | 80 | 0 | 0 | 0 | 0.0 | 0.174 | 22 | 0 |
522 | 6 | 114 | 0 | 0 | 0 | 0.0 | 0.189 | 26 | 0 |
533 | 6 | 91 | 0 | 0 | 0 | 29.8 | 0.501 | 31 | 0 |
535 | 4 | 132 | 0 | 0 | 0 | 32.9 | 0.302 | 23 | 1 |
589 | 0 | 73 | 0 | 0 | 0 | 21.1 | 0.342 | 25 | 0 |
601 | 6 | 96 | 0 | 0 | 0 | 23.7 | 0.190 | 28 | 0 |
604 | 4 | 183 | 0 | 0 | 0 | 28.4 | 0.212 | 36 | 1 |
619 | 0 | 119 | 0 | 0 | 0 | 32.4 | 0.141 | 24 | 1 |
643 | 4 | 90 | 0 | 0 | 0 | 28.0 | 0.610 | 31 | 0 |
697 | 0 | 99 | 0 | 0 | 0 | 25.0 | 0.253 | 22 | 0 |
703 | 2 | 129 | 0 | 0 | 0 | 38.5 | 0.304 | 41 | 0 |
706 | 10 | 115 | 0 | 0 | 0 | 0.0 | 0.261 | 30 | 1 |
From above we see that there are 35 observations with incorrect data in Glucose column
BloodPressure
columndata['BloodPressure'].replace(0,np.nan,inplace=True)
data['BloodPressure'].fillna(data['BloodPressure'].mean(skipna=True),inplace=True)
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(12,5))
sns.histplot(df['BloodPressure'],kde=True,ax=axes[0])
sns.histplot(data['BloodPressure'],kde=True,ax=axes[1])
axes[0].set_title('Distribution of Blood Pressure before replacing')
axes[1].set_title('Distribution of Blood Pressure after replacing')
plt.show()