About Dataset¶
This dataset contains insightful information related to insurance claims, giving us an in-depth look into the demographic patterns of those receiving them. The dataset contains information on patient age, gender, BMI (Body Mass Index), blood pressure levels, diabetic status, number of children, smoking status and region. By analyzing these key factors across geographical areas and across different demographics such as age or gender we can gain a greater understanding of who is most likely to receive an insurance claim. This understanding gives us valuable insight that can be used to inform our decision making when considering potential customers for our services. On a broader scale it can inform public policy by allowing for more targeted support for those who are most in need and vulnerable. These kinds of insights are extremely valuable and this dataset provides us with the tools we need to uncover them!
Dataset Link -Kaggle link
Data Dictionary¶
index
: A unique identifier for each entry in the dataset.PatientID
: A unique identifier for each patient in the dataset.age
: The age of the patient.gender
: The gender of the patient.bmi
: The Body Mass Index (BMI) of the patient, which is a measure of body fat based on height and weight.bloodpressure
: The blood pressure level of the patient.diabetic
: Indicates whether the patient has diabetes (Yes/No).children
: The number of children the patient has.smoker
: Indicates whether the patient smokes (Yes/No).region
: The geographical region where the patient resides.claim
: The amount of the insurance claim made by the patient.
Installing dependency¶
👉Ignore It if already installed
1. !pip install numpy
2. !pip install pandas
3. !pip install matplotlib
4. !pip install seaborn
step -1 Data Preprocessing and Cleaning¶
Importing Required library¶
# perform linear operations
import numpy as np
# Data manipulation
import pandas as pd
#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
#Load the dataset
insurance=pd.read_csv(r"C:\Users\Lenovo\Downloads\content\Insurance Data Analysis\insurance_data - insurance_data.csv")
# Print top 5 rows
insurance.head()
index | PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 39.0 | male | 23.2 | 91 | Yes | 0 | No | southeast | 1121.87 |
1 | 1 | 2 | 24.0 | male | 30.1 | 87 | No | 0 | No | southeast | 1131.51 |
2 | 2 | 3 | NaN | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 |
3 | 3 | 4 | NaN | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 |
4 | 4 | 5 | NaN | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 |
insurance
index | PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 39.0 | male | 23.2 | 91 | Yes | 0 | No | southeast | 1121.87 |
1 | 1 | 2 | 24.0 | male | 30.1 | 87 | No | 0 | No | southeast | 1131.51 |
2 | 2 | 3 | NaN | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 |
3 | 3 | 4 | NaN | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 |
4 | 4 | 5 | NaN | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1335 | 1335 | 1336 | 44.0 | female | 35.5 | 88 | Yes | 0 | Yes | northwest | 55135.40 |
1336 | 1336 | 1337 | 59.0 | female | 38.1 | 120 | No | 1 | Yes | northeast | 58571.07 |
1337 | 1337 | 1338 | 30.0 | male | 34.5 | 91 | Yes | 3 | Yes | northwest | 60021.40 |
1338 | 1338 | 1339 | 37.0 | male | 30.4 | 106 | No | 0 | Yes | southeast | 62592.87 |
1339 | 1339 | 1340 | 30.0 | female | 47.4 | 101 | No | 0 | Yes | southeast | 63770.43 |
1340 rows × 11 columns
# check for shape
insurance.shape
(1340, 11)
From above cell we see that the dataset is quite large it contains 1340 observations and 11 columns
#Check info of each colummn
insurance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1340 entries, 0 to 1339 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 1340 non-null int64 1 PatientID 1340 non-null int64 2 age 1335 non-null float64 3 gender 1340 non-null object 4 bmi 1340 non-null float64 5 bloodpressure 1340 non-null int64 6 diabetic 1340 non-null object 7 children 1340 non-null int64 8 smoker 1340 non-null object 9 region 1337 non-null object 10 claim 1340 non-null float64 dtypes: float64(3), int64(4), object(4) memory usage: 115.3+ KB
From above cell we see that there are 4 object column and 4 integer and 3 column contain float values
# Checking null values
insurance.isnull().sum()
index 0 PatientID 0 age 5 gender 0 bmi 0 bloodpressure 0 diabetic 0 children 0 smoker 0 region 3 claim 0 dtype: int64
From above cell we see that there are some missing values in our data So we have to fill it
Fill missing values in age column¶
Let me Think how can i fill in the missing values in age column because we don't have real data. The age column contain continuous data, Generally when we have missing data in the continuous column then we filled the missing values by mean of that column or median of that column In our case we gonna filled the missing values by the mean of age column
insurance.age.isna().sum()
5
insurance[insurance.age.isnull()]
index | PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 3 | NaN | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 |
3 | 3 | 4 | NaN | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 |
4 | 4 | 5 | NaN | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 |
5 | 5 | 6 | NaN | male | 34.4 | 96 | Yes | 0 | No | northwest | 1137.47 |
6 | 6 | 7 | NaN | male | 37.3 | 86 | Yes | 0 | No | northwest | 1141.45 |
mean_age=insurance.age.mean()
mean_age
38.07865168539326
insurance.age.fillna(mean_age,inplace=True)
insurance.age.isna().sum()
0
Fill the missing values in region column¶
We can see that the region
column contain categorical data, Generally when we have missing data in any categorical column then we filled the missing data by mode(filled by most frequent category) of that column
insurance.region.isnull().sum()
3
insurance[insurance.region.isna()]
index | PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
13 | 13 | 14 | 32.0 | male | 27.6 | 100 | No | 0 | No | NaN | 1252.41 |
14 | 14 | 15 | 40.0 | male | 28.7 | 81 | Yes | 0 | No | NaN | 1253.94 |
15 | 15 | 16 | 32.0 | male | 30.4 | 86 | Yes | 0 | No | NaN | 1256.30 |
mode_region=insurance.region.mode()
mode_region
0 southeast Name: region, dtype: object
insurance.region.fillna('southeast',inplace=True)
insurance.region.isnull().sum()
0
# check for duplicate
insurance.duplicated().sum()
0
From above cell we see that there are no duplicates present in our dataset
Remove Unnecessary column - index
¶
insurance.drop(columns='index',inplace=True)
Step -2 Data analysis¶
insurance
PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39.000000 | male | 23.2 | 91 | Yes | 0 | No | southeast | 1121.87 |
1 | 2 | 24.000000 | male | 30.1 | 87 | No | 0 | No | southeast | 1131.51 |
2 | 3 | 38.078652 | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 |
3 | 4 | 38.078652 | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 |
4 | 5 | 38.078652 | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1335 | 1336 | 44.000000 | female | 35.5 | 88 | Yes | 0 | Yes | northwest | 55135.40 |
1336 | 1337 | 59.000000 | female | 38.1 | 120 | No | 1 | Yes | northeast | 58571.07 |
1337 | 1338 | 30.000000 | male | 34.5 | 91 | Yes | 3 | Yes | northwest | 60021.40 |
1338 | 1339 | 37.000000 | male | 30.4 | 106 | No | 0 | Yes | southeast | 62592.87 |
1339 | 1340 | 30.000000 | female | 47.4 | 101 | No | 0 | Yes | southeast | 63770.43 |
1340 rows × 10 columns
Let's check the distribuiton of Each column¶
AGE
¶
# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
# Plot a displot
sns.histplot(insurance.age, kde=True, ax=axes[0])
axes[0].set_title('Distribution of Age')
# Plot a boxplot
sns.boxplot(x=insurance.age, ax=axes[1])
axes[1].set_title('Boxplot of Age')
# Display the plots
plt.tight_layout()
plt.show()
From above plots we can observe that most of the data in the age
column lies in between 30 - 50 and median age is around 36 - 38 also the data distribuition is normal there is no outlier in the data
BMI
¶
# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
# Plot a displot
sns.histplot(insurance.bmi, kde=True, ax=axes[0])
axes[0].set_title('Distribution of BMI')
# Plot a boxplot
sns.boxplot(x=insurance.bmi, ax=axes[1])
axes[1].set_title('Boxplot of BMI')
# Display the plots
plt.tight_layout()
plt.show()
Above plots indicates that the data of bmi
column is lies in between 25 - 35 IQR range and the median bmi value is 30 and there are also some outliers
# Lets check the data of outliers
l=insurance[insurance.bmi>50].index.to_list()
insurance[insurance.bmi>50]
PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|
9 | 10 | 30.0 | male | 53.1 | 97 | No | 0 | No | northwest | 1163.46 |
141 | 142 | 46.0 | male | 50.4 | 89 | Yes | 1 | No | southeast | 2438.06 |
1299 | 1300 | 50.0 | male | 52.6 | 110 | No | 1 | Yes | southeast | 44501.40 |
# dropping outliers whose BMI value is above 50
insurance.drop(index=l,inplace=True)
# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
# Plot a displot
sns.histplot(insurance.bmi, kde=True, ax=axes[0])
axes[0].set_title('Distribution of BMI')
# Plot a boxplot
sns.boxplot(x=insurance.bmi, ax=axes[1])
axes[1].set_title('Boxplot of BMI')
# Display the plots
plt.tight_layout()
plt.show()
Blood Pressure
¶
# Plot a displot
sns.histplot(insurance.bloodpressure, kde=True)
plt.title('Distribution of Blood Pressure')
plt.show()
Above Displot indicates that the data is right Skewed in Blood Pressure
column and there is a possibility that any patients can have blood pressure is 140
Gender
¶
ax=sns.countplot(x=insurance.gender)
for i in ax.containers:
ax.bar_label(i)
plt.title("Distribuiton of each category in Gender Column")
plt.show()
Above count plot shows us that we have 675 male patient data and 662 female patient data
Diabetic
¶
ax=sns.countplot(x=insurance.diabetic)
for i in ax.containers:
ax.bar_label(i)
plt.title("Distribuiton of each category in Diabetic Column")
plt.show()