About Dataset¶
This dataset contains insightful information related to insurance claims, giving us an in-depth look into the demographic patterns of those receiving them. The dataset contains information on patient age, gender, BMI (Body Mass Index), blood pressure levels, diabetic status, number of children, smoking status and region. By analyzing these key factors across geographical areas and across different demographics such as age or gender we can gain a greater understanding of who is most likely to receive an insurance claim. This understanding gives us valuable insight that can be used to inform our decision making when considering potential customers for our services. On a broader scale it can inform public policy by allowing for more targeted support for those who are most in need and vulnerable. These kinds of insights are extremely valuable and this dataset provides us with the tools we need to uncover them!
Dataset Link -Kaggle link
Data Dictionary¶
index
: A unique identifier for each entry in the dataset.PatientID
: A unique identifier for each patient in the dataset.age
: The age of the patient.gender
: The gender of the patient.bmi
: The Body Mass Index (BMI) of the patient, which is a measure of body fat based on height and weight.bloodpressure
: The blood pressure level of the patient.diabetic
: Indicates whether the patient has diabetes (Yes/No).children
: The number of children the patient has.smoker
: Indicates whether the patient smokes (Yes/No).region
: The geographical region where the patient resides.claim
: The amount of the insurance claim made by the patient.
Installing dependency¶
👉Ignore It if already installed
1. !pip install numpy
2. !pip install pandas
3. !pip install matplotlib
4. !pip install seaborn
step -1 Data Preprocessing and Cleaning¶
Importing Required library¶
# perform linear operations
import numpy as np
# Data manipulation
import pandas as pd
#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
#Load the dataset
insurance=pd.read_csv(r"C:\Users\Lenovo\Downloads\content\Insurance Data Analysis\insurance_data - insurance_data.csv")
# Print top 5 rows
insurance.head()
index | PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 39.0 | male | 23.2 | 91 | Yes | 0 | No | southeast | 1121.87 |
1 | 1 | 2 | 24.0 | male | 30.1 | 87 | No | 0 | No | southeast | 1131.51 |
2 | 2 | 3 | NaN | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 |
3 | 3 | 4 | NaN | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 |
4 | 4 | 5 | NaN | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 |
insurance
index | PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 39.0 | male | 23.2 | 91 | Yes | 0 | No | southeast | 1121.87 |
1 | 1 | 2 | 24.0 | male | 30.1 | 87 | No | 0 | No | southeast | 1131.51 |
2 | 2 | 3 | NaN | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 |
3 | 3 | 4 | NaN | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 |
4 | 4 | 5 | NaN | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1335 | 1335 | 1336 | 44.0 | female | 35.5 | 88 | Yes | 0 | Yes | northwest | 55135.40 |
1336 | 1336 | 1337 | 59.0 | female | 38.1 | 120 | No | 1 | Yes | northeast | 58571.07 |
1337 | 1337 | 1338 | 30.0 | male | 34.5 | 91 | Yes | 3 | Yes | northwest | 60021.40 |
1338 | 1338 | 1339 | 37.0 | male | 30.4 | 106 | No | 0 | Yes | southeast | 62592.87 |
1339 | 1339 | 1340 | 30.0 | female | 47.4 | 101 | No | 0 | Yes | southeast | 63770.43 |
1340 rows × 11 columns
# check for shape
insurance.shape
(1340, 11)
From above cell we see that the dataset is quite large it contains 1340 observations and 11 columns
#Check info of each colummn
insurance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1340 entries, 0 to 1339 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 1340 non-null int64 1 PatientID 1340 non-null int64 2 age 1335 non-null float64 3 gender 1340 non-null object 4 bmi 1340 non-null float64 5 bloodpressure 1340 non-null int64 6 diabetic 1340 non-null object 7 children 1340 non-null int64 8 smoker 1340 non-null object 9 region 1337 non-null object 10 claim 1340 non-null float64 dtypes: float64(3), int64(4), object(4) memory usage: 115.3+ KB
From above cell we see that there are 4 object column and 4 integer and 3 column contain float values
# Checking null values
insurance.isnull().sum()
index 0 PatientID 0 age 5 gender 0 bmi 0 bloodpressure 0 diabetic 0 children 0 smoker 0 region 3 claim 0 dtype: int64
From above cell we see that there are some missing values in our data So we have to fill it
Fill missing values in age column¶
Let me Think how can i fill in the missing values in age column because we don't have real data. The age column contain continuous data, Generally when we have missing data in the continuous column then we filled the missing values by mean of that column or median of that column In our case we gonna filled the missing values by the mean of age column
insurance.age.isna().sum()
5
insurance[insurance.age.isnull()]
index | PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 3 | NaN | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 |
3 | 3 | 4 | NaN | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 |
4 | 4 | 5 | NaN | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 |
5 | 5 | 6 | NaN | male | 34.4 | 96 | Yes | 0 | No | northwest | 1137.47 |
6 | 6 | 7 | NaN | male | 37.3 | 86 | Yes | 0 | No | northwest | 1141.45 |
mean_age=insurance.age.mean()
mean_age
38.07865168539326
insurance.age.fillna(mean_age,inplace=True)
insurance.age.isna().sum()
0
Fill the missing values in region column¶
We can see that the region
column contain categorical data, Generally when we have missing data in any categorical column then we filled the missing data by mode(filled by most frequent category) of that column
insurance.region.isnull().sum()
3
insurance[insurance.region.isna()]
index | PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
13 | 13 | 14 | 32.0 | male | 27.6 | 100 | No | 0 | No | NaN | 1252.41 |
14 | 14 | 15 | 40.0 | male | 28.7 | 81 | Yes | 0 | No | NaN | 1253.94 |
15 | 15 | 16 | 32.0 | male | 30.4 | 86 | Yes | 0 | No | NaN | 1256.30 |
mode_region=insurance.region.mode()
mode_region
0 southeast Name: region, dtype: object
insurance.region.fillna('southeast',inplace=True)
insurance.region.isnull().sum()
0
# check for duplicate
insurance.duplicated().sum()
0
From above cell we see that there are no duplicates present in our dataset
Remove Unnecessary column - index
¶
insurance.drop(columns='index',inplace=True)
Step -2 Data analysis¶
insurance
PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39.000000 | male | 23.2 | 91 | Yes | 0 | No | southeast | 1121.87 |
1 | 2 | 24.000000 | male | 30.1 | 87 | No | 0 | No | southeast | 1131.51 |
2 | 3 | 38.078652 | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 |
3 | 4 | 38.078652 | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 |
4 | 5 | 38.078652 | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1335 | 1336 | 44.000000 | female | 35.5 | 88 | Yes | 0 | Yes | northwest | 55135.40 |
1336 | 1337 | 59.000000 | female | 38.1 | 120 | No | 1 | Yes | northeast | 58571.07 |
1337 | 1338 | 30.000000 | male | 34.5 | 91 | Yes | 3 | Yes | northwest | 60021.40 |
1338 | 1339 | 37.000000 | male | 30.4 | 106 | No | 0 | Yes | southeast | 62592.87 |
1339 | 1340 | 30.000000 | female | 47.4 | 101 | No | 0 | Yes | southeast | 63770.43 |
1340 rows × 10 columns
Let's check the distribuiton of Each column¶
AGE
¶
# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
# Plot a displot
sns.histplot(insurance.age, kde=True, ax=axes[0])
axes[0].set_title('Distribution of Age')
# Plot a boxplot
sns.boxplot(x=insurance.age, ax=axes[1])
axes[1].set_title('Boxplot of Age')
# Display the plots
plt.tight_layout()
plt.show()
From above plots we can observe that most of the data in the age
column lies in between 30 - 50 and median age is around 36 - 38 also the data distribuition is normal there is no outlier in the data
BMI
¶
# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
# Plot a displot
sns.histplot(insurance.bmi, kde=True, ax=axes[0])
axes[0].set_title('Distribution of BMI')
# Plot a boxplot
sns.boxplot(x=insurance.bmi, ax=axes[1])
axes[1].set_title('Boxplot of BMI')
# Display the plots
plt.tight_layout()
plt.show()
Above plots indicates that the data of bmi
column is lies in between 25 - 35 IQR range and the median bmi value is 30 and there are also some outliers
# Lets check the data of outliers
l=insurance[insurance.bmi>50].index.to_list()
insurance[insurance.bmi>50]
PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | |
---|---|---|---|---|---|---|---|---|---|---|
9 | 10 | 30.0 | male | 53.1 | 97 | No | 0 | No | northwest | 1163.46 |
141 | 142 | 46.0 | male | 50.4 | 89 | Yes | 1 | No | southeast | 2438.06 |
1299 | 1300 | 50.0 | male | 52.6 | 110 | No | 1 | Yes | southeast | 44501.40 |
# dropping outliers whose BMI value is above 50
insurance.drop(index=l,inplace=True)
# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
# Plot a displot
sns.histplot(insurance.bmi, kde=True, ax=axes[0])
axes[0].set_title('Distribution of BMI')
# Plot a boxplot
sns.boxplot(x=insurance.bmi, ax=axes[1])
axes[1].set_title('Boxplot of BMI')
# Display the plots
plt.tight_layout()
plt.show()
Blood Pressure
¶
# Plot a displot
sns.histplot(insurance.bloodpressure, kde=True)
plt.title('Distribution of Blood Pressure')
plt.show()
Above Displot indicates that the data is right Skewed in Blood Pressure
column and there is a possibility that any patients can have blood pressure is 140
Gender
¶
ax=sns.countplot(x=insurance.gender)
for i in ax.containers:
ax.bar_label(i)
plt.title("Distribuiton of each category in Gender Column")
plt.show()
Above count plot shows us that we have 675 male patient data and 662 female patient data
Diabetic
¶
ax=sns.countplot(x=insurance.diabetic)
for i in ax.containers:
ax.bar_label(i)
plt.title("Distribuiton of each category in Diabetic Column")
plt.show()
From the count plot above, we can conclude that there are 641 patients with diabetes, while 696 patients do not have diabetes in our dataset.
Children
¶
ax=sns.countplot(x=insurance.children)
for i in ax.containers:
ax.bar_label(i)
plt.title("Distribuiton of each category in children Column")
plt.show()
From the above count plot, we observe that there are 575 patients who have no children, 322 patients who have one child, and 240 patients who have two children, and so on.
Smoker
¶
ax=sns.countplot(x=insurance.smoker)
for i in ax.containers:
ax.bar_label(i)
plt.title("Distribuiton of each category in Smoker Column")
plt.show()
The count plot indicates that 1064 patients are smokers, while 273 patients are non-smokers
Region
¶
ax=sns.countplot(x=insurance.region)
for i in ax.containers:
ax.bar_label(i)
plt.title("Distribuiton of each category in region Column")
plt.show()
The count plot illustrates that the majority of the patients are from the southeast region, followed by the northwest region and then the southwest and so on...
Claim
¶
# Plot a displot
sns.histplot(insurance.claim, kde=True)
plt.title('Distribution of Claim amount')
# Display the plots
plt.show()
The displot suggests that the majority of patients receive claim amounts below 20,000, while a smaller proportion receives larger claim amounts exceeding 50,000.
Let's ask some question to the data¶
What is the distribution of BMI among different age groups and regions?
def agegroup(x):
if x>=18 and x<29:
return '18-29'
elif x>=29 and x<39:
return '29-39'
elif x>=39 and x<49:
return '39-49'
else:
return '49-60'
insurance['agegroup']=insurance.age.apply(agegroup)
insurance
PatientID | age | gender | bmi | bloodpressure | diabetic | children | smoker | region | claim | agegroup | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 39.000000 | male | 23.2 | 91 | Yes | 0 | No | southeast | 1121.87 | 39-49 |
1 | 2 | 24.000000 | male | 30.1 | 87 | No | 0 | No | southeast | 1131.51 | 18-29 |
2 | 3 | 38.078652 | male | 33.3 | 82 | Yes | 0 | No | southeast | 1135.94 | 29-39 |
3 | 4 | 38.078652 | male | 33.7 | 80 | No | 0 | No | northwest | 1136.40 | 29-39 |
4 | 5 | 38.078652 | male | 34.1 | 100 | No | 0 | No | northwest | 1137.01 | 29-39 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1335 | 1336 | 44.000000 | female | 35.5 | 88 | Yes | 0 | Yes | northwest | 55135.40 | 39-49 |
1336 | 1337 | 59.000000 | female | 38.1 | 120 | No | 1 | Yes | northeast | 58571.07 | 49-60 |
1337 | 1338 | 30.000000 | male | 34.5 | 91 | Yes | 3 | Yes | northwest | 60021.40 | 29-39 |
1338 | 1339 | 37.000000 | male | 30.4 | 106 | No | 0 | Yes | southeast | 62592.87 | 29-39 |
1339 | 1340 | 30.000000 | female | 47.4 | 101 | No | 0 | Yes | southeast | 63770.43 | 29-39 |
1337 rows × 11 columns
plt.figure(figsize=(12, 8))
sns.violinplot(x='agegroup', y='bmi', hue='region', data=insurance, inner='quart', palette='Set2')
plt.title('Distribution of BMI across Age Groups and Regions')
plt.xlabel('Age Group')
plt.ylabel('BMI')
plt.show()
Based on the above plots, we can observe that in the southwest region, the BMI range falls between 20 and 40, with the median value ranging from 32 to 36 across all age groups. Similarly, in the northwest region, the BMI range spans from 20 to 35, with the median value falling between 26 and 30 for all age groups. In the southeast region, the BMI range remains consistent, ranging from 29 to 32 across all age groups. Lastly, in the northeast region, the majority of data falls within the range of 20 to 35, with the median value ranging from 28 to 32 across all age groups.
Find average age of diabetic people across different regions?
plt.figure(figsize=(12, 8))
ax = sns.barplot(x='region', y='age', data=insurance[insurance.diabetic=='Yes'], ci=None, palette='Set3')
for p in ax.patches:
ax.annotate('{:.2f}'.format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=11, color='black', xytext=(0, 5),
textcoords='offset points')
plt.title('Average age of Diabetic people Across Different Regions')
plt.xlabel('Region')
plt.ylabel('Age')
plt.show()
Upon analyzing the average age of diabetic individuals in different regions, we find that the highest average age is observed in the Southwest region, with an average age of 38.92. The Northwest region follows closely, with an average age of 36.35, and the Southeast region closely behind with an average age of 37.92. Finally, the Northeast region exhibits an average age of 37.32 for diabetic individuals.
Is there a correlation between blood pressure levels and BMI for patients with and without diabetes?
plt.figure(figsize=(10, 6))
sns.scatterplot(data=insurance, x="bloodpressure", y="bmi", hue="diabetic")
plt.title('Correlation Between Blood Pressure and BMI for Patients with and without Diabetes')
plt.xlabel('Blood Pressure')
plt.ylabel('BMI')
plt.show()
From the scatter plot, it is evident that the majority of the patients who claimed insurance fall within the blood pressure range of 80 to 110 and the BMI range of 20 to 40, irrespective of their diabetic status.
How does the number of children in a family impact the likelihood of having a diabetic condition?
sns.countplot(hue='diabetic',x='children',data=insurance)
plt.title('Diabetic condition based on the number of children')
plt.show()
From the countplot above, it is evident that the number of children does not seem to significantly impact the likelihood of a patient being diabetic.
What is the average insurance claim amount for different age groups and genders?
plt.figure(figsize=(12, 8))
ax=sns.barplot(x='agegroup', y='claim', hue='gender', data=insurance, ci=None, palette='Set3')
for label in ax.containers:
ax.bar_label(label)
plt.title('Average Insurance Claim Amount for Different Age Groups and Genders')
plt.xlabel('Age Group')
plt.ylabel('Average Claim Amount')
plt.show()
From the above bar plot, we observe that males receive a higher average claim amount in each age group. In the 18-29 age group, the average claim amount for males is 14,708, while for females, it is 11,912.30. In the 29-39 age group, males receive 12,981 on average, whereas females receive 12,708. In the 39-49 age group, the average claim amount for males is 14,347, whereas for females, it is 12,477. In the 49-60 age group, males receive an average of 11,921, while females receive 12,788.
Can we identify any specific trends in insurance claims based on smoking habits and regions?
plt.figure(figsize=(12, 8))
ax=sns.barplot(x='smoker', y='claim', hue='region', data=insurance, ci=None, palette='Set3')
for label in ax.containers:
ax.bar_label(label)
plt.title('Insurance Claims Based on Smoking Habits and Regions')
plt.xlabel('Smoker')
plt.ylabel('Insurance Claim Amount')
plt.show()
From the bar plot, we observed distinct trends in insurance claims based on smoking habits and regions. Smokers in the southeast region had the highest average claim amount, reaching 34,737. Additionally, in the northwest and southwest regions, smokers had average claim amounts of 30,192 and 32,269, respectively. Meanwhile, non-smokers in the southeast region had the lowest average claim amount at 7,388, followed by the northwest and southwest regions with average claim amounts of 8,004 and 8,294, respectively. Surprisingly, non-smokers in the northeast region had a notably higher average claim amount of 11,666.
Recommendations and Conclusions:¶
Tailored Insurance Offerings: Insurance providers could benefit from tailoring their offerings based on the unique demographic trends observed in the dataset. This could involve adjusting premiums or coverage to better suit the needs and risks associated with various age groups, genders, and regions.
Focus on Health and Wellness Programs: Encouraging and incentivizing health and wellness programs, particularly in regions with higher average claim amounts, could prove beneficial. By promoting healthier lifestyles and disease prevention, insurance providers can potentially reduce the risk of claims and improve overall customer well-being.
Targeted Policyholder Support: Understanding the prevalent factors associated with insurance claims, such as BMI and smoking habits, can guide the development of targeted support for policyholders. Offering resources and guidance for managing and improving these factors could help individuals reduce their risk of health issues and claims.
Enhanced Risk Assessment: Utilizing the insights gained from the analysis, insurance companies can enhance their risk assessment models. By factoring in specific variables like age, gender, and region, insurers can better predict and manage potential risks, leading to more accurate underwriting and improved pricing strategies.
Policyholder Education and Awareness: Educating policyholders about the impact of various factors on insurance claims, such as BMI, blood pressure, and lifestyle choices, can empower them to make informed decisions about their health. This can lead to better health outcomes, reduced insurance claims, and ultimately lower insurance costs for individuals and providers alike.
In conclusion, this analysis highlights the crucial role of data-driven insights in understanding and managing insurance claims effectively. By leveraging these findings, insurance providers can improve their offerings, enhance customer experiences, and foster healthier and more resilient communities.
Practice Questions -¶
- Is there a relationship between BMI and the number of children in a family, and does it differ by gender?
- How do different regions compare in terms of the prevalence of specific health conditions, such as diabetes and high blood pressure?
- Are there any significant differences in insurance claims based on gender for patients with specific health conditions?
- How do different demographic factors collectively contribute to the prediction of insurance claim amounts?