About Dataset¶

This dataset contains insightful information related to insurance claims, giving us an in-depth look into the demographic patterns of those receiving them. The dataset contains information on patient age, gender, BMI (Body Mass Index), blood pressure levels, diabetic status, number of children, smoking status and region. By analyzing these key factors across geographical areas and across different demographics such as age or gender we can gain a greater understanding of who is most likely to receive an insurance claim. This understanding gives us valuable insight that can be used to inform our decision making when considering potential customers for our services. On a broader scale it can inform public policy by allowing for more targeted support for those who are most in need and vulnerable. These kinds of insights are extremely valuable and this dataset provides us with the tools we need to uncover them!

Dataset Link -Kaggle link

Data Dictionary¶

index: A unique identifier for each entry in the dataset.
PatientID: A unique identifier for each patient in the dataset.
age: The age of the patient.
gender: The gender of the patient.
bmi: The Body Mass Index (BMI) of the patient, which is a measure of body fat based on height and weight.
bloodpressure: The blood pressure level of the patient.
diabetic: Indicates whether the patient has diabetes (Yes/No).
children: The number of children the patient has.
smoker: Indicates whether the patient smokes (Yes/No).
region: The geographical region where the patient resides.
claim: The amount of the insurance claim made by the patient.

Installing dependency¶

👉Ignore It if already installed

1. !pip install numpy
2. !pip install pandas
3. !pip install matplotlib
4. !pip install seaborn

step -1 Data Preprocessing and Cleaning¶

Importing Required library¶

In [1]:

# perform linear operations
import numpy as np

# Data manipulation
import pandas as pd

#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:

#Load the dataset
insurance=pd.read_csv(r"C:\Users\Lenovo\Downloads\content\Insurance Data Analysis\insurance_data - insurance_data.csv")

# Print top 5 rows
insurance.head()

Out[2]:

	index	PatientID	age	gender	bmi	bloodpressure	diabetic	smoker	region	claim
0	0	1	39.0	male	23.2	91	Yes	No	southeast	1121.87
1	1	2	24.0	male	30.1	87	No	No	southeast	1131.51
2	2	3	NaN	male	33.3	82	Yes	No	southeast	1135.94
3	3	4	NaN	male	33.7	80	No	No	northwest	1136.40
4	4	5	NaN	male	34.1	100	No	No	northwest	1137.01

In [3]:

insurance

Out[3]:

	index	PatientID	age	gender	bmi	bloodpressure	diabetic	children	smoker	region	claim
0	0	1	39.0	male	23.2	91	Yes	0	No	southeast	1121.87
1	1	2	24.0	male	30.1	87	No	0	No	southeast	1131.51
2	2	3	NaN	male	33.3	82	Yes	0	No	southeast	1135.94
3	3	4	NaN	male	33.7	80	No	0	No	northwest	1136.40
4	4	5	NaN	male	34.1	100	No	0	No	northwest	1137.01
...	...	...	...	...	...	...	...	...	...	...	...
1335	1335	1336	44.0	female	35.5	88	Yes	0	Yes	northwest	55135.40
1336	1336	1337	59.0	female	38.1	120	No	1	Yes	northeast	58571.07
1337	1337	1338	30.0	male	34.5	91	Yes	3	Yes	northwest	60021.40
1338	1338	1339	37.0	male	30.4	106	No	0	Yes	southeast	62592.87
1339	1339	1340	30.0	female	47.4	101	No	0	Yes	southeast	63770.43

1340 rows × 11 columns

In [4]:

# check for shape
insurance.shape

Out[4]:

(1340, 11)

From above cell we see that the dataset is quite large it contains 1340 observations and 11 columns

In [5]:

#Check info of each colummn
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340 entries, 0 to 1339
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          1340 non-null   int64  
 1   PatientID      1340 non-null   int64  
 2   age            1335 non-null   float64
 3   gender         1340 non-null   object 
 4   bmi            1340 non-null   float64
 5   bloodpressure  1340 non-null   int64  
 6   diabetic       1340 non-null   object 
 7   children       1340 non-null   int64  
 8   smoker         1340 non-null   object 
 9   region         1337 non-null   object 
 10  claim          1340 non-null   float64
dtypes: float64(3), int64(4), object(4)
memory usage: 115.3+ KB

From above cell we see that there are 4 object column and 4 integer and 3 column contain float values

In [6]:

# Checking null values
insurance.isnull().sum()

Out[6]:

index            0
PatientID        0
age              5
gender           0
bmi              0
bloodpressure    0
diabetic         0
children         0
smoker           0
region           3
claim            0
dtype: int64

From above cell we see that there are some missing values in our data So we have to fill it

Fill missing values in age column¶

Let me Think how can i fill in the missing values in age column because we don't have real data. The age column contain continuous data, Generally when we have missing data in the continuous column then we filled the missing values by mean of that column or median of that column In our case we gonna filled the missing values by the mean of age column

In [7]:

insurance.age.isna().sum()

Out[7]:

In [8]:

insurance[insurance.age.isnull()]

Out[8]:

	index	PatientID	age	gender	bmi	bloodpressure	diabetic	smoker	region	claim
2	2	3	NaN	male	33.3	82	Yes	No	southeast	1135.94
3	3	4	NaN	male	33.7	80	No	No	northwest	1136.40
4	4	5	NaN	male	34.1	100	No	No	northwest	1137.01
5	5	6	NaN	male	34.4	96	Yes	No	northwest	1137.47
6	6	7	NaN	male	37.3	86	Yes	No	northwest	1141.45

In [9]:

mean_age=insurance.age.mean()
mean_age

Out[9]:

38.07865168539326

In [10]:

insurance.age.fillna(mean_age,inplace=True)

In [11]:

insurance.age.isna().sum()

Out[11]:

Fill the missing values in region column¶

We can see that the region column contain categorical data, Generally when we have missing data in any categorical column then we filled the missing data by mode(filled by most frequent category) of that column

In [12]:

insurance.region.isnull().sum()

Out[12]:

In [13]:

insurance[insurance.region.isna()]

Out[13]:

	index	PatientID	age	gender	bmi	bloodpressure	diabetic	smoker	region	claim
13	13	14	32.0	male	27.6	100	No	No	NaN	1252.41
14	14	15	40.0	male	28.7	81	Yes	No	NaN	1253.94
15	15	16	32.0	male	30.4	86	Yes	No	NaN	1256.30

In [14]:

mode_region=insurance.region.mode()
mode_region

Out[14]:

0    southeast
Name: region, dtype: object

In [15]:

insurance.region.fillna('southeast',inplace=True)

In [16]:

insurance.region.isnull().sum()

Out[16]:

In [17]:

# check for duplicate
insurance.duplicated().sum()

Out[17]:

From above cell we see that there are no duplicates present in our dataset

Remove Unnecessary column - `index`¶

In [18]:

insurance.drop(columns='index',inplace=True)

Step -2 Data analysis¶

In [19]:

insurance

Out[19]:

	PatientID	age	gender	bmi	bloodpressure	diabetic	children	smoker	region	claim
0	1	39.000000	male	23.2	91	Yes	0	No	southeast	1121.87
1	2	24.000000	male	30.1	87	No	0	No	southeast	1131.51
2	3	38.078652	male	33.3	82	Yes	0	No	southeast	1135.94
3	4	38.078652	male	33.7	80	No	0	No	northwest	1136.40
4	5	38.078652	male	34.1	100	No	0	No	northwest	1137.01
...	...	...	...	...	...	...	...	...	...	...
1335	1336	44.000000	female	35.5	88	Yes	0	Yes	northwest	55135.40
1336	1337	59.000000	female	38.1	120	No	1	Yes	northeast	58571.07
1337	1338	30.000000	male	34.5	91	Yes	3	Yes	northwest	60021.40
1338	1339	37.000000	male	30.4	106	No	0	Yes	southeast	62592.87
1339	1340	30.000000	female	47.4	101	No	0	Yes	southeast	63770.43

1340 rows × 10 columns

Let's check the distribuiton of Each column¶

`AGE`¶

In [20]:

# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

# Plot a displot
sns.histplot(insurance.age, kde=True, ax=axes[0])
axes[0].set_title('Distribution of Age')

# Plot a boxplot
sns.boxplot(x=insurance.age, ax=axes[1])
axes[1].set_title('Boxplot of Age')

# Display the plots
plt.tight_layout()
plt.show()

No description has been provided for this image

From above plots we can observe that most of the data in the age column lies in between 30 - 50 and median age is around 36 - 38 also the data distribuition is normal there is no outlier in the data

`BMI`¶

In [21]:

# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

# Plot a displot
sns.histplot(insurance.bmi, kde=True, ax=axes[0])
axes[0].set_title('Distribution of BMI')

# Plot a boxplot
sns.boxplot(x=insurance.bmi, ax=axes[1])
axes[1].set_title('Boxplot of BMI')

# Display the plots
plt.tight_layout()
plt.show()

Above plots indicates that the data of bmi column is lies in between 25 - 35 IQR range and the median bmi value is 30 and there are also some outliers

In [22]:

# Lets check the data of outliers
l=insurance[insurance.bmi>50].index.to_list()
insurance[insurance.bmi>50]

Out[22]:

	PatientID	age	gender	bmi	bloodpressure	diabetic	children	smoker	region	claim
9	10	30.0	male	53.1	97	No	0	No	northwest	1163.46
141	142	46.0	male	50.4	89	Yes	1	No	southeast	2438.06
1299	1300	50.0	male	52.6	110	No	1	Yes	southeast	44501.40

In [23]:

# dropping outliers whose BMI value is above 50 
insurance.drop(index=l,inplace=True)

In [24]:

# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

# Plot a displot
sns.histplot(insurance.bmi, kde=True, ax=axes[0])
axes[0].set_title('Distribution of BMI')

# Plot a boxplot
sns.boxplot(x=insurance.bmi, ax=axes[1])
axes[1].set_title('Boxplot of BMI')

# Display the plots
plt.tight_layout()
plt.show()

`Blood Pressure`¶

In [25]:

# Plot a displot
sns.histplot(insurance.bloodpressure, kde=True)
plt.title('Distribution of Blood Pressure')

plt.show()

Above Displot indicates that the data is right Skewed in Blood Pressure column and there is a possibility that any patients can have blood pressure is 140

`Gender`¶

In [26]:

ax=sns.countplot(x=insurance.gender)
for i in ax.containers:
    ax.bar_label(i)
plt.title("Distribuiton of each category in Gender Column")
plt.show()

Above count plot shows us that we have 675 male patient data and 662 female patient data

`Diabetic`¶

In [27]:

ax=sns.countplot(x=insurance.diabetic)
for i in ax.containers:
    ax.bar_label(i)
plt.title("Distribuiton of each category in Diabetic Column")
plt.show()

From the count plot above, we can conclude that there are 641 patients with diabetes, while 696 patients do not have diabetes in our dataset.

`Children`¶

In [28]:

ax=sns.countplot(x=insurance.children)
for i in ax.containers:
    ax.bar_label(i)
plt.title("Distribuiton of each category in children Column")
plt.show()

From the above count plot, we observe that there are 575 patients who have no children, 322 patients who have one child, and 240 patients who have two children, and so on.

`Smoker`¶

In [29]:

ax=sns.countplot(x=insurance.smoker)
for i in ax.containers:
    ax.bar_label(i)
plt.title("Distribuiton of each category in Smoker Column")
plt.show()

The count plot indicates that 1064 patients are smokers, while 273 patients are non-smokers

`Region`¶

In [30]:

ax=sns.countplot(x=insurance.region)
for i in ax.containers:
    ax.bar_label(i)
plt.title("Distribuiton of each category in region Column")
plt.show()

The count plot illustrates that the majority of the patients are from the southeast region, followed by the northwest region and then the southwest and so on...

`Claim`¶

In [31]:

# Plot a displot
sns.histplot(insurance.claim, kde=True)
plt.title('Distribution of Claim amount')

# Display the plots
plt.show()

The displot suggests that the majority of patients receive claim amounts below 20,000, while a smaller proportion receives larger claim amounts exceeding 50,000.

Let's ask some question to the data¶

What is the distribution of BMI among different age groups and regions?

In [32]:

def agegroup(x):
    if x>=18 and x<29:
        return '18-29'
    elif x>=29 and x<39:
        return '29-39'
    elif x>=39 and x<49:
        return '39-49'
    else:
        return '49-60'
insurance['agegroup']=insurance.age.apply(agegroup)

In [33]:

insurance

Out[33]:

	PatientID	age	gender	bmi	bloodpressure	diabetic	children	smoker	region	claim	agegroup
0	1	39.000000	male	23.2	91	Yes	0	No	southeast	1121.87	39-49
1	2	24.000000	male	30.1	87	No	0	No	southeast	1131.51	18-29
2	3	38.078652	male	33.3	82	Yes	0	No	southeast	1135.94	29-39
3	4	38.078652	male	33.7	80	No	0	No	northwest	1136.40	29-39
4	5	38.078652	male	34.1	100	No	0	No	northwest	1137.01	29-39
...	...	...	...	...	...	...	...	...	...	...	...
1335	1336	44.000000	female	35.5	88	Yes	0	Yes	northwest	55135.40	39-49
1336	1337	59.000000	female	38.1	120	No	1	Yes	northeast	58571.07	49-60
1337	1338	30.000000	male	34.5	91	Yes	3	Yes	northwest	60021.40	29-39
1338	1339	37.000000	male	30.4	106	No	0	Yes	southeast	62592.87	29-39
1339	1340	30.000000	female	47.4	101	No	0	Yes	southeast	63770.43	29-39

1337 rows × 11 columns

In [34]:

plt.figure(figsize=(12, 8))
sns.violinplot(x='agegroup', y='bmi', hue='region', data=insurance, inner='quart', palette='Set2')
plt.title('Distribution of BMI across Age Groups and Regions')
plt.xlabel('Age Group')
plt.ylabel('BMI')
plt.show()

Based on the above plots, we can observe that in the southwest region, the BMI range falls between 20 and 40, with the median value ranging from 32 to 36 across all age groups. Similarly, in the northwest region, the BMI range spans from 20 to 35, with the median value falling between 26 and 30 for all age groups. In the southeast region, the BMI range remains consistent, ranging from 29 to 32 across all age groups. Lastly, in the northeast region, the majority of data falls within the range of 20 to 35, with the median value ranging from 28 to 32 across all age groups.

Find average age of diabetic people across different regions?

In [35]:

plt.figure(figsize=(12, 8))
ax = sns.barplot(x='region', y='age', data=insurance[insurance.diabetic=='Yes'], ci=None, palette='Set3')
for p in ax.patches:
    ax.annotate('{:.2f}'.format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='center', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.title('Average age of Diabetic people Across Different Regions')
plt.xlabel('Region')
plt.ylabel('Age')
plt.show()

Upon analyzing the average age of diabetic individuals in different regions, we find that the highest average age is observed in the Southwest region, with an average age of 38.92. The Northwest region follows closely, with an average age of 36.35, and the Southeast region closely behind with an average age of 37.92. Finally, the Northeast region exhibits an average age of 37.32 for diabetic individuals.

Is there a correlation between blood pressure levels and BMI for patients with and without diabetes?

In [36]:

plt.figure(figsize=(10, 6))
sns.scatterplot(data=insurance, x="bloodpressure", y="bmi", hue="diabetic")
plt.title('Correlation Between Blood Pressure and BMI for Patients with and without Diabetes')
plt.xlabel('Blood Pressure')
plt.ylabel('BMI')
plt.show()

From the scatter plot, it is evident that the majority of the patients who claimed insurance fall within the blood pressure range of 80 to 110 and the BMI range of 20 to 40, irrespective of their diabetic status.

How does the number of children in a family impact the likelihood of having a diabetic condition?

In [37]:

sns.countplot(hue='diabetic',x='children',data=insurance)
plt.title('Diabetic condition based on the number of children')
plt.show()

From the countplot above, it is evident that the number of children does not seem to significantly impact the likelihood of a patient being diabetic.

What is the average insurance claim amount for different age groups and genders?

In [38]:

plt.figure(figsize=(12, 8))
ax=sns.barplot(x='agegroup', y='claim', hue='gender', data=insurance, ci=None, palette='Set3')
for label in ax.containers:
    ax.bar_label(label)
plt.title('Average Insurance Claim Amount for Different Age Groups and Genders')
plt.xlabel('Age Group')
plt.ylabel('Average Claim Amount')
plt.show()

From the above bar plot, we observe that males receive a higher average claim amount in each age group. In the 18-29 age group, the average claim amount for males is 14,708, while for females, it is 11,912.30. In the 29-39 age group, males receive 12,981 on average, whereas females receive 12,708. In the 39-49 age group, the average claim amount for males is 14,347, whereas for females, it is 12,477. In the 49-60 age group, males receive an average of 11,921, while females receive 12,788.

Can we identify any specific trends in insurance claims based on smoking habits and regions?

In [39]:

plt.figure(figsize=(12, 8))
ax=sns.barplot(x='smoker', y='claim', hue='region', data=insurance, ci=None, palette='Set3')
for label in ax.containers:
    ax.bar_label(label)
plt.title('Insurance Claims Based on Smoking Habits and Regions')
plt.xlabel('Smoker')
plt.ylabel('Insurance Claim Amount')
plt.show()

From the bar plot, we observed distinct trends in insurance claims based on smoking habits and regions. Smokers in the southeast region had the highest average claim amount, reaching 34,737. Additionally, in the northwest and southwest regions, smokers had average claim amounts of 30,192 and 32,269, respectively. Meanwhile, non-smokers in the southeast region had the lowest average claim amount at 7,388, followed by the northwest and southwest regions with average claim amounts of 8,004 and 8,294, respectively. Surprisingly, non-smokers in the northeast region had a notably higher average claim amount of 11,666.

Recommendations and Conclusions:¶

Tailored Insurance Offerings: Insurance providers could benefit from tailoring their offerings based on the unique demographic trends observed in the dataset. This could involve adjusting premiums or coverage to better suit the needs and risks associated with various age groups, genders, and regions.
Focus on Health and Wellness Programs: Encouraging and incentivizing health and wellness programs, particularly in regions with higher average claim amounts, could prove beneficial. By promoting healthier lifestyles and disease prevention, insurance providers can potentially reduce the risk of claims and improve overall customer well-being.
Targeted Policyholder Support: Understanding the prevalent factors associated with insurance claims, such as BMI and smoking habits, can guide the development of targeted support for policyholders. Offering resources and guidance for managing and improving these factors could help individuals reduce their risk of health issues and claims.
Enhanced Risk Assessment: Utilizing the insights gained from the analysis, insurance companies can enhance their risk assessment models. By factoring in specific variables like age, gender, and region, insurers can better predict and manage potential risks, leading to more accurate underwriting and improved pricing strategies.
Policyholder Education and Awareness: Educating policyholders about the impact of various factors on insurance claims, such as BMI, blood pressure, and lifestyle choices, can empower them to make informed decisions about their health. This can lead to better health outcomes, reduced insurance claims, and ultimately lower insurance costs for individuals and providers alike.

In conclusion, this analysis highlights the crucial role of data-driven insights in understanding and managing insurance claims effectively. By leveraging these findings, insurance providers can improve their offerings, enhance customer experiences, and foster healthier and more resilient communities.

Practice Questions -¶

Is there a relationship between BMI and the number of children in a family, and does it differ by gender?
How do different regions compare in terms of the prevalence of specific health conditions, such as diabetes and high blood pressure?
Are there any significant differences in insurance claims based on gender for patients with specific health conditions?
How do different demographic factors collectively contribute to the prediction of insurance claim amounts?

In [ ]:

Exploratory Data Analysis on Insurance Data in Python

About Dataset¶

Data Dictionary¶

Installing dependency¶

step -1 Data Preprocessing and Cleaning¶

Importing Required library¶

Fill missing values in age column¶

Fill the missing values in region column¶

Remove Unnecessary column - `index`¶

Step -2 Data analysis¶

Let's check the distribuiton of Each column¶

`AGE`¶

`BMI`¶

`Blood Pressure`¶

`Gender`¶

`Diabetic`¶

`Children`¶

`Smoker`¶

`Region`¶

`Claim`¶

Let's ask some question to the data¶

Recommendations and Conclusions:¶

Practice Questions -¶

Talk to our Industry Experts for Career Counselling

Company

Platform

Resources

Get in touch

Exploratory Data Analysis on Insurance Data in Python

About Dataset¶

Data Dictionary¶

Installing dependency¶

step -1 Data Preprocessing and Cleaning¶

Importing Required library¶

Fill missing values in age column¶

Fill the missing values in region column¶

Remove Unnecessary column - index¶

Step -2 Data analysis¶

Let's check the distribuiton of Each column¶

AGE¶

BMI¶

Blood Pressure¶

Gender¶

Diabetic¶

Children¶

Smoker¶

Region¶

Claim¶

Let's ask some question to the data¶

Recommendations and Conclusions:¶

Practice Questions -¶

Talk to our Industry Experts for Career Counselling

Company

Platform

Resources

Get in touch

Remove Unnecessary column - `index`¶

`AGE`¶

`BMI`¶

`Blood Pressure`¶

`Gender`¶

`Diabetic`¶

`Children`¶

`Smoker`¶

`Region`¶

`Claim`¶