Data Analysis Project by Anushka Sharma, Team edSlash.
This dataset captures detailed information about customer transactions, providing a comprehensive view of purchasing behavior across various demographics, regions, and product categories. It includes features such as customer demographics (age, gender, location), product details (item purchased, category, size, color, and season), and transaction specifics (purchase amount, payment method, shipping type, and discounts applied).
👉Ignore it if already installed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"C:\Users\harsh\Downloads\shopping_trends.csv")
df.head()
Customer ID | Age | Gender | Item Purchased | Category | Purchase Amount (USD) | Location | Size | Color | Season | Review Rating | Subscription Status | Payment Method | Shipping Type | Discount Applied | Promo Code Used | Previous Purchases | Preferred Payment Method | Frequency of Purchases | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 55 | Male | Blouse | Clothing | 53 | Kentucky | L | Gray | Winter | 3.1 | Yes | Credit Card | Express | Yes | Yes | 14 | Venmo | Fortnightly |
1 | 2 | 19 | Male | Sweater | Clothing | 64 | Maine | L | Maroon | Winter | 3.1 | Yes | Bank Transfer | Express | Yes | Yes | 2 | Cash | Fortnightly |
2 | 3 | 50 | Male | Jeans | Clothing | 73 | Massachusetts | S | Maroon | Spring | 3.1 | Yes | Cash | Free Shipping | Yes | Yes | 23 | Credit Card | Weekly |
3 | 4 | 21 | Male | Sandals | Footwear | 90 | Rhode Island | M | Maroon | Spring | 3.5 | Yes | PayPal | Next Day Air | Yes | Yes | 49 | PayPal | Weekly |
4 | 5 | 45 | Male | Blouse | Clothing | 49 | Oregon | M | Turquoise | Spring | 2.7 | Yes | Cash | Free Shipping | Yes | Yes | 31 | PayPal | Annually |
df.describe()
Customer ID | Age | Purchase Amount (USD) | Review Rating | Previous Purchases | |
---|---|---|---|---|---|
count | 3900.000000 | 3900.000000 | 3900.000000 | 3900.000000 | 3900.000000 |
mean | 1950.500000 | 44.068462 | 59.764359 | 3.749949 | 25.351538 |
std | 1125.977353 | 15.207589 | 23.685392 | 0.716223 | 14.447125 |
min | 1.000000 | 18.000000 | 20.000000 | 2.500000 | 1.000000 |
25% | 975.750000 | 31.000000 | 39.000000 | 3.100000 | 13.000000 |
50% | 1950.500000 | 44.000000 | 60.000000 | 3.700000 | 25.000000 |
75% | 2925.250000 | 57.000000 | 81.000000 | 4.400000 | 38.000000 |
max | 3900.000000 | 70.000000 | 100.000000 | 5.000000 | 50.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3900 entries, 0 to 3899 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer ID 3900 non-null int64 1 Age 3900 non-null int64 2 Gender 3900 non-null object 3 Item Purchased 3900 non-null object 4 Category 3900 non-null object 5 Purchase Amount (USD) 3900 non-null int64 6 Location 3900 non-null object 7 Size 3900 non-null object 8 Color 3900 non-null object 9 Season 3900 non-null object 10 Review Rating 3900 non-null float64 11 Subscription Status 3900 non-null object 12 Payment Method 3900 non-null object 13 Shipping Type 3900 non-null object 14 Discount Applied 3900 non-null object 15 Promo Code Used 3900 non-null object 16 Previous Purchases 3900 non-null int64 17 Preferred Payment Method 3900 non-null object 18 Frequency of Purchases 3900 non-null object dtypes: float64(1), int64(4), object(14) memory usage: 579.0+ KB
we have non null values
df.dtypes
Customer ID int64 Age int64 Gender object Item Purchased object Category object Purchase Amount (USD) int64 Location object Size object Color object Season object Review Rating float64 Subscription Status object Payment Method object Shipping Type object Discount Applied object Promo Code Used object Previous Purchases int64 Preferred Payment Method object Frequency of Purchases object dtype: object
sns.displot(x = df.Age, kde=True , bins= 28)
D:\Installed Softwares\Anaconda\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x16eeeb06b50>
def agegrp(data):
if data < 20 :
return "below 20 "
elif (data >= 20) & (data < 30):
return "20-30"
elif (data >= 30) & (data < 40):
return "30-40"
elif (data >= 40) & (data < 50):
return "40-50"
elif (data >= 50) & (data < 60):
return "50-60"
elif (data >= 60) & (data <= 70):
return "60-70"
else:
return np.nan
d1=df.copy()
d1["age group"]= d1.Age.apply(agegrp)
d1
Customer ID | Age | Gender | Item Purchased | Category | Purchase Amount (USD) | Location | Size | Color | Season | Review Rating | Subscription Status | Payment Method | Shipping Type | Discount Applied | Promo Code Used | Previous Purchases | Preferred Payment Method | Frequency of Purchases | age group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 55 | Male | Blouse | Clothing | 53 | Kentucky | L | Gray | Winter | 3.1 | Yes | Credit Card | Express | Yes | Yes | 14 | Venmo | Fortnightly | 50-60 |
1 | 2 | 19 | Male | Sweater | Clothing | 64 | Maine | L | Maroon | Winter | 3.1 | Yes | Bank Transfer | Express | Yes | Yes | 2 | Cash | Fortnightly | below 20 |
2 | 3 | 50 | Male | Jeans | Clothing | 73 | Massachusetts | S | Maroon | Spring | 3.1 | Yes | Cash | Free Shipping | Yes | Yes | 23 | Credit Card | Weekly | 50-60 |
3 | 4 | 21 | Male | Sandals | Footwear | 90 | Rhode Island | M | Maroon | Spring | 3.5 | Yes | PayPal | Next Day Air | Yes | Yes | 49 | PayPal | Weekly | 20-30 |
4 | 5 | 45 | Male | Blouse | Clothing | 49 | Oregon | M | Turquoise | Spring | 2.7 | Yes | Cash | Free Shipping | Yes | Yes | 31 | PayPal | Annually | 40-50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3895 | 3896 | 40 | Female | Hoodie | Clothing | 28 | Virginia | L | Turquoise | Summer | 4.2 | No | Cash | 2-Day Shipping | No | No | 32 | Venmo | Weekly | 40-50 |
3896 | 3897 | 52 | Female | Backpack | Accessories | 49 | Iowa | L | White | Spring | 4.5 | No | PayPal | Store Pickup | No | No | 41 | Bank Transfer | Bi-Weekly | 50-60 |
3897 | 3898 | 46 | Female | Belt | Accessories | 33 | New Jersey | L | Green | Spring | 2.9 | No | Credit Card | Standard | No | No | 24 | Venmo | Quarterly | 40-50 |
3898 | 3899 | 44 | Female | Shoes | Footwear | 77 | Minnesota | S | Brown | Summer | 3.8 | No | PayPal | Express | No | No | 24 | Venmo | Weekly | 40-50 |
3899 | 3900 | 52 | Female | Handbag | Accessories | 81 | California | M | Beige | Spring | 3.1 | No | Bank Transfer | Store Pickup | No | No | 33 | Venmo | Quarterly | 50-60 |
3900 rows × 20 columns
sns.barplot(x= d1["age group"], y=d1.index)
<Axes: xlabel='age group'>
no significant change in shopping pattern of each age group. thus people of all age shopped regularly .
sns.countplot(x=d1.Gender)
<Axes: xlabel='Gender', ylabel='count'>
we see that males have purchased more than females.
plt.figure(figsize=(10,6))
sns.countplot(x=d1.Location)
plt.xticks(rotation=90)
plt.show()
We can see highest purchasing from "MONTANA" and lowest purchasing from Rhode Island
sns.countplot(x=d1.Category, hue=d1.Gender)
<Axes: xlabel='Category', ylabel='count'>
highest shopped category is "Clothing"
meanprice=d1.groupby("Category")["Purchase Amount (USD)"].mean()
meanprice
Category Accessories 59.838710 Clothing 60.025331 Footwear 60.255426 Outerwear 57.172840 Name: Purchase Amount (USD), dtype: float64
totalpurchase=d1.groupby("Category")["Purchase Amount (USD)"].sum()
totalpurchase
Category Accessories 74200 Clothing 104264 Footwear 36093 Outerwear 18524 Name: Purchase Amount (USD), dtype: int64
from above, we can conclude that highest shopped category is "Clothing" with total sales of 104264 USD. but the mean sales of "footware" category is high which means more expensive items from footware category were sold with mean being 60 USD.
sns.countplot(x=d1["Frequency of Purchases"])
plt.xticks(rotation=35)
plt.show()
we can say that the frequency of purchase is also uniformally distributed.
sns.countplot(x=d1["Payment Method"])
plt.xticks(rotation=35)
plt.show()
Although the data is uniformally distributed for each payment method but we can see a slight increase in credit card purchases.
productcount=d1["Item Purchased"].value_counts()
productcount
Item Purchased Blouse 171 Jewelry 171 Pants 171 Shirt 169 Dress 166 Sweater 164 Jacket 163 Belt 161 Sunglasses 161 Coat 161 Sandals 160 Socks 159 Skirt 158 Shorts 157 Scarf 157 Hat 154 Handbag 153 Hoodie 151 Shoes 150 T-shirt 147 Sneakers 145 Boots 144 Backpack 143 Gloves 140 Jeans 124 Name: count, dtype: int64
sns.countplot(x=d1.Size)
<Axes: xlabel='Size', ylabel='count'>
most selling size is M
sns.countplot(x=d1.Color)
plt.xticks(rotation=90)
plt.show()
sns.countplot(x=d1.Season)
<Axes: xlabel='Season', ylabel='count'>
d1["Frequency of Purchases"].unique().tolist()
['Fortnightly', 'Weekly', 'Annually', 'Quarterly', 'Bi-Weekly', 'Monthly', 'Every 3 Months']
fortnight=d1[d1["Frequency of Purchases"]=="Fortnightly"]
weekly=d1[d1["Frequency of Purchases"]=="Weekly"]
annually=d1[d1["Frequency of Purchases"]=="Annually"]
quarterly=d1[d1["Frequency of Purchases"]=="Quarterly"]
week2=d1[d1["Frequency of Purchases"]=="Bi-Weekly"]
monthly=d1[d1["Frequency of Purchases"]=="Monthly"]
month3=d1[d1["Frequency of Purchases"]=="Every 3 Months"]
def revrating(data):
if data <= 2 :
return "below 2"
elif (data > 2) & (data <=3 ):
return "2-3"
elif (data > 3) & (data <=4):
return "3-4"
elif (data > 4) & (data <=5):
return "4-5"
else:
return np.nan
d1["rating group"]= d1["Review Rating"].apply(revrating)
subscribed=d1[d1["Subscription Status"]=="Yes"]
sns.countplot(x=subscribed.Gender)
<Axes: xlabel='Gender', ylabel='count'>
fom the above observation we can see that all the subscriber are MALES.
sns.countplot(x=subscribed["Item Purchased"])
plt.xticks(rotation=90)
plt.show()
most purchased item is sweater.
sns.countplot(x=subscribed["Payment Method"])
plt.xticks(rotation=45)
plt.show()
sns.countplot(x=subscribed["Discount Applied"])
<Axes: xlabel='Discount Applied', ylabel='count'>
Discount is applied for all the subscribers.
sns.countplot(x=subscribed["Promo Code Used"])
<Axes: xlabel='Promo Code Used', ylabel='count'>
Promo code is used for all the subscribers.
sns.countplot(x=subscribed["age group"])
<Axes: xlabel='age group', ylabel='count'>
maximum purchases are from 50-60 years.
sns.countplot(x=subscribed["rating group"])
<Axes: xlabel='rating group', ylabel='count'>
most of the subscriber gave 3-4 rating.
sns.displot(x=d1["Review Rating"])
D:\Installed Softwares\Anaconda\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
<seaborn.axisgrid.FacetGrid at 0x16ef377c9d0>
revcount = d1.groupby("rating group")["Category"].value_counts().reset_index(name = "Count")
revcount
rating group | Category | Count | |
---|---|---|---|
0 | 2-3 | Clothing | 400 |
1 | 2-3 | Accessories | 255 |
2 | 2-3 | Footwear | 123 |
3 | 2-3 | Outerwear | 69 |
4 | 3-4 | Clothing | 715 |
5 | 3-4 | Accessories | 507 |
6 | 3-4 | Footwear | 239 |
7 | 3-4 | Outerwear | 129 |
8 | 4-5 | Clothing | 622 |
9 | 4-5 | Accessories | 478 |
10 | 4-5 | Footwear | 237 |
11 | 4-5 | Outerwear | 126 |
revcount = d1.groupby("Category")["Review Rating"].mean()
revcount.sort_values()
Category Clothing 3.723143 Outerwear 3.746914 Accessories 3.768629 Footwear 3.790651 Name: Review Rating, dtype: float64
discount_impact=d1.groupby("Discount Applied")["Purchase Amount (USD)"].mean()
discount_impact
Discount Applied No 60.130454 Yes 59.279070 Name: Purchase Amount (USD), dtype: float64
when discount is applied the average purchase amount is =59 usd and without discount it is 60 usd.
promo_impact=d1.groupby("Promo Code Used")["Purchase Amount (USD)"].mean()
promo_impact
Promo Code Used No 60.130454 Yes 59.279070 Name: Purchase Amount (USD), dtype: float64
def discountencode(data):
if data=="Yes": return 1
else : return 0
d1["discount"]=d1["Discount Applied"].apply(discountencode)
mapping={
"Weekly": 4,
"Fortnightly": 2,
"Monthly": 1,
"Quarterly": 0.25,
"Annually": 0.083
}
d1["Frequency"] = d1["Frequency of Purchases"].map(mapping)
d1
Customer ID | Age | Gender | Item Purchased | Category | Purchase Amount (USD) | Location | Size | Color | Season | ... | Shipping Type | Discount Applied | Promo Code Used | Previous Purchases | Preferred Payment Method | Frequency of Purchases | age group | rating group | discount | Frequency | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 55 | Male | Blouse | Clothing | 53 | Kentucky | L | Gray | Winter | ... | Express | Yes | Yes | 14 | Venmo | Fortnightly | 50-60 | 3-4 | 1 | 2.000 |
1 | 2 | 19 | Male | Sweater | Clothing | 64 | Maine | L | Maroon | Winter | ... | Express | Yes | Yes | 2 | Cash | Fortnightly | below 20 | 3-4 | 1 | 2.000 |
2 | 3 | 50 | Male | Jeans | Clothing | 73 | Massachusetts | S | Maroon | Spring | ... | Free Shipping | Yes | Yes | 23 | Credit Card | Weekly | 50-60 | 3-4 | 1 | 4.000 |
3 | 4 | 21 | Male | Sandals | Footwear | 90 | Rhode Island | M | Maroon | Spring | ... | Next Day Air | Yes | Yes | 49 | PayPal | Weekly | 20-30 | 3-4 | 1 | 4.000 |
4 | 5 | 45 | Male | Blouse | Clothing | 49 | Oregon | M | Turquoise | Spring | ... | Free Shipping | Yes | Yes | 31 | PayPal | Annually | 40-50 | 2-3 | 1 | 0.083 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3895 | 3896 | 40 | Female | Hoodie | Clothing | 28 | Virginia | L | Turquoise | Summer | ... | 2-Day Shipping | No | No | 32 | Venmo | Weekly | 40-50 | 4-5 | 0 | 4.000 |
3896 | 3897 | 52 | Female | Backpack | Accessories | 49 | Iowa | L | White | Spring | ... | Store Pickup | No | No | 41 | Bank Transfer | Bi-Weekly | 50-60 | 4-5 | 0 | NaN |
3897 | 3898 | 46 | Female | Belt | Accessories | 33 | New Jersey | L | Green | Spring | ... | Standard | No | No | 24 | Venmo | Quarterly | 40-50 | 2-3 | 0 | 0.250 |
3898 | 3899 | 44 | Female | Shoes | Footwear | 77 | Minnesota | S | Brown | Summer | ... | Express | No | No | 24 | Venmo | Weekly | 40-50 | 3-4 | 0 | 4.000 |
3899 | 3900 | 52 | Female | Handbag | Accessories | 81 | California | M | Beige | Spring | ... | Store Pickup | No | No | 33 | Venmo | Quarterly | 50-60 | 3-4 | 0 | 0.250 |
3900 rows × 23 columns
correlation=d1[["discount", "Frequency"]].corr()
correlation
discount | Frequency | |
---|---|---|
discount | 1.000000 | 0.012103 |
Frequency | 0.012103 | 1.000000 |
sns.heatmap(correlation, annot=True)
<Axes: >
sns.countplot(x=d1["Shipping Type"])
plt.xticks(rotation=45)
plt.show()
shipping_cat=d1.groupby(["Category", "Shipping Type"]).size()
shipping_cat.sort_values()
Category Shipping Type Outerwear Standard 49 2-Day Shipping 50 Next Day Air 51 Store Pickup 53 Express 57 Free Shipping 64 Footwear 2-Day Shipping 90 Next Day Air 93 Express 96 Store Pickup 98 Standard 100 Free Shipping 122 Accessories Free Shipping 195 Express 203 2-Day Shipping 206 Standard 208 Next Day Air 211 Store Pickup 217 Clothing 2-Day Shipping 281 Store Pickup 282 Express 290 Next Day Air 293 Free Shipping 294 Standard 297 dtype: int64
ship_rate=d1.groupby("Shipping Type")["Review Rating"].mean()
ship_rate.sort_values()
Shipping Type Store Pickup 3.706462 Free Shipping 3.717926 Next Day Air 3.720833 2-Day Shipping 3.761085 Express 3.776935 Standard 3.817737 Name: Review Rating, dtype: float64
seasonal_behavior = d1.groupby("Season")["Purchase Amount (USD)"].mean()
seasonal_behavior
Season Fall 61.556923 Spring 58.737738 Summer 58.405236 Winter 60.357364 Name: Purchase Amount (USD), dtype: float64
season_cat= d1.groupby(["Season","Category"]).size()
season_cat.sort_values()
Season Category Summer Outerwear 75 Winter Outerwear 80 Spring Outerwear 81 Fall Outerwear 88 Footwear 136 Winter Footwear 140 Summer Footwear 160 Spring Footwear 163 Accessories 301 Winter Accessories 303 Summer Accessories 312 Fall Accessories 324 Summer Clothing 408 Fall Clothing 427 Winter Clothing 448 Spring Clothing 454 dtype: int64
In spring, customer purchased more clothing .
numerical_data = d1[["Age", "Purchase Amount (USD)", "Review Rating", "Previous Purchases"]]
correlation_matrix = numerical_data.corr()
correlation_matrix
Age | Purchase Amount (USD) | Review Rating | Previous Purchases | |
---|---|---|---|---|
Age | 1.000000 | -0.010424 | -0.021949 | 0.040445 |
Purchase Amount (USD) | -0.010424 | 1.000000 | 0.030776 | 0.008063 |
Review Rating | -0.021949 | 0.030776 | 1.000000 | 0.004229 |
Previous Purchases | 0.040445 | 0.008063 | 0.004229 | 1.000000 |
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.xticks(rotation=15)
plt.show()
Conclusion from the above analysis is that :
Office:- 660, Sector 14A, Vasundhara, Ghaziabad, Uttar Pradesh - 201012, India