Exploratory Data Analysis on Sales Data in Python

Analyse the given dataset completely and perform the following questions.

Convert the text based categorical columns to numeric columns
Convert the Order Date column into Day, Month, Year and Time columns separately
Create a new column and store the Price data as rounded off in integer format
Create a new column and store the PIN Codes from the Address column
Create a new column named City and store the city names from the Address column
Group Product column into different categories such as Phones, Accessories, etc according to the data available.
Create 12 separate date ordered dataframes for every
What was the best month for sales? How much was earned that month? Explain via graph/chart
What city sold the most product?
What time should we display advertisements to maximize likelihood of customer’s buying product?
Which product was sold the least as per your categories and what might be the reason behind that. Explain your analysis.
Which product is most likely to be sold more during winter season, summer season and rainy season? Explain the possible reason behind the High-Volume Sale
Which phone is sold most during the month of March?
Which headphones are the most expensive?
Check if the price of any product is changing in different If yes, explain the possible reason behind the price change.
People prefer which product more in earphones, wired or wireless?
Which product in every category in more likely to be ordered in bulk?

Data Analysis Project by Anushka Khemaria, Team edSlash.

eda_salesds

Introduction¶

This dataset contains sales records from an unspecified store in USA, capturing various details about each transaction, including the product, quantity ordered, price, order date, and purchase address. With over 186,000 records, it offers a comprehensive view of the store's sales activities over time.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

df=pd.read_csv(r"C:\Users\DELL\Downloads\Sales Dataset\Sales Dataset\SalesAnalysis.csv")
df.head()

Out[2]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001
1	NaN	NaN	NaN	NaN	NaN	NaN
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001

Columns Description¶

Order ID: A unique identifier for each order.
Product: The name of the product sold. ( Electronic Items)
Quantity Ordered: The number of units sold in each transaction.
Price Each: The price of a single unit of the product. (In Dollars)
Order Date: The date and time when the order was placed. (mm/dd/yy hr:min)
Purchase Address: The address where the product was delivered.

Cleaning the data¶

In [3]:

df.dtypes

Out[3]:

Order ID            object
Product             object
Quantity Ordered    object
Price Each          object
Order Date          object
Purchase Address    object
dtype: object

We observe that all the data is in object format so we will convert the relevant data to int or float depending on the requirement.

In [4]:

df.Product.value_counts()

Out[4]:

USB-C Charging Cable          21903
Lightning Charging Cable      21658
AAA Batteries (4-pack)        20641
AA Batteries (4-pack)         20577
Wired Headphones              18882
Apple Airpods Headphones      15549
Bose SoundSport Headphones    13325
27in FHD Monitor               7507
iPhone                         6842
27in 4K Gaming Monitor         6230
34in Ultrawide Monitor         6181
Google Phone                   5525
Flatscreen TV                  4800
Macbook Pro Laptop             4724
ThinkPad Laptop                4128
20in Monitor                   4101
Vareebadd Phone                2065
LG Washing Machine              666
LG Dryer                        646
Product                         355
Name: Product, dtype: int64

Here we see that an item named "Product" is coming 355 times in the data.

In [5]:

df["Quantity Ordered"].value_counts()

Out[5]:

1                   168552
2                    13324
3                     2920
4                      806
Quantity Ordered       355
5                      236
6                       80
7                       24
8                        5
9                        3
Name: Quantity Ordered, dtype: int64

Similarly , here we see that "Quantity ordered" is occuring 355 in the data which can be a mistake so first we will observe and then remove those specific column with irrelevant data.

In [6]:

df["Price Each"].value_counts()

Out[6]:

11.95         21903
14.95         21658
2.99          20641
3.84          20577
11.99         18882
150           15450
99.99         13325
149.99         7507
700            6804
389.99         6230
379.99         6181
600            5490
300            4780
1700           4702
999.99         4128
109.99         4101
400            2056
600.0          1347
Price Each      355
150.0            99
700.0            38
1700.0           22
300.0            20
400.0             9
Name: Price Each, dtype: int64

Here we see "Price each" is also repeating 355 times . Thus , we will filter the data based on the occurance of name " quantity ordered" in the dataset .

In [7]:

index_remove=df[df["Quantity Ordered"]=="Quantity Ordered"]
index_remove.head()

Out[7]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
519	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
1149	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
1155	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
2878	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
2893	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address

We observer that there are 355 columns which have column name in them insted of data so we will remove them.

Removing irrelevant rows.¶

First we will collect the index of such rows with irrelevant data in list and then we will use this list to drop the rows from the data.

In [8]:

index_remove=df[df["Quantity Ordered"]=="Quantity Ordered"].index.tolist()
data_string = [','.join(map(str, index_remove))]
data_string = list(map(int, data_string[0].split(',')))
print(data_string)

[519, 1149, 1155, 2878, 2893, 3036, 3209, 3618, 4138, 4645, 4794, 5303, 6939, 7497, 8635, 9189, 9329, 9973, 10885, 10958, 11272, 11619, 11742, 11778, 11874, 12455, 13362, 14022, 14522, 14860, 15280, 17208, 17324, 17339, 17744, 18594, 18626, 19674, 19795, 19810, 20112, 20687, 20865, 20951, 21051, 21739, 24641, 24817, 25370, 25503, 26140, 26292, 26950, 27107, 27596, 27602, 27988, 28600, 28721, 29088, 29387, 30648, 31099, 31495, 33269, 34102, 34187, 34206, 34612, 35537, 35687, 37002, 37146, 37761, 38098, 38175, 38488, 40811, 41916, 41939, 42010, 42730, 42910, 43067, 43555, 46967, 47181, 47406, 47479, 49042, 49468, 49720, 49872, 49974, 50103, 50435, 50712, 50792, 50943, 51380, 51542, 51943, 52448, 52769, 53592, 53731, 54142, 54586, 54616, 56994, 57415, 57561, 57736, 58602, 58639, 58653, 58871, 58937, 59342, 59524, 59581, 60029, 60826, 61610, 62468, 62812, 67039, 68620, 68649, 68741, 69444, 70010, 70662, 70794, 71159, 71170, 72448, 73140, 74053, 75091, 75221, 75818, 76858, 78045, 78661, 78708, 78894, 78990, 79849, 80221, 80717, 80763, 80783, 80956, 81103, 81764, 82340, 82428, 82435, 83126, 84483, 85068, 85398, 85430, 85520, 85530, 85549, 85920, 85977, 86559, 86722, 87155, 87229, 87579, 89259, 89307, 89951, 91178, 91799, 92631, 93320, 93325, 94767, 94853, 95559, 95838, 95870, 95970, 96500, 96518, 97014, 97659, 99368, 99892, 100547, 100559, 102084, 102425, 102454, 102688, 104945, 105696, 106127, 106193, 107242, 107295, 107370, 107748, 107991, 108602, 109285, 109313, 109486, 109519, 110083, 110961, 111021, 112090, 112148, 112636, 112654, 112918, 113104, 113676, 114812, 115304, 116302, 117537, 117568, 118033, 118218, 118744, 119036, 119496, 119679, 119889, 121453, 121462, 121464, 121807, 121844, 123434, 123532, 124425, 124955, 125563, 125662, 126010, 126111, 126494, 127873, 128103, 129288, 129593, 130041, 130047, 130303, 130982, 131083, 131241, 131437, 131640, 133194, 134551, 134780, 135908, 136452, 136723, 136994, 138296, 138643, 138744, 138850, 139015, 139253, 139431, 139785, 140246, 140441, 140560, 141344, 141904, 141988, 142302, 142352, 143301, 143326, 144319, 144687, 145838, 146250, 147069, 147122, 148373, 149084, 149456, 149906, 150791, 152145, 152548, 153138, 153778, 154171, 154271, 154383, 155306, 155618, 156965, 157173, 159033, 159050, 160520, 160671, 160967, 161352, 161868, 162292, 163496, 163808, 164347, 166833, 166976, 167273, 168535, 170324, 170469, 170511, 170617, 170929, 171225, 171810, 172006, 172143, 172816, 172856, 173402, 173673, 175088, 176843, 178896, 178954, 179487, 180743, 181368, 183502, 183808, 184489, 184666, 184779, 185118, 185164, 185551, 186563, 186632, 186738]

Above we created a list of all the indices now we will use this to remove rows.

In [9]:

dsremove=[519,1149,1155,2878,2893,3036,3209,3618,4138,4645,4794,5303,6939,7497,8635,9189,9329,9973,10885,10958,11272,11619,11742,11778,11874,12455,13362,14022,14522,14860,15280,17208,17324,17339,17744,18594,18626,19674,19795,19810,20112,20687,20865,20951,21051,21739,24641,24817,25370,25503,26140,26292,26950,27107,27596,27602,27988,28600,28721,29088,29387,30648,31099,31495,33269,34102,34187,34206,34612,35537,35687,37002,37146,37761,38098,38175,38488,40811,41916,41939,42010,42730,42910,43067,43555,46967,47181,47406,47479,49042,49468,49720,49872,49974,50103,50435,50712,50792,50943,51380,51542,51943,52448,52769,53592,53731,54142,54586,54616,56994,57415,57561,57736,58602,58639,58653,58871,58937,59342,59524,59581,60029,60826,61610,62468,62812,67039,68620,68649,68741,69444,70010,70662,70794,71159,71170,72448,73140,74053,75091,75221,75818,76858,78045,78661,78708,78894,78990,79849,80221,80717,80763,80783,80956,81103,81764,82340,82428,82435,83126,84483,85068,85398,85430,85520,85530,85549,85920,85977,86559,86722,87155,87229,87579,89259,89307,89951,91178,91799,92631,93320,93325,94767,94853,95559,95838,95870,95970,96500,96518,97014,97659,99368,99892,100547,100559,102084,102425,102454,102688,104945,105696,106127,106193,107242,107295,107370,107748,107991,108602,109285,109313,109486,109519,110083,110961,111021,112090,112148,112636,112654,112918,113104,113676,114812,115304,116302,117537,117568,118033,118218,118744,119036,119496,119679,119889,121453,121462,121464,121807,121844,123434,123532,124425,124955,125563,125662,126010,126111,126494,127873,128103,129288,129593,130041,130047,130303,130982,131083,131241,131437,131640,133194,134551,134780,135908,136452,136723,136994,138296,138643,138744,138850,139015,139253,139431,139785,140246,140441,140560,141344,141904,141988,142302,142352,143301,143326,144319,144687,145838,146250,147069,147122,148373,149084,149456,149906,150791,152145,152548,153138,153778,154171,154271,154383,155306,155618,156965,157173,159033,159050,160520,160671,160967,161352,161868,162292,163496,163808,164347,166833,166976,167273,168535,170324,170469,170511,170617,170929,171225,171810,172006,172143,172816,172856,173402,173673,175088,176843,178896,178954,179487,180743,181368,183502,183808,184489,184666,184779,185118,185164,185551,186563,186632,186738]
data= df.drop(dsremove, axis=0)

The specified rows are removed . Now we will check for null data and remove or replace those null values based on the observations.

Handling missing data.¶

In [10]:

data.isnull().sum()

Out[10]:

Order ID            545
Product             545
Quantity Ordered    545
Price Each          545
Order Date          545
Purchase Address    545
dtype: int64

We see that there are 545 missing values in each column. So we will remove them by using dropna .

In [11]:

data.dropna(inplace=True)
data.head()

Out[11]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001

In [12]:

data.isnull().sum()

Out[12]:

Order ID            0
Product             0
Quantity Ordered    0
Price Each          0
Order Date          0
Purchase Address    0
dtype: int64

Null values are now removed .

Dropping duplicate values¶

Now we will see if there are any duplicate rows in the data.

In [13]:

duplicate_list=data[data.duplicated()]
duplicate_list

Out[13]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
31	176585	Bose SoundSport Headphones	1	99.99	04/07/19 11:31	823 Highland St, Boston, MA 02215
1302	177795	Apple Airpods Headphones	1	150	04/27/19 19:45	740 14th St, Seattle, WA 98101
1684	178158	USB-C Charging Cable	1	11.95	04/28/19 21:13	197 Center St, San Francisco, CA 94016
3805	180207	Apple Airpods Headphones	1	150	04/13/19 01:46	196 7th St, Los Angeles, CA 90001
4196	180576	Lightning Charging Cable	1	14.95	04/18/19 17:23	431 Park St, Dallas, TX 75001
...	...	...	...	...	...	...
184940	257530	USB-C Charging Cable	1	11.95	09/03/19 11:13	192 Johnson St, San Francisco, CA 94016
186173	258715	Lightning Charging Cable	1	14.95	09/15/19 16:50	550 10th St, Portland, OR 97035
186508	259035	27in FHD Monitor	1	149.99	09/29/19 13:52	327 Lake St, San Francisco, CA 94016
186782	259296	Apple Airpods Headphones	1	150	09/28/19 16:48	894 6th St, Dallas, TX 75001
186785	259297	Lightning Charging Cable	1	14.95	09/15/19 18:54	138 Main St, Boston, MA 02215

264 rows × 6 columns

We observe that there are some duplicate rows as well so we will drop them also .

In [14]:

data=data.drop_duplicates()
data.head()

Out[14]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215
3	176560	Google Phone	1	600	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001

Hence , duplicate rows are dropped and now we can move to further analysis based on the tasks provided.

Transforming the data¶

Task 1. Convert the text based categorical columns to numeric columns¶

The task involves converting text-based categorical columns to numeric columns to facilitate numerical analysis and machine learning operations.

In [15]:

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 185686 entries, 0 to 186849
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Order ID          185686 non-null  object
 1   Product           185686 non-null  object
 2   Quantity Ordered  185686 non-null  object
 3   Price Each        185686 non-null  object
 4   Order Date        185686 non-null  object
 5   Purchase Address  185686 non-null  object
dtypes: object(6)
memory usage: 9.9+ MB

We observe that all the columns are in object format.

1.Data Type Conversion¶

In [16]:

data['Order ID']         = data['Order ID'].astype('int32')
data['Quantity Ordered'] = data['Quantity Ordered'].astype('int32')
data['Price Each']       = data['Price Each'].astype('float64')
data.dtypes

Out[16]:

Order ID              int32
Product              object
Quantity Ordered      int32
Price Each          float64
Order Date           object
Purchase Address     object
dtype: object

The data types of 'Order ID', 'Quantity Ordered', and 'Price Each' columns are converted to 'int32' and 'float64', respectively, ensuring these columns can be used for numerical computations.

2. Label Encoding of Product Column¶

In [17]:

a=data.Product.unique().tolist()
def productnum(data):                                #label encoding
#     global a
    return a.index(data) + 1

data['productnum']=data.Product.apply(productnum)
data.head()

Out[17]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum
0	176558	USB-C Charging Cable	2	11.95	04/19/19 08:46	917 1st St, Dallas, TX 75001	1
2	176559	Bose SoundSport Headphones	1	99.99	04/07/19 22:30	682 Chestnut St, Boston, MA 02215	2
3	176560	Google Phone	1	600.00	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	3
4	176560	Wired Headphones	1	11.99	04/12/19 14:38	669 Spruce St, Los Angeles, CA 90001	4
5	176561	Wired Headphones	1	11.99	04/30/19 09:27	333 8th St, Los Angeles, CA 90001	4

The 'Product' column is label-encoded, converting each unique product name into a unique numeric identifier, stored in the new 'productnum' column. This transformation allows the categorical product data to be used in numerical analyses and machine learning models.

Task 2. Convert the Order Date column into Day, Month, Year and Time columns separately .¶

The task involves converting the 'Order Date' column into separate columns for the day, month, year, and time to facilitate detailed time-based analysis.

1. Converting Order Date to DateTime Format¶

In [18]:

data["Order Date"] = pd.to_datetime(data["Order Date"])
data.head()

Out[18]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	1
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	2
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	3
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4

In [19]:

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 185686 entries, 0 to 186849
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Order ID          185686 non-null  int32         
 1   Product           185686 non-null  object        
 2   Quantity Ordered  185686 non-null  int32         
 3   Price Each        185686 non-null  float64       
 4   Order Date        185686 non-null  datetime64[ns]
 5   Purchase Address  185686 non-null  object        
 6   productnum        185686 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int32(2), int64(1), object(2)
memory usage: 9.9+ MB

The 'Order Date' column is converted to a DateTime format, enabling the extraction of specific date and time components. This step is essential for performing further time-based operations and analyses.

2. Extracting Day, Month, Year, and Time Components¶

In [20]:

data["Day"]   = data["Order Date"].dt.day
data["Month"] = data["Order Date"].dt.month
data["Year"]  = data["Order Date"].dt.year
data["Time"]  = data["Order Date"].dt.time
data.head()

Out[20]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	1	19	4	2019	08:46:00
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	2	7	4	2019	22:30:00
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	3	12	4	2019	14:38:00
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	12	4	2019	14:38:00
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	30	4	2019	09:27:00

New columns 'Day', 'Month', 'Year', and 'Time' are created by extracting respective components from the 'Order Date' column. This detailed breakdown of the date and time allows for more granular analysis of sales trends and patterns over different periods.

Task 3. Create a new column and store the Price data as rounded off in integer format¶

The task involves creating a new column that stores the price data in a rounded-off integer format, making it easier to analyze and interpret.

1. Rounding and Converting Price Data to Integer Format¶

In [21]:

data['Price Rounded'] = data['Price Each'].round().astype(int)
data.head()

Out[21]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	1	19	4	2019	08:46:00	12
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	2	7	4	2019	22:30:00	100
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	3	12	4	2019	14:38:00	600
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	12	4	2019	14:38:00	12
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	30	4	2019	09:27:00	12

A new column 'Price Rounded' is created by rounding off the 'Price Each' values to the nearest integer and converting them to integer format. This transformation simplifies the price data, facilitating more straightforward analysis and potential aggregation without decimal precision.

Task 4. Create a new column and store the PIN Codes from the Address column¶

The task involves extracting the PIN codes from the 'Purchase Address' column and storing them in a new column for more accessible geographical analysis.

1. Splitting the Address Column to Extract PIN Codes¶

In [22]:

address=data["Purchase Address"].str.split(",", expand=True)
pin=address[2].str.split(" ",expand=True)
pin.head()

Out[22]:

	1	2
0	TX	75001
2	MA	02215
3	CA	90001
4	CA	90001
5	CA	90001

The 'Purchase Address' column is split into multiple components using commas and spaces, isolating the segment that contains the PIN codes. This intermediate step helps in precisely locating and extracting the PIN codes from the address.

2. Storing the Extracted PIN Codes in a New Column¶

In [23]:

data["PIN"]=pin[2]
data.head()

Out[23]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	1	19	4	2019	08:46:00	12	75001
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	2	7	4	2019	22:30:00	100	02215
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	3	12	4	2019	14:38:00	600	90001
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	12	4	2019	14:38:00	12	90001
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	30	4	2019	09:27:00	12	90001

A new column 'PIN' is created to store the extracted PIN codes from the address. This addition allows for more detailed geographical analysis, enabling the identification of sales patterns and trends based on specific locations.

Task 5. Create a new column named City and store the city names from the Address column¶

The task involves creating a new column to store city names extracted from the 'Purchase Address' column, facilitating geographical analysis based on city-level data.

1. Inspecting the Address Column¶

In [24]:

address.head()

Out[24]:

	0	1	2
0	917 1st St	Dallas	TX 75001
2	682 Chestnut St	Boston	MA 02215
3	669 Spruce St	Los Angeles	CA 90001
4	669 Spruce St	Los Angeles	CA 90001
5	333 8th St	Los Angeles	CA 90001

By inspecting the 'address' DataFrame, we observe that the city names are located in the second segment (index 1) of the split addresses. This step confirms the correct position of the city names, ensuring accurate extraction.

2. Storing the Extracted City Names in a New Column¶

In [25]:

data["City"]=address[1]
data.head()

Out[25]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN	City
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	1	19	4	2019	08:46:00	12	75001	Dallas
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	2	7	4	2019	22:30:00	100	02215	Boston
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	3	12	4	2019	14:38:00	600	90001	Los Angeles
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	12	4	2019	14:38:00	12	90001	Los Angeles
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	30	4	2019	09:27:00	12	90001	Los Angeles

A new column 'City' is created to store the city names extracted from the 'Purchase Address' column. This addition allows for detailed geographical analysis, enabling the identification of sales trends and patterns based on specific cities.

Task 6. Group Product column into different categories such as Phones, Accessories, etc according to the data available.¶

The task involves grouping the products into different categories such as Phones, Accessories, etc., based on their types, to enable better analysis and reporting.

1. Identifying Unique Products¶

In [26]:

data.Product.unique().tolist()

Out[26]:

['USB-C Charging Cable',
 'Bose SoundSport Headphones',
 'Google Phone',
 'Wired Headphones',
 'Macbook Pro Laptop',
 'Lightning Charging Cable',
 '27in 4K Gaming Monitor',
 'AA Batteries (4-pack)',
 'Apple Airpods Headphones',
 'AAA Batteries (4-pack)',
 'iPhone',
 'Flatscreen TV',
 '27in FHD Monitor',
 '20in Monitor',
 'LG Dryer',
 'ThinkPad Laptop',
 'Vareebadd Phone',
 'LG Washing Machine',
 '34in Ultrawide Monitor']

By listing the unique product names, we can categorize them effectively. This step provides a clear overview of the product types available in the dataset, which is essential for accurate grouping.

2. Defining the Product Grouping Function¶

In [27]:

def grpproduct(data):
    if data in ["USB-C Charging Cable",'Lightning Charging Cable']:
        return "Charging Cable"
    elif data in ['Bose SoundSport Headphones','Wired Headphones','Apple Airpods Headphones']:
        return "Headphones"
    elif data in ['Google Phone','iPhone','Vareebadd Phone']:
        return "Phone"
    elif data in ['Macbook Pro Laptop','ThinkPad Laptop']:
        return "Laptop"
    elif data in ['27in 4K Gaming Monitor','27in FHD Monitor','20in Monitor','34in Ultrawide Monitor']:
        return "Monitor"
    elif data in ['AA Batteries (4-pack)','AAA Batteries (4-pack)']:
        return "Batteries"
    elif data in ['Flatscreen TV',]:
        return "TV"
    elif data in ['LG Dryer','LG Washing Machine']:
        return "Home appliances"
    else:
        return np.nan

The function grpproduct is defined to categorize each product into predefined categories. This function maps each product to its corresponding category based on the provided list.

3. Creating the Product Category Column¶

In [28]:

data["Product Category"]=data.Product.apply(grpproduct)
data

Out[28]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN	City	Product Category
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	1	19	4	2019	08:46:00	12	75001	Dallas	Charging Cable
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	2	7	4	2019	22:30:00	100	02215	Boston	Headphones
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	3	12	4	2019	14:38:00	600	90001	Los Angeles	Phone
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	12	4	2019	14:38:00	12	90001	Los Angeles	Headphones
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	30	4	2019	09:27:00	12	90001	Los Angeles	Headphones
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
186845	259353	AAA Batteries (4-pack)	3	2.99	2019-09-17 20:56:00	840 Highland St, Los Angeles, CA 90001	10	17	9	2019	20:56:00	3	90001	Los Angeles	Batteries
186846	259354	iPhone	1	700.00	2019-09-01 16:00:00	216 Dogwood St, San Francisco, CA 94016	11	1	9	2019	16:00:00	700	94016	San Francisco	Phone
186847	259355	iPhone	1	700.00	2019-09-23 07:39:00	220 12th St, San Francisco, CA 94016	11	23	9	2019	07:39:00	700	94016	San Francisco	Phone
186848	259356	34in Ultrawide Monitor	1	379.99	2019-09-19 17:30:00	511 Forest St, San Francisco, CA 94016	19	19	9	2019	17:30:00	380	94016	San Francisco	Monitor
186849	259357	USB-C Charging Cable	1	11.95	2019-09-30 00:18:00	250 Meadow St, San Francisco, CA 94016	1	30	9	2019	00:18:00	12	94016	San Francisco	Charging Cable

185686 rows × 15 columns

A new column 'Product Category' is created by applying the grpproduct function to the 'Product' column. This categorization allows for better organization and analysis of the data, enabling insights into the performance of different product categories.

Task 7. Create 12 separate date ordered dataframes for every month.¶

The task involves creating 12 separate dataframes, each corresponding to a month, to facilitate month-wise analysis of the sales data.

1. Identifying Unique Years¶

In [29]:

data.Year.unique().tolist()

Out[29]:

[2019, 2020]

By listing the unique years, we can filter the data accordingly. This step confirms the range of years present in the dataset, which is crucial for segregating the data by month.

2. Filtering Data by Year¶

In [30]:

data19=data[data.Year==2019]
data20=data[data.Year==2020]

The data is filtered into separate dataframes for the years 2019 and 2020. This segregation ensures that month-wise analysis can be accurately performed within each year.

3. Initializing Data Structures for Months¶

In [31]:

ds_list={}
monthnum=[1,2,3,4,5,6,7,8,9,10,11,12]
monthlist=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]

A dictionary ds_list and lists monthnum and monthlist are initialized to store the month-wise dataframes and to facilitate iteration over the months. This setup is essential for creating and labeling the separate monthly dataframes.

4. Creating Separate DataFrames for Each Month¶

In [32]:

for i in monthnum:
    ds_list[monthlist[i-1]]=data.groupby("Month").get_group(i)

The data dataframe is grouped by the 'Month' column, and separate dataframes for each month are stored in the ds_list dictionary. This step ensures that each month has a dedicated dataframe for detailed analysis.

5: Displaying the January DataFrame:¶

In [33]:

ds_list["Jan"].head()

Out[33]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN	City	Product Category
31957	297150	Lightning Charging Cable	1	14.95	2020-01-01 00:38:00	427 Wilson St, Dallas, TX 75001	6	1	1	2020	00:38:00	15	75001	Dallas	Charging Cable
32659	297817	iPhone	1	700.00	2020-01-01 00:22:00	519 13th St, New York City, NY 10001	11	1	1	2020	00:22:00	700	10001	New York City	Phone
32660	297817	Lightning Charging Cable	2	14.95	2020-01-01 00:22:00	519 13th St, New York City, NY 10001	6	1	1	2020	00:22:00	15	10001	New York City	Charging Cable
32819	297969	Google Phone	1	600.00	2020-01-01 00:54:00	542 2nd St, San Francisco, CA 94016	3	1	1	2020	00:54:00	600	94016	San Francisco	Phone
33200	298344	Wired Headphones	4	11.99	2020-01-01 03:34:00	731 7th St, New York City, NY 10001	4	1	1	2020	03:34:00	12	10001	New York City	Headphones

Displaying the first few rows of the January dataframe confirms that the data for January has been correctly extracted and stored. This validation step helps ensure the accuracy of the monthly segregation.

Task 8. What was the best month for sales? How much was earned that month? Explain via graph/chart¶

The goal is to identify the best month for sales and determine how much was earned during that month. We will use the sales data to calculate total sales for each month and then visualize this data using a bar chart to easily identify the best month for sales.

1. Calculate Total Sales¶

In [34]:

data["Total Sales"]= data["Quantity Ordered"]*data["Price Rounded"]

This step creates a new column, Total Sales, by multiplying the quantity of items ordered (Quantity Ordered) by the price per item (Price Rounded). This provides the total sales amount for each individual order.

2. Group Sales by Month and Sum Them:¶

In [35]:

sale=data.groupby("Month")["Total Sales"].sum()
sale

Out[35]:

Month
1     1821811
2     2200572
3     2805594
4     3389991
5     3151289
6     2576850
7     2647047
8     2241581
9     2094958
10    3735611
11    3198606
12    4609333
Name: Total Sales, dtype: int32

Here, the data is grouped by the Month column, and the total sales for each month are summed up. This results in a new Series, sale, where the index is the month and the values are the total sales for that month.

3. Find the Maximum Sales Value:¶

In [36]:

sale.max()

Out[36]:

This line finds the highest value in the sale Series, representing the month with the highest total sales. It identifies the peak sales performance across all months.

4. Plot the Sales Data:¶

In [37]:

sns.barplot(x=sale.index , y= sale.values)

Out[37]:

<AxesSubplot:xlabel='Month'>

A bar plot is created using the seaborn library, with months on the x-axis and total sales values on the y-axis. This visual representation makes it easy to compare sales performance across different months and clearly identify the best month for sales.

The bar plot visualizes the total earnings per month. It confirms that December was the best month for sales, with the highest total earnings of 4,583,267.77 dollars. This visualization effectively communicates the peak sales month, making it easy to identify trends and patterns.

Conclusion :¶

Based on the calculated total sales for each month, the results indicate the following:

The month with the highest sales is December, with total earnings of 4,608,295.70.
The month with the second highest sales is October, with total earnings of 3,734,777.86.
The month with the lowest sales is January, with total earnings of 1,821,413.16.

Task 9. What city sold the most product?¶

The task is to identify the city with the highest product sales and visualize the distribution of sales across different cities.

1. Counting the Number of Sales Per City¶

In [38]:

data.City.value_counts()

Out[38]:

 San Francisco    44662
 Los Angeles      29564
 New York City    24847
 Boston           19901
 Atlanta          14863
 Dallas           14797
 Seattle          14713
 Portland         12449
 Austin            9890
Name: City, dtype: int64

The output indicates that San Francisco has the highest number of sales transactions (44,662), followed by Los Angeles (29,564) and New York City (24,847). This step helps identify the city with the most sales activity.

2. Visualizing Sales Transactions Per City¶

In [39]:

sns.countplot(x=data.City)
plt.xticks(rotation=90)
plt.show()

The count plot visualizes the number of sales transactions across different cities. San Francisco is clearly the city with the highest sales volume, as reflected by the tallest bar in the plot. The rotation of x-axis labels ensures all city names are readable. This visualization provides a clear comparison of sales performance by city, highlighting San Francisco as the leading city in terms of product sales.

Conclusion :¶

The analysis reveals that San Francisco leads in sales transactions compared to other cities. The visual representation confirms this by showing San Francisco's prominent position in the sales distribution, making it evident which city contributes the most to the overall sales figures.

Task 10. What time should we display advertisements to maximize likelihood of customer's buying product?¶

The task aims to determine the optimal time to display advertisements by analyzing the hourly distribution of purchases to maximize the likelihood of customers buying products.

1. Extracting the Hour of Purchase¶

In [40]:

data["Hour"]=data["Order Date"].dt.hour
data.head()

Out[40]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN	City	Product Category	Total Sales	Hour
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	1	19	4	2019	08:46:00	12	75001	Dallas	Charging Cable	24	8
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	2	7	4	2019	22:30:00	100	02215	Boston	Headphones	100	22
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	3	12	4	2019	14:38:00	600	90001	Los Angeles	Phone	600	14
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	12	4	2019	14:38:00	12	90001	Los Angeles	Headphones	12	14
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	30	4	2019	09:27:00	12	90001	Los Angeles	Headphones	12	9

The 'Hour' column is created to capture the hour at which each purchase was made. This step prepares the data for time-based analysis, allowing us to identify peak purchasing hours.

2. Visualizing the Number of Purchases by Hour¶

In [41]:

sns.countplot(x=data.Hour)

Out[41]:

<AxesSubplot:xlabel='Hour', ylabel='count'>

The count plot displays the number of purchases made during each hour of the day. It highlights that 11-12 hour and 19 hour has the highest frequency of purchases. This visualization helps identify peak hours when customers are most active.

3. Visualizing the Distribution of Purchases by Hour¶

In [42]:

sns.displot(x=data.Hour, kde=True, bins=24)
plt.title('Distribution of Purchases by Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Purchases')
plt.grid(True)
plt.xticks(range(0, 24), rotation=90)
plt.show()

The distribution plot with a KDE curve shows the frequency of purchases throughout the day. The plot reveals 12 and 19 hours have the highest density of purchases, indicating the optimal times for displaying advertisements. The grid and x-axis labels make the plot easy to interpret, showing that the highest number of purchases typically occurs in the afternoon or late evening hours, suggesting these times might be ideal for ad placements.

Task 11. Which product was sold the least as per your categories and what might be the reason behind that. Explain your analysis.¶

The task is to identify which product was sold the least according to the product categories and analyze potential reasons behind its lower sales.

1. Counting Sales per Product¶

In [43]:

procount=data.groupby("Product Category")["Product"].value_counts().reset_index(name="Count")
procount

Out[43]:

	Product Category	Product	Count
0	Batteries	AAA Batteries (4-pack)	20612
1	Batteries	AA Batteries (4-pack)	20558
2	Charging Cable	USB-C Charging Cable	21859
3	Charging Cable	Lightning Charging Cable	21610
4	Headphones	Wired Headphones	18849
5	Headphones	Apple Airpods Headphones	15525
6	Headphones	Bose SoundSport Headphones	13298
7	Home appliances	LG Washing Machine	666
8	Home appliances	LG Dryer	646
9	Laptop	Macbook Pro Laptop	4721
10	Laptop	ThinkPad Laptop	4126
11	Monitor	27in FHD Monitor	7498
12	Monitor	27in 4K Gaming Monitor	6225
13	Monitor	34in Ultrawide Monitor	6174
14	Monitor	20in Monitor	4098
15	Phone	iPhone	6840
16	Phone	Google Phone	5522
17	Phone	Vareebadd Phone	2065
18	TV	Flatscreen TV	4794

The output shows the count of sales for each product within its category. From this, it is evident that the products with the least sales are the LG Washing Machine and LG Dryer, with counts of 666 and 646, respectively, in the 'Home appliances' category. These are the least sold products among the listed categories.

2. Calculating Mean Price per Product¶

In [44]:

meanprice=data.groupby("Product")["Price Each"].mean().reset_index()
meanprice.columns=["Product", "Mean Price"]
meanprice

Out[44]:

	Product	Mean Price
0	20in Monitor	109.99
1	27in 4K Gaming Monitor	389.99
2	27in FHD Monitor	149.99
3	34in Ultrawide Monitor	379.99
4	AA Batteries (4-pack)	3.84
5	AAA Batteries (4-pack)	2.99
6	Apple Airpods Headphones	150.00
7	Bose SoundSport Headphones	99.99
8	Flatscreen TV	300.00
9	Google Phone	600.00
10	LG Dryer	600.00
11	LG Washing Machine	600.00
12	Lightning Charging Cable	14.95
13	Macbook Pro Laptop	1700.00
14	ThinkPad Laptop	999.99
15	USB-C Charging Cable	11.95
16	Vareebadd Phone	400.00
17	Wired Headphones	11.99
18	iPhone	700.00

The mean price data reveals that products like the LG Washing Machine and LG Dryer are priced at 600.00 dollars , which is relatively high compared to other products in the dataset. This suggests that their high prices may be a contributing factor to their lower sales volumes.

3. Merging Sales Count with Mean Price¶

In [45]:

merged=pd.merge(procount, meanprice, on="Product", how="left")
merged

Out[45]:

	Product Category	Product	Count	Mean Price
0	Batteries	AAA Batteries (4-pack)	20612	2.99
1	Batteries	AA Batteries (4-pack)	20558	3.84
2	Charging Cable	USB-C Charging Cable	21859	11.95
3	Charging Cable	Lightning Charging Cable	21610	14.95
4	Headphones	Wired Headphones	18849	11.99
5	Headphones	Apple Airpods Headphones	15525	150.00
6	Headphones	Bose SoundSport Headphones	13298	99.99
7	Home appliances	LG Washing Machine	666	600.00
8	Home appliances	LG Dryer	646	600.00
9	Laptop	Macbook Pro Laptop	4721	1700.00
10	Laptop	ThinkPad Laptop	4126	999.99
11	Monitor	27in FHD Monitor	7498	149.99
12	Monitor	27in 4K Gaming Monitor	6225	389.99
13	Monitor	34in Ultrawide Monitor	6174	379.99
14	Monitor	20in Monitor	4098	109.99
15	Phone	iPhone	6840	700.00
16	Phone	Google Phone	5522	600.00
17	Phone	Vareebadd Phone	2065	400.00
18	TV	Flatscreen TV	4794	300.00

The merged data provides a comprehensive view of both the sales count and mean price for each product. The LG Washing Machine and LG Dryer have the lowest sales counts and are also among the higher-priced items. This pricing could be a significant factor in their low sales, as higher-priced products may have a smaller customer base or face more competition compared to lower-priced items.

Conclusion :¶

The analysis indicates that the LG Washing Machine and LG Dryer were sold the least. Their high prices, relative to other products, likely contribute to their lower sales volumes. Higher-priced items may have fewer buyers due to budget constraints or lower demand compared to more affordable products. Additionally, factors such as product necessity, market competition, and customer preferences may also influence the sales figures.

Task 12. Which product is most likely to be sold more during winter season, summer season and rainy season? Explain the possible reason behind the High-Volume Sale¶

The task is to determine which products are most likely to be sold more during different seasons (winter, summer, and rainy) and to explain the reasons behind these seasonal sales trends.

1. Categorizing Data by Season¶

In [46]:

def season(data):
    if data in [11,12,1,2]:  return "Winter"
    elif data in [3,4,5,6]:  return "Summer"
    elif data in [7,8,9,10]: return "Rainy"
    else :                   return np.nan
    
data["Season"]=data.Month.apply(season)
data.head()

Out[46]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN	City	Product Category	Total Sales	Hour	Season
0	176558	USB-C Charging Cable	2	11.95	2019-04-19 08:46:00	917 1st St, Dallas, TX 75001	1	19	4	2019	08:46:00	12	75001	Dallas	Charging Cable	24	8	Summer
2	176559	Bose SoundSport Headphones	1	99.99	2019-04-07 22:30:00	682 Chestnut St, Boston, MA 02215	2	7	4	2019	22:30:00	100	02215	Boston	Headphones	100	22	Summer
3	176560	Google Phone	1	600.00	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	3	12	4	2019	14:38:00	600	90001	Los Angeles	Phone	600	14	Summer
4	176560	Wired Headphones	1	11.99	2019-04-12 14:38:00	669 Spruce St, Los Angeles, CA 90001	4	12	4	2019	14:38:00	12	90001	Los Angeles	Headphones	12	14	Summer
5	176561	Wired Headphones	1	11.99	2019-04-30 09:27:00	333 8th St, Los Angeles, CA 90001	4	30	4	2019	09:27:00	12	90001	Los Angeles	Headphones	12	9	Summer

The 'Season' column is created based on the month of the year, categorizing each month into Winter, Summer, or Rainy season. This categorization is crucial for analyzing seasonal sales patterns.

2. Grouping and Counting Sales by Season¶

In [47]:

data.groupby("Season")["Product"].value_counts()

Out[47]:

Season  Product                   
Rainy   USB-C Charging Cable          6894
        Lightning Charging Cable      6782
        AA Batteries (4-pack)         6466
        AAA Batteries (4-pack)        6409
        Wired Headphones              5889
        Apple Airpods Headphones      4831
        Bose SoundSport Headphones    4180
        27in FHD Monitor              2351
        iPhone                        2051
        27in 4K Gaming Monitor        2018
        34in Ultrawide Monitor        1885
        Google Phone                  1707
        Flatscreen TV                 1540
        Macbook Pro Laptop            1478
        ThinkPad Laptop               1302
        20in Monitor                  1283
        Vareebadd Phone                616
        LG Washing Machine             207
        LG Dryer                       181
Summer  Lightning Charging Cable      7435
        USB-C Charging Cable          7250
        AA Batteries (4-pack)         7095
        AAA Batteries (4-pack)        6972
        Wired Headphones              6463
        Apple Airpods Headphones      5321
        Bose SoundSport Headphones    4592
        27in FHD Monitor              2587
        iPhone                        2400
        34in Ultrawide Monitor        2167
        27in 4K Gaming Monitor        2034
        Google Phone                  1913
        Macbook Pro Laptop            1649
        Flatscreen TV                 1581
        ThinkPad Laptop               1420
        20in Monitor                  1369
        Vareebadd Phone                734
        LG Dryer                       246
        LG Washing Machine             244
Winter  USB-C Charging Cable          7715
        Lightning Charging Cable      7393
        AAA Batteries (4-pack)        7231
        AA Batteries (4-pack)         6997
        Wired Headphones              6497
        Apple Airpods Headphones      5373
        Bose SoundSport Headphones    4526
        27in FHD Monitor              2560
        iPhone                        2389
        27in 4K Gaming Monitor        2173
        34in Ultrawide Monitor        2122
        Google Phone                  1902
        Flatscreen TV                 1673
        Macbook Pro Laptop            1594
        20in Monitor                  1446
        ThinkPad Laptop               1404
        Vareebadd Phone                715
        LG Dryer                       219
        LG Washing Machine             215
Name: Product, dtype: int64

The output shows which products have the highest sales counts for each season. The most popular products by season are:

Winter: USB-C Charging Cable
Summer: Lightning Charging Cable
Rainy: USB-C Charging Cable This indicates that certain products perform better during specific seasons.

3. Visualizing Product Sales by Season¶

In [48]:

sns.countplot(x=data.Product, hue=data.Season)
plt.xticks(rotation=90)
plt.show()

The count plot visually represents the sales of different products across seasons. It confirms that:

Winter: USB-C Charging Cable is the top-selling product.
Summer: Lightning Charging Cable leads in sales.
Rainy: USB-C Charging Cable also performs well.

Conclusion :¶

Winter: The USB-C Charging Cable tops the sales list, likely because people may purchase more tech accessories during the colder months when they stay indoors more often and might need additional charging solutions for their devices.
Summer: The Lightning Charging Cable leads in sales. During summer, people might be on the go more frequently and need portable charging solutions, which explains the higher sales of this product.
Rainy: The USB-C Charging Cable again stands out. Rainy weather often leads to increased indoor activities, which could drive higher sales of tech products as people spend more time using their electronic devices.

Overall, these trends highlight how seasonal weather and activities influence purchasing behavior, suggesting that advertisements and promotions could be tailored to these seasonal patterns to maximize sales.

Task 13. Which phone is sold most during the month of March?¶

The task is to identify which phone model had the highest sales in March.

1. Filtering Data for March¶

In [49]:

march=data.groupby("Month").get_group(3)
march.head()

Out[49]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN	City	Product Category	Total Sales	Hour	Season
55543	150533	AAA Batteries (4-pack)	1	2.99	2019-03-01 03:06:00	270 Dogwood St, San Francisco, CA 94016	10	1	3	2019	03:06:00	3	94016	San Francisco	Batteries	3	3	Summer
55551	150541	AAA Batteries (4-pack)	1	2.99	2019-03-01 01:03:00	683 Adams St, Portland, OR 97035	10	1	3	2019	01:03:00	3	97035	Portland	Batteries	3	1	Summer
55834	150812	Wired Headphones	1	11.99	2019-03-01 02:18:00	915 Maple St, San Francisco, CA 94016	4	1	3	2019	02:18:00	12	94016	San Francisco	Headphones	12	2	Summer
56817	151749	Bose SoundSport Headphones	1	99.99	2019-03-01 01:54:00	993 Washington St, San Francisco, CA 94016	2	1	3	2019	01:54:00	100	94016	San Francisco	Headphones	100	1	Summer
57120	152036	ThinkPad Laptop	1	999.99	2019-03-01 05:56:00	477 Washington St, San Francisco, CA 94016	16	1	3	2019	05:56:00	1000	94016	San Francisco	Laptop	1000	5	Summer

This step filters the dataset to include only sales data from March. It provides a subset of the data where the month is March, setting the stage for further analysis of phone sales during this month.

2. Filtering for Phone Products¶

In [50]:

mphone=march[march["Product Category"]=="Phone"]
mphone.head()

Out[50]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN	City	Product Category	Total Sales	Hour	Season
105263	162009	iPhone	1	700.0	2019-03-28 20:59:00	942 Church St, Austin, TX 73301	11	28	3	2019	20:59:00	700	73301	Austin	Phone	700	20	Summer
105282	162026	iPhone	1	700.0	2019-03-22 22:31:00	71 Pine St, Austin, TX 73301	11	22	3	2019	22:31:00	700	73301	Austin	Phone	700	22	Summer
105286	162030	Google Phone	1	600.0	2019-03-09 16:50:00	463 4th St, San Francisco, CA 94016	3	9	3	2019	16:50:00	600	94016	San Francisco	Phone	600	16	Summer
105289	162033	iPhone	1	700.0	2019-03-19 11:06:00	267 Johnson St, Dallas, TX 75001	11	19	3	2019	11:06:00	700	75001	Dallas	Phone	700	11	Summer
105293	162037	iPhone	1	700.0	2019-03-23 02:11:00	277 Elm St, Los Angeles, CA 90001	11	23	3	2019	02:11:00	700	90001	Los Angeles	Phone	700	2	Summer

This step further filters the March data to include only products categorized as phones. It isolates the phone sales data from the overall March data, focusing on the relevant product category.

3. Counting Phone Sales in March¶

In [51]:

mphone.Product.value_counts()

Out[51]:

iPhone             537
Google Phone       463
Vareebadd Phone    174
Name: Product, dtype: int64

The output reveals the number of units sold for each phone model in March:

iPhone: 537 units
Google Phone: 463 units
Vareebadd Phone: 174 units The iPhone was the most sold phone model in March, followed by the Google Phone and Vareebadd Phone.

Conclusion :¶

The iPhone led in sales during March with 537 units sold, significantly outperforming other phone models. This high volume of sales might be attributed to factors such as brand popularity, promotions, or consumer preferences for the iPhone. The analysis highlights which phone model resonates most with customers in March, providing valuable insights for inventory management, marketing strategies, and sales forecasting.

Task 14. Which headphones are the most expensive?¶

The task is to identify which headphones are the most expensive based on their price data.

1. Filtering and Identifying the Most Expensive Headphones¶

In [52]:

hdphones=data[data["Product Category"]=="Headphones"]
hdphones[hdphones["Price Each"]==hdphones["Price Each"].max()].head()

Out[52]:

	Order ID	Product	Quantity Ordered	Price Each	Order Date	Purchase Address	productnum	Day	Month	Year	Time	Price Rounded	PIN	City	Product Category	Total Sales	Hour	Season
16	176572	Apple Airpods Headphones	1	150.0	2019-04-04 20:30:00	149 Dogwood St, New York City, NY 10001	9	4	4	2019	20:30:00	150	10001	New York City	Headphones	150	20	Summer
21	176576	Apple Airpods Headphones	1	150.0	2019-04-28 11:42:00	771 Ridge St, Los Angeles, CA 90001	9	28	4	2019	11:42:00	150	90001	Los Angeles	Headphones	150	11	Summer
22	176577	Apple Airpods Headphones	1	150.0	2019-04-04 19:25:00	260 Spruce St, Dallas, TX 75001	9	4	4	2019	19:25:00	150	75001	Dallas	Headphones	150	19	Summer
23	176578	Apple Airpods Headphones	1	150.0	2019-04-09 23:35:00	513 Church St, Boston, MA 02215	9	9	4	2019	23:35:00	150	02215	Boston	Headphones	150	23	Summer
38	176591	Apple Airpods Headphones	1	150.0	2019-04-21 07:21:00	600 Maple St, Austin, TX 73301	9	21	4	2019	07:21:00	150	73301	Austin	Headphones	150	7	Summer

This step filters the dataset to include only products in the "Headphones" category and identifies the headphone model with the highest price. The output reveals that the Apple Airpods Headphones are the most expensive headphones, priced at 150.00 dollars.

2. Reviewing Merged Data for Headphones¶

In [53]:

merged

Out[53]:

	Product Category	Product	Count	Mean Price
0	Batteries	AAA Batteries (4-pack)	20612	2.99
1	Batteries	AA Batteries (4-pack)	20558	3.84
2	Charging Cable	USB-C Charging Cable	21859	11.95
3	Charging Cable	Lightning Charging Cable	21610	14.95
4	Headphones	Wired Headphones	18849	11.99
5	Headphones	Apple Airpods Headphones	15525	150.00
6	Headphones	Bose SoundSport Headphones	13298	99.99
7	Home appliances	LG Washing Machine	666	600.00
8	Home appliances	LG Dryer	646	600.00
9	Laptop	Macbook Pro Laptop	4721	1700.00
10	Laptop	ThinkPad Laptop	4126	999.99
11	Monitor	27in FHD Monitor	7498	149.99
12	Monitor	27in 4K Gaming Monitor	6225	389.99
13	Monitor	34in Ultrawide Monitor	6174	379.99
14	Monitor	20in Monitor	4098	109.99
15	Phone	iPhone	6840	700.00
16	Phone	Google Phone	5522	600.00
17	Phone	Vareebadd Phone	2065	400.00
18	TV	Flatscreen TV	4794	300.00

In [54]:

mhp=merged[merged["Product Category"]=="Headphones"]
mhp

Out[54]:

	Product Category	Product	Count	Mean Price
4	Headphones	Wired Headphones	18849	11.99
5	Headphones	Apple Airpods Headphones	15525	150.00
6	Headphones	Bose SoundSport Headphones	13298	99.99

This step shows the merged data for headphones, listing the units sold and their prices. The output confirms:

Wired Headphones: 11.99 d
Apple Airpods Headphones: 150.00 d
Bose SoundSport Headphones: 99.99 d The Apple Airpods Headphones have the highest price at 150.00 d, verifying the result from the first step.

Conclusion:¶

The Apple Airpods Headphones are the most expensive headphones in the dataset, with a price of $150.00. This high price point is likely due to the brand's premium positioning, advanced features, or market demand. Understanding the pricing of different products helps in pricing strategy decisions, market positioning, and inventory management.

Task 15. Check if the price of any product is changing in different months. If yes, explain the possible reason behind the price change¶

The task is to determine whether the price of any product changes across different months, and if so, to explore potential reasons for such variations.

1. Calculating Monthly Average Prices for Each Product¶

In [55]:

jan8=pd.DataFrame(data=data.groupby(["Product","Month"])["Price Each"].mean())
#pd.set_option('display.max_rows', None)
jan8

Out[55]:

		Price Each
Product	Month
20in Monitor	1	109.99
	2	109.99
	3	109.99
	4	109.99
	5	109.99
...	...	...
iPhone	8	700.00
	9	700.00
	10	700.00
	11	700.00
	12	700.00

228 rows × 1 columns

This step groups the data by Product and Month, then calculates the average price of each product for each month.

The result shows that there was no observed change in the average price of any product across different months. This implies that the prices remained consistent throughout the year.

Task 16. People prefer which product more in earphones, wired or wireless?¶

In [56]:

mhp

Out[56]:

	Product Category	Product	Count	Mean Price
4	Headphones	Wired Headphones	18849	11.99
5	Headphones	Apple Airpods Headphones	15525	150.00
6	Headphones	Bose SoundSport Headphones	13298	99.99

if we consider "bose soundsport headphones" to be wireless we can see that sum of apple and bose headphones (wireless) is greater than wired headphones, ie, 28823(sum of apple and bose headphones sold) > 18849 (wired headphones sold ).

Task 17. Which product in every category in more likely to be ordered in bulk?¶

The task is to determine which product within each category is most likely to be ordered in bulk, defined as orders of higher quantities.

1. Filtering Data by Product Category¶

In [57]:

catlist=data["Product Category"].unique().tolist()
catdic={}
for i in catlist:
    dt10=data[data["Product Category"]==i]
    

This step initializes the analysis by creating a list of unique product categories and iterates over each category to filter the dataset accordingly. However, the specific handling of each category (e.g., calculating bulk orders) is not completed in this snippet.

2. Counting Order Quantities¶

In [58]:

data["Quantity Ordered"].value_counts()

Out[58]:

1    168291
2     13321
3      2920
4       806
5       236
6        80
7        24
8         5
9         3
Name: Quantity Ordered, dtype: int64

The output shows the frequency of different order quantities. Orders of 1 are the most common, with 168,291 occurrences, while orders of higher quantities (e.g., 2, 3, 4) are less frequent. This distribution highlights that while bulk orders (e.g., quantities of 4 or more) are less common, they are still present in the dataset.

3. Analyzing Bulk Orders for Headphones¶

In [59]:

dt10=data[data["Product Category"]=="Headphones"]
dt10.groupby("Product")["Quantity Ordered"].max()

Out[59]:

Product
Apple Airpods Headphones      3
Bose SoundSport Headphones    3
Wired Headphones              4
Name: Quantity Ordered, dtype: int32

This step focuses on the "Headphones" category and identifies the maximum quantity ordered for each headphone product:

Apple Airpods Headphones: Maximum order quantity of 3
Bose SoundSport Headphones: Maximum order quantity of 3
Wired Headphones: Maximum order quantity of 4

Among the headphones, the Wired Headphones are most likely to be ordered in bulk, with the highest maximum order quantity of 4.

Overall Conclusion and Summary¶

Data Cleaning and Transformation:¶

Conversion of Categorical to Numeric: Converted text-based columns like Product into numerical formats for easier analysis.
Date Parsing: Split Order Date into separate columns for Day, Month, Year, and Time to facilitate time-based analysis.
Price Rounding: Created a Price Rounded column for simplicity in cost-related calculations.
PIN Code Extraction: Extracted PIN codes from the Purchase Address for geographical analysis.
City Extraction: Isolated city names from addresses for city-based sales analysis.
Product Categorization: Grouped products into categories (e.g., Phones, Accessories) to analyze sales trends within these categories.

Exploratory Data Analysis (EDA):¶

Best Month for Sales: Identified December as the best month for sales, with the highest total revenue. This suggests that end-of-year sales and holidays drive higher sales volumes.
City with Highest Sales: Found that San Francisco had the highest number of product sales, indicating it as a major market for the products.
Optimal Advertisement Time: Determined that the most purchases occur in the evening hours, suggesting that advertising during these times might be more effective.
Least Sold Product: Identified LG Washing Machine as the least sold product. The lower sales could be due to its high price compared to other products or lower consumer demand.
Seasonal Sales Patterns:
- Winter: USB-C Charging Cable is the most popular, likely due to increased indoor activity and device usage.
- Summer: Lightning Charging Cable sees higher sales, possibly due to the need for portable charging solutions during travel.
- Rainy Season: USB-C Charging Cable leads in sales again, indicating higher indoor activities during the rainy season.
Most Expensive Headphones: Apple Airpods Headphones are the most expensive, priced at D 150, reflecting their premium market positioning.
Bulk Orders by Product Category: For the Headphones category, Wired Headphones are most frequently ordered in larger quantities compared to other products.

Recommendations:¶

Marketing Strategy: Focus advertising efforts around December and in the evenings to capture peak purchasing times.
City-Specific Marketing: Tailor marketing and sales strategies for San Francisco to leverage its high sales potential.
Seasonal Promotions: Adjust product promotions and inventory based on seasonal trends to maximize sales during peak periods.
Product Pricing and Inventory: Consider revising the pricing strategy for high-ticket items like LG Washing Machines and analyze potential factors affecting their sales.

In [ ]:

Exploratory Data Analysis on Sales Data in Python

Introduction¶

Columns Description¶

Cleaning the data¶

Removing irrelevant rows.¶

Handling missing data.¶

Dropping duplicate values¶

Transforming the data¶

Task 1. Convert the text based categorical columns to numeric columns¶

1.Data Type Conversion¶

2. Label Encoding of Product Column¶

Task 2. Convert the Order Date column into Day, Month, Year and Time columns separately .¶

1. Converting Order Date to DateTime Format¶

2. Extracting Day, Month, Year, and Time Components¶

Task 3. Create a new column and store the Price data as rounded off in integer format¶

1. Rounding and Converting Price Data to Integer Format¶

Task 4. Create a new column and store the PIN Codes from the Address column¶

1. Splitting the Address Column to Extract PIN Codes¶

2. Storing the Extracted PIN Codes in a New Column¶

Task 5. Create a new column named City and store the city names from the Address column¶

1. Inspecting the Address Column¶

2. Storing the Extracted City Names in a New Column¶

Task 6. Group Product column into different categories such as Phones, Accessories, etc according to the data available.¶

1. Identifying Unique Products¶

2. Defining the Product Grouping Function¶

3. Creating the Product Category Column¶

Task 7. Create 12 separate date ordered dataframes for every month.¶

1. Identifying Unique Years¶

2. Filtering Data by Year¶

3. Initializing Data Structures for Months¶

4. Creating Separate DataFrames for Each Month¶

5: Displaying the January DataFrame:¶

Task 8. What was the best month for sales? How much was earned that month? Explain via graph/chart¶

1. Calculate Total Sales¶

2. Group Sales by Month and Sum Them:¶

3. Find the Maximum Sales Value:¶

4. Plot the Sales Data:¶

Conclusion :¶

Task 9. What city sold the most product?¶

1. Counting the Number of Sales Per City¶

2. Visualizing Sales Transactions Per City¶

Conclusion :¶

Task 10. What time should we display advertisements to maximize likelihood of customer's buying product?¶

1. Extracting the Hour of Purchase¶

2. Visualizing the Number of Purchases by Hour¶

3. Visualizing the Distribution of Purchases by Hour¶

Task 11. Which product was sold the least as per your categories and what might be the reason behind that. Explain your analysis.¶

1. Counting Sales per Product¶

2. Calculating Mean Price per Product¶

3. Merging Sales Count with Mean Price¶

Conclusion :¶

Task 12. Which product is most likely to be sold more during winter season, summer season and rainy season? Explain the possible reason behind the High-Volume Sale¶

1. Categorizing Data by Season¶

2. Grouping and Counting Sales by Season¶

3. Visualizing Product Sales by Season¶

Conclusion :¶

Task 13. Which phone is sold most during the month of March?¶

1. Filtering Data for March¶

2. Filtering for Phone Products¶

3. Counting Phone Sales in March¶

Conclusion :¶

Task 14. Which headphones are the most expensive?¶

1. Filtering and Identifying the Most Expensive Headphones¶

2. Reviewing Merged Data for Headphones¶

Conclusion:¶

Task 15. Check if the price of any product is changing in different months. If yes, explain the possible reason behind the price change¶

1. Calculating Monthly Average Prices for Each Product¶

Task 16. People prefer which product more in earphones, wired or wireless?¶

Task 17. Which product in every category in more likely to be ordered in bulk?¶

1. Filtering Data by Product Category¶

2. Counting Order Quantities¶

3. Analyzing Bulk Orders for Headphones¶

Overall Conclusion and Summary¶

Data Cleaning and Transformation:¶

Exploratory Data Analysis (EDA):¶

Recommendations:¶

Talk to our Industry Experts for Career Counselling

Company

Platform

Resources