Analyse the given dataset completely and perform the following questions.
Data Analysis Project by Anushka Khemaria, Team edSlash.
This dataset contains sales records from an unspecified store in USA, capturing various details about each transaction, including the product, quantity ordered, price, order date, and purchase address. With over 186,000 records, it offers a comprehensive view of the store's sales activities over time.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv(r"C:\Users\DELL\Downloads\Sales Dataset\Sales Dataset\SalesAnalysis.csv")
df.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
df.dtypes
Order ID object Product object Quantity Ordered object Price Each object Order Date object Purchase Address object dtype: object
We observe that all the data is in object format so we will convert the relevant data to int or float depending on the requirement.
df.Product.value_counts()
USB-C Charging Cable 21903 Lightning Charging Cable 21658 AAA Batteries (4-pack) 20641 AA Batteries (4-pack) 20577 Wired Headphones 18882 Apple Airpods Headphones 15549 Bose SoundSport Headphones 13325 27in FHD Monitor 7507 iPhone 6842 27in 4K Gaming Monitor 6230 34in Ultrawide Monitor 6181 Google Phone 5525 Flatscreen TV 4800 Macbook Pro Laptop 4724 ThinkPad Laptop 4128 20in Monitor 4101 Vareebadd Phone 2065 LG Washing Machine 666 LG Dryer 646 Product 355 Name: Product, dtype: int64
Here we see that an item named "Product" is coming 355 times in the data.
df["Quantity Ordered"].value_counts()
1 168552 2 13324 3 2920 4 806 Quantity Ordered 355 5 236 6 80 7 24 8 5 9 3 Name: Quantity Ordered, dtype: int64
Similarly , here we see that "Quantity ordered" is occuring 355 in the data which can be a mistake so first we will observe and then remove those specific column with irrelevant data.
df["Price Each"].value_counts()
11.95 21903 14.95 21658 2.99 20641 3.84 20577 11.99 18882 150 15450 99.99 13325 149.99 7507 700 6804 389.99 6230 379.99 6181 600 5490 300 4780 1700 4702 999.99 4128 109.99 4101 400 2056 600.0 1347 Price Each 355 150.0 99 700.0 38 1700.0 22 300.0 20 400.0 9 Name: Price Each, dtype: int64
Here we see "Price each" is also repeating 355 times . Thus , we will filter the data based on the occurance of name " quantity ordered" in the dataset .
index_remove=df[df["Quantity Ordered"]=="Quantity Ordered"]
index_remove.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
519 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
1149 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
1155 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
2878 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
2893 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
We observer that there are 355 columns which have column name in them insted of data so we will remove them.
First we will collect the index of such rows with irrelevant data in list and then we will use this list to drop the rows from the data.
index_remove=df[df["Quantity Ordered"]=="Quantity Ordered"].index.tolist()
data_string = [','.join(map(str, index_remove))]
data_string = list(map(int, data_string[0].split(',')))
print(data_string)
[519, 1149, 1155, 2878, 2893, 3036, 3209, 3618, 4138, 4645, 4794, 5303, 6939, 7497, 8635, 9189, 9329, 9973, 10885, 10958, 11272, 11619, 11742, 11778, 11874, 12455, 13362, 14022, 14522, 14860, 15280, 17208, 17324, 17339, 17744, 18594, 18626, 19674, 19795, 19810, 20112, 20687, 20865, 20951, 21051, 21739, 24641, 24817, 25370, 25503, 26140, 26292, 26950, 27107, 27596, 27602, 27988, 28600, 28721, 29088, 29387, 30648, 31099, 31495, 33269, 34102, 34187, 34206, 34612, 35537, 35687, 37002, 37146, 37761, 38098, 38175, 38488, 40811, 41916, 41939, 42010, 42730, 42910, 43067, 43555, 46967, 47181, 47406, 47479, 49042, 49468, 49720, 49872, 49974, 50103, 50435, 50712, 50792, 50943, 51380, 51542, 51943, 52448, 52769, 53592, 53731, 54142, 54586, 54616, 56994, 57415, 57561, 57736, 58602, 58639, 58653, 58871, 58937, 59342, 59524, 59581, 60029, 60826, 61610, 62468, 62812, 67039, 68620, 68649, 68741, 69444, 70010, 70662, 70794, 71159, 71170, 72448, 73140, 74053, 75091, 75221, 75818, 76858, 78045, 78661, 78708, 78894, 78990, 79849, 80221, 80717, 80763, 80783, 80956, 81103, 81764, 82340, 82428, 82435, 83126, 84483, 85068, 85398, 85430, 85520, 85530, 85549, 85920, 85977, 86559, 86722, 87155, 87229, 87579, 89259, 89307, 89951, 91178, 91799, 92631, 93320, 93325, 94767, 94853, 95559, 95838, 95870, 95970, 96500, 96518, 97014, 97659, 99368, 99892, 100547, 100559, 102084, 102425, 102454, 102688, 104945, 105696, 106127, 106193, 107242, 107295, 107370, 107748, 107991, 108602, 109285, 109313, 109486, 109519, 110083, 110961, 111021, 112090, 112148, 112636, 112654, 112918, 113104, 113676, 114812, 115304, 116302, 117537, 117568, 118033, 118218, 118744, 119036, 119496, 119679, 119889, 121453, 121462, 121464, 121807, 121844, 123434, 123532, 124425, 124955, 125563, 125662, 126010, 126111, 126494, 127873, 128103, 129288, 129593, 130041, 130047, 130303, 130982, 131083, 131241, 131437, 131640, 133194, 134551, 134780, 135908, 136452, 136723, 136994, 138296, 138643, 138744, 138850, 139015, 139253, 139431, 139785, 140246, 140441, 140560, 141344, 141904, 141988, 142302, 142352, 143301, 143326, 144319, 144687, 145838, 146250, 147069, 147122, 148373, 149084, 149456, 149906, 150791, 152145, 152548, 153138, 153778, 154171, 154271, 154383, 155306, 155618, 156965, 157173, 159033, 159050, 160520, 160671, 160967, 161352, 161868, 162292, 163496, 163808, 164347, 166833, 166976, 167273, 168535, 170324, 170469, 170511, 170617, 170929, 171225, 171810, 172006, 172143, 172816, 172856, 173402, 173673, 175088, 176843, 178896, 178954, 179487, 180743, 181368, 183502, 183808, 184489, 184666, 184779, 185118, 185164, 185551, 186563, 186632, 186738]
Above we created a list of all the indices now we will use this to remove rows.
dsremove=[519,1149,1155,2878,2893,3036,3209,3618,4138,4645,4794,5303,6939,7497,8635,9189,9329,9973,10885,10958,11272,11619,11742,11778,11874,12455,13362,14022,14522,14860,15280,17208,17324,17339,17744,18594,18626,19674,19795,19810,20112,20687,20865,20951,21051,21739,24641,24817,25370,25503,26140,26292,26950,27107,27596,27602,27988,28600,28721,29088,29387,30648,31099,31495,33269,34102,34187,34206,34612,35537,35687,37002,37146,37761,38098,38175,38488,40811,41916,41939,42010,42730,42910,43067,43555,46967,47181,47406,47479,49042,49468,49720,49872,49974,50103,50435,50712,50792,50943,51380,51542,51943,52448,52769,53592,53731,54142,54586,54616,56994,57415,57561,57736,58602,58639,58653,58871,58937,59342,59524,59581,60029,60826,61610,62468,62812,67039,68620,68649,68741,69444,70010,70662,70794,71159,71170,72448,73140,74053,75091,75221,75818,76858,78045,78661,78708,78894,78990,79849,80221,80717,80763,80783,80956,81103,81764,82340,82428,82435,83126,84483,85068,85398,85430,85520,85530,85549,85920,85977,86559,86722,87155,87229,87579,89259,89307,89951,91178,91799,92631,93320,93325,94767,94853,95559,95838,95870,95970,96500,96518,97014,97659,99368,99892,100547,100559,102084,102425,102454,102688,104945,105696,106127,106193,107242,107295,107370,107748,107991,108602,109285,109313,109486,109519,110083,110961,111021,112090,112148,112636,112654,112918,113104,113676,114812,115304,116302,117537,117568,118033,118218,118744,119036,119496,119679,119889,121453,121462,121464,121807,121844,123434,123532,124425,124955,125563,125662,126010,126111,126494,127873,128103,129288,129593,130041,130047,130303,130982,131083,131241,131437,131640,133194,134551,134780,135908,136452,136723,136994,138296,138643,138744,138850,139015,139253,139431,139785,140246,140441,140560,141344,141904,141988,142302,142352,143301,143326,144319,144687,145838,146250,147069,147122,148373,149084,149456,149906,150791,152145,152548,153138,153778,154171,154271,154383,155306,155618,156965,157173,159033,159050,160520,160671,160967,161352,161868,162292,163496,163808,164347,166833,166976,167273,168535,170324,170469,170511,170617,170929,171225,171810,172006,172143,172816,172856,173402,173673,175088,176843,178896,178954,179487,180743,181368,183502,183808,184489,184666,184779,185118,185164,185551,186563,186632,186738]
data= df.drop(dsremove, axis=0)
The specified rows are removed . Now we will check for null data and remove or replace those null values based on the observations.
data.isnull().sum()
Order ID 545 Product 545 Quantity Ordered 545 Price Each 545 Order Date 545 Purchase Address 545 dtype: int64
We see that there are 545 missing values in each column. So we will remove them by using dropna .
data.dropna(inplace=True)
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 |
data.isnull().sum()
Order ID 0 Product 0 Quantity Ordered 0 Price Each 0 Order Date 0 Purchase Address 0 dtype: int64
Null values are now removed .
Now we will see if there are any duplicate rows in the data.
duplicate_list=data[data.duplicated()]
duplicate_list
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
31 | 176585 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 11:31 | 823 Highland St, Boston, MA 02215 |
1302 | 177795 | Apple Airpods Headphones | 1 | 150 | 04/27/19 19:45 | 740 14th St, Seattle, WA 98101 |
1684 | 178158 | USB-C Charging Cable | 1 | 11.95 | 04/28/19 21:13 | 197 Center St, San Francisco, CA 94016 |
3805 | 180207 | Apple Airpods Headphones | 1 | 150 | 04/13/19 01:46 | 196 7th St, Los Angeles, CA 90001 |
4196 | 180576 | Lightning Charging Cable | 1 | 14.95 | 04/18/19 17:23 | 431 Park St, Dallas, TX 75001 |
... | ... | ... | ... | ... | ... | ... |
184940 | 257530 | USB-C Charging Cable | 1 | 11.95 | 09/03/19 11:13 | 192 Johnson St, San Francisco, CA 94016 |
186173 | 258715 | Lightning Charging Cable | 1 | 14.95 | 09/15/19 16:50 | 550 10th St, Portland, OR 97035 |
186508 | 259035 | 27in FHD Monitor | 1 | 149.99 | 09/29/19 13:52 | 327 Lake St, San Francisco, CA 94016 |
186782 | 259296 | Apple Airpods Headphones | 1 | 150 | 09/28/19 16:48 | 894 6th St, Dallas, TX 75001 |
186785 | 259297 | Lightning Charging Cable | 1 | 14.95 | 09/15/19 18:54 | 138 Main St, Boston, MA 02215 |
264 rows × 6 columns
We observe that there are some duplicate rows as well so we will drop them also .
data=data.drop_duplicates()
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 |
3 | 176560 | Google Phone | 1 | 600 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 |
Hence , duplicate rows are dropped and now we can move to further analysis based on the tasks provided.
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 185686 entries, 0 to 186849 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order ID 185686 non-null object 1 Product 185686 non-null object 2 Quantity Ordered 185686 non-null object 3 Price Each 185686 non-null object 4 Order Date 185686 non-null object 5 Purchase Address 185686 non-null object dtypes: object(6) memory usage: 9.9+ MB
We observe that all the columns are in object format.
data['Order ID'] = data['Order ID'].astype('int32')
data['Quantity Ordered'] = data['Quantity Ordered'].astype('int32')
data['Price Each'] = data['Price Each'].astype('float64')
data.dtypes
Order ID int32 Product object Quantity Ordered int32 Price Each float64 Order Date object Purchase Address object dtype: object
The data types of 'Order ID', 'Quantity Ordered', and 'Price Each' columns are converted to 'int32' and 'float64', respectively, ensuring these columns can be used for numerical computations.
a=data.Product.unique().tolist()
def productnum(data): #label encoding
# global a
return a.index(data) + 1
data['productnum']=data.Product.apply(productnum)
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | |
---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 04/19/19 08:46 | 917 1st St, Dallas, TX 75001 | 1 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 04/07/19 22:30 | 682 Chestnut St, Boston, MA 02215 | 2 |
3 | 176560 | Google Phone | 1 | 600.00 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 3 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 04/12/19 14:38 | 669 Spruce St, Los Angeles, CA 90001 | 4 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 04/30/19 09:27 | 333 8th St, Los Angeles, CA 90001 | 4 |
The 'Product' column is label-encoded, converting each unique product name into a unique numeric identifier, stored in the new 'productnum' column. This transformation allows the categorical product data to be used in numerical analyses and machine learning models.
data["Order Date"] = pd.to_datetime(data["Order Date"])
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | |
---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 1 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 2 |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 3 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 |
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 185686 entries, 0 to 186849 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order ID 185686 non-null int32 1 Product 185686 non-null object 2 Quantity Ordered 185686 non-null int32 3 Price Each 185686 non-null float64 4 Order Date 185686 non-null datetime64[ns] 5 Purchase Address 185686 non-null object 6 productnum 185686 non-null int64 dtypes: datetime64[ns](1), float64(1), int32(2), int64(1), object(2) memory usage: 9.9+ MB
The 'Order Date' column is converted to a DateTime format, enabling the extraction of specific date and time components. This step is essential for performing further time-based operations and analyses.
data["Day"] = data["Order Date"].dt.day
data["Month"] = data["Order Date"].dt.month
data["Year"] = data["Order Date"].dt.year
data["Time"] = data["Order Date"].dt.time
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 1 | 19 | 4 | 2019 | 08:46:00 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 2 | 7 | 4 | 2019 | 22:30:00 |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 3 | 12 | 4 | 2019 | 14:38:00 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 12 | 4 | 2019 | 14:38:00 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 30 | 4 | 2019 | 09:27:00 |
New columns 'Day', 'Month', 'Year', and 'Time' are created by extracting respective components from the 'Order Date' column. This detailed breakdown of the date and time allows for more granular analysis of sales trends and patterns over different periods.
data['Price Rounded'] = data['Price Each'].round().astype(int)
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 1 | 19 | 4 | 2019 | 08:46:00 | 12 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 2 | 7 | 4 | 2019 | 22:30:00 | 100 |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 3 | 12 | 4 | 2019 | 14:38:00 | 600 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 12 | 4 | 2019 | 14:38:00 | 12 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 30 | 4 | 2019 | 09:27:00 | 12 |
A new column 'Price Rounded' is created by rounding off the 'Price Each' values to the nearest integer and converting them to integer format. This transformation simplifies the price data, facilitating more straightforward analysis and potential aggregation without decimal precision.
address=data["Purchase Address"].str.split(",", expand=True)
pin=address[2].str.split(" ",expand=True)
pin.head()
0 | 1 | 2 | |
---|---|---|---|
0 | TX | 75001 | |
2 | MA | 02215 | |
3 | CA | 90001 | |
4 | CA | 90001 | |
5 | CA | 90001 |
The 'Purchase Address' column is split into multiple components using commas and spaces, isolating the segment that contains the PIN codes. This intermediate step helps in precisely locating and extracting the PIN codes from the address.
data["PIN"]=pin[2]
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 1 | 19 | 4 | 2019 | 08:46:00 | 12 | 75001 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 2 | 7 | 4 | 2019 | 22:30:00 | 100 | 02215 |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 3 | 12 | 4 | 2019 | 14:38:00 | 600 | 90001 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 12 | 4 | 2019 | 14:38:00 | 12 | 90001 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 30 | 4 | 2019 | 09:27:00 | 12 | 90001 |
A new column 'PIN' is created to store the extracted PIN codes from the address. This addition allows for more detailed geographical analysis, enabling the identification of sales patterns and trends based on specific locations.
address.head()
0 | 1 | 2 | |
---|---|---|---|
0 | 917 1st St | Dallas | TX 75001 |
2 | 682 Chestnut St | Boston | MA 02215 |
3 | 669 Spruce St | Los Angeles | CA 90001 |
4 | 669 Spruce St | Los Angeles | CA 90001 |
5 | 333 8th St | Los Angeles | CA 90001 |
By inspecting the 'address' DataFrame, we observe that the city names are located in the second segment (index 1) of the split addresses. This step confirms the correct position of the city names, ensuring accurate extraction.
data["City"]=address[1]
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | City | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 1 | 19 | 4 | 2019 | 08:46:00 | 12 | 75001 | Dallas |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 2 | 7 | 4 | 2019 | 22:30:00 | 100 | 02215 | Boston |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 3 | 12 | 4 | 2019 | 14:38:00 | 600 | 90001 | Los Angeles |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 12 | 4 | 2019 | 14:38:00 | 12 | 90001 | Los Angeles |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 30 | 4 | 2019 | 09:27:00 | 12 | 90001 | Los Angeles |
A new column 'City' is created to store the city names extracted from the 'Purchase Address' column. This addition allows for detailed geographical analysis, enabling the identification of sales trends and patterns based on specific cities.
The task involves grouping the products into different categories such as Phones, Accessories, etc., based on their types, to enable better analysis and reporting.
data.Product.unique().tolist()
['USB-C Charging Cable', 'Bose SoundSport Headphones', 'Google Phone', 'Wired Headphones', 'Macbook Pro Laptop', 'Lightning Charging Cable', '27in 4K Gaming Monitor', 'AA Batteries (4-pack)', 'Apple Airpods Headphones', 'AAA Batteries (4-pack)', 'iPhone', 'Flatscreen TV', '27in FHD Monitor', '20in Monitor', 'LG Dryer', 'ThinkPad Laptop', 'Vareebadd Phone', 'LG Washing Machine', '34in Ultrawide Monitor']
By listing the unique product names, we can categorize them effectively. This step provides a clear overview of the product types available in the dataset, which is essential for accurate grouping.
def grpproduct(data):
if data in ["USB-C Charging Cable",'Lightning Charging Cable']:
return "Charging Cable"
elif data in ['Bose SoundSport Headphones','Wired Headphones','Apple Airpods Headphones']:
return "Headphones"
elif data in ['Google Phone','iPhone','Vareebadd Phone']:
return "Phone"
elif data in ['Macbook Pro Laptop','ThinkPad Laptop']:
return "Laptop"
elif data in ['27in 4K Gaming Monitor','27in FHD Monitor','20in Monitor','34in Ultrawide Monitor']:
return "Monitor"
elif data in ['AA Batteries (4-pack)','AAA Batteries (4-pack)']:
return "Batteries"
elif data in ['Flatscreen TV',]:
return "TV"
elif data in ['LG Dryer','LG Washing Machine']:
return "Home appliances"
else:
return np.nan
The function grpproduct is defined to categorize each product into predefined categories. This function maps each product to its corresponding category based on the provided list.
data["Product Category"]=data.Product.apply(grpproduct)
data
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | City | Product Category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 1 | 19 | 4 | 2019 | 08:46:00 | 12 | 75001 | Dallas | Charging Cable |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 2 | 7 | 4 | 2019 | 22:30:00 | 100 | 02215 | Boston | Headphones |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 3 | 12 | 4 | 2019 | 14:38:00 | 600 | 90001 | Los Angeles | Phone |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 12 | 4 | 2019 | 14:38:00 | 12 | 90001 | Los Angeles | Headphones |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 30 | 4 | 2019 | 09:27:00 | 12 | 90001 | Los Angeles | Headphones |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
186845 | 259353 | AAA Batteries (4-pack) | 3 | 2.99 | 2019-09-17 20:56:00 | 840 Highland St, Los Angeles, CA 90001 | 10 | 17 | 9 | 2019 | 20:56:00 | 3 | 90001 | Los Angeles | Batteries |
186846 | 259354 | iPhone | 1 | 700.00 | 2019-09-01 16:00:00 | 216 Dogwood St, San Francisco, CA 94016 | 11 | 1 | 9 | 2019 | 16:00:00 | 700 | 94016 | San Francisco | Phone |
186847 | 259355 | iPhone | 1 | 700.00 | 2019-09-23 07:39:00 | 220 12th St, San Francisco, CA 94016 | 11 | 23 | 9 | 2019 | 07:39:00 | 700 | 94016 | San Francisco | Phone |
186848 | 259356 | 34in Ultrawide Monitor | 1 | 379.99 | 2019-09-19 17:30:00 | 511 Forest St, San Francisco, CA 94016 | 19 | 19 | 9 | 2019 | 17:30:00 | 380 | 94016 | San Francisco | Monitor |
186849 | 259357 | USB-C Charging Cable | 1 | 11.95 | 2019-09-30 00:18:00 | 250 Meadow St, San Francisco, CA 94016 | 1 | 30 | 9 | 2019 | 00:18:00 | 12 | 94016 | San Francisco | Charging Cable |
185686 rows × 15 columns
A new column 'Product Category' is created by applying the grpproduct function to the 'Product' column. This categorization allows for better organization and analysis of the data, enabling insights into the performance of different product categories.
data.Year.unique().tolist()
[2019, 2020]
By listing the unique years, we can filter the data accordingly. This step confirms the range of years present in the dataset, which is crucial for segregating the data by month.
data19=data[data.Year==2019]
data20=data[data.Year==2020]
The data is filtered into separate dataframes for the years 2019 and 2020. This segregation ensures that month-wise analysis can be accurately performed within each year.
ds_list={}
monthnum=[1,2,3,4,5,6,7,8,9,10,11,12]
monthlist=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
A dictionary ds_list and lists monthnum and monthlist are initialized to store the month-wise dataframes and to facilitate iteration over the months. This setup is essential for creating and labeling the separate monthly dataframes.
for i in monthnum:
ds_list[monthlist[i-1]]=data.groupby("Month").get_group(i)
The data dataframe is grouped by the 'Month' column, and separate dataframes for each month are stored in the ds_list dictionary. This step ensures that each month has a dedicated dataframe for detailed analysis.
ds_list["Jan"].head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | City | Product Category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
31957 | 297150 | Lightning Charging Cable | 1 | 14.95 | 2020-01-01 00:38:00 | 427 Wilson St, Dallas, TX 75001 | 6 | 1 | 1 | 2020 | 00:38:00 | 15 | 75001 | Dallas | Charging Cable |
32659 | 297817 | iPhone | 1 | 700.00 | 2020-01-01 00:22:00 | 519 13th St, New York City, NY 10001 | 11 | 1 | 1 | 2020 | 00:22:00 | 700 | 10001 | New York City | Phone |
32660 | 297817 | Lightning Charging Cable | 2 | 14.95 | 2020-01-01 00:22:00 | 519 13th St, New York City, NY 10001 | 6 | 1 | 1 | 2020 | 00:22:00 | 15 | 10001 | New York City | Charging Cable |
32819 | 297969 | Google Phone | 1 | 600.00 | 2020-01-01 00:54:00 | 542 2nd St, San Francisco, CA 94016 | 3 | 1 | 1 | 2020 | 00:54:00 | 600 | 94016 | San Francisco | Phone |
33200 | 298344 | Wired Headphones | 4 | 11.99 | 2020-01-01 03:34:00 | 731 7th St, New York City, NY 10001 | 4 | 1 | 1 | 2020 | 03:34:00 | 12 | 10001 | New York City | Headphones |
Displaying the first few rows of the January dataframe confirms that the data for January has been correctly extracted and stored. This validation step helps ensure the accuracy of the monthly segregation.
The goal is to identify the best month for sales and determine how much was earned during that month. We will use the sales data to calculate total sales for each month and then visualize this data using a bar chart to easily identify the best month for sales.
data["Total Sales"]= data["Quantity Ordered"]*data["Price Rounded"]
This step creates a new column, Total Sales, by multiplying the quantity of items ordered (Quantity Ordered) by the price per item (Price Rounded). This provides the total sales amount for each individual order.
sale=data.groupby("Month")["Total Sales"].sum()
sale
Month 1 1821811 2 2200572 3 2805594 4 3389991 5 3151289 6 2576850 7 2647047 8 2241581 9 2094958 10 3735611 11 3198606 12 4609333 Name: Total Sales, dtype: int32
Here, the data is grouped by the Month column, and the total sales for each month are summed up. This results in a new Series, sale, where the index is the month and the values are the total sales for that month.
sale.max()
4609333
This line finds the highest value in the sale Series, representing the month with the highest total sales. It identifies the peak sales performance across all months.
sns.barplot(x=sale.index , y= sale.values)
<AxesSubplot:xlabel='Month'>
A bar plot is created using the seaborn library, with months on the x-axis and total sales values on the y-axis. This visual representation makes it easy to compare sales performance across different months and clearly identify the best month for sales.
The bar plot visualizes the total earnings per month. It confirms that December was the best month for sales, with the highest total earnings of 4,583,267.77 dollars. This visualization effectively communicates the peak sales month, making it easy to identify trends and patterns.
Based on the calculated total sales for each month, the results indicate the following:
data.City.value_counts()
San Francisco 44662 Los Angeles 29564 New York City 24847 Boston 19901 Atlanta 14863 Dallas 14797 Seattle 14713 Portland 12449 Austin 9890 Name: City, dtype: int64
The output indicates that San Francisco has the highest number of sales transactions (44,662), followed by Los Angeles (29,564) and New York City (24,847). This step helps identify the city with the most sales activity.
sns.countplot(x=data.City)
plt.xticks(rotation=90)
plt.show()
The count plot visualizes the number of sales transactions across different cities. San Francisco is clearly the city with the highest sales volume, as reflected by the tallest bar in the plot. The rotation of x-axis labels ensures all city names are readable. This visualization provides a clear comparison of sales performance by city, highlighting San Francisco as the leading city in terms of product sales.
The analysis reveals that San Francisco leads in sales transactions compared to other cities. The visual representation confirms this by showing San Francisco's prominent position in the sales distribution, making it evident which city contributes the most to the overall sales figures.
The task aims to determine the optimal time to display advertisements by analyzing the hourly distribution of purchases to maximize the likelihood of customers buying products.
data["Hour"]=data["Order Date"].dt.hour
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | City | Product Category | Total Sales | Hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 1 | 19 | 4 | 2019 | 08:46:00 | 12 | 75001 | Dallas | Charging Cable | 24 | 8 |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 2 | 7 | 4 | 2019 | 22:30:00 | 100 | 02215 | Boston | Headphones | 100 | 22 |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 3 | 12 | 4 | 2019 | 14:38:00 | 600 | 90001 | Los Angeles | Phone | 600 | 14 |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 12 | 4 | 2019 | 14:38:00 | 12 | 90001 | Los Angeles | Headphones | 12 | 14 |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 30 | 4 | 2019 | 09:27:00 | 12 | 90001 | Los Angeles | Headphones | 12 | 9 |
The 'Hour' column is created to capture the hour at which each purchase was made. This step prepares the data for time-based analysis, allowing us to identify peak purchasing hours.
sns.countplot(x=data.Hour)
<AxesSubplot:xlabel='Hour', ylabel='count'>
The count plot displays the number of purchases made during each hour of the day. It highlights that 11-12 hour and 19 hour has the highest frequency of purchases. This visualization helps identify peak hours when customers are most active.
sns.displot(x=data.Hour, kde=True, bins=24)
plt.title('Distribution of Purchases by Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Purchases')
plt.grid(True)
plt.xticks(range(0, 24), rotation=90)
plt.show()
The distribution plot with a KDE curve shows the frequency of purchases throughout the day. The plot reveals 12 and 19 hours have the highest density of purchases, indicating the optimal times for displaying advertisements. The grid and x-axis labels make the plot easy to interpret, showing that the highest number of purchases typically occurs in the afternoon or late evening hours, suggesting these times might be ideal for ad placements.
The task is to identify which product was sold the least according to the product categories and analyze potential reasons behind its lower sales.
procount=data.groupby("Product Category")["Product"].value_counts().reset_index(name="Count")
procount
Product Category | Product | Count | |
---|---|---|---|
0 | Batteries | AAA Batteries (4-pack) | 20612 |
1 | Batteries | AA Batteries (4-pack) | 20558 |
2 | Charging Cable | USB-C Charging Cable | 21859 |
3 | Charging Cable | Lightning Charging Cable | 21610 |
4 | Headphones | Wired Headphones | 18849 |
5 | Headphones | Apple Airpods Headphones | 15525 |
6 | Headphones | Bose SoundSport Headphones | 13298 |
7 | Home appliances | LG Washing Machine | 666 |
8 | Home appliances | LG Dryer | 646 |
9 | Laptop | Macbook Pro Laptop | 4721 |
10 | Laptop | ThinkPad Laptop | 4126 |
11 | Monitor | 27in FHD Monitor | 7498 |
12 | Monitor | 27in 4K Gaming Monitor | 6225 |
13 | Monitor | 34in Ultrawide Monitor | 6174 |
14 | Monitor | 20in Monitor | 4098 |
15 | Phone | iPhone | 6840 |
16 | Phone | Google Phone | 5522 |
17 | Phone | Vareebadd Phone | 2065 |
18 | TV | Flatscreen TV | 4794 |
The output shows the count of sales for each product within its category. From this, it is evident that the products with the least sales are the LG Washing Machine and LG Dryer, with counts of 666 and 646, respectively, in the 'Home appliances' category. These are the least sold products among the listed categories.
meanprice=data.groupby("Product")["Price Each"].mean().reset_index()
meanprice.columns=["Product", "Mean Price"]
meanprice
Product | Mean Price | |
---|---|---|
0 | 20in Monitor | 109.99 |
1 | 27in 4K Gaming Monitor | 389.99 |
2 | 27in FHD Monitor | 149.99 |
3 | 34in Ultrawide Monitor | 379.99 |
4 | AA Batteries (4-pack) | 3.84 |
5 | AAA Batteries (4-pack) | 2.99 |
6 | Apple Airpods Headphones | 150.00 |
7 | Bose SoundSport Headphones | 99.99 |
8 | Flatscreen TV | 300.00 |
9 | Google Phone | 600.00 |
10 | LG Dryer | 600.00 |
11 | LG Washing Machine | 600.00 |
12 | Lightning Charging Cable | 14.95 |
13 | Macbook Pro Laptop | 1700.00 |
14 | ThinkPad Laptop | 999.99 |
15 | USB-C Charging Cable | 11.95 |
16 | Vareebadd Phone | 400.00 |
17 | Wired Headphones | 11.99 |
18 | iPhone | 700.00 |
The mean price data reveals that products like the LG Washing Machine and LG Dryer are priced at 600.00 dollars , which is relatively high compared to other products in the dataset. This suggests that their high prices may be a contributing factor to their lower sales volumes.
merged=pd.merge(procount, meanprice, on="Product", how="left")
merged
Product Category | Product | Count | Mean Price | |
---|---|---|---|---|
0 | Batteries | AAA Batteries (4-pack) | 20612 | 2.99 |
1 | Batteries | AA Batteries (4-pack) | 20558 | 3.84 |
2 | Charging Cable | USB-C Charging Cable | 21859 | 11.95 |
3 | Charging Cable | Lightning Charging Cable | 21610 | 14.95 |
4 | Headphones | Wired Headphones | 18849 | 11.99 |
5 | Headphones | Apple Airpods Headphones | 15525 | 150.00 |
6 | Headphones | Bose SoundSport Headphones | 13298 | 99.99 |
7 | Home appliances | LG Washing Machine | 666 | 600.00 |
8 | Home appliances | LG Dryer | 646 | 600.00 |
9 | Laptop | Macbook Pro Laptop | 4721 | 1700.00 |
10 | Laptop | ThinkPad Laptop | 4126 | 999.99 |
11 | Monitor | 27in FHD Monitor | 7498 | 149.99 |
12 | Monitor | 27in 4K Gaming Monitor | 6225 | 389.99 |
13 | Monitor | 34in Ultrawide Monitor | 6174 | 379.99 |
14 | Monitor | 20in Monitor | 4098 | 109.99 |
15 | Phone | iPhone | 6840 | 700.00 |
16 | Phone | Google Phone | 5522 | 600.00 |
17 | Phone | Vareebadd Phone | 2065 | 400.00 |
18 | TV | Flatscreen TV | 4794 | 300.00 |
The merged data provides a comprehensive view of both the sales count and mean price for each product. The LG Washing Machine and LG Dryer have the lowest sales counts and are also among the higher-priced items. This pricing could be a significant factor in their low sales, as higher-priced products may have a smaller customer base or face more competition compared to lower-priced items.
The analysis indicates that the LG Washing Machine and LG Dryer were sold the least. Their high prices, relative to other products, likely contribute to their lower sales volumes. Higher-priced items may have fewer buyers due to budget constraints or lower demand compared to more affordable products. Additionally, factors such as product necessity, market competition, and customer preferences may also influence the sales figures.
The task is to determine which products are most likely to be sold more during different seasons (winter, summer, and rainy) and to explain the reasons behind these seasonal sales trends.
def season(data):
if data in [11,12,1,2]: return "Winter"
elif data in [3,4,5,6]: return "Summer"
elif data in [7,8,9,10]: return "Rainy"
else : return np.nan
data["Season"]=data.Month.apply(season)
data.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | City | Product Category | Total Sales | Hour | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 176558 | USB-C Charging Cable | 2 | 11.95 | 2019-04-19 08:46:00 | 917 1st St, Dallas, TX 75001 | 1 | 19 | 4 | 2019 | 08:46:00 | 12 | 75001 | Dallas | Charging Cable | 24 | 8 | Summer |
2 | 176559 | Bose SoundSport Headphones | 1 | 99.99 | 2019-04-07 22:30:00 | 682 Chestnut St, Boston, MA 02215 | 2 | 7 | 4 | 2019 | 22:30:00 | 100 | 02215 | Boston | Headphones | 100 | 22 | Summer |
3 | 176560 | Google Phone | 1 | 600.00 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 3 | 12 | 4 | 2019 | 14:38:00 | 600 | 90001 | Los Angeles | Phone | 600 | 14 | Summer |
4 | 176560 | Wired Headphones | 1 | 11.99 | 2019-04-12 14:38:00 | 669 Spruce St, Los Angeles, CA 90001 | 4 | 12 | 4 | 2019 | 14:38:00 | 12 | 90001 | Los Angeles | Headphones | 12 | 14 | Summer |
5 | 176561 | Wired Headphones | 1 | 11.99 | 2019-04-30 09:27:00 | 333 8th St, Los Angeles, CA 90001 | 4 | 30 | 4 | 2019 | 09:27:00 | 12 | 90001 | Los Angeles | Headphones | 12 | 9 | Summer |
The 'Season' column is created based on the month of the year, categorizing each month into Winter, Summer, or Rainy season. This categorization is crucial for analyzing seasonal sales patterns.
data.groupby("Season")["Product"].value_counts()
Season Product Rainy USB-C Charging Cable 6894 Lightning Charging Cable 6782 AA Batteries (4-pack) 6466 AAA Batteries (4-pack) 6409 Wired Headphones 5889 Apple Airpods Headphones 4831 Bose SoundSport Headphones 4180 27in FHD Monitor 2351 iPhone 2051 27in 4K Gaming Monitor 2018 34in Ultrawide Monitor 1885 Google Phone 1707 Flatscreen TV 1540 Macbook Pro Laptop 1478 ThinkPad Laptop 1302 20in Monitor 1283 Vareebadd Phone 616 LG Washing Machine 207 LG Dryer 181 Summer Lightning Charging Cable 7435 USB-C Charging Cable 7250 AA Batteries (4-pack) 7095 AAA Batteries (4-pack) 6972 Wired Headphones 6463 Apple Airpods Headphones 5321 Bose SoundSport Headphones 4592 27in FHD Monitor 2587 iPhone 2400 34in Ultrawide Monitor 2167 27in 4K Gaming Monitor 2034 Google Phone 1913 Macbook Pro Laptop 1649 Flatscreen TV 1581 ThinkPad Laptop 1420 20in Monitor 1369 Vareebadd Phone 734 LG Dryer 246 LG Washing Machine 244 Winter USB-C Charging Cable 7715 Lightning Charging Cable 7393 AAA Batteries (4-pack) 7231 AA Batteries (4-pack) 6997 Wired Headphones 6497 Apple Airpods Headphones 5373 Bose SoundSport Headphones 4526 27in FHD Monitor 2560 iPhone 2389 27in 4K Gaming Monitor 2173 34in Ultrawide Monitor 2122 Google Phone 1902 Flatscreen TV 1673 Macbook Pro Laptop 1594 20in Monitor 1446 ThinkPad Laptop 1404 Vareebadd Phone 715 LG Dryer 219 LG Washing Machine 215 Name: Product, dtype: int64
The output shows which products have the highest sales counts for each season. The most popular products by season are:
sns.countplot(x=data.Product, hue=data.Season)
plt.xticks(rotation=90)
plt.show()
The count plot visually represents the sales of different products across seasons. It confirms that:
Winter: The USB-C Charging Cable tops the sales list, likely because people may purchase more tech accessories during the colder months when they stay indoors more often and might need additional charging solutions for their devices.
Summer: The Lightning Charging Cable leads in sales. During summer, people might be on the go more frequently and need portable charging solutions, which explains the higher sales of this product.
Rainy: The USB-C Charging Cable again stands out. Rainy weather often leads to increased indoor activities, which could drive higher sales of tech products as people spend more time using their electronic devices.
Overall, these trends highlight how seasonal weather and activities influence purchasing behavior, suggesting that advertisements and promotions could be tailored to these seasonal patterns to maximize sales.
march=data.groupby("Month").get_group(3)
march.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | City | Product Category | Total Sales | Hour | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55543 | 150533 | AAA Batteries (4-pack) | 1 | 2.99 | 2019-03-01 03:06:00 | 270 Dogwood St, San Francisco, CA 94016 | 10 | 1 | 3 | 2019 | 03:06:00 | 3 | 94016 | San Francisco | Batteries | 3 | 3 | Summer |
55551 | 150541 | AAA Batteries (4-pack) | 1 | 2.99 | 2019-03-01 01:03:00 | 683 Adams St, Portland, OR 97035 | 10 | 1 | 3 | 2019 | 01:03:00 | 3 | 97035 | Portland | Batteries | 3 | 1 | Summer |
55834 | 150812 | Wired Headphones | 1 | 11.99 | 2019-03-01 02:18:00 | 915 Maple St, San Francisco, CA 94016 | 4 | 1 | 3 | 2019 | 02:18:00 | 12 | 94016 | San Francisco | Headphones | 12 | 2 | Summer |
56817 | 151749 | Bose SoundSport Headphones | 1 | 99.99 | 2019-03-01 01:54:00 | 993 Washington St, San Francisco, CA 94016 | 2 | 1 | 3 | 2019 | 01:54:00 | 100 | 94016 | San Francisco | Headphones | 100 | 1 | Summer |
57120 | 152036 | ThinkPad Laptop | 1 | 999.99 | 2019-03-01 05:56:00 | 477 Washington St, San Francisco, CA 94016 | 16 | 1 | 3 | 2019 | 05:56:00 | 1000 | 94016 | San Francisco | Laptop | 1000 | 5 | Summer |
This step filters the dataset to include only sales data from March. It provides a subset of the data where the month is March, setting the stage for further analysis of phone sales during this month.
mphone=march[march["Product Category"]=="Phone"]
mphone.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | City | Product Category | Total Sales | Hour | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
105263 | 162009 | iPhone | 1 | 700.0 | 2019-03-28 20:59:00 | 942 Church St, Austin, TX 73301 | 11 | 28 | 3 | 2019 | 20:59:00 | 700 | 73301 | Austin | Phone | 700 | 20 | Summer |
105282 | 162026 | iPhone | 1 | 700.0 | 2019-03-22 22:31:00 | 71 Pine St, Austin, TX 73301 | 11 | 22 | 3 | 2019 | 22:31:00 | 700 | 73301 | Austin | Phone | 700 | 22 | Summer |
105286 | 162030 | Google Phone | 1 | 600.0 | 2019-03-09 16:50:00 | 463 4th St, San Francisco, CA 94016 | 3 | 9 | 3 | 2019 | 16:50:00 | 600 | 94016 | San Francisco | Phone | 600 | 16 | Summer |
105289 | 162033 | iPhone | 1 | 700.0 | 2019-03-19 11:06:00 | 267 Johnson St, Dallas, TX 75001 | 11 | 19 | 3 | 2019 | 11:06:00 | 700 | 75001 | Dallas | Phone | 700 | 11 | Summer |
105293 | 162037 | iPhone | 1 | 700.0 | 2019-03-23 02:11:00 | 277 Elm St, Los Angeles, CA 90001 | 11 | 23 | 3 | 2019 | 02:11:00 | 700 | 90001 | Los Angeles | Phone | 700 | 2 | Summer |
This step further filters the March data to include only products categorized as phones. It isolates the phone sales data from the overall March data, focusing on the relevant product category.
mphone.Product.value_counts()
iPhone 537 Google Phone 463 Vareebadd Phone 174 Name: Product, dtype: int64
The output reveals the number of units sold for each phone model in March:
The iPhone led in sales during March with 537 units sold, significantly outperforming other phone models. This high volume of sales might be attributed to factors such as brand popularity, promotions, or consumer preferences for the iPhone. The analysis highlights which phone model resonates most with customers in March, providing valuable insights for inventory management, marketing strategies, and sales forecasting.
hdphones=data[data["Product Category"]=="Headphones"]
hdphones[hdphones["Price Each"]==hdphones["Price Each"].max()].head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | productnum | Day | Month | Year | Time | Price Rounded | PIN | City | Product Category | Total Sales | Hour | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16 | 176572 | Apple Airpods Headphones | 1 | 150.0 | 2019-04-04 20:30:00 | 149 Dogwood St, New York City, NY 10001 | 9 | 4 | 4 | 2019 | 20:30:00 | 150 | 10001 | New York City | Headphones | 150 | 20 | Summer |
21 | 176576 | Apple Airpods Headphones | 1 | 150.0 | 2019-04-28 11:42:00 | 771 Ridge St, Los Angeles, CA 90001 | 9 | 28 | 4 | 2019 | 11:42:00 | 150 | 90001 | Los Angeles | Headphones | 150 | 11 | Summer |
22 | 176577 | Apple Airpods Headphones | 1 | 150.0 | 2019-04-04 19:25:00 | 260 Spruce St, Dallas, TX 75001 | 9 | 4 | 4 | 2019 | 19:25:00 | 150 | 75001 | Dallas | Headphones | 150 | 19 | Summer |
23 | 176578 | Apple Airpods Headphones | 1 | 150.0 | 2019-04-09 23:35:00 | 513 Church St, Boston, MA 02215 | 9 | 9 | 4 | 2019 | 23:35:00 | 150 | 02215 | Boston | Headphones | 150 | 23 | Summer |
38 | 176591 | Apple Airpods Headphones | 1 | 150.0 | 2019-04-21 07:21:00 | 600 Maple St, Austin, TX 73301 | 9 | 21 | 4 | 2019 | 07:21:00 | 150 | 73301 | Austin | Headphones | 150 | 7 | Summer |
This step filters the dataset to include only products in the "Headphones" category and identifies the headphone model with the highest price. The output reveals that the Apple Airpods Headphones are the most expensive headphones, priced at 150.00 dollars.
merged
Product Category | Product | Count | Mean Price | |
---|---|---|---|---|
0 | Batteries | AAA Batteries (4-pack) | 20612 | 2.99 |
1 | Batteries | AA Batteries (4-pack) | 20558 | 3.84 |
2 | Charging Cable | USB-C Charging Cable | 21859 | 11.95 |
3 | Charging Cable | Lightning Charging Cable | 21610 | 14.95 |
4 | Headphones | Wired Headphones | 18849 | 11.99 |
5 | Headphones | Apple Airpods Headphones | 15525 | 150.00 |
6 | Headphones | Bose SoundSport Headphones | 13298 | 99.99 |
7 | Home appliances | LG Washing Machine | 666 | 600.00 |
8 | Home appliances | LG Dryer | 646 | 600.00 |
9 | Laptop | Macbook Pro Laptop | 4721 | 1700.00 |
10 | Laptop | ThinkPad Laptop | 4126 | 999.99 |
11 | Monitor | 27in FHD Monitor | 7498 | 149.99 |
12 | Monitor | 27in 4K Gaming Monitor | 6225 | 389.99 |
13 | Monitor | 34in Ultrawide Monitor | 6174 | 379.99 |
14 | Monitor | 20in Monitor | 4098 | 109.99 |
15 | Phone | iPhone | 6840 | 700.00 |
16 | Phone | Google Phone | 5522 | 600.00 |
17 | Phone | Vareebadd Phone | 2065 | 400.00 |
18 | TV | Flatscreen TV | 4794 | 300.00 |
mhp=merged[merged["Product Category"]=="Headphones"]
mhp
Product Category | Product | Count | Mean Price | |
---|---|---|---|---|
4 | Headphones | Wired Headphones | 18849 | 11.99 |
5 | Headphones | Apple Airpods Headphones | 15525 | 150.00 |
6 | Headphones | Bose SoundSport Headphones | 13298 | 99.99 |
This step shows the merged data for headphones, listing the units sold and their prices. The output confirms:
The Apple Airpods Headphones are the most expensive headphones in the dataset, with a price of $150.00. This high price point is likely due to the brand's premium positioning, advanced features, or market demand. Understanding the pricing of different products helps in pricing strategy decisions, market positioning, and inventory management.
The task is to determine whether the price of any product changes across different months, and if so, to explore potential reasons for such variations.
jan8=pd.DataFrame(data=data.groupby(["Product","Month"])["Price Each"].mean())
#pd.set_option('display.max_rows', None)
jan8
Price Each | ||
---|---|---|
Product | Month | |
20in Monitor | 1 | 109.99 |
2 | 109.99 | |
3 | 109.99 | |
4 | 109.99 | |
5 | 109.99 | |
... | ... | ... |
iPhone | 8 | 700.00 |
9 | 700.00 | |
10 | 700.00 | |
11 | 700.00 | |
12 | 700.00 |
228 rows × 1 columns
This step groups the data by Product and Month, then calculates the average price of each product for each month.
The result shows that there was no observed change in the average price of any product across different months. This implies that the prices remained consistent throughout the year.
mhp
Product Category | Product | Count | Mean Price | |
---|---|---|---|---|
4 | Headphones | Wired Headphones | 18849 | 11.99 |
5 | Headphones | Apple Airpods Headphones | 15525 | 150.00 |
6 | Headphones | Bose SoundSport Headphones | 13298 | 99.99 |
if we consider "bose soundsport headphones" to be wireless we can see that sum of apple and bose headphones (wireless) is greater than wired headphones, ie, 28823(sum of apple and bose headphones sold) > 18849 (wired headphones sold ).
catlist=data["Product Category"].unique().tolist()
catdic={}
for i in catlist:
dt10=data[data["Product Category"]==i]
This step initializes the analysis by creating a list of unique product categories and iterates over each category to filter the dataset accordingly. However, the specific handling of each category (e.g., calculating bulk orders) is not completed in this snippet.
data["Quantity Ordered"].value_counts()
1 168291 2 13321 3 2920 4 806 5 236 6 80 7 24 8 5 9 3 Name: Quantity Ordered, dtype: int64
The output shows the frequency of different order quantities. Orders of 1 are the most common, with 168,291 occurrences, while orders of higher quantities (e.g., 2, 3, 4) are less frequent. This distribution highlights that while bulk orders (e.g., quantities of 4 or more) are less common, they are still present in the dataset.
dt10=data[data["Product Category"]=="Headphones"]
dt10.groupby("Product")["Quantity Ordered"].max()
Product Apple Airpods Headphones 3 Bose SoundSport Headphones 3 Wired Headphones 4 Name: Quantity Ordered, dtype: int32
This step focuses on the "Headphones" category and identifies the maximum quantity ordered for each headphone product:
Among the headphones, the Wired Headphones are most likely to be ordered in bulk, with the highest maximum order quantity of 4.
Seasonal Sales Patterns:
Office:- 660, Sector 14A, Vasundhara, Ghaziabad, Uttar Pradesh - 201012, India