About Dataset -¶
Dataset contain daily reports of all Power stations (regional-wise) with Expected power generation, actual power generation, Deviation, etc. in unit MU ( A million units, designated MU, is a gigawatt-hour ) and MW (megawatt)
Dataset Link -Kaggle link
Data Dictionary¶
Date
: The date of the power generation dataPower Station
: The names of the power stations for which the power generation data is recorded.Monitored Capacity (MW)
: The monitored capacity of the power stations, denoted in megawatts (MW).Total Capacity Under Maintenance (MW)
: The total capacity under maintenance for the power station, denoted in megawatts (MW).Planned Maintenance (MW)
: The capacity under planned maintenance for the power station, denoted in megawatts (MW).Forced Maintenance (MW)
: The capacity under forced maintenance for the power station, denoted in megawatts (MW).Other Reasons (MW)
: The capacity under maintenance due to other reasons for the power station, denoted in megawatts (MW).Programme (MW)
: The estimated or programmed power needed to be generated by the power station, denoted in megawatts (MW).Actual (MW)
: The actual power generated by the power station, denoted in megawatts (MW).Excess(+) / Shortfall (-) (MW)
: The difference between the actual and expected power generation. A positive value indicates overproduction, and a negative value indicates underproduction, denoted in megawatts (MW).Deviation (MW)
: The deviation of the actual power generated from the programmed power needed, denoted in megawatts (MW).
Installing dependency¶
👉Ignore It if already installed
1. !pip install numpy
2. !pip install pandas
3. !pip install matplotlib
4. !pip install seaborn
step -1 Data Preprocessing and Cleaning¶
Importing Required library¶
# perform linear operations
import numpy as np
# Data manipulation
import pandas as pd
#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
#Load the dataset
power=pd.read_csv(r"C:\Users\Lenovo\Downloads\content\Power Generation Data Analysis\PowerGeneration.csv")
# Print top 5 rows
power.head()
Date | Power Station | Monitored Cap.(MW) | Total Cap. Under Maintenace (MW) | Planned Maintanence (MW) | Forced Maintanence(MW) | Other Reasons (MW) | Programme (MW) | Actual (MW) | Excess(+) / Shortfall (-) (MW) | Deviation (MW) | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-09-2017 | Delhi | 2235.4 | 135.0 | 0.0 | 135.0 | 0.0 | 13.29 | 18.29 | 5.00 | 37.62 |
1 | 01-09-2017 | STPL | 1350.0 | 1350.0 | 0.0 | 1350.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
2 | 01-09-2017 | SPPL | 150.0 | 150.0 | 0.0 | 150.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
3 | 01-09-2017 | SPL | 3960.0 | 0.0 | 0.0 | 0.0 | 0.0 | 92.13 | 96.16 | 4.03 | 4.37 |
4 | 01-09-2017 | SKS | 600.0 | 300.0 | 0.0 | 300.0 | 0.0 | 6.84 | 7.18 | 0.34 | 4.97 |
# check for shape
power.shape
(334994, 11)
From above cell we see that the dataset is quite large it contains 334994 observations and 11 columns
#Check info of each colummn
power.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 334994 entries, 0 to 334993 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 334994 non-null object 1 Power Station 334994 non-null object 2 Monitored Cap.(MW) 334994 non-null float64 3 Total Cap. Under Maintenace (MW) 334994 non-null float64 4 Planned Maintanence (MW) 334994 non-null float64 5 Forced Maintanence(MW) 334994 non-null float64 6 Other Reasons (MW) 334994 non-null float64 7 Programme (MW) 334994 non-null float64 8 Actual (MW) 334994 non-null float64 9 Excess(+) / Shortfall (-) (MW) 334994 non-null float64 10 Deviation (MW) 334994 non-null float64 dtypes: float64(9), object(2) memory usage: 28.1+ MB
From above cell we see that there are 2 object column and 9 column contain float values
# Checking null values
power.isnull().sum()
Date 0 Power Station 0 Monitored Cap.(MW) 0 Total Cap. Under Maintenace (MW) 0 Planned Maintanence (MW) 0 Forced Maintanence(MW) 0 Other Reasons (MW) 0 Programme (MW) 0 Actual (MW) 0 Excess(+) / Shortfall (-) (MW) 0 Deviation (MW) 0 dtype: int64
From above cell we see that there are no missing values in our data
# check for duplicate
power.duplicated().sum()
142110
From above cell we see that the dataset contain 142110 duplicate observations out of 334994 observations, So we have to drop these duplicate rows
Here are some reasons why dropping duplicate values is common in data analysis:
Accuracy: Duplicate values can lead to inflated or inaccurate results, affecting statistical measures such as mean, median, and standard deviation.
Data integrity: Duplicates can create confusion and compromise the integrity of the dataset, leading to erroneous interpretations and conclusions.
Efficiency: Large datasets with duplicate values require more computational resources and time for analysis. Removing duplicates can streamline the analysis process and improve efficiency.
Redundancy: Duplicate values add redundancy to the dataset, taking up unnecessary storage space and potentially complicating data management.
power.drop_duplicates(inplace=True)
From above you noticed that the date
is stored in object format, let's convert it into datetime format.
power['Date']=pd.to_datetime(power['Date'])
power.Date.value_counts()
2017-01-09 192884 Name: Date, dtype: int64
power.shape
(192884, 11)
From above cells you noticed that the date is same for all the data or you can say that we have only one day data, hence we can drop the date column because it is not meaningfull now
power.drop(columns=['Date'],inplace=True)
# final data we have
power
Power Station | Monitored Cap.(MW) | Total Cap. Under Maintenace (MW) | Planned Maintanence (MW) | Forced Maintanence(MW) | Other Reasons (MW) | Programme (MW) | Actual (MW) | Excess(+) / Shortfall (-) (MW) | Deviation (MW) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Delhi | 2235.4 | 135.0 | 0.0 | 135.0 | 0.0 | 13.29 | 18.29 | 5.00 | 37.62 |
1 | STPL | 1350.0 | 1350.0 | 0.0 | 1350.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
2 | SPPL | 150.0 | 150.0 | 0.0 | 150.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
3 | SPL | 3960.0 | 0.0 | 0.0 | 0.0 | 0.0 | 92.13 | 96.16 | 4.03 | 4.37 |
4 | SKS | 600.0 | 300.0 | 0.0 | 300.0 | 0.0 | 6.84 | 7.18 | 0.34 | 4.97 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
334988 | TOR. POW. (SUGEN) | 1147.5 | 0.0 | 0.0 | 0.0 | 0.0 | 17.11 | 21.88 | 4.77 | 27.88 |
334989 | TATA PCL | 1430.0 | 500.0 | 0.0 | 500.0 | 0.0 | 16.50 | 16.98 | 0.48 | 2.91 |
334990 | TATA MAH. | 447.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.29 | 3.96 | -0.33 | -7.69 |
334991 | SVPPL | 63.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 1.22 | 1.22 | 0.00 |
334993 | ONGC | 726.6 | 0.0 | 0.0 | 0.0 | 0.0 | 11.78 | 14.56 | 2.78 | 23.60 |
192884 rows × 10 columns
shape=power.shape
print(f'The final dataset contains {shape[0]} observations and {shape[1]} columns')
The final dataset contains 192884 observations and 10 columns
Step -2 Data analysis¶
Let's ask some questions from our data
What is the average power generation across all power stations during the recorded period?¶
average_power_gen=power.groupby(['Power Station'])['Actual (MW)'].mean()
average_power_gen=average_power_gen.reset_index()
average_power_gen.sort_values(by='Actual (MW)',ascending=False,inplace=True)
average_power_gen
Power Station | Actual (MW) | |
---|---|---|
118 | NTPC Ltd. | 155.729818 |
103 | Maharashtra | 145.404000 |
40 | DVC | 108.029323 |
169 | Telangana | 97.712310 |
139 | Rajasthan | 92.398414 |
... | ... | ... |
23 | BSES AP | 0.000000 |
93 | LVTPL | 0.000000 |
92 | LVS POWER | 0.000000 |
90 | LBPL | 0.000000 |
31 | CPL | 0.000000 |
182 rows × 2 columns
# Top 10 power station according to average power generation
top_10_power_station=average_power_gen.head(10)
top_10_power_station
Power Station | Actual (MW) | |
---|---|---|
118 | NTPC Ltd. | 155.729818 |
103 | Maharashtra | 145.404000 |
40 | DVC | 108.029323 |
169 | Telangana | 97.712310 |
139 | Rajasthan | 92.398414 |
152 | SPL | 88.774070 |
172 | Uttar Pradesh | 84.987898 |
181 | West Bengal | 81.439698 |
14 | Andhra Pradesh. | 78.139310 |
168 | Tamil Nadu | 77.649947 |
ax=sns.barplot(x='Power Station',y='Actual (MW)',data=top_10_power_station)
for label in ax.containers:
ax.bar_label(label)
plt.xticks(rotation=45)
plt.title("Top 10 Power Station according to Power generation")
plt.show()
The bar plot demonstrates that NTPC Ltd. leads as the top power station in actual power generation, closely followed by Maharashtra, DVC, and Telangana, among others, solidifying their significant contributions to the power generation landscape.
Which power stations demonstrate the highest and lowest levels of excess or shortfall in power generation, and can we identify any underlying reasons for these discrepancies?¶
average_deviation = power.groupby('Power Station')['Deviation (MW)'].mean()
average_deviation=average_deviation.reset_index()
# Identify the power stations with the highest and lowest deviations
highest_deviation_stations = average_deviation.nlargest(5,'Deviation (MW)')
lowest_deviation_stations = average_deviation.nsmallest(5,'Deviation (MW)')
ax=sns.barplot(x='Power Station',y='Deviation (MW)',data=highest_deviation_stations)
for label in ax.containers:
ax.bar_label(label)
plt.title('Power stations with the highest levels of excess in power generation:')
plt.show()
Based on the above bar plot, it is evident that PPNPGCl
exhibits the highest deviation, followed by IEPL
in the second position, MEL
in the third position, GMR BHHPL
in the fourth position, and CLPINDIA
in the fifth position.
sns.barplot(x='Power Station',y='Deviation (MW)',data=lowest_deviation_stations)
plt.title('Power stations with the highest levels of shortfall in power generation:')
plt.show()
Based on the above bar plot, it is apparent that PVUNL
has the highest levels of shortfall in power generation, followed by NPGCPL
in the second position, NTPGPL
in the third position, NUPPL
in the fourth position, and GREL
in the fifth position.
How does the monitored capacity of the power stations correlate with their actual power generation, and are there any notable trends or patterns in this relationship?¶
correlation = power['Monitored Cap.(MW)'].corr(power['Actual (MW)'])
# Visualize the relationship between monitored capacity and actual power generation using a scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Monitored Cap.(MW)', y='Actual (MW)', data=power, color='skyblue')
plt.title('Relationship between Monitored Capacity and Actual Power Generation', fontsize=14)
plt.xlabel('Monitored Capacity (MW)', fontsize=12)
plt.ylabel('Actual Power Generation (MW)', fontsize=12)
plt.text(0.5, 0.9, f"Correlation: {correlation:.2f}", transform=plt.gca().transAxes, fontsize=12)
plt.show()