About Dataset -¶
Dataset contain daily reports of all Power stations (regional-wise) with Expected power generation, actual power generation, Deviation, etc. in unit MU ( A million units, designated MU, is a gigawatt-hour ) and MW (megawatt)
Dataset Link -Kaggle link
Data Dictionary¶
Date
: The date of the power generation dataPower Station
: The names of the power stations for which the power generation data is recorded.Monitored Capacity (MW)
: The monitored capacity of the power stations, denoted in megawatts (MW).Total Capacity Under Maintenance (MW)
: The total capacity under maintenance for the power station, denoted in megawatts (MW).Planned Maintenance (MW)
: The capacity under planned maintenance for the power station, denoted in megawatts (MW).Forced Maintenance (MW)
: The capacity under forced maintenance for the power station, denoted in megawatts (MW).Other Reasons (MW)
: The capacity under maintenance due to other reasons for the power station, denoted in megawatts (MW).Programme (MW)
: The estimated or programmed power needed to be generated by the power station, denoted in megawatts (MW).Actual (MW)
: The actual power generated by the power station, denoted in megawatts (MW).Excess(+) / Shortfall (-) (MW)
: The difference between the actual and expected power generation. A positive value indicates overproduction, and a negative value indicates underproduction, denoted in megawatts (MW).Deviation (MW)
: The deviation of the actual power generated from the programmed power needed, denoted in megawatts (MW).
Installing dependency¶
👉Ignore It if already installed
1. !pip install numpy
2. !pip install pandas
3. !pip install matplotlib
4. !pip install seaborn
step -1 Data Preprocessing and Cleaning¶
Importing Required library¶
# perform linear operations
import numpy as np
# Data manipulation
import pandas as pd
#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
#Load the dataset
power=pd.read_csv(r"C:\Users\Lenovo\Downloads\content\Power Generation Data Analysis\PowerGeneration.csv")
# Print top 5 rows
power.head()
Date | Power Station | Monitored Cap.(MW) | Total Cap. Under Maintenace (MW) | Planned Maintanence (MW) | Forced Maintanence(MW) | Other Reasons (MW) | Programme (MW) | Actual (MW) | Excess(+) / Shortfall (-) (MW) | Deviation (MW) | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-09-2017 | Delhi | 2235.4 | 135.0 | 0.0 | 135.0 | 0.0 | 13.29 | 18.29 | 5.00 | 37.62 |
1 | 01-09-2017 | STPL | 1350.0 | 1350.0 | 0.0 | 1350.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
2 | 01-09-2017 | SPPL | 150.0 | 150.0 | 0.0 | 150.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
3 | 01-09-2017 | SPL | 3960.0 | 0.0 | 0.0 | 0.0 | 0.0 | 92.13 | 96.16 | 4.03 | 4.37 |
4 | 01-09-2017 | SKS | 600.0 | 300.0 | 0.0 | 300.0 | 0.0 | 6.84 | 7.18 | 0.34 | 4.97 |
# check for shape
power.shape
(334994, 11)
From above cell we see that the dataset is quite large it contains 334994 observations and 11 columns
#Check info of each colummn
power.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 334994 entries, 0 to 334993 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 334994 non-null object 1 Power Station 334994 non-null object 2 Monitored Cap.(MW) 334994 non-null float64 3 Total Cap. Under Maintenace (MW) 334994 non-null float64 4 Planned Maintanence (MW) 334994 non-null float64 5 Forced Maintanence(MW) 334994 non-null float64 6 Other Reasons (MW) 334994 non-null float64 7 Programme (MW) 334994 non-null float64 8 Actual (MW) 334994 non-null float64 9 Excess(+) / Shortfall (-) (MW) 334994 non-null float64 10 Deviation (MW) 334994 non-null float64 dtypes: float64(9), object(2) memory usage: 28.1+ MB
From above cell we see that there are 2 object column and 9 column contain float values
# Checking null values
power.isnull().sum()
Date 0 Power Station 0 Monitored Cap.(MW) 0 Total Cap. Under Maintenace (MW) 0 Planned Maintanence (MW) 0 Forced Maintanence(MW) 0 Other Reasons (MW) 0 Programme (MW) 0 Actual (MW) 0 Excess(+) / Shortfall (-) (MW) 0 Deviation (MW) 0 dtype: int64
From above cell we see that there are no missing values in our data
# check for duplicate
power.duplicated().sum()
142110
From above cell we see that the dataset contain 142110 duplicate observations out of 334994 observations, So we have to drop these duplicate rows
Here are some reasons why dropping duplicate values is common in data analysis:
Accuracy: Duplicate values can lead to inflated or inaccurate results, affecting statistical measures such as mean, median, and standard deviation.
Data integrity: Duplicates can create confusion and compromise the integrity of the dataset, leading to erroneous interpretations and conclusions.
Efficiency: Large datasets with duplicate values require more computational resources and time for analysis. Removing duplicates can streamline the analysis process and improve efficiency.
Redundancy: Duplicate values add redundancy to the dataset, taking up unnecessary storage space and potentially complicating data management.
power.drop_duplicates(inplace=True)
From above you noticed that the date
is stored in object format, let's convert it into datetime format.
power['Date']=pd.to_datetime(power['Date'])
power.Date.value_counts()
2017-01-09 192884 Name: Date, dtype: int64
power.shape
(192884, 11)
From above cells you noticed that the date is same for all the data or you can say that we have only one day data, hence we can drop the date column because it is not meaningfull now
power.drop(columns=['Date'],inplace=True)
# final data we have
power
Power Station | Monitored Cap.(MW) | Total Cap. Under Maintenace (MW) | Planned Maintanence (MW) | Forced Maintanence(MW) | Other Reasons (MW) | Programme (MW) | Actual (MW) | Excess(+) / Shortfall (-) (MW) | Deviation (MW) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Delhi | 2235.4 | 135.0 | 0.0 | 135.0 | 0.0 | 13.29 | 18.29 | 5.00 | 37.62 |
1 | STPL | 1350.0 | 1350.0 | 0.0 | 1350.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
2 | SPPL | 150.0 | 150.0 | 0.0 | 150.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 |
3 | SPL | 3960.0 | 0.0 | 0.0 | 0.0 | 0.0 | 92.13 | 96.16 | 4.03 | 4.37 |
4 | SKS | 600.0 | 300.0 | 0.0 | 300.0 | 0.0 | 6.84 | 7.18 | 0.34 | 4.97 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
334988 | TOR. POW. (SUGEN) | 1147.5 | 0.0 | 0.0 | 0.0 | 0.0 | 17.11 | 21.88 | 4.77 | 27.88 |
334989 | TATA PCL | 1430.0 | 500.0 | 0.0 | 500.0 | 0.0 | 16.50 | 16.98 | 0.48 | 2.91 |
334990 | TATA MAH. | 447.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.29 | 3.96 | -0.33 | -7.69 |
334991 | SVPPL | 63.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 1.22 | 1.22 | 0.00 |
334993 | ONGC | 726.6 | 0.0 | 0.0 | 0.0 | 0.0 | 11.78 | 14.56 | 2.78 | 23.60 |
192884 rows × 10 columns
shape=power.shape
print(f'The final dataset contains {shape[0]} observations and {shape[1]} columns')
The final dataset contains 192884 observations and 10 columns
Step -2 Data analysis¶
Let's ask some questions from our data
What is the average power generation across all power stations during the recorded period?¶
average_power_gen=power.groupby(['Power Station'])['Actual (MW)'].mean()
average_power_gen=average_power_gen.reset_index()
average_power_gen.sort_values(by='Actual (MW)',ascending=False,inplace=True)
average_power_gen
Power Station | Actual (MW) | |
---|---|---|
118 | NTPC Ltd. | 155.729818 |
103 | Maharashtra | 145.404000 |
40 | DVC | 108.029323 |
169 | Telangana | 97.712310 |
139 | Rajasthan | 92.398414 |
... | ... | ... |
23 | BSES AP | 0.000000 |
93 | LVTPL | 0.000000 |
92 | LVS POWER | 0.000000 |
90 | LBPL | 0.000000 |
31 | CPL | 0.000000 |
182 rows × 2 columns
# Top 10 power station according to average power generation
top_10_power_station=average_power_gen.head(10)
top_10_power_station
Power Station | Actual (MW) | |
---|---|---|
118 | NTPC Ltd. | 155.729818 |
103 | Maharashtra | 145.404000 |
40 | DVC | 108.029323 |
169 | Telangana | 97.712310 |
139 | Rajasthan | 92.398414 |
152 | SPL | 88.774070 |
172 | Uttar Pradesh | 84.987898 |
181 | West Bengal | 81.439698 |
14 | Andhra Pradesh. | 78.139310 |
168 | Tamil Nadu | 77.649947 |
ax=sns.barplot(x='Power Station',y='Actual (MW)',data=top_10_power_station)
for label in ax.containers:
ax.bar_label(label)
plt.xticks(rotation=45)
plt.title("Top 10 Power Station according to Power generation")
plt.show()
The bar plot demonstrates that NTPC Ltd. leads as the top power station in actual power generation, closely followed by Maharashtra, DVC, and Telangana, among others, solidifying their significant contributions to the power generation landscape.
Which power stations demonstrate the highest and lowest levels of excess or shortfall in power generation, and can we identify any underlying reasons for these discrepancies?¶
average_deviation = power.groupby('Power Station')['Deviation (MW)'].mean()
average_deviation=average_deviation.reset_index()
# Identify the power stations with the highest and lowest deviations
highest_deviation_stations = average_deviation.nlargest(5,'Deviation (MW)')
lowest_deviation_stations = average_deviation.nsmallest(5,'Deviation (MW)')
ax=sns.barplot(x='Power Station',y='Deviation (MW)',data=highest_deviation_stations)
for label in ax.containers:
ax.bar_label(label)
plt.title('Power stations with the highest levels of excess in power generation:')
plt.show()
Based on the above bar plot, it is evident that PPNPGCl
exhibits the highest deviation, followed by IEPL
in the second position, MEL
in the third position, GMR BHHPL
in the fourth position, and CLPINDIA
in the fifth position.
sns.barplot(x='Power Station',y='Deviation (MW)',data=lowest_deviation_stations)
plt.title('Power stations with the highest levels of shortfall in power generation:')
plt.show()
Based on the above bar plot, it is apparent that PVUNL
has the highest levels of shortfall in power generation, followed by NPGCPL
in the second position, NTPGPL
in the third position, NUPPL
in the fourth position, and GREL
in the fifth position.
How does the monitored capacity of the power stations correlate with their actual power generation, and are there any notable trends or patterns in this relationship?¶
correlation = power['Monitored Cap.(MW)'].corr(power['Actual (MW)'])
# Visualize the relationship between monitored capacity and actual power generation using a scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Monitored Cap.(MW)', y='Actual (MW)', data=power, color='skyblue')
plt.title('Relationship between Monitored Capacity and Actual Power Generation', fontsize=14)
plt.xlabel('Monitored Capacity (MW)', fontsize=12)
plt.ylabel('Actual Power Generation (MW)', fontsize=12)
plt.text(0.5, 0.9, f"Correlation: {correlation:.2f}", transform=plt.gca().transAxes, fontsize=12)
plt.show()
plt.figure(figsize=(8, 6))
sns.lineplot(x='Monitored Cap.(MW)', y='Actual (MW)', data=power, color='skyblue')
plt.title('Relationship between Monitored Capacity and Actual Power Generation', fontsize=14)
plt.xlabel('Monitored Capacity (MW)', fontsize=12)
plt.ylabel('Actual Power Generation (MW)', fontsize=12)
plt.text(0.5, 0.9, f"Correlation: {correlation:.2f}", transform=plt.gca().transAxes, fontsize=12)
plt.show()
The visual analysis confirms a compelling pattern: as actual power generation increases, the monitored capacity also demonstrates a corresponding increase. The high positive correlation coefficient of 0.93 emphasizes a strong linear relationship between the two variables, underscoring the significant impact of monitored capacity on actual power generation.
What is the overall trend in power generation for each power station, and how does it compare to the programmed power needed?¶
total_power_generation = power.groupby('Power Station')['Actual (MW)'].sum()
total_programmed_power = power.groupby('Power Station')['Programme (MW)'].sum()
plt.figure(figsize=(14, 6))
total_power_generation.plot(kind='line', label='Total Power Generation', marker='o')
total_programmed_power.plot(kind='line', label='Total Programmed Power', marker='o')
plt.title('Trends in Power Generation vs. Programmed Power', fontsize=16)
plt.xlabel('Power Station', fontsize=12)
plt.ylabel('Power (MW)', fontsize=12)
plt.legend()
plt.xticks(rotation=45)
plt.show()
The analysis of the line plot reveals a commendable trend, with the majority of the power stations successfully meeting their total programmed power targets. Notably, the 'PGPL' power station has exhibited exceptional performance by surpassing its programmed power generation, highlighting its robust operational efficiency and capacity for exceeding predefined goals.
Are there any significant correlations between the different types of maintenance and the deviations in power generation?¶
# Calculate the correlation coefficients
correlation_planned = power['Planned Maintanence (MW)'].corr(power['Deviation (MW)'])
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Planned Maintanence (MW)', y='Deviation (MW)', data=power, label=f'Correlation: {correlation_planned:.2f}'
,color='red')
plt.title('Correlation Between Planned Maintenance and Power Generation Deviations', fontsize=16)
plt.xlabel('Planned Maintenance (MW)', fontsize=12)
plt.ylabel('Deviation (MW)', fontsize=12)
plt.legend()
plt.show()
correlation_forced = power['Forced Maintanence(MW)'].corr(power['Deviation (MW)'])
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Forced Maintanence(MW)', y='Deviation (MW)', data=power, label=f'Correlation: {correlation_forced:.2f}')
plt.title('Correlation Between Force Maintenance and Power Generation Deviations', fontsize=16)
plt.xlabel('Forced Maintenance (MW)', fontsize=12)
plt.ylabel('Deviation (MW)', fontsize=12)
plt.legend()
plt.show()
1. Upon analysis, we observed that lower levels of planned and forced maintenance activities were associated with higher deviations in power generation. Conversely, an increase in both planned and forced maintenance levels corresponded to a reduction in deviation, ultimately leading to a deviation of 0.
2. Furthermore, the correlation coefficients revealed a weak negative correlation between planned maintenance and deviation (-0.08) and forced maintenance and deviation (-0.15), providing additional evidence of the inverse relationship between maintenance levels and deviations in power generation.
Recommendations and Conclusions -¶
Based on the analysis of the power generation dataset, the following recommendations and conclusions can be drawn:
Recommendations:
Implement Predictive Maintenance Strategies
: Given the significant impact of maintenance activities on power generation, it is advisable to adopt predictive maintenance techniques to minimize unexpected downtimes and optimize overall operational efficiency.Enhance Monitoring and Capacity Planning
: To ensure consistent power generation, it is crucial to continually monitor power stations' capacities and align them with the actual power generation requirements. Regular capacity assessments can help anticipate potential discrepancies and enable proactive capacity adjustments.Strengthen Maintenance Protocols
: Building robust maintenance protocols that effectively manage both planned and forced maintenance activities can help reduce power generation deviations. Implementing comprehensive maintenance schedules and strategies can minimize disruptions and ensure smoother operations.
Conclusions:
Operational Efficiency
: The analysis underscores the commendable operational efficiency demonstrated by several power stations in meeting their programmed power targets. Notably, the 'PGPL' power station's exceptional performance highlights the potential for surpassing set power generation goals through efficient operational practices.Maintenance-Deviation Relationship
: The inverse relationship between maintenance levels and deviations in power generation emphasizes the critical role of maintenance activities in ensuring consistent power production. As observed, maintaining optimal levels of planned and forced maintenance can significantly mitigate deviations and stabilize power generation outputs.Capacity-Generation Alignment
: The strong correlation between monitored capacity and actual power generation highlights the pivotal role of effective capacity planning in achieving consistent power generation levels. Aligning monitored capacities with actual power demands is crucial for maintaining a stable and reliable power supply.
By implementing these recommendations and acknowledging the insights gleaned from the analysis, power generation facilities can optimize their operations, enhance their maintenance strategies, and ensure consistent and reliable power generation, thus contributing to a more sustainable and efficient energy landscape.
Practice Questions -¶
- How do the different reasons for maintenance (planned, forced, and others) vary across the power stations, and are there any specific patterns or trends in the distribution of these maintenance reasons?
- Can we identify any seasonal fluctuations or patterns in the power generation data for specific power stations, and how do these seasonal variations impact overall power generation?
- What are the primary factors contributing to the excess or shortfall in power generation for each power station, and how do these factors compare in their influence on power generation discrepancies?
- How has the power generation trend evolved over the years for each power station, and what external factors might have contributed to any notable changes or shifts in these trends?
- Are there any notable disparities in the monitored capacity of the power stations, and how do these disparities impact the overall efficiency and productivity of the power generation facilities?