About Dataset -¶

Dataset contain daily reports of all Power stations (regional-wise) with Expected power generation, actual power generation, Deviation, etc. in unit MU ( A million units, designated MU, is a gigawatt-hour ) and MW (megawatt)

Dataset Link -Kaggle link

Data Dictionary¶

Date: The date of the power generation data
Power Station: The names of the power stations for which the power generation data is recorded.
Monitored Capacity (MW): The monitored capacity of the power stations, denoted in megawatts (MW).
Total Capacity Under Maintenance (MW): The total capacity under maintenance for the power station, denoted in megawatts (MW).
Planned Maintenance (MW): The capacity under planned maintenance for the power station, denoted in megawatts (MW).
Forced Maintenance (MW): The capacity under forced maintenance for the power station, denoted in megawatts (MW).
Other Reasons (MW): The capacity under maintenance due to other reasons for the power station, denoted in megawatts (MW).
Programme (MW): The estimated or programmed power needed to be generated by the power station, denoted in megawatts (MW).
Actual (MW): The actual power generated by the power station, denoted in megawatts (MW).
Excess(+) / Shortfall (-) (MW): The difference between the actual and expected power generation. A positive value indicates overproduction, and a negative value indicates underproduction, denoted in megawatts (MW).
Deviation (MW): The deviation of the actual power generated from the programmed power needed, denoted in megawatts (MW).

Installing dependency¶

👉Ignore It if already installed

1. !pip install numpy
2. !pip install pandas
3. !pip install matplotlib
4. !pip install seaborn

step -1 Data Preprocessing and Cleaning¶

Importing Required library¶

In [1]:

# perform linear operations
import numpy as np

# Data manipulation
import pandas as pd

#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:

#Load the dataset
power=pd.read_csv(r"C:\Users\Lenovo\Downloads\content\Power Generation Data Analysis\PowerGeneration.csv")

# Print top 5 rows
power.head()

Out[2]:

	Date	Power Station	Monitored Cap.(MW)	Total Cap. Under Maintenace (MW)	Forced Maintanence(MW)	Programme (MW)	Actual (MW)	Excess(+) / Shortfall (-) (MW)	Deviation (MW)
0	01-09-2017	Delhi	2235.4	135.0	135.0	13.29	18.29	5.00	37.62
1	01-09-2017	STPL	1350.0	1350.0	1350.0	0.00	0.00	0.00	0.00
2	01-09-2017	SPPL	150.0	150.0	150.0	0.00	0.00	0.00	0.00
3	01-09-2017	SPL	3960.0	0.0	0.0	92.13	96.16	4.03	4.37
4	01-09-2017	SKS	600.0	300.0	300.0	6.84	7.18	0.34	4.97

In [3]:

# check for shape
power.shape

Out[3]:

(334994, 11)

From above cell we see that the dataset is quite large it contains 334994 observations and 11 columns

In [4]:

#Check info of each colummn
power.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 334994 entries, 0 to 334993
Data columns (total 11 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   Date                              334994 non-null  object 
 1   Power Station                     334994 non-null  object 
 2   Monitored Cap.(MW)                334994 non-null  float64
 3   Total Cap. Under Maintenace (MW)  334994 non-null  float64
 4   Planned Maintanence (MW)          334994 non-null  float64
 5   Forced Maintanence(MW)            334994 non-null  float64
 6   Other Reasons (MW)                334994 non-null  float64
 7   Programme (MW)                    334994 non-null  float64
 8   Actual (MW)                       334994 non-null  float64
 9   Excess(+) / Shortfall (-) (MW)    334994 non-null  float64
 10  Deviation (MW)                    334994 non-null  float64
dtypes: float64(9), object(2)
memory usage: 28.1+ MB

From above cell we see that there are 2 object column and 9 column contain float values

In [5]:

# Checking null values
power.isnull().sum()

Out[5]:

Date                                0
Power Station                       0
Monitored Cap.(MW)                  0
Total Cap. Under Maintenace (MW)    0
Planned Maintanence (MW)            0
Forced Maintanence(MW)              0
Other Reasons (MW)                  0
Programme (MW)                      0
Actual (MW)                         0
Excess(+) / Shortfall (-) (MW)      0
Deviation (MW)                      0
dtype: int64

From above cell we see that there are no missing values in our data

In [6]:

# check for duplicate
power.duplicated().sum()

Out[6]:

From above cell we see that the dataset contain 142110 duplicate observations out of 334994 observations, So we have to drop these duplicate rows

Here are some reasons why dropping duplicate values is common in data analysis:

Accuracy: Duplicate values can lead to inflated or inaccurate results, affecting statistical measures such as mean, median, and standard deviation.

Data integrity: Duplicates can create confusion and compromise the integrity of the dataset, leading to erroneous interpretations and conclusions.

Efficiency: Large datasets with duplicate values require more computational resources and time for analysis. Removing duplicates can streamline the analysis process and improve efficiency.

Redundancy: Duplicate values add redundancy to the dataset, taking up unnecessary storage space and potentially complicating data management.

In [7]:

power.drop_duplicates(inplace=True)

From above you noticed that the date is stored in object format, let's convert it into datetime format.

In [8]:

power['Date']=pd.to_datetime(power['Date'])
power.Date.value_counts()

Out[8]:

2017-01-09    192884
Name: Date, dtype: int64

In [9]:

power.shape

Out[9]:

(192884, 11)

From above cells you noticed that the date is same for all the data or you can say that we have only one day data, hence we can drop the date column because it is not meaningfull now

In [10]:

power.drop(columns=['Date'],inplace=True)

In [11]:

# final data we have
power

Out[11]:

	Power Station	Monitored Cap.(MW)	Total Cap. Under Maintenace (MW)	Planned Maintanence (MW)	Forced Maintanence(MW)	Other Reasons (MW)	Programme (MW)	Actual (MW)	Excess(+) / Shortfall (-) (MW)	Deviation (MW)
0	Delhi	2235.4	135.0	0.0	135.0	0.0	13.29	18.29	5.00	37.62
1	STPL	1350.0	1350.0	0.0	1350.0	0.0	0.00	0.00	0.00	0.00
2	SPPL	150.0	150.0	0.0	150.0	0.0	0.00	0.00	0.00	0.00
3	SPL	3960.0	0.0	0.0	0.0	0.0	92.13	96.16	4.03	4.37
4	SKS	600.0	300.0	0.0	300.0	0.0	6.84	7.18	0.34	4.97
...	...	...	...	...	...	...	...	...	...	...
334988	TOR. POW. (SUGEN)	1147.5	0.0	0.0	0.0	0.0	17.11	21.88	4.77	27.88
334989	TATA PCL	1430.0	500.0	0.0	500.0	0.0	16.50	16.98	0.48	2.91
334990	TATA MAH.	447.0	0.0	0.0	0.0	0.0	4.29	3.96	-0.33	-7.69
334991	SVPPL	63.0	0.0	0.0	0.0	0.0	0.00	1.22	1.22	0.00
334993	ONGC	726.6	0.0	0.0	0.0	0.0	11.78	14.56	2.78	23.60

192884 rows × 10 columns

In [14]:

shape=power.shape
print(f'The final dataset contains {shape[0]} observations and {shape[1]} columns')

The final dataset contains 192884 observations and 10 columns

Step -2 Data analysis¶

Let's ask some questions from our data

What is the average power generation across all power stations during the recorded period?¶

In [29]:

average_power_gen=power.groupby(['Power Station'])['Actual (MW)'].mean()
average_power_gen=average_power_gen.reset_index()
average_power_gen.sort_values(by='Actual (MW)',ascending=False,inplace=True)
average_power_gen

Out[29]:

	Power Station	Actual (MW)
118	NTPC Ltd.	155.729818
103	Maharashtra	145.404000
40	DVC	108.029323
169	Telangana	97.712310
139	Rajasthan	92.398414
...	...	...
23	BSES AP	0.000000
93	LVTPL	0.000000
92	LVS POWER	0.000000
90	LBPL	0.000000
31	CPL	0.000000

182 rows × 2 columns

In [30]:

# Top 10 power station according to average power generation
top_10_power_station=average_power_gen.head(10)
top_10_power_station

Out[30]:

	Power Station	Actual (MW)
118	NTPC Ltd.	155.729818
103	Maharashtra	145.404000
40	DVC	108.029323
169	Telangana	97.712310
139	Rajasthan	92.398414
152	SPL	88.774070
172	Uttar Pradesh	84.987898
181	West Bengal	81.439698
14	Andhra Pradesh.	78.139310
168	Tamil Nadu	77.649947

In [36]:

ax=sns.barplot(x='Power Station',y='Actual (MW)',data=top_10_power_station)
for label in ax.containers:
    ax.bar_label(label)
plt.xticks(rotation=45)
plt.title("Top 10 Power Station according to Power generation")
plt.show()

No description has been provided for this image

The bar plot demonstrates that NTPC Ltd. leads as the top power station in actual power generation, closely followed by Maharashtra, DVC, and Telangana, among others, solidifying their significant contributions to the power generation landscape.

Which power stations demonstrate the highest and lowest levels of excess or shortfall in power generation, and can we identify any underlying reasons for these discrepancies?¶

In [53]:

average_deviation = power.groupby('Power Station')['Deviation (MW)'].mean()
average_deviation=average_deviation.reset_index()

# Identify the power stations with the highest and lowest deviations
highest_deviation_stations = average_deviation.nlargest(5,'Deviation (MW)')
lowest_deviation_stations = average_deviation.nsmallest(5,'Deviation (MW)')

In [52]:

ax=sns.barplot(x='Power Station',y='Deviation (MW)',data=highest_deviation_stations)
for label in ax.containers:
    ax.bar_label(label)
plt.title('Power stations with the highest levels of excess in power generation:')
plt.show()

Based on the above bar plot, it is evident that PPNPGCl exhibits the highest deviation, followed by IEPL in the second position, MEL in the third position, GMR BHHPL in the fourth position, and CLPINDIA in the fifth position.

In [55]:

sns.barplot(x='Power Station',y='Deviation (MW)',data=lowest_deviation_stations)

plt.title('Power stations with the highest levels of shortfall in power generation:')
plt.show()

Based on the above bar plot, it is apparent that PVUNL has the highest levels of shortfall in power generation, followed by NPGCPL in the second position, NTPGPL in the third position, NUPPL in the fourth position, and GREL in the fifth position.

How does the monitored capacity of the power stations correlate with their actual power generation, and are there any notable trends or patterns in this relationship?¶

In [58]:

correlation = power['Monitored Cap.(MW)'].corr(power['Actual (MW)'])

# Visualize the relationship between monitored capacity and actual power generation using a scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Monitored Cap.(MW)', y='Actual (MW)', data=power, color='skyblue')
plt.title('Relationship between Monitored Capacity and Actual Power Generation', fontsize=14)
plt.xlabel('Monitored Capacity (MW)', fontsize=12)
plt.ylabel('Actual Power Generation (MW)', fontsize=12)
plt.text(0.5, 0.9, f"Correlation: {correlation:.2f}", transform=plt.gca().transAxes, fontsize=12)
plt.show()

In [59]:

plt.figure(figsize=(8, 6))
sns.lineplot(x='Monitored Cap.(MW)', y='Actual (MW)', data=power, color='skyblue')
plt.title('Relationship between Monitored Capacity and Actual Power Generation', fontsize=14)
plt.xlabel('Monitored Capacity (MW)', fontsize=12)
plt.ylabel('Actual Power Generation (MW)', fontsize=12)
plt.text(0.5, 0.9, f"Correlation: {correlation:.2f}", transform=plt.gca().transAxes, fontsize=12)
plt.show()

The visual analysis confirms a compelling pattern: as actual power generation increases, the monitored capacity also demonstrates a corresponding increase. The high positive correlation coefficient of 0.93 emphasizes a strong linear relationship between the two variables, underscoring the significant impact of monitored capacity on actual power generation.

What is the overall trend in power generation for each power station, and how does it compare to the programmed power needed?¶

In [69]:

total_power_generation = power.groupby('Power Station')['Actual (MW)'].sum()
total_programmed_power = power.groupby('Power Station')['Programme (MW)'].sum()

In [72]:

plt.figure(figsize=(14, 6))
total_power_generation.plot(kind='line', label='Total Power Generation', marker='o')
total_programmed_power.plot(kind='line', label='Total Programmed Power', marker='o')
plt.title('Trends in Power Generation vs. Programmed Power', fontsize=16)
plt.xlabel('Power Station', fontsize=12)
plt.ylabel('Power (MW)', fontsize=12)
plt.legend()
plt.xticks(rotation=45)
plt.show()

The analysis of the line plot reveals a commendable trend, with the majority of the power stations successfully meeting their total programmed power targets. Notably, the 'PGPL' power station has exhibited exceptional performance by surpassing its programmed power generation, highlighting its robust operational efficiency and capacity for exceeding predefined goals.

Are there any significant correlations between the different types of maintenance and the deviations in power generation?¶

In [85]:

# Calculate the correlation coefficients
correlation_planned = power['Planned Maintanence (MW)'].corr(power['Deviation (MW)'])


plt.figure(figsize=(8, 6))
sns.scatterplot(x='Planned Maintanence (MW)', y='Deviation (MW)', data=power, label=f'Correlation: {correlation_planned:.2f}'
                ,color='red')

plt.title('Correlation Between Planned Maintenance and Power Generation Deviations', fontsize=16)
plt.xlabel('Planned Maintenance (MW)', fontsize=12)
plt.ylabel('Deviation (MW)', fontsize=12)
plt.legend()
plt.show()

In [84]:

correlation_forced = power['Forced Maintanence(MW)'].corr(power['Deviation (MW)'])

plt.figure(figsize=(8, 6))
sns.scatterplot(x='Forced Maintanence(MW)', y='Deviation (MW)', data=power, label=f'Correlation: {correlation_forced:.2f}')

plt.title('Correlation Between Force Maintenance and Power Generation Deviations', fontsize=16)
plt.xlabel('Forced Maintenance (MW)', fontsize=12)
plt.ylabel('Deviation (MW)', fontsize=12)
plt.legend()
plt.show()

1. Upon analysis, we observed that lower levels of planned and forced maintenance activities were associated with higher deviations in power generation. Conversely, an increase in both planned and forced maintenance levels corresponded to a reduction in deviation, ultimately leading to a deviation of 0.

2. Furthermore, the correlation coefficients revealed a weak negative correlation between planned maintenance and deviation (-0.08) and forced maintenance and deviation (-0.15), providing additional evidence of the inverse relationship between maintenance levels and deviations in power generation.

Recommendations and Conclusions -¶

Based on the analysis of the power generation dataset, the following recommendations and conclusions can be drawn:

Recommendations:

Implement Predictive Maintenance Strategies: Given the significant impact of maintenance activities on power generation, it is advisable to adopt predictive maintenance techniques to minimize unexpected downtimes and optimize overall operational efficiency.
Enhance Monitoring and Capacity Planning: To ensure consistent power generation, it is crucial to continually monitor power stations' capacities and align them with the actual power generation requirements. Regular capacity assessments can help anticipate potential discrepancies and enable proactive capacity adjustments.
Strengthen Maintenance Protocols: Building robust maintenance protocols that effectively manage both planned and forced maintenance activities can help reduce power generation deviations. Implementing comprehensive maintenance schedules and strategies can minimize disruptions and ensure smoother operations.

Conclusions:

Operational Efficiency: The analysis underscores the commendable operational efficiency demonstrated by several power stations in meeting their programmed power targets. Notably, the 'PGPL' power station's exceptional performance highlights the potential for surpassing set power generation goals through efficient operational practices.
Maintenance-Deviation Relationship: The inverse relationship between maintenance levels and deviations in power generation emphasizes the critical role of maintenance activities in ensuring consistent power production. As observed, maintaining optimal levels of planned and forced maintenance can significantly mitigate deviations and stabilize power generation outputs.
Capacity-Generation Alignment: The strong correlation between monitored capacity and actual power generation highlights the pivotal role of effective capacity planning in achieving consistent power generation levels. Aligning monitored capacities with actual power demands is crucial for maintaining a stable and reliable power supply.

By implementing these recommendations and acknowledging the insights gleaned from the analysis, power generation facilities can optimize their operations, enhance their maintenance strategies, and ensure consistent and reliable power generation, thus contributing to a more sustainable and efficient energy landscape.

Practice Questions -¶

How do the different reasons for maintenance (planned, forced, and others) vary across the power stations, and are there any specific patterns or trends in the distribution of these maintenance reasons?
Can we identify any seasonal fluctuations or patterns in the power generation data for specific power stations, and how do these seasonal variations impact overall power generation?
What are the primary factors contributing to the excess or shortfall in power generation for each power station, and how do these factors compare in their influence on power generation discrepancies?
How has the power generation trend evolved over the years for each power station, and what external factors might have contributed to any notable changes or shifts in these trends?
Are there any notable disparities in the monitored capacity of the power stations, and how do these disparities impact the overall efficiency and productivity of the power generation facilities?

In [ ]:

Exploratory Data Analysis on Power Generation Data in Python