About Dataset¶
This dataset focuses on the black-white wage gap in the United States. It provides insights into the disparities in hourly wages between black and white workers, as well as different gender and subgroup breakdowns.
The data is derived from the Economic Policy Institute’s State of Working America Data Library, a reputable source for socio-economic research and analysis.
This dataset contains information about the black-white wage gap in the USA at different levels, such as median, average. It includes data on houly wages for workers ages 16 and older, adjusted into 2022 dollars.
Dataset Link -Kaggle link
Data Dictionary -¶
Year
- Year of the data collectionWhite_median
- Median hourly wage for white workers.White_average
- Average hourly wage for white workers.Black_median
- Median hourly wage for black workers.black_average
- Average hourly wage for black workers.white_men_median
- Median hourly wage for white male workerswhite_men_average
- Average hourly wage for white male workersblack_men_median
-Median hourly wage for black male workers.black_men_average
- Average hourly wage for black male workers.white_women_median
- Median hourly wage for white female workers.white_women_average
- Average hourly wage for white female workers.black_women_median
- Median hourly wage for black female workers.black_women_average
- Average hourly wage for black female workers.
Installing dependency¶
👉Ignore It if already installed
1. !pip install numpy
2. !pip install pandas
3. !pip install matplotlib
4. !pip install seaborn
step -1 Data Preprocessing and Cleaning¶
Importing Required library¶
# perform linear operations
import numpy as np
# Data manipulation
import pandas as pd
#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
# Perfrom Stastical operation
from scipy.stats import ttest_ind
#Load the dataset
black_white = pd.read_csv(r"C:\Users\Lenovo\Downloads\content\Black-White Wage Gap Data Analysis\black_white_wage_gap.csv")
# Print top 5 rows
black_white.head()
year | white_median | white_average | black_median | black_average | white_men_median | white_men_average | black_men_median | black_men_average | white_women_median | white_women_average | black_women_median | black_women_average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 24.96 | 34.49 | 19.60 | 25.61 | 27.11 | 39.10 | 20.02 | 27.43 | 22.47 | 29.50 | 19.00 | 23.99 |
1 | 2021 | 25.40 | 34.50 | 19.45 | 25.40 | 27.76 | 38.78 | 20.08 | 26.88 | 22.76 | 29.90 | 18.85 | 24.13 |
2 | 2020 | 25.98 | 34.86 | 19.85 | 26.03 | 28.36 | 39.08 | 20.56 | 27.40 | 23.05 | 30.30 | 19.26 | 24.87 |
3 | 2019 | 24.39 | 32.79 | 18.45 | 24.09 | 27.39 | 36.84 | 19.31 | 25.18 | 22.01 | 28.41 | 18.08 | 23.17 |
4 | 2018 | 23.97 | 32.44 | 17.57 | 23.53 | 26.79 | 36.55 | 18.66 | 24.67 | 21.75 | 28.01 | 17.34 | 22.55 |
# Print last 5 rows
black_white.tail()
year | white_median | white_average | black_median | black_average | white_men_median | white_men_average | black_men_median | black_men_average | white_women_median | white_women_average | black_women_median | black_women_average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
45 | 1977 | 20.00 | 23.38 | 16.23 | 18.93 | 24.94 | 27.66 | 18.70 | 20.84 | 15.33 | 17.57 | 14.06 | 16.91 |
46 | 1976 | 20.06 | 23.47 | 16.25 | 19.20 | 24.32 | 27.54 | 19.19 | 21.57 | 15.42 | 17.79 | 13.97 | 16.73 |
47 | 1975 | 19.96 | 23.30 | 16.15 | 18.46 | 24.68 | 27.37 | 19.15 | 20.60 | 15.32 | 17.45 | 13.41 | 16.14 |
48 | 1974 | 20.04 | 23.21 | 16.07 | 18.36 | 24.55 | 27.34 | 19.02 | 20.84 | 15.22 | 17.23 | 13.46 | 15.68 |
49 | 1973 | 20.53 | 23.72 | 15.96 | 18.61 | 24.98 | 27.93 | 19.29 | 21.09 | 15.36 | 17.57 | 13.38 | 15.83 |
black_white.shape
(50, 13)
#Check info of each colummn
black_white.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 50 non-null int64 1 white_median 50 non-null float64 2 white_average 50 non-null float64 3 black_median 50 non-null float64 4 black_average 50 non-null float64 5 white_men_median 50 non-null float64 6 white_men_average 50 non-null float64 7 black_men_median 50 non-null float64 8 black_men_average 50 non-null float64 9 white_women_median 50 non-null float64 10 white_women_average 50 non-null float64 11 black_women_median 50 non-null float64 12 black_women_average 50 non-null float64 dtypes: float64(12), int64(1) memory usage: 5.2 KB
# check for duplicate
black_white.duplicated().sum()
0
From above cell we see that there is no duplicate present in the data
Step -2 Data analysis¶
Let's ask some question from our data
black_white['year'].unique()
array([2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973], dtype=int64)
plt.figure(figsize=(12, 8))
# Overall median wages
plt.plot(black_white['year'], black_white['white_median'], label='White Median', linestyle='--', marker='o')
plt.plot(black_white['year'], black_white['black_median'], label='Black Median', linestyle='--', marker='o')
# Overall average wages
plt.plot(black_white['year'], black_white['white_average'], label='White Average', linestyle='-', marker='x')
plt.plot(black_white['year'], black_white['black_average'], label='Black Average', linestyle='-', marker='x')
plt.title('Overall Median and Average Wages Over Time')
plt.xlabel('Year')
plt.ylabel('Wage')
plt.legend()
plt.grid(True)
plt.show()