About Dataset¶
Compiled from the National Center of Education Statistics Annual Digest. Specifically, Table 330.20: Average undergraduate tuition and fees and room and board rates charged for full-time students in degree-granting postsecondary institutions, by control and level of institution and state or jurisdiction.
Dataset Link - Kaggle Link
Data Dictionary -¶
Year
- The Digest year this information comes fromState
- The U.S. StateType
- Type of University, Private or Public and in-state or out-of-state. Private colleges charge the same for in/out of stateLength
- Whether the college mainly offers 2-year (Associates) or 4-year (Bachelors) programsExpenses
- The Expense being described, tuition/fees or on-campus living expensesValue
- The average cost for this particular expense, in USD ($)
Installing dependency¶
👉Ignore It if already installed
1. !pip install numpy
2. !pip install pandas
3. !pip install matplotlib
4. !pip install seaborn
step -1 Data Preprocessing and Cleaning¶
Importing Required library¶
# perform linear operations
import numpy as np
# Data manipulation
import pandas as pd
#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Remove warnings
import warnings
warnings.filterwarnings('ignore')
#Load the dataset
us_data=pd.read_csv(r"C:\Users\Lenovo\Downloads\content\US undergrad data\nces330_20.csv")
# Print top 5 rows
us_data.head()
Year | State | Type | Length | Expense | Value | |
---|---|---|---|---|---|---|
0 | 2013 | Alabama | Private | 4-year | Fees/Tuition | 13983 |
1 | 2013 | Alabama | Private | 4-year | Room/Board | 8503 |
2 | 2013 | Alabama | Public In-State | 2-year | Fees/Tuition | 4048 |
3 | 2013 | Alabama | Public In-State | 4-year | Fees/Tuition | 8073 |
4 | 2013 | Alabama | Public In-State | 4-year | Room/Board | 8473 |
# check for shape
us_data.shape
(3548, 6)
From above cell we see that the dataset contains 3548 observations and 6 columns
#Check info of each colummn
us_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3548 entries, 0 to 3547 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 3548 non-null int64 1 State 3548 non-null object 2 Type 3548 non-null object 3 Length 3548 non-null object 4 Expense 3548 non-null object 5 Value 3548 non-null int64 dtypes: int64(2), object(4) memory usage: 166.4+ KB
From above cell we see that there are 4 object column and 2 integer
# Checking null values
us_data.isnull().sum()
Year 0 State 0 Type 0 Length 0 Expense 0 Value 0 dtype: int64
From above cell we see that there are no missing values in our dataset
# check for duplicate
us_data.duplicated().sum()
0
From above cell we see that there are no duplicates present in our dataset
Step -2 Data Analysis¶
Let's Check the Distribution of each column¶
us_data
Year | State | Type | Length | Expense | Value | |
---|---|---|---|---|---|---|
0 | 2013 | Alabama | Private | 4-year | Fees/Tuition | 13983 |
1 | 2013 | Alabama | Private | 4-year | Room/Board | 8503 |
2 | 2013 | Alabama | Public In-State | 2-year | Fees/Tuition | 4048 |
3 | 2013 | Alabama | Public In-State | 4-year | Fees/Tuition | 8073 |
4 | 2013 | Alabama | Public In-State | 4-year | Room/Board | 8473 |
... | ... | ... | ... | ... | ... | ... |
3543 | 2021 | Wyoming | Public In-State | 2-year | Fees/Tuition | 3987 |
3544 | 2021 | Wyoming | Public In-State | 4-year | Room/Board | 9799 |
3545 | 2021 | Wyoming | Public Out-of-State | 2-year | Fees/Tuition | 9820 |
3546 | 2021 | Wyoming | Public Out-of-State | 4-year | Fees/Tuition | 14710 |
3547 | 2021 | Wyoming | Public Out-of-State | 4-year | Room/Board | 9799 |
3548 rows × 6 columns
Year
¶
ax=sns.countplot(x='Year',data=us_data)
for label in ax.containers:
ax.bar_label(label)
plt.title("Various years available in the dataset")
plt.show()
From above bar plot we see that the year range is start from 2013 and end upto 2021
State
¶
plt.figure(figsize=(10,5))
ax=sns.countplot(x='State',data=us_data)
for label in ax.containers:
ax.bar_label(label)
plt.title("Various states available in the dataset")
plt.xticks(rotation=90)
plt.show()
Above plot indicates that there are various states in our data or all the states of US
Type
¶
plt.figure(figsize=(10,5))
ax=sns.countplot(x='Type',data=us_data)
for label in ax.containers:
ax.bar_label(label)
plt.title("Various Types of Universities in our data")
plt.xticks(rotation=45)
plt.show()
The distribution of universities across the United States reveals three distinct categories: "Private," "Public In-State," and "Public Out-of-State." Notably, the dataset indicates the prevalence of 905 "private" universities, 1296 "public in-state" institutions, and 1347 "public out-of-state" establishments throughout the country. This data underscores the diversity in the higher education landscape, with various institutions offering educational opportunities to students nationwide. The presence of a larger number of public universities, both in-state and out-of-state, suggests the significance of accessible and affordable education within the United States.
Length
¶
plt.figure(figsize=(10,5))
ax=sns.countplot(x='Length',data=us_data)
for label in ax.containers:
ax.bar_label(label)
plt.title("Types of program available")
plt.xticks(rotation=45)
plt.show()
The analysis of university programs in the United States highlights the prevalence of two primary categories: "4-year" and "2-year" programs. The dataset demonstrates a substantial presence of 2672 records affiliated with "4-year" programs and 876 observations associated with "2-year" programs.
Expense
¶
plt.figure(figsize=(10,5))
ax=sns.countplot(x='Expense',data=us_data)
for label in ax.containers:
ax.bar_label(label)
plt.title("Types of Expense")
plt.xticks(rotation=45)
plt.show()
The bar plot elucidates the distinct types of expenses incurred by students in the United States, namely "Fees/Tuition" and "Room/Board." Notably, the data reveals that there are 2198 records associated with "Fees/Tuition" expenses, while 1350 records correspond to "Room/Board" expenditures.
Value
¶
sns.histplot(us_data['Value'],kde=True)
plt.title("distribution of cost")
plt.show()
Let's ask some question from the data¶
How do the expenses differ between public and private universities in various states?¶
us_data
Year | State | Type | Length | Expense | Value | |
---|---|---|---|---|---|---|
0 | 2013 | Alabama | Private | 4-year | Fees/Tuition | 13983 |
1 | 2013 | Alabama | Private | 4-year | Room/Board | 8503 |
2 | 2013 | Alabama | Public In-State | 2-year | Fees/Tuition | 4048 |
3 | 2013 | Alabama | Public In-State | 4-year | Fees/Tuition | 8073 |
4 | 2013 | Alabama | Public In-State | 4-year | Room/Board | 8473 |
... | ... | ... | ... | ... | ... | ... |
3543 | 2021 | Wyoming | Public In-State | 2-year | Fees/Tuition | 3987 |
3544 | 2021 | Wyoming | Public In-State | 4-year | Room/Board | 9799 |
3545 | 2021 | Wyoming | Public Out-of-State | 2-year | Fees/Tuition | 9820 |
3546 | 2021 | Wyoming | Public Out-of-State | 4-year | Fees/Tuition | 14710 |
3547 | 2021 | Wyoming | Public Out-of-State | 4-year | Room/Board | 9799 |
3548 rows × 6 columns
plt.figure(figsize=(14, 5))
sns.barplot(x='State', y='Value', hue='Type', data=us_data[us_data.Expense=='Fees/Tuition'], ci=None)
plt.title('Average Expenses of Tuition fees for Different Types of Universities in Each State')
plt.xticks(rotation=45)
plt.xlabel('State')
plt.ylabel('Average Expense Value')
plt.show()