Over the past decade, I had the opportunity to work and serve both at the community college level and university level. I found the distribution of available resources and student demographics to be different. Naturally, as a researcher and professor, I was curious to find out the impact of disparities on students' performance outcome. Picking a metric to evaluate such hypothesis is not trivial. However, many scholars in the field of education believe that "student completion" (that is graduation rate) is a rather inclusive metric.
Moreover, the government of California changed the allocation of funding for colleges in 2019. Based on the new funding formula, the annual allocated funding for colleges is composed of 3 categories:
1.60% of the funding as a base allocation, which largely reflects enrollment.
2.20% of the funding is directly correlated with success measure (that includes completion rate)
3.20% of the funding allocation is based on the numbers of students receiving grants.
For more details visit California Community Colleges Student Centered Funding Formula.
College administrations have always been interested to monitor their students outcome by measuring their completion rate. However, now more than ever it seems crucial to measure this factor. Additionally, administrations need to predict the completion rates to account for it in their planning and budgets. Therefore, I am proposing the following project to create a predictive model to forecast the completion rate.
For the proposed project I am using a publicly available dataset gathered from nearly all degree-granting institutions in the US to compare the completion rate within community colleges. The dataset includes the students' completion based on various factors such as race, the state, gender and year.
The primary source of data for this work comes from College Scorecard by US Department of Eduction. College Scorecard provides data at the institution-level and data by field of study. The technical data documents provide in-depth information about these data. The data dictionary includes separate worksheets with institution-level and field of study dictionaries as well as cohort maps describing the timing aspects of each data element.
The link to the publicly available dataset is here.
The dataset contains 3GB of data ranging from 1997 to June 2020, gathering and reporting approximately 1900 features. Related features to this work were extracted and summarized below.
Here is the information on the data set:
Features | Description |
---|---|
UGDS | Total Student Enrollment |
UGDS_WHITE | Ratio of White Students |
UGDS_Black | Ratio of African-American Students |
UGDS_HISP | Ratio of Hispanic Students |
UGDS_OTH | Ratio of Other Students |
UGDS_MEN | Ratio of Male Students |
UGDS_WOMEN | Ratio of Female Students |
TUITIONFEE_IN | Tuition Fee for In-State Students |
TUITIONFEE_OUT | Tuition Fee for Out-Of-State-State Students |
AVGFACSAL | Average Salary of Faculties |
INC_PCT_LO | Ratio of Low-Income Families |
INC_PCT_M | Ratio of Middle-Income Families |
INC_PCT_H | Ratio of High-Income Families |
DEBT_MDN | Median Loan Debt Accumulated at the Institution |
FTFTPCTFLOAN | Share of Students Who Received Federal Loans |
College Completion examines data and trends at 3,800 degree-granting institutions in the United States that reported a first-time, full-time degree-seeking undergraduate cohort, had a total of at least 500 students at the undergraduate level , and awarded undergraduate degrees between 2010 and 2017.
Graduation data from the National Center for Education Statistics’ Integrated Postsecondary Education System is limited to tracking completions for groups of first-time, full-time degree-seeking students at the undergraduate level.
Until 2009, the NCES classified students in seven ways: White, non-Hispanic; Black, non-Hispanic; American Indian/Alaskan Native; Asian/Pacific Islander; unknown race or ethnicity; and nonresident. In addition to creating a stronger separation between race and ethnicity categories, two new race categories were created: Native Hawaiian or Other Pacific Islander (previously combined with Asian students) and students who belong to two or more races.
“Awards per 100 full-time undergraduate students” includes all undergraduate-level completions reported by the institution to the NCES: associate degrees, and certificate programs of less than four years in length. Full-time-equivalent undergraduates are estimated from the number of credit hours taken at the institution in an academic year. To account for changes in enrollment, the resulting metric is a seven-year average of data from 2011 to 2017.
First, let's import the required libraries: NumPy, Pandas, Matplotlib and Seaborn. Now let's take a look at data and calculate the key statistics on all numeric columns. I use pairplots to find any correlation between data or catch any trends in the beginning. Now I have a good glimpse of the data. Next, I would like to explore any relationship between graduation rate and race between males and females. Heatmap seems like a good choice and in fact it yields a beautiful graph. Moreover, I would like to find out the graduation rate nationwide and rank all states accordingly. Bar graphs seems the right choice. Moreover, I am interested in producing an interactive plot of the US with each state indicating the graduation rate using Chart Studio package.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import os
import warnings
warnings.filterwarnings('ignore')
College Scorecard provides tens of csv files from 1997 to June 2020. Let's open one of the more recent files.
datadir = os.getcwd() + "/rawdata"
COL_1819 = pd.read_csv(datadir + '/MERGED2018_19_PP.csv')
COL_1819.info()
There are 6806 entries (degree-granting institutions) for 2018-2019 academic year and 1986 columns in this one file.
Note that memory usage for this file is 103+ MB.
COL_1819.head()
There are missing values tagged with PrivacySupprressed
I would like to deal with missing values. So I perform a hard "na drop".
COL_1819_NoNA = COL_1819.dropna(axis=1)
COL_1819_NoNA.info()
It seems like only 18 columns survived the hard drop. Memory usage went down from 103+ MB to less than 1 MB.
Taking such drastic measures by dropping all missing values is not optimal. Therefore, I read the data documentation and only keep the columns that I need for my analysis.
I select the following variables from Student Body, Admission, Institution and Financial. I store all variables of interest in vars_interest
studentBodyVars = ['UGDS', 'UGDS_MEN', 'UGDS_WOMEN', 'UGDS_WHITE',
'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN']
admission = ['ADM_RATE_ALL', 'ACTCMMID', 'SATVRMID', 'SATMTMID', 'SATWRMID']
completion = ['C100_4', 'C100_L4', 'C150_4', 'C150_L4']
institution = ['AVGFACSAL', 'TUITIONFEE_OUT', 'TUITIONFEE_IN']
financial = ['FTFTPCTFLOAN', 'DEBT_MDN', 'INC_PCT_LO', 'INC_PCT_M1', 'INC_PCT_M2', 'INC_PCT_H1', 'INC_PCT_H2']
vars_interest = studentBodyVars + admission + completion + institution + financial # Include these vars
I prefer working with float numbers so I convert the columns to float. However, UNITID
is the Institution Unique ID so I convert that back to integer.
col_dtypes = dict(COL_1819_NoNA.dtypes.replace(np.dtype('int64'),np.dtype('float64'))) # make the dtypes floats
col_dtypes['UNITID'] = np.dtype('int64') # convert the UNITID back to int
I would want to apply the changes to my vars_interest
col_dtypes.update({a: np.dtype('float64') for a in vars_interest}) # make them floats
I would like to read the csv file again, however, this time I only select and import the columns in vars_interest
. Moreover, the missing values are called PrivacySupprressed
in the csv so I make sure Pandas knows about that as well. The import works much faster this way
COL_1819 = pd.read_csv(datadir + '/MERGED2018_19_PP.csv',na_values='PrivacySuppressed',
usecols=col_dtypes.keys())
COL_1819.info()
COL_1819.head()
The index is integer. I would like to change that to a set of (Institution, Year).
I write a function to open each csv from 1997, extract the columns in vars_interest
and update the index as we go.
def read_cs_data(year,col_dtypes, datadir):
"""read a CollegeScorecard dataframe"""
nextyr = str(int(year) + 1)[-2:]
filename = datadir + '/MERGED{}_{}_PP.csv'.format(year,nextyr)
col = pd.read_csv(filename,na_values='PrivacySuppressed',
usecols=col_dtypes.keys())
col['Year'] = pd.Period(str(int(year) + 1),freq='Y')
return col
col_all = pd.concat((read_cs_data(str(y),col_dtypes,datadir) for y in range(1997,2019)))
col_all = col_all.set_index(['UNITID','Year'])
col_all.info()
Lets get some key statistics from the dataset so far.
col_all.describe()
I would like to focus on community colleges for this particular work so I select a subset of data that corresponds to community colleges by searching for institutions with the highest granting degree being Associate using 'HIGHDEG' == 2
df = col_all[col_all['HIGHDEG']==2]
df.info()
let's look at the distribution of graduate rate by state. We use Seaborn distribution plot for illustration. "grad_rate_150 is the ultimate number of students who eventually graduated. We pick that as the indicator. We group our DataFrame by state, find the average "grad_150_rate for each state and and sort them descendingly.
df['C150_L4'].describe()
f, ax = plt.subplots(1, 1)
sns.distplot(df['C100_L4'], label="Within 100% of Expected Time")
sns.distplot(df['C150_L4'], label="Within 150% of Expected Time")
ax.legend()
f.set_size_inches(11.7, 8.27)
ax.set(xlabel='Graduation Rate', ylabel='common ylabel')
plt.show()
One way to visualize the performance of students nation-wide is plotting bar graphs for each state. Additionally, it will be useful to compare graduation within states by gender.
For this matter, we use pandas pivot table. We pick state_abbr as the index, and gender for our columns. The values populated in the columns are the average graduation rate within 150% expected time (i.e. grad_150_rate).
dfs = col_all.pivot_table(index="STABBR", values='C150_L4', aggfunc='mean', fill_value=0)
dfs=dfs.sort_values(by='C150_L4',ascending=False)
dfs.plot(kind='bar',figsize=(20,6))
plt.ylabel('Graduate Rate')
plt.xlabel('States')
plt.title('States Graduation Rate')
Choropleth Maps is a type of thematic map in which areas are shaded or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density. Along the same line of the previous graph, let's take a look at the distribution of graduation rates in the United States.
Note: Other than Pandas and NumPy, "Chart Studio" must be installed too.
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
Now we need to set up everything so that the figures show up in the notebook:
init_notebook_mode(connected=True)
Now we need to begin to build a data dictionary. Easiest way to do this is to use the dict() function of the general form:
#Building data dictionary
data = dict(type = 'choropleth',
locations = dfs.index,
locationmode = 'USA-states',
colorscale= 'Greens',
#text= ['text1','text2','text3'],
z=dfs['C150_L4'],
colorbar = {'title':'Graduate Rate'})
Then we create the layout nested dictionary:
#Building nested dictionary
layout = dict(geo = {'scope':'usa'})
Then we use:
go.Figure(data = [data],layout = layout) to set up the object that finally gets passed into iplot()
#Set up object to be passed into iplot()
choromap = go.Figure(data = [data],layout = layout)
#Plotting the interactive map of US with our graduation rate data
iplot(choromap)
The college completion rate (that is the graduation rate of students) from 3800 degree-granting institutions between 2003 to 2013 was studied. Racial and gender factors were considered to establish a measure of success (completion) for students. The result indicate the average completion rates for colleges is between 30 to 50 percent. California ranks top with 50% graduation rate while Alaska sits at the bottom with 28% completion. Moreover, Asian and White students have the highest completion while African-American students have the lowest completion. Female students graduate at higher rates among all races. Moreover, female students have higher graduation rate compared to male students in every state consistently. These results are useful for future studies that include graduation rates for colleges. Especially, for California colleges where the funding formula is directly dependent on the completion rates and college racial diversity. The next step of the way is to create a predictive model to forecast the completion rates for the coming academic years. The predictive model could potentially predict the racial diversity each year as well. However, to create a robust model more data is required. The required data must be more recent, reflecting the years past 2013 up to 2020. Data must include new features such as teacher-to-student ratio.
Next step presents the results of analysis that will be conducted in order to gauge the predictive capability of a machine learning algorithm to predict on-time graduation that takes into consideration students' learning and development. Picking such model is crucial for college administration planning and budgeting. Anderson et al. [3] trained a set of four binary classifiers to predict FTIC students’ graduation rates. Based on the F1 and AUC scores, SGD classifier and logistic regression performed the best, while linear SVM and decision tree performed slightly worse. Pang et al. [4] predicted students graduation with 72% accuracy and precision/recall around 62% using ensemble support vector machines algorithm. Kesumawati et al. [5] applied Naïve Bayes Classifier and SVM to forecast students graduation rates. Both algorithms yielded 69% accuracy in predictions. Raju in his dissertation [6] applied regression models (forward, backward and stepwise), neural network and decision tree. Neural networks had the best performance in predicting graduation rates, while regression methods were a close contender with 77% accuracy while decision tree performed the worst.
From the brief literature review performed above, it seems that either neural networks or regression models are potential candidates for our predictive model.
Predictions could benefit from engineering features before feeding them to a predictive model. The share of various races in student body is different. While white students constitute more than 50% of the student body, African-American 14% and Hispanic 13%, all other races, including non-resident and unknown races constitute 20% of the student body. So a new variable called UGDS_OTH
is created to include all other races.
Moreover, family income in broken down to middle 1 (INC_PCT_M1
) and middle 2 (INC_PCT_M2
). Both variables are combined into a new variable called INC_PCT_M
. Similarly, high income 1 INC_PCT_H1
and high income 2 INC_PCT_H2
are summed up in INC_PCT_H
.
df['UGDS_OTH'] = df['UGDS_ASIAN']+df['UGDS_AIAN']+df['UGDS_NHPI']+df['UGDS_2MOR']+df['UGDS_NRA']+df['UGDS_UNKN']
df['INC_PCT_M'] = df['INC_PCT_M1']+df['INC_PCT_M2']
df['INC_PCT_H'] = df['INC_PCT_H1']+df['INC_PCT_H2']
df.describe()
The original dataframe created, df
contain some features that are not useful in predictions such as institution name INSTNM
or institution unit ID and its variations OPEID
and OPEID6
. Some other features that thought to be useful in the first round of feature selection such as admission rate ADM_RATE_ALL
or SAT and ACT scores SATVRMID
and ACRCMMID
contain many missing values. These features are removed to increase model predictions. Finally, the features that are combined in the new variables must be removed too (e.g. UGDS_AIAN
or INC_PCT_M1
).
mldf = df.drop(columns=['INSTNM','ADM_RATE_ALL','SATVRMID','SATMTMID','SATWRMID','ACTCMMID','HCM2','CURROPER',
'C150_4','C100_4','OPEID','OPEID6','CITY','STABBR','ZIP','MAIN','NUMBRANCH',
'PREDDEG','HIGHDEG','CONTROL','ST_FIPS','REGION','ICLEVEL','OPEFLAG', 'UGDS_AIAN', 'UGDS_ASIAN',
'UGDS_NHPI','UGDS_2MOR','UGDS_NRA','UGDS_UNKN','INC_PCT_M1','INC_PCT_M2','INC_PCT_H1',
'INC_PCT_H2'], axis = 1)
mldf.info()
The average population for a community college is close to 6000 and average graduation rate close to 0.5. However, in rare cases there are some institutions with small population or very high graduation rates. These institution offer only certifications and offer only a few associate degree programs or none at all. Therefore, these institutions are removed from the dataframe using a simple query
. The query maintain observations with more than 500 students and graduation rates below 0.8.
dfn = mldf.query('C150_L4<0.8 & UGDS>500')
dfn.info()
The dataframe is reduced down to 22713 observations. There are however plenty of missing values in data. Most of the data is suppressed due to privacy issues. We need to perform a hard na drop on the remainder of the dataframe.
dfn.dropna(axis = 0, inplace = True)
Let's rearrange the dataframe and put family features side-by-side
dfn = dfn[['C100_L4','C150_L4','UGDS','UGDS_WHITE','UGDS_BLACK','UGDS_HISP','UGDS_OTH','UGDS_MEN','UGDS_WOMEN','TUITIONFEE_IN','TUITIONFEE_OUT','DEBT_MDN','AVGFACSAL','INC_PCT_LO','INC_PCT_M','INC_PCT_H','FTFTPCTFLOAN']]
dfn.describe()
Let's compute pairwise correlation of columns using dfn.corr(). Seaborn Heatmap shows the contrast of correlations beautifully. We use "RdBu"
color map to contrast positive correlations with negative or inverse correlations.
plt.figure(figsize=(18,10))
sns.set_context('paper',font_scale=1.5)
#cmap = sns.cubehelix_palette(light=1, as_cmap=True)
#sns.heatmap(bf.corr(), cmap = cmap, linecolor='white', linewidths = 3, annot = True, fmt='.1g')
sns.heatmap(dfn.corr(), cmap="RdBu", linecolor='white', linewidths = 3, annot = True, fmt='.1g')
The top 2 rows show the correlation of graduation rate (within 100 and 150 percent of expected time) with other columns. Graduation rate has strong inverse correlation with college population and and average faculty salary. Graduation rate has a positive correlation with tuition fee and average student debt. However, we can not make early judgments until later.
First let's make X
and y
matrices. X
contains all the features except graduation rates. y
contains graduation rates.
y = dfn[['C150_L4']]
X = dfn.drop(columns=['C150_L4', 'C100_L4'], axis = 1)
from sklearn.preprocessing import QuantileTransformer, MinMaxScaler
Scaling is necessary because features range from fractions of 1 to hundreds of thousands. Graduation rates are confined between 0 and 1. We used MinMaxScaler
to keep the range while scaling all features.
We store transformed X
and y
in scy
and scX
.
#scy = MinMaxScaler(feature_range=(0, 0.8)).fit_transform(y)
scX = MinMaxScaler().fit_transform(X)
We use 70% of data for training and set aside 30% for predictions.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scX, y, test_size=0.3)
Researchers in the field have applied different types of models to predict graduation rates. The accuracy of prediction are about 70%. In this study we use RandomForestRegressor
from sklearn.ensemble
.
from sklearn.ensemble import RandomForestRegressor
Additionally, we use GridSearchCV
from sklearn.model_selection
to find the best parameters for the model.
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train, y_train)
rf_random.best_params_
param_grid = {
'bootstrap' : [False],
'max_depth' : [None, 93],
'n_estimators' : [100, 400, 1000, 1500],
'max_features' : [4, 8, 12],
'min_samples_split': [1, 2, 3],
'min_samples_leaf' : [1, 2]
}
grid = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
grid.fit(X_train,y_train)
grid.best_params_
param_grid = {
'bootstrap' : [False],
'max_depth' : [None],
'n_estimators' : [1500],
'max_features' : [4],
'min_samples_split': [2],
'min_samples_leaf' : [1]
}
model = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
model.fit(X_train,y_train)
#model.feature_importances_
Let's import mean_squared_error
, mean_absolute_error
, explained_variance_score
from sklearn.metrics
.
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score
y_pred = model.predict(X_test)
expVar = explained_variance_score(y_test, y_pred)
MSE = mean_squared_error(y_test,y_pred)
MAE = mean_absolute_error(y_test,y_pred)
"Exp Variance:{} MSE: {} MAE: {}".format(expVar, MSE, MAE)
The mean absolute error is ~ 0.07. In the context of graduation rates (between 0 to 1.0) it means that our model is predicting graduation with 7 percent point accuracy. The explained variance is at 0.73. There is still some variance in the model that cannot be resolved readily.
Let's plot test cases versus our prediction and look at the spread.
plt.scatter(y_test,y_pred)
plt.plot(y_test,y_test,'r')
The overall trends indicate consistent predictions. However, there are some outliers and some less accurate predictions.
The model seems to be predicting better up to 0.6 graduation rates and performs poorly for graduation rates above 0.6. One reasons the overwhelming majority of institutions report graduation rates around 0.6 or lower. There are some institutions that graduate students at higher rates that are rare.
Another way to look at the predictions vs test cases, is by getting their difference and plot the distributions plot.
errors = np.array(y_test) - y_pred
sns.distplot(errors)
The errors distribution plot is behaving normal with majority of the error hovering around zero.
X_baseline = np.asarray(X.mean()).reshape((15,1))
X_baseline
BASE CASE:
Graduation rate of base case
X_case =scMM.fit_transform(X_baseline)
model.predict(X_case.reshape((1,15)))
Graduation rate by reducing student debt
X_testing =[[5.90140064e+03],
[5.33810859e-01],
[1.38852707e-01],
[1.28654903e-01],
[1.28312902e-01],
[4.08781989e-01],
[5.91218027e-01],
[4.58957465e+03],
[8.20187433e+03],
[2500],
[6.05138691e+03],
[6.12463601e-01],
[2.83733083e-01],
[1.03803316e-01],
[3.63204819e-01]]
X_case =scMM.fit_transform(X_testing)
grid.predict(X_case.reshape((1,15))) #Debt from 5.7k to 2.7
Graduation rate by reducing increasing tutition fee
X_testing =[[5.90140064e+03],
[5.33810859e-01],
[1.38852707e-01],
[1.28654903e-01],
[1.28312902e-01],
[4.08781989e-01],
[5.91218027e-01],
[9000],
[16000],
[5.67319483e+03],
[6.05138691e+03],
[6.12463601e-01],
[2.83733083e-01],
[1.03803316e-01],
[3.63204819e-01]]
X_case =scMM.fit_transform(X_testing)
model.predict(X_case.reshape((1,15))) #Tuition from 4.5k to 10k