Over the past decade, I had the opportunity to work and serve both at the community college level and university level. I found the distribution of available resources and student demographics to be different. Naturally, as a researcher and professor, I was curious to find out the impact of disparities on students' performance outcome. Picking a metric to evaluate such hypothesis is not trivial. However, many scholars in the field of education believe that "student completion" (that is graduation rate) is a rather inclusive metric.
Moreover, the government of California changed the funding formula for colleges in 2019. Based on the new funding formula, the annual budget granted to each college is composed of 3 categories:
1.60% of the funding is directly correlated with the students enrollment size.
2.20% of the funding is directly correlated with success measure (i.e. completion rate)
3.20% of the funding is directly correlated with the racial diversity.
College administrations have always been interested to monitor their students outcome by measuring their completion rate. However, now more than ever it seems crucial to measure this factor. Additionally, administrations need to predict the completion rates to account for it in their planning and budgets. Therefore, I am proposing the following project to create a predictive model to forecast the completion rate.
For the proposed project I am using a publicly available dataset gathered from nearly all degree-granting institutions in the US to compare the completion rate within 4-year and 2-year colleges. The dataset includes the students' completion based on various factors such as race, the state, gender and year.
Here is the information on our data set:
Features | Description | |
---|---|---|
0 | stateid | state FIPS code ('00' for United States) |
1 | state | state name |
2 | state_abbr | state abbreviation |
3 | control | Control of institution (Public, Private not-for-profit, Private for-profit) |
4 | level | Level of institution (4-year, 2-year) |
5 | year | year of data release |
6 | gender | gender of students ('B' = both genders; 'M' = male; 'F' = female) |
7 | race | race/ethnicity of students ('X' = all students; 'Ai' = American Indian; 'A' = Asian; 'B' = Black; 'H' = Hispanic; 'W' = White) |
8 | cohort | degree-seeking cohort type ('4y bach' = Bachelor's/equivalent-seeking cohort at 4-year institutions; '4y other' = Students seeking another type of degree or certificate at a 4-year institution; '2y all' = Degree-seeking students at 2-year institutions) |
9 | grad_cohort | Number of first-time, full-time, degree-seeking students in the cohort being tracked, minus any exclusions |
10 | grad_100 | Number of students who graduated within 100 percent of normal/expected time |
11 | grad_150 | Number of students who graduated within 150 percent of normal/expected time |
12 | grad_100_rate | Percentage of students who graduated within 100 percent of normal/expected time |
13 | grad_150_rate | Percentage of students who graduated within 150 percent of normal/expected time |
14 | grad_cohort_ct | Number of institutions with data included in the cohort |
College Completion examines data and trends at 3,800 degree-granting institutions in the United States (excluding territories) that reported a first-time, full-time degree-seeking undergraduate cohort, had a total of at least 100 students at the undergraduate level in 2013, and awarded undergraduate degrees between 2011 and 2013. It also includes colleges and universities that met the same criteria in 2010.
Graduation data from the National Center for Education Statistics’ Integrated Postsecondary Education System is limited to tracking completions for groups of first-time, full-time degree-seeking students at the undergraduate level.
Until 2009, the NCES classified students in seven ways: White, non-Hispanic; Black, non-Hispanic; American Indian/Alaskan Native; Asian/Pacific Islander; unknown race or ethnicity; and nonresident. In addition to creating a stronger separation between race and ethnicity categories, two new race categories were created: Native Hawaiian or Other Pacific Islander (previously combined with Asian students) and students who belong to two or more races.
“Awards per 100 full-time undergraduate students” includes all undergraduate-level completions reported by the institution to the NCES: bachelor’s degrees, associate degrees, and certificate programs of less than four years in length. Full-time-equivalent undergraduates are estimated from the number of credit hours taken at the institution in an academic year. To account for changes in enrollment, the resulting metric is a three-year average of data from 2011, 2012, and 2013.
First, I import the required libraries: NumPy, Pandas, Matplotlib and Seaborn. Then I import the dataset. I check the data and calculate the key statistics on all numeric columns. I use pairplots to find any correlation between data or catch any trends in the beginning. Now I have a good glimpse of the data. Next, I would like to explore any relationship between graduation rate and race between males and females. Heatmap seems like a good choice and in fact it yields a beautiful graph. Moreover, I would like to find out the graduation rate nationwide and rank all states accordingly. Bar graphs seems the right choice. Moreover, I am interested in producing an interactive plot of the US with each state indicating the graduation rate using Chart Studio package.
# Importing NumPy and Pandas
import numpy as np
import pandas as pd
#Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
#Importing dataset into df
df = pd.read_csv('cc_state_sector_grads.csv')
Let's take a look at the overall structure of our dataset and familiarize us with the features and some of the first observations.
df.head(10)
The first column is a generated index that we could get rid of at the time of import. We will deal with it later.
df.info()
There are about 85,000 entries (observations) of numerics and strings. There are more than 50,000 null data in grad_100. However, there is no non-null in grad_150. By definition, some people graduate within 100% expected time and the rest graduated within 150% of the expected time. Moreover, there are also more than 50,000 null data in grad_100_rate which is expected. However, there rare about 10,000 null data in grad_150_rate. This is because the rate is division. In some colleges, there were 0 students who graduating so the division came out to be 0 by 0 i.e. NaN.
Let's look at some key statistics of thefeatures:
df.describe().transpose()
The year data is reported in this data set is between 2002 and 2013 with a relatively even scatter throughout the years considering 25, 50 and 75 percentile.
The number of graduates cohorts vary from 0 to 891211 students. However, the statistics for the entire country is also included for each year in our data set. We confirm this by locating the max grad_cohort:
df.loc[df['grad_cohort'].idxmax()]
The maximum grad_cohort, grad_100 and grad_150 occurs in 2013, the last year data is available.
The maximum grad_100_rate and grad_150_rate is looked up as well:
df.loc[df['grad_100_rate'].idxmax()]
The only student in the cohort graduated yielding a 100% graduation rate.
Every year data is available from a portion of the institution in the US. Let's find out what year we had the most data by the number of institutions:
df.loc[df['grad_cohort_ct'].idxmax()]
It turns out in 2011 we gathered data from more institution than any other year.
One of the most effective starting tools in data visualization is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python!
Before we create the pairplot, let's take out the "United States" from the data. However, we bring the United States back in other parts of our analysis and use as a baseline for comparison.
dfstate= df[df.stateid != 0]
sns.pairplot(dfstate)
Graduation cohort is strongly correlated with graduation rate whether at 100% or 150% of the expected time. Graduation within 100% of designated timeframe is also correlated with graduation withing 150% of designated timeframe. However, it should be noted that grad_150 is cumulative i.e. the population graduated within 100% of time are also included in the population of graduation in 150% of time.
Now let's look at the distribution of graduate rate by state. We use Seaborn distribution plot for illustration. "grad_rate_150 is the ultimate number of students who eventually graduated. We pick that as the indicator. We group our DataFrame by state, find the average "grad_150_rate for each state and and sort them descendingly.
Note: We leave "United States" in dataset as a baseline measure.
df.loc[df['gender']=='B'].groupby('state')['grad_150_rate'].mean().sort_values(ascending=False)
We would like to see the distribution of graduation for all genders.
sns.distplot(df[df['gender']=='B']['grad_150_rate'])
For the purpose of visualization, graduation at 150% of time which includes people graduating on time and people graduating 50% later as the label, while race and gender and state of graduatio is used as features for visualization Note: Later on in the project, when applying machine learning teachniques, all features are examined again. Categorical variables will be converted to dummy variables.
#Selecting the desired features for visualization using pivot tables
dfpivot=df.pivot_table(values='grad_150_rate', index='gender', columns = 'race')
Let's compare the mean graduation rate (graduation percentage) across race and gender. Heatmap is good candidate for this purpose.
plt.figure(figsize=(12,6))
sns.set_context('paper',font_scale=1.5)
cmap = cmap = sns.cubehelix_palette(light=1, as_cmap=True)
sns.heatmap(dfpivot, cmap = cmap, linecolor='white', linewidths = 3)
Looking at the far-right column, female students in general have a higher graduation rate. This trend is consistent throughout every race. Moreover, White and Asian students have a higher graduation rates comparatively. Within White and Asian students, females have higher graduation rate, consistent with the overall trends. African-American students have the lowest graduation rates. African-American males have the lowest graduation rate, while the Asian and White female students have the higher graduation rate.
One way to visualize the performance of students nation-wide is plotting bar graphs for each state. Additionally, it will be useful to compare graduation within states by gender.
For this matter, we use pandas pivot table. We pick state_abbr as the index, and gender for our columns. The values populated in the columns are the average graduation rate within 150% expected time (i.e. grad_150_rate).
#Using barplot to study graduation rate nationwide
dfs=df.pivot_table(index="state_abbr", columns="gender", values="grad_150_rate", fill_value=0)
dfs['state'] = dfs.index
dfs.head()
We use matplotlib for illustration.
dfs=dfs.sort_values(by='B',ascending=False)
dfs.plot(kind='bar',figsize=(20,6))
plt.ylabel('Graduate Rate')
plt.xlabel('States')
plt.title('States Graduation Rate Among Genders')
The chart is drawn in a descending format, clearly showing California with the highest graduation rate int the States. The graduation rate in California for both genders is slightly higher than 50% while the graduation rate is higher for females, which is consistent with our findings before. Wyoming, New Hampshire, Virginia and Vermont are ranked 2 to 5. The bottom 5 states are West Virgina, Hawaii, South Dakota, Michigan, Nevada and Alaska, with graduation rates around 30%. We left "United States" in our data set. The graduation rates in the United States as an average is about 45% with females having the bigger share.
Another insightful plot here would be the difference in females and males graduation rates by state.
Choropleth Maps is a type of thematic map in which areas are shaded or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density. Along the same line of the previous graph, let's take a look at the distribution of graduation rates in the United States.
Note: Other than Pandas and NumPy, "Chart Studio" must be installed too.
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
Now we need to set up everything so that the figures show up in the notebook:
init_notebook_mode(connected=True)
Now we need to begin to build a data dictionary. Easiest way to do this is to use the dict() function of the general form:
#Building data dictionary
data = dict(type = 'choropleth',
locations = dfs['state'],
locationmode = 'USA-states',
colorscale= 'Greens',
#text= ['text1','text2','text3'],
z=dfs['B'],
colorbar = {'title':'Graduate Rate'})
Then we create the layout nested dictionary:
#Building nested dictionary
layout = dict(geo = {'scope':'usa'})
Then we use:
go.Figure(data = [data],layout = layout) to set up the object that finally gets passed into iplot()
#Set up object to be passed into iplot()
choromap = go.Figure(data = [data],layout = layout)
#Plotting the interactive map of US with our graduation rate data
iplot(choromap)
The college completion rate (that is the graduation rate of students) from 3800 degree-granting institutions between 2003 to 2013 was studied. Racial and gender factors were considered to establish a measure of success (completion) for students. The result indicate the average completion rates for colleges is between 30 to 50 percent. California ranks top with 50% graduation rate while Alaska sits at the bottom with 28% completion. Moreover, Asian and White students have the highest completion while African-American students have the lowest completion. Female students graduate at higher rates among all races. Moreover, female students have higher graduation rate compared to male students in every state consistently. These results are useful for future studies that include graduation rates for colleges. Especially, for California colleges where the funding formula is directly dependent on the completion rates and college racial diversity. The next step of the way is to create a predictive model to forecast the completion rates for the coming academic years. The predictive model could potentially predict the racial diversity each year as well. However, to create a robust model more data is required. The required data must be more recent, reflecting the years past 2013 up to 2020. Data must include new features such as teacher-to-student ratio.
Next step presents the results of analysis that will be conducted in order to gauge the predictive capability of a machine learning algorithm to predict on-time graduation that takes into consideration students' learning and development. Picking such model is crucial for college administration planning and budgeting. Anderson et al. [3] trained a set of four binary classifiers to predict FTIC students’ graduation rates. Based on the F1 and AUC scores, SGD classifier and logistic regression performed the best, while linear SVM and decision tree performed slightly worse. Pang et al. [4] predicted students graduation with 72% accuracy and precision/recall around 62% using ensemble support vector machines algorithm. Kesumawati et al. [5] applied Naïve Bayes Classifier and SVM to forecast students graduation rates. Both algorithms yielded 69% accuracy in predictions. Raju in his dissertation [6] applied regression models (forward, backward and stepwise), neural network and decision tree. Neural networks had the best performance in predicting graduation rates, while regression methods were a close contender with 77% accuracy while decision tree performed the worst.
From the brief literature review performed above, it seems that either neural networks or regression models are potential candidates for our predictive model.