College Completion Rate in the United States

Background

Over the past decade, I had the opportunity to work and serve both at the community college level and university level. I found the distribution of available resources and student demographics to be different. Naturally, as a researcher and professor, I was curious to find out the impact of disparities on students' performance outcome. Picking a metric to evaluate such hypothesis is not trivial. However, many scholars in the field of education believe that "student completion" (that is graduation rate) is a rather inclusive metric.

Moreover, the government of California changed the funding formula for colleges in 2019. Based on the new funding formula, the annual budget granted to each college is composed of 3 categories:

1.60% of the funding is directly correlated with the students enrollment size.

2.20% of the funding is directly correlated with success measure (i.e. completion rate)

3.20% of the funding is directly correlated with the racial diversity.

College administrations have always been interested to monitor their students outcome by measuring their completion rate. However, now more than ever it seems crucial to measure this factor. Additionally, administrations need to predict the completion rates to account for it in their planning and budgets. Therefore, I am proposing the following project to create a predictive model to forecast the completion rate.

For the proposed project I am using a publicly available dataset gathered from nearly all degree-granting institutions in the US to compare the completion rate within 4-year and 2-year colleges. The dataset includes the students' completion based on various factors such as race, the state, gender and year.

The Data

Source

The link to the publicly available dataset and summary is provided by data.world [1]. These data were pulled from the College Completion microsite produced by The Chronicle of Higher Education with support from the Bill & Melinda Gates Foundation [2].

Data Overview

Here is the information on our data set:

Features Description
0 stateid state FIPS code ('00' for United States)
1 state state name
2 state_abbr state abbreviation
3 control Control of institution (Public, Private not-for-profit, Private for-profit)
4 level Level of institution (4-year, 2-year)
5 year year of data release
6 gender gender of students ('B' = both genders; 'M' = male; 'F' = female)
7 race race/ethnicity of students ('X' = all students; 'Ai' = American Indian; 'A' = Asian; 'B' = Black; 'H' = Hispanic; 'W' = White)
8 cohort degree-seeking cohort type ('4y bach' = Bachelor's/equivalent-seeking cohort at 4-year institutions; '4y other' = Students seeking another type of degree or certificate at a 4-year institution; '2y all' = Degree-seeking students at 2-year institutions)
9 grad_cohort Number of first-time, full-time, degree-seeking students in the cohort being tracked, minus any exclusions
10 grad_100 Number of students who graduated within 100 percent of normal/expected time
11 grad_150 Number of students who graduated within 150 percent of normal/expected time
12 grad_100_rate Percentage of students who graduated within 100 percent of normal/expected time
13 grad_150_rate Percentage of students who graduated within 150 percent of normal/expected time
14 grad_cohort_ct Number of institutions with data included in the cohort

The institutions

College Completion examines data and trends at 3,800 degree-granting institutions in the United States (excluding territories) that reported a first-time, full-time degree-seeking undergraduate cohort, had a total of at least 100 students at the undergraduate level in 2013, and awarded undergraduate degrees between 2011 and 2013. It also includes colleges and universities that met the same criteria in 2010.

Graduation rates

Graduation data from the National Center for Education Statistics’ Integrated Postsecondary Education System is limited to tracking completions for groups of first-time, full-time degree-seeking students at the undergraduate level.

Race and ethnicity

Until 2009, the NCES classified students in seven ways: White, non-Hispanic; Black, non-Hispanic; American Indian/Alaskan Native; Asian/Pacific Islander; unknown race or ethnicity; and nonresident. In addition to creating a stronger separation between race and ethnicity categories, two new race categories were created: Native Hawaiian or Other Pacific Islander (previously combined with Asian students) and students who belong to two or more races.

Efficiency measures

“Awards per 100 full-time undergraduate students” includes all undergraduate-level completions reported by the institution to the NCES: bachelor’s degrees, associate degrees, and certificate programs of less than four years in length. Full-time-equivalent undergraduates are estimated from the number of credit hours taken at the institution in an academic year. To account for changes in enrollment, the resulting metric is a three-year average of data from 2011, 2012, and 2013.

Exploratory Data Analysis

First, I import the required libraries: NumPy, Pandas, Matplotlib and Seaborn. Then I import the dataset. I check the data and calculate the key statistics on all numeric columns. I use pairplots to find any correlation between data or catch any trends in the beginning. Now I have a good glimpse of the data. Next, I would like to explore any relationship between graduation rate and race between males and females. Heatmap seems like a good choice and in fact it yields a beautiful graph. Moreover, I would like to find out the graduation rate nationwide and rank all states accordingly. Bar graphs seems the right choice. Moreover, I am interested in producing an interactive plot of the US with each state indicating the graduation rate using Chart Studio package.

Importing the required packages and libraries

In [1]:
# Importing NumPy and Pandas

import numpy as np
import pandas as pd
In [2]:
#Import Matplotlib and Seaborn

import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
#Importing dataset into df

df = pd.read_csv('cc_state_sector_grads.csv')

Performing initial explorations

Let's take a look at the overall structure of our dataset and familiarize us with the features and some of the first observations.

In [4]:
df.head(10)
Out[4]:
stateid state state_abbr control level year gender race cohort grad_cohort grad_100 grad_150 grad_100_rate grad_150_rate grad_cohort_ct
0 1 Alabama AL Private for-profit 4-year 2011 B A 4y bach 0 0.0 0 NaN NaN 9
1 1 Alabama AL Private for-profit 4-year 2011 B Ai 4y bach 1 0.0 0 0.0 0.0 9
2 1 Alabama AL Private for-profit 4-year 2011 B B 4y bach 51 2.0 3 3.9 5.9 9
3 1 Alabama AL Private for-profit 4-year 2011 B H 4y bach 1 0.0 0 0.0 0.0 9
4 1 Alabama AL Private for-profit 4-year 2011 B W 4y bach 66 15.0 18 22.7 27.3 9
5 1 Alabama AL Private for-profit 4-year 2011 B X 4y bach 209 39.0 49 18.7 23.4 9
6 1 Alabama AL Private for-profit 4-year 2011 F A 4y bach 0 0.0 0 NaN NaN 9
7 1 Alabama AL Private for-profit 4-year 2011 F Ai 4y bach 0 0.0 0 NaN NaN 9
8 1 Alabama AL Private for-profit 4-year 2011 F B 4y bach 26 0.0 0 0.0 0.0 9
9 1 Alabama AL Private for-profit 4-year 2011 F H 4y bach 0 0.0 0 NaN NaN 9

The first column is a generated index that we could get rid of at the time of import. We will deal with it later.

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84942 entries, 0 to 84941
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   stateid         84942 non-null  int64  
 1   state           84942 non-null  object 
 2   state_abbr      84942 non-null  object 
 3   control         84942 non-null  object 
 4   level           84942 non-null  object 
 5   year            84942 non-null  int64  
 6   gender          84942 non-null  object 
 7   race            84942 non-null  object 
 8   cohort          84942 non-null  object 
 9   grad_cohort     84942 non-null  int64  
 10  grad_100        31760 non-null  float64
 11  grad_150        84942 non-null  int64  
 12  grad_100_rate   30033 non-null  float64
 13  grad_150_rate   75623 non-null  float64
 14  grad_cohort_ct  84942 non-null  int64  
dtypes: float64(3), int64(5), object(7)
memory usage: 9.7+ MB

There are about 85,000 entries (observations) of numerics and strings. There are more than 50,000 null data in grad_100. However, there is no non-null in grad_150. By definition, some people graduate within 100% expected time and the rest graduated within 150% of the expected time. Moreover, there are also more than 50,000 null data in grad_100_rate which is expected. However, there rare about 10,000 null data in grad_150_rate. This is because the rate is division. In some colleges, there were 0 students who graduating so the division came out to be 0 by 0 i.e. NaN.

Let's look at some key statistics of thefeatures:

In [6]:
df.describe().transpose()
Out[6]:
count mean std min 25% 50% 75% max
stateid 84942.0 28.179699 15.901155 0.0 16.0 28.0 41.0 56.0
year 84942.0 2007.565162 3.489043 2002.0 2005.0 2008.0 2011.0 2013.0
grad_cohort 84942.0 2308.136658 19414.099744 0.0 6.0 64.0 561.0 891211.0
grad_100 31760.0 1346.772733 9678.350718 0.0 3.0 33.5 409.0 296553.0
grad_150 84942.0 1058.684703 9753.837421 0.0 2.0 20.0 207.0 513263.0
grad_100_rate 30033.0 28.254796 19.921761 0.0 13.1 25.8 41.4 100.0
grad_150_rate 75623.0 39.426019 24.626696 0.0 19.9 39.4 56.8 100.0
grad_cohort_ct 84942.0 23.215512 92.582082 1.0 3.0 7.0 16.0 1180.0

The year data is reported in this data set is between 2002 and 2013 with a relatively even scatter throughout the years considering 25, 50 and 75 percentile.

The number of graduates cohorts vary from 0 to 891211 students. However, the statistics for the entire country is also included for each year in our data set. We confirm this by locating the max grad_cohort:

In [7]:
df.loc[df['grad_cohort'].idxmax()]
Out[7]:
stateid                       0
state             United States
state_abbr                   US
control                  Public
level                    4-year
year                       2013
gender                        B
race                          X
cohort                  4y bach
grad_cohort              891211
grad_100                 296553
grad_150                 513263
grad_100_rate              33.3
grad_150_rate              57.6
grad_cohort_ct              632
Name: 84605, dtype: object

The maximum grad_cohort, grad_100 and grad_150 occurs in 2013, the last year data is available.

The maximum grad_100_rate and grad_150_rate is looked up as well:

In [8]:
df.loc[df['grad_100_rate'].idxmax()]
Out[8]:
stateid                            1
state                        Alabama
state_abbr                        AL
control           Private for-profit
level                         4-year
year                            2012
gender                             M
race                               H
cohort                       4y bach
grad_cohort                        1
grad_100                           1
grad_150                           1
grad_100_rate                    100
grad_150_rate                    100
grad_cohort_ct                     9
Name: 33, dtype: object

The only student in the cohort graduated yielding a 100% graduation rate.

Every year data is available from a portion of the institution in the US. Let's find out what year we had the most data by the number of institutions:

In [9]:
df.loc[df['grad_cohort_ct'].idxmax()]
Out[9]:
stateid                                0
state                      United States
state_abbr                            US
control           Private not-for-profit
level                             4-year
year                                2011
gender                                 B
race                                   A
cohort                           4y bach
grad_cohort                        22763
grad_100                           14178
grad_150                           17411
grad_100_rate                       62.3
grad_150_rate                       76.5
grad_cohort_ct                      1180
Name: 84510, dtype: object

It turns out in 2011 we gathered data from more institution than any other year.

Visualizing Data

Pairplots

One of the most effective starting tools in data visualization is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python!

Before we create the pairplot, let's take out the "United States" from the data. However, we bring the United States back in other parts of our analysis and use as a baseline for comparison.

In [10]:
dfstate= df[df.stateid != 0]
In [11]:
sns.pairplot(dfstate)
Out[11]:
<seaborn.axisgrid.PairGrid at 0x130e504d788>

Graduation cohort is strongly correlated with graduation rate whether at 100% or 150% of the expected time. Graduation within 100% of designated timeframe is also correlated with graduation withing 150% of designated timeframe. However, it should be noted that grad_150 is cumulative i.e. the population graduated within 100% of time are also included in the population of graduation in 150% of time.

Graduation distribution plot

Now let's look at the distribution of graduate rate by state. We use Seaborn distribution plot for illustration. "grad_rate_150 is the ultimate number of students who eventually graduated. We pick that as the indicator. We group our DataFrame by state, find the average "grad_150_rate for each state and and sort them descendingly.

Note: We leave "United States" in dataset as a baseline measure.

In [12]:
df.loc[df['gender']=='B'].groupby('state')['grad_150_rate'].mean().sort_values(ascending=False)
Out[12]:
state
California              52.691754
Wyoming                 48.940278
New Hampshire           47.271849
Virginia                46.150584
Vermont                 45.457798
Florida                 45.425563
New Jersey              45.376957
Pennsylvania            45.324799
Iowa                    44.393402
Maryland                44.115090
Missouri                44.062399
Illinois                43.976384
Washington              43.574144
Massachusetts           43.525893
Connecticut             43.346286
Oregon                  43.311726
North Carolina          43.032522
Arizona                 41.932794
United States           41.686883
South Carolina          41.580651
Colorado                40.369474
Mississippi             40.203333
New York                40.090143
Alabama                 39.998225
Texas                   39.961774
Rhode Island            39.544056
Delaware                39.353911
Indiana                 39.253744
Nebraska                39.103604
Georgia                 38.884124
Tennessee               38.715120
Ohio                    38.665954
Idaho                   38.589452
Minnesota               38.352613
Maine                   37.400469
New Mexico              37.001044
Wisconsin               36.973194
Arkansas                36.684000
Utah                    36.547179
Kansas                  36.252427
Oklahoma                34.993053
Kentucky                34.238837
Louisiana               33.785952
Montana                 33.442265
North Dakota            32.493814
District of Columbia    32.437324
West Virginia           31.577106
Hawaii                  30.568791
South Dakota            29.738446
Michigan                29.551731
Nevada                  28.797430
Alaska                  27.550872
Name: grad_150_rate, dtype: float64

We would like to see the distribution of graduation for all genders.

In [13]:
sns.distplot(df[df['gender']=='B']['grad_150_rate'])
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x130e8769d08>

Deciding on the label and features

For the purpose of visualization, graduation at 150% of time which includes people graduating on time and people graduating 50% later as the label, while race and gender and state of graduatio is used as features for visualization Note: Later on in the project, when applying machine learning teachniques, all features are examined again. Categorical variables will be converted to dummy variables.

In [14]:
#Selecting the desired features for visualization using pivot tables

dfpivot=df.pivot_table(values='grad_150_rate', index='gender', columns = 'race')

Heatmap of graduation rate for race and gender

Let's compare the mean graduation rate (graduation percentage) across race and gender. Heatmap is good candidate for this purpose.

In [15]:
plt.figure(figsize=(12,6))
sns.set_context('paper',font_scale=1.5)
cmap = cmap = sns.cubehelix_palette(light=1, as_cmap=True)
sns.heatmap(dfpivot, cmap = cmap, linecolor='white', linewidths = 3)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x130ebe96748>

Looking at the far-right column, female students in general have a higher graduation rate. This trend is consistent throughout every race. Moreover, White and Asian students have a higher graduation rates comparatively. Within White and Asian students, females have higher graduation rate, consistent with the overall trends. African-American students have the lowest graduation rates. African-American males have the lowest graduation rate, while the Asian and White female students have the higher graduation rate.

Bar plot of graduation rate by state

One way to visualize the performance of students nation-wide is plotting bar graphs for each state. Additionally, it will be useful to compare graduation within states by gender.

For this matter, we use pandas pivot table. We pick state_abbr as the index, and gender for our columns. The values populated in the columns are the average graduation rate within 150% expected time (i.e. grad_150_rate).

In [16]:
#Using barplot to study graduation rate nationwide

dfs=df.pivot_table(index="state_abbr", columns="gender", values="grad_150_rate", fill_value=0)
dfs['state'] = dfs.index
dfs.head()
Out[16]:
gender B F M state
state_abbr
AK 27.550872 28.864263 24.873290 AK
AL 39.998225 41.750837 38.254661 AL
AR 36.684000 38.586147 32.468904 AR
AZ 41.932794 40.722316 41.468333 AZ
CA 52.691754 53.628799 50.298574 CA

We use matplotlib for illustration.

In [17]:
dfs=dfs.sort_values(by='B',ascending=False)
dfs.plot(kind='bar',figsize=(20,6))
plt.ylabel('Graduate Rate')
plt.xlabel('States')
plt.title('States Graduation Rate Among Genders')
Out[17]:
Text(0.5, 1.0, 'States Graduation Rate Among Genders')

The chart is drawn in a descending format, clearly showing California with the highest graduation rate int the States. The graduation rate in California for both genders is slightly higher than 50% while the graduation rate is higher for females, which is consistent with our findings before. Wyoming, New Hampshire, Virginia and Vermont are ranked 2 to 5. The bottom 5 states are West Virgina, Hawaii, South Dakota, Michigan, Nevada and Alaska, with graduation rates around 30%. We left "United States" in our data set. The graduation rates in the United States as an average is about 45% with females having the bigger share.

Another insightful plot here would be the difference in females and males graduation rates by state.

Creating an interactive map of nationwide graduation rate

Choropleth Maps is a type of thematic map in which areas are shaded or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density. Along the same line of the previous graph, let's take a look at the distribution of graduation rates in the United States.

Note: Other than Pandas and NumPy, "Chart Studio" must be installed too.

In [18]:
import chart_studio.plotly as py
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

Now we need to set up everything so that the figures show up in the notebook:

In [19]:
init_notebook_mode(connected=True) 

Now we need to begin to build a data dictionary. Easiest way to do this is to use the dict() function of the general form:

In [20]:
#Building data dictionary

data = dict(type = 'choropleth',
            locations = dfs['state'],
            locationmode = 'USA-states',
            colorscale= 'Greens',
            #text= ['text1','text2','text3'],
            z=dfs['B'],
            colorbar = {'title':'Graduate Rate'})

Then we create the layout nested dictionary:

In [21]:
#Building nested dictionary

layout = dict(geo = {'scope':'usa'})

Then we use:

go.Figure(data = [data],layout = layout) to set up the object that finally gets passed into iplot()

In [22]:
#Set up object to be passed into iplot()

choromap = go.Figure(data = [data],layout = layout)
In [23]:
#Plotting the interactive map of US with our graduation rate data

iplot(choromap)

Conclusion and Future Work

The college completion rate (that is the graduation rate of students) from 3800 degree-granting institutions between 2003 to 2013 was studied. Racial and gender factors were considered to establish a measure of success (completion) for students. The result indicate the average completion rates for colleges is between 30 to 50 percent. California ranks top with 50% graduation rate while Alaska sits at the bottom with 28% completion. Moreover, Asian and White students have the highest completion while African-American students have the lowest completion. Female students graduate at higher rates among all races. Moreover, female students have higher graduation rate compared to male students in every state consistently. These results are useful for future studies that include graduation rates for colleges. Especially, for California colleges where the funding formula is directly dependent on the completion rates and college racial diversity. The next step of the way is to create a predictive model to forecast the completion rates for the coming academic years. The predictive model could potentially predict the racial diversity each year as well. However, to create a robust model more data is required. The required data must be more recent, reflecting the years past 2013 up to 2020. Data must include new features such as teacher-to-student ratio.

Data modeling, prediction and evaluation

Next step presents the results of analysis that will be conducted in order to gauge the predictive capability of a machine learning algorithm to predict on-time graduation that takes into consideration students' learning and development. Picking such model is crucial for college administration planning and budgeting. Anderson et al. [3] trained a set of four binary classifiers to predict FTIC students’ graduation rates. Based on the F1 and AUC scores, SGD classifier and logistic regression performed the best, while linear SVM and decision tree performed slightly worse. Pang et al. [4] predicted students graduation with 72% accuracy and precision/recall around 62% using ensemble support vector machines algorithm. Kesumawati et al. [5] applied Naïve Bayes Classifier and SVM to forecast students graduation rates. Both algorithms yielded 69% accuracy in predictions. Raju in his dissertation [6] applied regression models (forward, backward and stepwise), neural network and decision tree. Neural networks had the best performance in predicting graduation rates, while regression methods were a close contender with 77% accuracy while decision tree performed the worst.

From the brief literature review performed above, it seems that either neural networks or regression models are potential candidates for our predictive model.