1. College Completion Rate in the United States

1.1 Background

Over the past decade, I had the opportunity to work and serve both at the community college level and university level. I found the distribution of available resources and student demographics to be different. Naturally, as a researcher and professor, I was curious to find out the impact of disparities on students' performance outcome. Picking a metric to evaluate such hypothesis is not trivial. However, many scholars in the field of education believe that "student completion" (that is graduation rate) is a rather inclusive metric.

Moreover, the government of California changed the allocation of funding for colleges in 2019. Based on the new funding formula, the annual allocated funding for colleges is composed of 3 categories:

1.60% of the funding as a base allocation, which largely reflects enrollment.

2.20% of the funding is directly correlated with success measure (that includes completion rate)

3.20% of the funding allocation is based on the numbers of students receiving grants.

For more details visit California Community Colleges Student Centered Funding Formula.

College administrations have always been interested to monitor their students outcome by measuring their completion rate. However, now more than ever it seems crucial to measure this factor. Additionally, administrations need to predict the completion rates to account for it in their planning and budgets. Therefore, I am proposing the following project to create a predictive model to forecast the completion rate.

For the proposed project I am using a publicly available dataset gathered from nearly all degree-granting institutions in the US to compare the completion rate within community colleges. The dataset includes the students' completion based on various factors such as race, the state, gender and year.

1.2 The Data

1.2.1 Source

The primary source of data for this work comes from College Scorecard by US Department of Eduction. College Scorecard provides data at the institution-level and data by field of study. The technical data documents provide in-depth information about these data. The data dictionary includes separate worksheets with institution-level and field of study dictionaries as well as cohort maps describing the timing aspects of each data element.

The link to the publicly available dataset is here.

The dataset contains 3GB of data ranging from 1997 to June 2020, gathering and reporting approximately 1900 features. Related features to this work were extracted and summarized below.

1.3 Data Overview

Here is the information on the data set:

Features Description
UGDS Total Student Enrollment
UGDS_WHITE Ratio of White Students
UGDS_Black Ratio of African-American Students
UGDS_HISP Ratio of Hispanic Students
UGDS_OTH Ratio of Other Students
UGDS_MEN Ratio of Male Students
UGDS_WOMEN Ratio of Female Students
TUITIONFEE_IN Tuition Fee for In-State Students
TUITIONFEE_OUT Tuition Fee for Out-Of-State-State Students
AVGFACSAL Average Salary of Faculties
INC_PCT_LO Ratio of Low-Income Families
INC_PCT_M Ratio of Middle-Income Families
INC_PCT_H Ratio of High-Income Families
DEBT_MDN Median Loan Debt Accumulated at the Institution
FTFTPCTFLOAN Share of Students Who Received Federal Loans

1.3.1 The institutions

College Completion examines data and trends at 3,800 degree-granting institutions in the United States that reported a first-time, full-time degree-seeking undergraduate cohort, had a total of at least 500 students at the undergraduate level , and awarded undergraduate degrees between 2010 and 2017.

1.3.2 Graduation rates

Graduation data from the National Center for Education Statistics’ Integrated Postsecondary Education System is limited to tracking completions for groups of first-time, full-time degree-seeking students at the undergraduate level.

1.3.3 Race and ethnicity

Until 2009, the NCES classified students in seven ways: White, non-Hispanic; Black, non-Hispanic; American Indian/Alaskan Native; Asian/Pacific Islander; unknown race or ethnicity; and nonresident. In addition to creating a stronger separation between race and ethnicity categories, two new race categories were created: Native Hawaiian or Other Pacific Islander (previously combined with Asian students) and students who belong to two or more races.

1.3.4 Efficiency measures

“Awards per 100 full-time undergraduate students” includes all undergraduate-level completions reported by the institution to the NCES: associate degrees, and certificate programs of less than four years in length. Full-time-equivalent undergraduates are estimated from the number of credit hours taken at the institution in an academic year. To account for changes in enrollment, the resulting metric is a seven-year average of data from 2011 to 2017.

1.4 Exploratory Data Analysis

First, let's import the required libraries: NumPy, Pandas, Matplotlib and Seaborn. Now let's take a look at data and calculate the key statistics on all numeric columns. I use pairplots to find any correlation between data or catch any trends in the beginning. Now I have a good glimpse of the data. Next, I would like to explore any relationship between graduation rate and race between males and females. Heatmap seems like a good choice and in fact it yields a beautiful graph. Moreover, I would like to find out the graduation rate nationwide and rank all states accordingly. Bar graphs seems the right choice. Moreover, I am interested in producing an interactive plot of the US with each state indicating the graduation rate using Chart Studio package.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import pickle 

import os

import warnings
warnings.filterwarnings('ignore')

College Scorecard provides tens of csv files from 1997 to June 2020. Let's open one of the more recent files.

In [2]:
datadir = os.getcwd() + "/rawdata"
In [3]:
COL_1819 = pd.read_csv(datadir + '/MERGED2018_19_PP.csv')
In [4]:
COL_1819.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6806 entries, 0 to 6805
Columns: 1986 entries, UNITID to SCUGFFN_POOLED
dtypes: float64(1902), int64(14), object(70)
memory usage: 103.1+ MB

There are 6806 entries (degree-granting institutions) for 2018-2019 academic year and 1986 columns in this one file.

Note that memory usage for this file is 103+ MB.

In [5]:
COL_1819.head()
Out[5]:
UNITID OPEID OPEID6 INSTNM CITY STABBR ZIP ACCREDAGENCY INSTURL NPCURL ... OMENRUP_PARTTIME_POOLED_SUPP FTFTPCTPELL FTFTPCTFLOAN UG12MN G12MN SCUGFFN POOLYRS_FTFTAIDPCT FTFTPCTPELL_POOLED_SUPP FTFTPCTFLOAN_POOLED_SUPP SCUGFFN_POOLED
0 100654 100200 1002 Alabama A & M University Normal AL 35762 Southern Association of Colleges and Schools C... www.aamu.edu/ www.aamu.edu/admissions-aid/tuition-fees/net-p... ... 0.3193 0.7057 0.7143 5343.0 1165.0 1288.0 2.0 0.7083 0.7287 2698.0
1 100663 105200 1052 University of Alabama at Birmingham Birmingham AL 35294-0110 Southern Association of Colleges and Schools C... https://www.uab.edu https://uab.studentaidcalculator.com/survey.aspx ... 0.2475 0.3788 0.5350 14445.0 10498.0 2228.0 2.0 0.3891 0.5414 4176.0
2 100690 2503400 25034 Amridge University Montgomery AL 36117-3553 Southern Association of Colleges and Schools C... www.amridgeuniversity.edu www2.amridgeuniversity.edu:9091/ ... 0.2836 1.0000 1.0000 440.0 527.0 5.0 4.0 PrivacySuppressed PrivacySuppressed 24.0
3 100706 105500 1055 University of Alabama in Huntsville Huntsville AL 35899 Southern Association of Colleges and Schools C... www.uah.edu finaid.uah.edu/ ... 0.2496 0.2707 0.4556 8145.0 2443.0 1341.0 2.0 0.2378 0.4402 2544.0
4 100724 100500 1005 Alabama State University Montgomery AL 36104-0271 Southern Association of Colleges and Schools C... www.alasu.edu www.alasu.edu/cost-aid/tuition-costs/net-price... ... 0.2621 0.7792 0.7539 4732.0 642.0 951.0 2.0 0.7684 0.7464 2094.0

5 rows × 1986 columns

There are missing values tagged with PrivacySupprressed

I would like to deal with missing values. So I perform a hard "na drop".

In [6]:
COL_1819_NoNA = COL_1819.dropna(axis=1)
COL_1819_NoNA.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6806 entries, 0 to 6805
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   UNITID     6806 non-null   int64 
 1   OPEID      6806 non-null   int64 
 2   OPEID6     6806 non-null   int64 
 3   INSTNM     6806 non-null   object
 4   CITY       6806 non-null   object
 5   STABBR     6806 non-null   object
 6   ZIP        6806 non-null   object
 7   HCM2       6806 non-null   int64 
 8   MAIN       6806 non-null   int64 
 9   NUMBRANCH  6806 non-null   int64 
 10  PREDDEG    6806 non-null   int64 
 11  HIGHDEG    6806 non-null   int64 
 12  CONTROL    6806 non-null   int64 
 13  ST_FIPS    6806 non-null   int64 
 14  REGION     6806 non-null   int64 
 15  CURROPER   6806 non-null   int64 
 16  ICLEVEL    6806 non-null   int64 
 17  OPEFLAG    6806 non-null   int64 
dtypes: int64(14), object(4)
memory usage: 957.2+ KB

It seems like only 18 columns survived the hard drop. Memory usage went down from 103+ MB to less than 1 MB.

Taking such drastic measures by dropping all missing values is not optimal. Therefore, I read the data documentation and only keep the columns that I need for my analysis.

I select the following variables from Student Body, Admission, Institution and Financial. I store all variables of interest in vars_interest

In [7]:
studentBodyVars = ['UGDS', 'UGDS_MEN', 'UGDS_WOMEN', 'UGDS_WHITE', 
                   'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN']
admission = ['ADM_RATE_ALL', 'ACTCMMID', 'SATVRMID', 'SATMTMID', 'SATWRMID']
completion = ['C100_4', 'C100_L4', 'C150_4', 'C150_L4']
institution = ['AVGFACSAL', 'TUITIONFEE_OUT', 'TUITIONFEE_IN']
financial = ['FTFTPCTFLOAN', 'DEBT_MDN', 'INC_PCT_LO', 'INC_PCT_M1', 'INC_PCT_M2', 'INC_PCT_H1', 'INC_PCT_H2']
vars_interest = studentBodyVars + admission + completion + institution + financial # Include these vars

I prefer working with float numbers so I convert the columns to float. However, UNITID is the Institution Unique ID so I convert that back to integer.

In [8]:
col_dtypes = dict(COL_1819_NoNA.dtypes.replace(np.dtype('int64'),np.dtype('float64'))) # make the dtypes floats
In [9]:
col_dtypes['UNITID'] = np.dtype('int64') # convert the UNITID back to int

I would want to apply the changes to my vars_interest

In [10]:
col_dtypes.update({a: np.dtype('float64') for a in vars_interest}) # make them floats

I would like to read the csv file again, however, this time I only select and import the columns in vars_interest. Moreover, the missing values are called PrivacySupprressed in the csv so I make sure Pandas knows about that as well. The import works much faster this way

In [11]:
COL_1819 = pd.read_csv(datadir + '/MERGED2018_19_PP.csv',na_values='PrivacySuppressed',
                            usecols=col_dtypes.keys())
COL_1819.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6806 entries, 0 to 6805
Data columns (total 49 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   UNITID          6806 non-null   int64  
 1   OPEID           6806 non-null   int64  
 2   OPEID6          6806 non-null   int64  
 3   INSTNM          6806 non-null   object 
 4   CITY            6806 non-null   object 
 5   STABBR          6806 non-null   object 
 6   ZIP             6806 non-null   object 
 7   HCM2            6806 non-null   int64  
 8   MAIN            6806 non-null   int64  
 9   NUMBRANCH       6806 non-null   int64  
 10  PREDDEG         6806 non-null   int64  
 11  HIGHDEG         6806 non-null   int64  
 12  CONTROL         6806 non-null   int64  
 13  ST_FIPS         6806 non-null   int64  
 14  REGION          6806 non-null   int64  
 15  ADM_RATE_ALL    2242 non-null   float64
 16  SATVRMID        1239 non-null   float64
 17  SATMTMID        1242 non-null   float64
 18  SATWRMID        0 non-null      float64
 19  ACTCMMID        1273 non-null   float64
 20  UGDS            6041 non-null   float64
 21  UGDS_WHITE      6041 non-null   float64
 22  UGDS_BLACK      6041 non-null   float64
 23  UGDS_HISP       6041 non-null   float64
 24  UGDS_ASIAN      6041 non-null   float64
 25  UGDS_AIAN       6041 non-null   float64
 26  UGDS_NHPI       6041 non-null   float64
 27  UGDS_2MOR       6041 non-null   float64
 28  UGDS_NRA        6041 non-null   float64
 29  UGDS_UNKN       6041 non-null   float64
 30  CURROPER        6806 non-null   int64  
 31  TUITIONFEE_IN   3865 non-null   float64
 32  TUITIONFEE_OUT  3621 non-null   float64
 33  AVGFACSAL       4012 non-null   float64
 34  C150_4          2300 non-null   float64
 35  C150_L4         3319 non-null   float64
 36  INC_PCT_LO      0 non-null      float64
 37  INC_PCT_M1      0 non-null      float64
 38  INC_PCT_M2      0 non-null      float64
 39  INC_PCT_H1      0 non-null      float64
 40  INC_PCT_H2      0 non-null      float64
 41  DEBT_MDN        0 non-null      float64
 42  C100_4          2051 non-null   float64
 43  C100_L4         3317 non-null   float64
 44  ICLEVEL         6806 non-null   int64  
 45  UGDS_MEN        6041 non-null   float64
 46  UGDS_WOMEN      6041 non-null   float64
 47  OPEFLAG         6806 non-null   int64  
 48  FTFTPCTFLOAN    5740 non-null   float64
dtypes: float64(31), int64(14), object(4)
memory usage: 2.5+ MB
In [12]:
COL_1819.head()
Out[12]:
UNITID OPEID OPEID6 INSTNM CITY STABBR ZIP HCM2 MAIN NUMBRANCH ... INC_PCT_H1 INC_PCT_H2 DEBT_MDN C100_4 C100_L4 ICLEVEL UGDS_MEN UGDS_WOMEN OPEFLAG FTFTPCTFLOAN
0 100654 100200 1002 Alabama A & M University Normal AL 35762 0 1 1 ... NaN NaN NaN 0.0556 NaN 1 0.4076 0.5924 1 0.7143
1 100663 105200 1052 University of Alabama at Birmingham Birmingham AL 35294-0110 0 1 1 ... NaN NaN NaN 0.3462 NaN 1 0.3930 0.6070 1 0.5350
2 100690 2503400 25034 Amridge University Montgomery AL 36117-3553 0 1 1 ... NaN NaN NaN 0.4000 NaN 1 0.3191 0.6809 1 1.0000
3 100706 105500 1055 University of Alabama in Huntsville Huntsville AL 35899 0 1 1 ... NaN NaN NaN 0.2195 NaN 1 0.5796 0.4204 1 0.4556
4 100724 100500 1005 Alabama State University Montgomery AL 36104-0271 0 1 1 ... NaN NaN NaN 0.0975 NaN 1 0.3710 0.6290 1 0.7539

5 rows × 49 columns

The index is integer. I would like to change that to a set of (Institution, Year).

I write a function to open each csv from 1997, extract the columns in vars_interest and update the index as we go.

In [13]:
def read_cs_data(year,col_dtypes, datadir):
    """read a CollegeScorecard dataframe"""
    nextyr = str(int(year) + 1)[-2:]
    filename = datadir + '/MERGED{}_{}_PP.csv'.format(year,nextyr)
    col = pd.read_csv(filename,na_values='PrivacySuppressed',
                      usecols=col_dtypes.keys())
    col['Year'] = pd.Period(str(int(year) + 1),freq='Y')
    return col
In [14]:
col_all = pd.concat((read_cs_data(str(y),col_dtypes,datadir) for y in range(1997,2019)))
col_all = col_all.set_index(['UNITID','Year'])
In [15]:
col_all.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 156324 entries, (100636, Period('1998', 'A-DEC')) to (49146401, Period('2019', 'A-DEC'))
Data columns (total 48 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   OPEID           156324 non-null  object 
 1   OPEID6          156324 non-null  int64  
 2   INSTNM          156324 non-null  object 
 3   CITY            156324 non-null  object 
 4   STABBR          156324 non-null  object 
 5   ZIP             156324 non-null  object 
 6   HCM2            6806 non-null    float64
 7   MAIN            156324 non-null  int64  
 8   NUMBRANCH       156324 non-null  int64  
 9   PREDDEG         156324 non-null  int64  
 10  HIGHDEG         156324 non-null  int64  
 11  CONTROL         156295 non-null  float64
 12  ST_FIPS         156323 non-null  float64
 13  REGION          156323 non-null  float64
 14  ADM_RATE_ALL    49236 non-null   float64
 15  SATVRMID        22303 non-null   float64
 16  SATMTMID        22440 non-null   float64
 17  SATWRMID        7036 non-null    float64
 18  ACTCMMID        22325 non-null   float64
 19  UGDS            138256 non-null  float64
 20  UGDS_WHITE      74340 non-null   float64
 21  UGDS_BLACK      74340 non-null   float64
 22  UGDS_HISP       74340 non-null   float64
 23  UGDS_ASIAN      74340 non-null   float64
 24  UGDS_AIAN       74340 non-null   float64
 25  UGDS_NHPI       74340 non-null   float64
 26  UGDS_2MOR       74340 non-null   float64
 27  UGDS_NRA        138253 non-null  float64
 28  UGDS_UNKN       138253 non-null  float64
 29  CURROPER        6806 non-null    float64
 30  TUITIONFEE_IN   79194 non-null   float64
 31  TUITIONFEE_OUT  76907 non-null   float64
 32  AVGFACSAL       89554 non-null   float64
 33  C150_4          47185 non-null   float64
 34  C150_L4         80324 non-null   float64
 35  INC_PCT_LO      128111 non-null  float64
 36  INC_PCT_M1      113763 non-null  float64
 37  INC_PCT_M2      101961 non-null  float64
 38  INC_PCT_H1      87150 non-null   float64
 39  INC_PCT_H2      86156 non-null   float64
 40  DEBT_MDN        121680 non-null  float64
 41  C100_4          41189 non-null   float64
 42  C100_L4         37820 non-null   float64
 43  ICLEVEL         156295 non-null  float64
 44  UGDS_MEN        138253 non-null  float64
 45  UGDS_WOMEN      138253 non-null  float64
 46  OPEFLAG         156324 non-null  int64  
 47  FTFTPCTFLOAN    71054 non-null   float64
dtypes: float64(37), int64(6), object(5)
memory usage: 57.8+ MB

Lets get some key statistics from the dataset so far.

In [16]:
col_all.describe()
Out[16]:
OPEID6 HCM2 MAIN NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION ADM_RATE_ALL ... INC_PCT_H1 INC_PCT_H2 DEBT_MDN C100_4 C100_L4 ICLEVEL UGDS_MEN UGDS_WOMEN OPEFLAG FTFTPCTFLOAN
count 156324.000000 6806.000000 156324.000000 156324.000000 156324.000000 156324.000000 156295.000000 156323.000000 156323.000000 49236.000000 ... 87150.000000 86156.000000 121680.000000 41189.000000 37820.000000 156295.000000 138253.000000 138253.000000 156324.000000 71054.000000
mean 14658.900425 0.008816 0.794536 4.175622 1.880396 2.210787 2.123069 29.217953 4.558203 0.694558 ... 0.108304 0.109647 7980.543331 0.322554 0.368407 1.861205 0.347235 0.646777 1.079610 0.559860
std 12582.664300 0.093484 0.404042 13.669866 1.000990 1.286281 0.838157 16.601570 2.188611 0.210672 ... 0.065278 0.107462 4595.143219 0.237243 0.296286 0.813569 0.237899 0.241641 0.818824 0.321981
min 1002.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 ... 0.003357 0.000915 310.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000
25% 3262.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 13.000000 3.000000 0.562400 ... 0.050814 0.028698 4500.000000 0.134900 0.099700 1.000000 0.139400 0.530500 1.000000 0.307700
50% 9863.000000 0.000000 1.000000 1.000000 2.000000 2.000000 2.000000 29.000000 5.000000 0.718900 ... 0.098608 0.068243 6625.000000 0.284800 0.300000 2.000000 0.364400 0.632200 1.000000 0.647100
75% 23482.000000 0.000000 1.000000 2.000000 3.000000 4.000000 3.000000 42.000000 6.000000 0.849500 ... 0.158234 0.159363 10108.000000 0.482200 0.617600 3.000000 0.466700 0.855700 1.000000 0.826100
max 99999.000000 1.000000 1.000000 143.000000 4.000000 4.000000 3.000000 78.000000 9.000000 1.000000 ... 0.476190 0.705882 46547.000000 1.000000 1.000000 3.000000 1.000000 1.000000 19.000000 1.000000

8 rows × 43 columns

I would like to focus on community colleges for this particular work so I select a subset of data that corresponds to community colleges by searching for institutions with the highest granting degree being Associate using 'HIGHDEG' == 2

In [17]:
df = col_all[col_all['HIGHDEG']==2]
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 37034 entries, (100636, Period('1998', 'A-DEC')) to (492476, Period('2019', 'A-DEC'))
Data columns (total 48 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   OPEID           37034 non-null  object 
 1   OPEID6          37034 non-null  int64  
 2   INSTNM          37034 non-null  object 
 3   CITY            37034 non-null  object 
 4   STABBR          37034 non-null  object 
 5   ZIP             37034 non-null  object 
 6   HCM2            1379 non-null   float64
 7   MAIN            37034 non-null  int64  
 8   NUMBRANCH       37034 non-null  int64  
 9   PREDDEG         37034 non-null  int64  
 10  HIGHDEG         37034 non-null  int64  
 11  CONTROL         37034 non-null  float64
 12  ST_FIPS         37034 non-null  float64
 13  REGION          37034 non-null  float64
 14  ADM_RATE_ALL    5709 non-null   float64
 15  SATVRMID        458 non-null    float64
 16  SATMTMID        465 non-null    float64
 17  SATWRMID        81 non-null     float64
 18  ACTCMMID        520 non-null    float64
 19  UGDS            35321 non-null  float64
 20  UGDS_WHITE      18145 non-null  float64
 21  UGDS_BLACK      18145 non-null  float64
 22  UGDS_HISP       18145 non-null  float64
 23  UGDS_ASIAN      18145 non-null  float64
 24  UGDS_AIAN       18145 non-null  float64
 25  UGDS_NHPI       18145 non-null  float64
 26  UGDS_2MOR       18145 non-null  float64
 27  UGDS_NRA        35321 non-null  float64
 28  UGDS_UNKN       35321 non-null  float64
 29  CURROPER        1379 non-null   float64
 30  TUITIONFEE_IN   27561 non-null  float64
 31  TUITIONFEE_OUT  27565 non-null  float64
 32  AVGFACSAL       33999 non-null  float64
 33  C150_4          1133 non-null   float64
 34  C150_L4         33700 non-null  float64
 35  INC_PCT_LO      33795 non-null  float64
 36  INC_PCT_M1      32992 non-null  float64
 37  INC_PCT_M2      29896 non-null  float64
 38  INC_PCT_H1      24424 non-null  float64
 39  INC_PCT_H2      24294 non-null  float64
 40  DEBT_MDN        31455 non-null  float64
 41  C100_4          228 non-null    float64
 42  C100_L4         15300 non-null  float64
 43  ICLEVEL         37034 non-null  float64
 44  UGDS_MEN        35321 non-null  float64
 45  UGDS_WOMEN      35321 non-null  float64
 46  OPEFLAG         37034 non-null  int64  
 47  FTFTPCTFLOAN    17958 non-null  float64
dtypes: float64(37), int64(6), object(5)
memory usage: 13.8+ MB

1.4.1 Visualizing Data

1.4.1.1 Graduation distribution plot

let's look at the distribution of graduate rate by state. We use Seaborn distribution plot for illustration. "grad_rate_150 is the ultimate number of students who eventually graduated. We pick that as the indicator. We group our DataFrame by state, find the average "grad_150_rate for each state and and sort them descendingly.

In [19]:
df['C150_L4'].describe()
Out[19]:
count    33700.000000
mean         0.361617
std          0.230463
min          0.000000
25%          0.182500
50%          0.301000
75%          0.513000
max          1.000000
Name: C150_L4, dtype: float64
In [20]:
f, ax = plt.subplots(1, 1)

sns.distplot(df['C100_L4'], label="Within 100% of Expected Time")
sns.distplot(df['C150_L4'], label="Within 150% of Expected Time")

ax.legend()
f.set_size_inches(11.7, 8.27)
ax.set(xlabel='Graduation Rate', ylabel='common ylabel')
plt.show()

1.4.1.2 Bar plot of graduation rate by state

One way to visualize the performance of students nation-wide is plotting bar graphs for each state. Additionally, it will be useful to compare graduation within states by gender.

For this matter, we use pandas pivot table. We pick state_abbr as the index, and gender for our columns. The values populated in the columns are the average graduation rate within 150% expected time (i.e. grad_150_rate).

In [21]:
dfs = col_all.pivot_table(index="STABBR", values='C150_L4', aggfunc='mean', fill_value=0)
In [22]:
dfs=dfs.sort_values(by='C150_L4',ascending=False)
dfs.plot(kind='bar',figsize=(20,6))
plt.ylabel('Graduate Rate')
plt.xlabel('States')
plt.title('States Graduation Rate')
Out[22]:
Text(0.5, 1.0, 'States Graduation Rate')

1.4.2 Creating an interactive map of nationwide graduation rate

Choropleth Maps is a type of thematic map in which areas are shaded or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density. Along the same line of the previous graph, let's take a look at the distribution of graduation rates in the United States.

Note: Other than Pandas and NumPy, "Chart Studio" must be installed too.

In [23]:
import chart_studio.plotly as py
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

Now we need to set up everything so that the figures show up in the notebook:

In [24]:
init_notebook_mode(connected=True) 

Now we need to begin to build a data dictionary. Easiest way to do this is to use the dict() function of the general form:

In [25]:
#Building data dictionary

data = dict(type = 'choropleth',
            locations = dfs.index,
            locationmode = 'USA-states',
            colorscale= 'Greens',
            #text= ['text1','text2','text3'],
            z=dfs['C150_L4'],
            colorbar = {'title':'Graduate Rate'})

Then we create the layout nested dictionary:

In [26]:
#Building nested dictionary

layout = dict(geo = {'scope':'usa'})

Then we use:

go.Figure(data = [data],layout = layout) to set up the object that finally gets passed into iplot()

In [27]:
#Set up object to be passed into iplot()

choromap = go.Figure(data = [data],layout = layout)
In [28]:
#Plotting the interactive map of US with our graduation rate data

iplot(choromap)

1.5 Modeling and Prediction

The college completion rate (that is the graduation rate of students) from 3800 degree-granting institutions between 2003 to 2013 was studied. Racial and gender factors were considered to establish a measure of success (completion) for students. The result indicate the average completion rates for colleges is between 30 to 50 percent. California ranks top with 50% graduation rate while Alaska sits at the bottom with 28% completion. Moreover, Asian and White students have the highest completion while African-American students have the lowest completion. Female students graduate at higher rates among all races. Moreover, female students have higher graduation rate compared to male students in every state consistently. These results are useful for future studies that include graduation rates for colleges. Especially, for California colleges where the funding formula is directly dependent on the completion rates and college racial diversity. The next step of the way is to create a predictive model to forecast the completion rates for the coming academic years. The predictive model could potentially predict the racial diversity each year as well. However, to create a robust model more data is required. The required data must be more recent, reflecting the years past 2013 up to 2020. Data must include new features such as teacher-to-student ratio.

1.5.1 Previous Work

Next step presents the results of analysis that will be conducted in order to gauge the predictive capability of a machine learning algorithm to predict on-time graduation that takes into consideration students' learning and development. Picking such model is crucial for college administration planning and budgeting. Anderson et al. [3] trained a set of four binary classifiers to predict FTIC students’ graduation rates. Based on the F1 and AUC scores, SGD classifier and logistic regression performed the best, while linear SVM and decision tree performed slightly worse. Pang et al. [4] predicted students graduation with 72% accuracy and precision/recall around 62% using ensemble support vector machines algorithm. Kesumawati et al. [5] applied Naïve Bayes Classifier and SVM to forecast students graduation rates. Both algorithms yielded 69% accuracy in predictions. Raju in his dissertation [6] applied regression models (forward, backward and stepwise), neural network and decision tree. Neural networks had the best performance in predicting graduation rates, while regression methods were a close contender with 77% accuracy while decision tree performed the worst.

From the brief literature review performed above, it seems that either neural networks or regression models are potential candidates for our predictive model.

1.5.2 Feature Engineering

1.5.2.1 Combining dependent features

Predictions could benefit from engineering features before feeding them to a predictive model. The share of various races in student body is different. While white students constitute more than 50% of the student body, African-American 14% and Hispanic 13%, all other races, including non-resident and unknown races constitute 20% of the student body. So a new variable called UGDS_OTH is created to include all other races. Moreover, family income in broken down to middle 1 (INC_PCT_M1) and middle 2 (INC_PCT_M2). Both variables are combined into a new variable called INC_PCT_M. Similarly, high income 1 INC_PCT_H1 and high income 2 INC_PCT_H2 are summed up in INC_PCT_H.

In [29]:
df['UGDS_OTH'] = df['UGDS_ASIAN']+df['UGDS_AIAN']+df['UGDS_NHPI']+df['UGDS_2MOR']+df['UGDS_NRA']+df['UGDS_UNKN']
df['INC_PCT_M'] = df['INC_PCT_M1']+df['INC_PCT_M2']
df['INC_PCT_H'] = df['INC_PCT_H1']+df['INC_PCT_H2']
df.describe()
Out[29]:
OPEID6 HCM2 MAIN NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION ADM_RATE_ALL ... C100_4 C100_L4 ICLEVEL UGDS_MEN UGDS_WOMEN OPEFLAG FTFTPCTFLOAN UGDS_OTH INC_PCT_M INC_PCT_H
count 37034.000000 1379.000000 37034.000000 37034.000000 37034.000000 37034.0 37034.000000 37034.000000 37034.000000 5709.000000 ... 228.000000 15300.000000 37034.000000 35321.000000 35321.000000 37034.000000 17958.000000 18145.000000 29726.000000 24257.000000
mean 11560.758276 0.002901 0.815440 3.071421 1.690528 2.0 1.710914 28.424475 4.761192 0.748007 ... 0.158577 0.251402 1.955473 0.388406 0.610490 1.033537 0.444072 0.127930 0.280360 0.119162
std 10550.043001 0.053799 0.387946 10.910958 0.462282 0.0 0.919153 16.435467 2.076091 0.211315 ... 0.187935 0.228020 0.206265 0.184164 0.184829 0.504519 0.340792 0.159417 0.067639 0.084220
min 1007.000000 0.000000 0.000000 1.000000 1.000000 2.0 1.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.057143 0.005703
25% 3273.000000 0.000000 1.000000 1.000000 1.000000 2.0 1.000000 13.000000 3.000000 0.615100 ... 0.055800 0.078400 2.000000 0.289200 0.545800 1.000000 0.114800 0.039300 0.236381 0.059741
50% 7437.000000 0.000000 1.000000 1.000000 2.000000 2.0 1.000000 29.000000 5.000000 0.787500 ... 0.088050 0.166350 2.000000 0.387000 0.612700 1.000000 0.411850 0.080000 0.287616 0.098142
75% 21192.000000 0.000000 1.000000 2.000000 2.000000 2.0 3.000000 42.000000 6.000000 0.919600 ... 0.179150 0.364525 2.000000 0.453700 0.710000 1.000000 0.780400 0.151400 0.325899 0.156573
max 42743.000000 1.000000 1.000000 143.000000 2.000000 2.0 3.000000 72.000000 9.000000 1.000000 ... 1.000000 1.000000 2.000000 1.000000 1.000000 19.000000 1.000000 1.000100 0.685619 0.654676

8 rows × 46 columns

1.5.2.2 Getting rid of unimportant features

The original dataframe created, df contain some features that are not useful in predictions such as institution name INSTNM or institution unit ID and its variations OPEID and OPEID6. Some other features that thought to be useful in the first round of feature selection such as admission rate ADM_RATE_ALL or SAT and ACT scores SATVRMID and ACRCMMID contain many missing values. These features are removed to increase model predictions. Finally, the features that are combined in the new variables must be removed too (e.g. UGDS_AIAN or INC_PCT_M1).

In [30]:
mldf = df.drop(columns=['INSTNM','ADM_RATE_ALL','SATVRMID','SATMTMID','SATWRMID','ACTCMMID','HCM2','CURROPER',
                        'C150_4','C100_4','OPEID','OPEID6','CITY','STABBR','ZIP','MAIN','NUMBRANCH',
                        'PREDDEG','HIGHDEG','CONTROL','ST_FIPS','REGION','ICLEVEL','OPEFLAG', 'UGDS_AIAN', 'UGDS_ASIAN',
                        'UGDS_NHPI','UGDS_2MOR','UGDS_NRA','UGDS_UNKN','INC_PCT_M1','INC_PCT_M2','INC_PCT_H1',
                        'INC_PCT_H2'], axis = 1)
mldf.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 37034 entries, (100636, Period('1998', 'A-DEC')) to (492476, Period('2019', 'A-DEC'))
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   UGDS            35321 non-null  float64
 1   UGDS_WHITE      18145 non-null  float64
 2   UGDS_BLACK      18145 non-null  float64
 3   UGDS_HISP       18145 non-null  float64
 4   TUITIONFEE_IN   27561 non-null  float64
 5   TUITIONFEE_OUT  27565 non-null  float64
 6   AVGFACSAL       33999 non-null  float64
 7   C150_L4         33700 non-null  float64
 8   INC_PCT_LO      33795 non-null  float64
 9   DEBT_MDN        31455 non-null  float64
 10  C100_L4         15300 non-null  float64
 11  UGDS_MEN        35321 non-null  float64
 12  UGDS_WOMEN      35321 non-null  float64
 13  FTFTPCTFLOAN    17958 non-null  float64
 14  UGDS_OTH        18145 non-null  float64
 15  INC_PCT_M       29726 non-null  float64
 16  INC_PCT_H       24257 non-null  float64
dtypes: float64(17)
memory usage: 5.0 MB

1.5.2.3 Eliminating outliers cases

The average population for a community college is close to 6000 and average graduation rate close to 0.5. However, in rare cases there are some institutions with small population or very high graduation rates. These institution offer only certifications and offer only a few associate degree programs or none at all. Therefore, these institutions are removed from the dataframe using a simple query. The query maintain observations with more than 500 students and graduation rates below 0.8.

In [31]:
dfn = mldf.query('C150_L4<0.8 & UGDS>500')
In [32]:
dfn.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 22713 entries, (100760, Period('1998', 'A-DEC')) to (489201, Period('2019', 'A-DEC'))
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   UGDS            22713 non-null  float64
 1   UGDS_WHITE      11945 non-null  float64
 2   UGDS_BLACK      11945 non-null  float64
 3   UGDS_HISP       11945 non-null  float64
 4   TUITIONFEE_IN   18598 non-null  float64
 5   TUITIONFEE_OUT  18601 non-null  float64
 6   AVGFACSAL       22555 non-null  float64
 7   C150_L4         22713 non-null  float64
 8   INC_PCT_LO      20810 non-null  float64
 9   DEBT_MDN        19184 non-null  float64
 10  C100_L4         10764 non-null  float64
 11  UGDS_MEN        22713 non-null  float64
 12  UGDS_WOMEN      22713 non-null  float64
 13  FTFTPCTFLOAN    11943 non-null  float64
 14  UGDS_OTH        11945 non-null  float64
 15  INC_PCT_M       18923 non-null  float64
 16  INC_PCT_H       15831 non-null  float64
dtypes: float64(17)
memory usage: 3.1 MB

The dataframe is reduced down to 22713 observations. There are however plenty of missing values in data. Most of the data is suppressed due to privacy issues. We need to perform a hard na drop on the remainder of the dataframe.

In [33]:
dfn.dropna(axis = 0, inplace = True)

Let's rearrange the dataframe and put family features side-by-side

In [34]:
dfn = dfn[['C100_L4','C150_L4','UGDS','UGDS_WHITE','UGDS_BLACK','UGDS_HISP','UGDS_OTH','UGDS_MEN','UGDS_WOMEN','TUITIONFEE_IN','TUITIONFEE_OUT','DEBT_MDN','AVGFACSAL','INC_PCT_LO','INC_PCT_M','INC_PCT_H','FTFTPCTFLOAN']]
dfn.describe()
Out[34]:
C100_L4 C150_L4 UGDS UGDS_WHITE UGDS_BLACK UGDS_HISP UGDS_OTH UGDS_MEN UGDS_WOMEN TUITIONFEE_IN TUITIONFEE_OUT DEBT_MDN AVGFACSAL INC_PCT_LO INC_PCT_M INC_PCT_H FTFTPCTFLOAN
count 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000 6557.000000
mean 0.152103 0.255245 5901.400641 0.533811 0.138853 0.128655 0.128313 0.408782 0.591218 4589.574653 8201.874333 5673.194830 6051.386915 0.612464 0.283733 0.103803 0.363205
std 0.133473 0.144274 6779.965465 0.277290 0.163871 0.159442 0.126030 0.103581 0.103581 4100.616341 3829.373367 2405.592652 1558.671142 0.102663 0.052160 0.065420 0.265619
min 0.000000 0.000000 501.000000 0.000000 0.000000 0.000000 0.000000 0.041800 0.011400 560.000000 658.000000 1634.500000 870.000000 0.197707 0.092247 0.005703 0.000000
25% 0.059300 0.149700 1547.000000 0.335700 0.026800 0.022200 0.053700 0.366700 0.545200 2440.000000 5358.000000 4000.000000 4945.000000 0.540812 0.254188 0.053779 0.154600
50% 0.112700 0.226200 3603.000000 0.593800 0.079600 0.061500 0.090400 0.411800 0.588200 3434.000000 7436.000000 5229.000000 5883.000000 0.616013 0.290870 0.092926 0.296800
75% 0.197600 0.318000 7679.000000 0.762100 0.190300 0.179500 0.159700 0.454800 0.633300 4464.000000 10358.000000 6332.000000 6981.000000 0.685677 0.319990 0.140475 0.539300
max 0.792600 0.797300 91112.000000 0.980700 0.978700 1.000000 1.000000 0.988600 0.958200 39750.000000 39750.000000 20000.000000 12613.000000 0.890584 0.452931 0.525550 1.000000

1.5.2.4 Studying the possible correlation among features

Let's compute pairwise correlation of columns using dfn.corr(). Seaborn Heatmap shows the contrast of correlations beautifully. We use "RdBu" color map to contrast positive correlations with negative or inverse correlations.

In [35]:
plt.figure(figsize=(18,10))
sns.set_context('paper',font_scale=1.5)
#cmap = sns.cubehelix_palette(light=1, as_cmap=True)
#sns.heatmap(bf.corr(), cmap = cmap, linecolor='white', linewidths = 3, annot = True, fmt='.1g')
sns.heatmap(dfn.corr(),  cmap="RdBu", linecolor='white', linewidths = 3, annot = True, fmt='.1g')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1effa68a348>

The top 2 rows show the correlation of graduation rate (within 100 and 150 percent of expected time) with other columns. Graduation rate has strong inverse correlation with college population and and average faculty salary. Graduation rate has a positive correlation with tuition fee and average student debt. However, we can not make early judgments until later.

1.5.3 Scaling

First let's make X and y matrices. X contains all the features except graduation rates. y contains graduation rates.

In [36]:
y = dfn[['C150_L4']]
X = dfn.drop(columns=['C150_L4', 'C100_L4'], axis = 1)
In [37]:
from sklearn.preprocessing import QuantileTransformer, MinMaxScaler

Scaling is necessary because features range from fractions of 1 to hundreds of thousands. Graduation rates are confined between 0 and 1. We used MinMaxScaler to keep the range while scaling all features.

We store transformed X and y in scy and scX.

In [357]:
#scy = MinMaxScaler(feature_range=(0, 0.8)).fit_transform(y)
scX = MinMaxScaler().fit_transform(X)

1.5.4 Fitting

We use 70% of data for training and set aside 30% for predictions.

In [40]:
from sklearn.model_selection import train_test_split
In [365]:
X_train, X_test, y_train, y_test = train_test_split(scX, y, test_size=0.3)

Researchers in the field have applied different types of models to predict graduation rates. The accuracy of prediction are about 70%. In this study we use RandomForestRegressor from sklearn.ensemble.

In [42]:
from sklearn.ensemble import RandomForestRegressor

Additionally, we use GridSearchCV from sklearn.model_selection to find the best parameters for the model.

In [43]:
from sklearn.model_selection import GridSearchCV
In [270]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
In [271]:
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
In [272]:
rf_random.fit(X_train, y_train)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   26.0s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 12.0min finished
Out[272]:
RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                                                   n_jobs=None, oob_score=Fals...
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring=None, verbose=2)
In [273]:
rf_random.best_params_
Out[273]:
{'n_estimators': 400,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': None,
 'bootstrap': False}
In [275]:
param_grid = {
              'bootstrap'        : [False],
              'max_depth'        : [None, 93],
              'n_estimators'     : [100, 400, 1000, 1500],
              'max_features'     : [4, 8, 12],
              'min_samples_split': [1, 2, 3],
              'min_samples_leaf' : [1, 2]   
             }
grid = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
In [276]:
grid.fit(X_train,y_train)
Fitting 3 folds for each of 144 candidates, totalling 432 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-1)]: Done 162 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 365 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 432 out of 432 | elapsed: 10.3min finished
Out[276]:
GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'bootstrap': [False], 'max_depth': [None, 93],
                         'max_features': [4, 8, 12], 'min_samples_leaf': [1, 2],
                         'min_samples_split': [1, 2, 3],
                         'n_estimators': [100, 400, 1000, 1500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)
In [277]:
grid.best_params_
Out[277]:
{'bootstrap': False,
 'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1500}
In [366]:
param_grid = {
              'bootstrap'        : [False],
              'max_depth'        : [None],
              'n_estimators'     : [1500],
              'max_features'     : [4],
              'min_samples_split': [2],
              'min_samples_leaf' : [1]   
             }
model = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
In [367]:
model.fit(X_train,y_train)
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   23.4s finished
Out[367]:
GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'bootstrap': [False], 'max_depth': [None],
                         'max_features': [4], 'min_samples_leaf': [1],
                         'min_samples_split': [2], 'n_estimators': [1500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)
In [121]:
#model.feature_importances_
Out[121]:
array([0.12819509, 0.04312008, 0.04848407, 0.0357881 , 0.0358406 ,
       0.06127434, 0.05577367, 0.15924328, 0.06905436, 0.09580903,
       0.05623485, 0.0447625 , 0.05181128, 0.0375561 , 0.07705266])

1.5.5 Predicting and Evaluating

Let's import mean_squared_error, mean_absolute_error, explained_variance_score from sklearn.metrics.

In [56]:
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score
In [463]:
y_pred = model.predict(X_test)
In [464]:
expVar = explained_variance_score(y_test, y_pred)
MSE = mean_squared_error(y_test,y_pred)
MAE = mean_absolute_error(y_test,y_pred)
"Exp Variance:{} MSE: {} MAE: {}".format(expVar, MSE, MAE)
Out[464]:
'Exp Variance:0.7487900951478811 MSE: 0.005449179862174175 MAE: 0.05084240704268286'

The mean absolute error is ~ 0.07. In the context of graduation rates (between 0 to 1.0) it means that our model is predicting graduation with 7 percent point accuracy. The explained variance is at 0.73. There is still some variance in the model that cannot be resolved readily.

Let's plot test cases versus our prediction and look at the spread.

In [407]:
plt.scatter(y_test,y_pred)
plt.plot(y_test,y_test,'r')
Out[407]:
[<matplotlib.lines.Line2D at 0x1ef8cdcdb08>]

The overall trends indicate consistent predictions. However, there are some outliers and some less accurate predictions.

The model seems to be predicting better up to 0.6 graduation rates and performs poorly for graduation rates above 0.6. One reasons the overwhelming majority of institutions report graduation rates around 0.6 or lower. There are some institutions that graduate students at higher rates that are rare.

Another way to look at the predictions vs test cases, is by getting their difference and plot the distributions plot.

In [408]:
errors = np.array(y_test) - y_pred
sns.distplot(errors)
Out[408]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ef8cf1ec48>

The errors distribution plot is behaving normal with majority of the error hovering around zero.

In [465]:
X_baseline = np.asarray(X.mean()).reshape((15,1))
X_baseline
Out[465]:
array([[5.90140064e+03],
       [5.33810859e-01],
       [1.38852707e-01],
       [1.28654903e-01],
       [1.28312902e-01],
       [4.08781989e-01],
       [5.91218027e-01],
       [4.58957465e+03],
       [8.20187433e+03],
       [5.67319483e+03],
       [6.05138691e+03],
       [6.12463601e-01],
       [2.83733083e-01],
       [1.03803316e-01],
       [3.63204819e-01]])

BASE CASE:

  • UGDS = X_baseline[0]
  • UGDS_WHITE = X_baseline[1]
  • UGDS_BLACK = X_baseline[2]
  • UGDS_HISP = X_baseline[3]
  • UGDS_OTH = X_baseline[4]
  • UGDS_MEN = X_baseline[5]
  • UGDS_WOMEN = X_baseline[6]
  • TUITIONFEE_IN = X_baseline[7]
  • TUITIONFEE_OUT = X_baseline[8]
  • DEBT_MDN = X_baseline[9]
  • AVGFACSAL = X_baseline[10]
  • INC_PCT_LO = X_baseline[11]
  • INC_PCT_M = X_baseline[12]
  • INC_PCT_H = X_baseline[13]
  • FTFTPCTFLOAN = X_baseline[14]

Graduation rate of base case

In [490]:
X_case =scMM.fit_transform(X_baseline)
model.predict(X_case.reshape((1,15)))
Out[490]:
array([0.36677497])

Graduation rate by reducing student debt

In [491]:
X_testing =[[5.90140064e+03],
       [5.33810859e-01],
       [1.38852707e-01],
       [1.28654903e-01],
       [1.28312902e-01],
       [4.08781989e-01],
       [5.91218027e-01],
       [4.58957465e+03],
       [8.20187433e+03],
       [2500],
       [6.05138691e+03],
       [6.12463601e-01],
       [2.83733083e-01],
       [1.03803316e-01],
       [3.63204819e-01]]
X_case =scMM.fit_transform(X_testing)
grid.predict(X_case.reshape((1,15))) #Debt from 5.7k to 2.7
Out[491]:
array([0.456381])

Graduation rate by reducing increasing tutition fee

In [489]:
X_testing =[[5.90140064e+03],
       [5.33810859e-01],
       [1.38852707e-01],
       [1.28654903e-01],
       [1.28312902e-01],
       [4.08781989e-01],
       [5.91218027e-01],
       [9000],
       [16000],
       [5.67319483e+03],
       [6.05138691e+03],
       [6.12463601e-01],
       [2.83733083e-01],
       [1.03803316e-01],
       [3.63204819e-01]]
X_case =scMM.fit_transform(X_testing)
model.predict(X_case.reshape((1,15))) #Tuition from 4.5k to 10k
Out[489]:
array([0.36678997])
In [ ]: