Table of Contents¶

1. College Completion Rate in the United States

1. College Completion Rate in the United States¶

1.1 Background¶

Over the past decade, I had the opportunity to work and serve both at the community college level and university level. I found the distribution of available resources and student demographics to be different. Naturally, as a researcher and professor, I was curious to find out the impact of disparities on students' performance outcome. Picking a metric to evaluate such hypothesis is not trivial. However, many scholars in the field of education believe that "student completion" (that is graduation rate) is a rather inclusive metric.

Moreover, the government of California changed the allocation of funding for colleges in 2019. Based on the new funding formula, the annual allocated funding for colleges is composed of 3 categories:

1.60% of the funding as a base allocation, which largely reflects enrollment.

2.20% of the funding is directly correlated with success measure (that includes completion rate)

3.20% of the funding allocation is based on the numbers of students receiving grants.

For more details visit California Community Colleges Student Centered Funding Formula.

College administrations have always been interested to monitor their students outcome by measuring their completion rate. However, now more than ever it seems crucial to measure this factor. Additionally, administrations need to predict the completion rates to account for it in their planning and budgets. Therefore, I am proposing the following project to create a predictive model to forecast the completion rate.

For the proposed project I am using a publicly available dataset gathered from nearly all degree-granting institutions in the US to compare the completion rate within community colleges. The dataset includes the students' completion based on various factors such as race, the state, gender and year.

1.2 The Data¶

1.2.1 Source¶

The primary source of data for this work comes from College Scorecard by US Department of Eduction. College Scorecard provides data at the institution-level and data by field of study. The technical data documents provide in-depth information about these data. The data dictionary includes separate worksheets with institution-level and field of study dictionaries as well as cohort maps describing the timing aspects of each data element.

The link to the publicly available dataset is here.

The dataset contains 3GB of data ranging from 1997 to June 2020, gathering and reporting approximately 1900 features. Related features to this work were extracted and summarized below.

1.3 Data Overview¶

Here is the information on the data set:

Features	Description
UGDS	Total Student Enrollment
UGDS_WHITE	Ratio of White Students
UGDS_Black	Ratio of African-American Students
UGDS_HISP	Ratio of Hispanic Students
UGDS_OTH	Ratio of Other Students
UGDS_MEN	Ratio of Male Students
UGDS_WOMEN	Ratio of Female Students
TUITIONFEE_IN	Tuition Fee for In-State Students
TUITIONFEE_OUT	Tuition Fee for Out-Of-State-State Students
AVGFACSAL	Average Salary of Faculties
INC_PCT_LO	Ratio of Low-Income Families
INC_PCT_M	Ratio of Middle-Income Families
INC_PCT_H	Ratio of High-Income Families
DEBT_MDN	Median Loan Debt Accumulated at the Institution
FTFTPCTFLOAN	Share of Students Who Received Federal Loans

1.3.1 The institutions¶

College Completion examines data and trends at 3,800 degree-granting institutions in the United States that reported a first-time, full-time degree-seeking undergraduate cohort, had a total of at least 500 students at the undergraduate level , and awarded undergraduate degrees between 2010 and 2017.

1.3.2 Graduation rates¶

Graduation data from the National Center for Education Statistics’ Integrated Postsecondary Education System is limited to tracking completions for groups of first-time, full-time degree-seeking students at the undergraduate level.

1.3.3 Race and ethnicity¶

Until 2009, the NCES classified students in seven ways: White, non-Hispanic; Black, non-Hispanic; American Indian/Alaskan Native; Asian/Pacific Islander; unknown race or ethnicity; and nonresident. In addition to creating a stronger separation between race and ethnicity categories, two new race categories were created: Native Hawaiian or Other Pacific Islander (previously combined with Asian students) and students who belong to two or more races.

1.3.4 Efficiency measures¶

“Awards per 100 full-time undergraduate students” includes all undergraduate-level completions reported by the institution to the NCES: associate degrees, and certificate programs of less than four years in length. Full-time-equivalent undergraduates are estimated from the number of credit hours taken at the institution in an academic year. To account for changes in enrollment, the resulting metric is a seven-year average of data from 2011 to 2017.

1.4 Exploratory Data Analysis¶

First, let's import the required libraries: NumPy, Pandas, Matplotlib and Seaborn. Now let's take a look at data and calculate the key statistics on all numeric columns. I use pairplots to find any correlation between data or catch any trends in the beginning. Now I have a good glimpse of the data. Next, I would like to explore any relationship between graduation rate and race between males and females. Heatmap seems like a good choice and in fact it yields a beautiful graph. Moreover, I would like to find out the graduation rate nationwide and rank all states accordingly. Bar graphs seems the right choice. Moreover, I am interested in producing an interactive plot of the US with each state indicating the graduation rate using Chart Studio package.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import pickle 

import os

import warnings
warnings.filterwarnings('ignore')

College Scorecard provides tens of csv files from 1997 to June 2020. Let's open one of the more recent files.

datadir = os.getcwd() + "/rawdata"

COL_1819 = pd.read_csv(datadir + '/MERGED2018_19_PP.csv')

COL_1819.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6806 entries, 0 to 6805
Columns: 1986 entries, UNITID to SCUGFFN_POOLED
dtypes: float64(1902), int64(14), object(70)
memory usage: 103.1+ MB

There are 6806 entries (degree-granting institutions) for 2018-2019 academic year and 1986 columns in this one file.

Note that memory usage for this file is 103+ MB.

COL_1819.head()

There are missing values tagged with PrivacySupprressed

I would like to deal with missing values. So I perform a hard "na drop".

COL_1819_NoNA = COL_1819.dropna(axis=1)
COL_1819_NoNA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6806 entries, 0 to 6805
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   UNITID     6806 non-null   int64 
 1   OPEID      6806 non-null   int64 
 2   OPEID6     6806 non-null   int64 
 3   INSTNM     6806 non-null   object
 4   CITY       6806 non-null   object
 5   STABBR     6806 non-null   object
 6   ZIP        6806 non-null   object
 7   HCM2       6806 non-null   int64 
 8   MAIN       6806 non-null   int64 
 9   NUMBRANCH  6806 non-null   int64 
 10  PREDDEG    6806 non-null   int64 
 11  HIGHDEG    6806 non-null   int64 
 12  CONTROL    6806 non-null   int64 
 13  ST_FIPS    6806 non-null   int64 
 14  REGION     6806 non-null   int64 
 15  CURROPER   6806 non-null   int64 
 16  ICLEVEL    6806 non-null   int64 
 17  OPEFLAG    6806 non-null   int64 
dtypes: int64(14), object(4)
memory usage: 957.2+ KB

It seems like only 18 columns survived the hard drop. Memory usage went down from 103+ MB to less than 1 MB.

Taking such drastic measures by dropping all missing values is not optimal. Therefore, I read the data documentation and only keep the columns that I need for my analysis.

I select the following variables from Student Body, Admission, Institution and Financial. I store all variables of interest in vars_interest

studentBodyVars = ['UGDS', 'UGDS_MEN', 'UGDS_WOMEN', 'UGDS_WHITE', 
                   'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN']
admission = ['ADM_RATE_ALL', 'ACTCMMID', 'SATVRMID', 'SATMTMID', 'SATWRMID']
completion = ['C100_4', 'C100_L4', 'C150_4', 'C150_L4']
institution = ['AVGFACSAL', 'TUITIONFEE_OUT', 'TUITIONFEE_IN']
financial = ['FTFTPCTFLOAN', 'DEBT_MDN', 'INC_PCT_LO', 'INC_PCT_M1', 'INC_PCT_M2', 'INC_PCT_H1', 'INC_PCT_H2']
vars_interest = studentBodyVars + admission + completion + institution + financial # Include these vars

I prefer working with float numbers so I convert the columns to float. However, UNITID is the Institution Unique ID so I convert that back to integer.

col_dtypes = dict(COL_1819_NoNA.dtypes.replace(np.dtype('int64'),np.dtype('float64'))) # make the dtypes floats

col_dtypes['UNITID'] = np.dtype('int64') # convert the UNITID back to int

I would want to apply the changes to my vars_interest

col_dtypes.update({a: np.dtype('float64') for a in vars_interest}) # make them floats

I would like to read the csv file again, however, this time I only select and import the columns in vars_interest. Moreover, the missing values are called PrivacySupprressed in the csv so I make sure Pandas knows about that as well. The import works much faster this way

COL_1819 = pd.read_csv(datadir + '/MERGED2018_19_PP.csv',na_values='PrivacySuppressed',
                            usecols=col_dtypes.keys())
COL_1819.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6806 entries, 0 to 6805
Data columns (total 49 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   UNITID          6806 non-null   int64  
 1   OPEID           6806 non-null   int64  
 2   OPEID6          6806 non-null   int64  
 3   INSTNM          6806 non-null   object 
 4   CITY            6806 non-null   object 
 5   STABBR          6806 non-null   object 
 6   ZIP             6806 non-null   object 
 7   HCM2            6806 non-null   int64  
 8   MAIN            6806 non-null   int64  
 9   NUMBRANCH       6806 non-null   int64  
 10  PREDDEG         6806 non-null   int64  
 11  HIGHDEG         6806 non-null   int64  
 12  CONTROL         6806 non-null   int64  
 13  ST_FIPS         6806 non-null   int64  
 14  REGION          6806 non-null   int64  
 15  ADM_RATE_ALL    2242 non-null   float64
 16  SATVRMID        1239 non-null   float64
 17  SATMTMID        1242 non-null   float64
 18  SATWRMID        0 non-null      float64
 19  ACTCMMID        1273 non-null   float64
 20  UGDS            6041 non-null   float64
 21  UGDS_WHITE      6041 non-null   float64
 22  UGDS_BLACK      6041 non-null   float64
 23  UGDS_HISP       6041 non-null   float64
 24  UGDS_ASIAN      6041 non-null   float64
 25  UGDS_AIAN       6041 non-null   float64
 26  UGDS_NHPI       6041 non-null   float64
 27  UGDS_2MOR       6041 non-null   float64
 28  UGDS_NRA        6041 non-null   float64
 29  UGDS_UNKN       6041 non-null   float64
 30  CURROPER        6806 non-null   int64  
 31  TUITIONFEE_IN   3865 non-null   float64
 32  TUITIONFEE_OUT  3621 non-null   float64
 33  AVGFACSAL       4012 non-null   float64
 34  C150_4          2300 non-null   float64
 35  C150_L4         3319 non-null   float64
 36  INC_PCT_LO      0 non-null      float64
 37  INC_PCT_M1      0 non-null      float64
 38  INC_PCT_M2      0 non-null      float64
 39  INC_PCT_H1      0 non-null      float64
 40  INC_PCT_H2      0 non-null      float64
 41  DEBT_MDN        0 non-null      float64
 42  C100_4          2051 non-null   float64
 43  C100_L4         3317 non-null   float64
 44  ICLEVEL         6806 non-null   int64  
 45  UGDS_MEN        6041 non-null   float64
 46  UGDS_WOMEN      6041 non-null   float64
 47  OPEFLAG         6806 non-null   int64  
 48  FTFTPCTFLOAN    5740 non-null   float64
dtypes: float64(31), int64(14), object(4)
memory usage: 2.5+ MB

COL_1819.head()

The index is integer. I would like to change that to a set of (Institution, Year).

I write a function to open each csv from 1997, extract the columns in vars_interest and update the index as we go.

def read_cs_data(year,col_dtypes, datadir):
    """read a CollegeScorecard dataframe"""
    nextyr = str(int(year) + 1)[-2:]
    filename = datadir + '/MERGED{}_{}_PP.csv'.format(year,nextyr)
    col = pd.read_csv(filename,na_values='PrivacySuppressed',
                      usecols=col_dtypes.keys())
    col['Year'] = pd.Period(str(int(year) + 1),freq='Y')
    return col

col_all = pd.concat((read_cs_data(str(y),col_dtypes,datadir) for y in range(1997,2019)))
col_all = col_all.set_index(['UNITID','Year'])

col_all.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 156324 entries, (100636, Period('1998', 'A-DEC')) to (49146401, Period('2019', 'A-DEC'))
Data columns (total 48 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   OPEID           156324 non-null  object 
 1   OPEID6          156324 non-null  int64  
 2   INSTNM          156324 non-null  object 
 3   CITY            156324 non-null  object 
 4   STABBR          156324 non-null  object 
 5   ZIP             156324 non-null  object 
 6   HCM2            6806 non-null    float64
 7   MAIN            156324 non-null  int64  
 8   NUMBRANCH       156324 non-null  int64  
 9   PREDDEG         156324 non-null  int64  
 10  HIGHDEG         156324 non-null  int64  
 11  CONTROL         156295 non-null  float64
 12  ST_FIPS         156323 non-null  float64
 13  REGION          156323 non-null  float64
 14  ADM_RATE_ALL    49236 non-null   float64
 15  SATVRMID        22303 non-null   float64
 16  SATMTMID        22440 non-null   float64
 17  SATWRMID        7036 non-null    float64
 18  ACTCMMID        22325 non-null   float64
 19  UGDS            138256 non-null  float64
 20  UGDS_WHITE      74340 non-null   float64
 21  UGDS_BLACK      74340 non-null   float64
 22  UGDS_HISP       74340 non-null   float64
 23  UGDS_ASIAN      74340 non-null   float64
 24  UGDS_AIAN       74340 non-null   float64
 25  UGDS_NHPI       74340 non-null   float64
 26  UGDS_2MOR       74340 non-null   float64
 27  UGDS_NRA        138253 non-null  float64
 28  UGDS_UNKN       138253 non-null  float64
 29  CURROPER        6806 non-null    float64
 30  TUITIONFEE_IN   79194 non-null   float64
 31  TUITIONFEE_OUT  76907 non-null   float64
 32  AVGFACSAL       89554 non-null   float64
 33  C150_4          47185 non-null   float64
 34  C150_L4         80324 non-null   float64
 35  INC_PCT_LO      128111 non-null  float64
 36  INC_PCT_M1      113763 non-null  float64
 37  INC_PCT_M2      101961 non-null  float64
 38  INC_PCT_H1      87150 non-null   float64
 39  INC_PCT_H2      86156 non-null   float64
 40  DEBT_MDN        121680 non-null  float64
 41  C100_4          41189 non-null   float64
 42  C100_L4         37820 non-null   float64
 43  ICLEVEL         156295 non-null  float64
 44  UGDS_MEN        138253 non-null  float64
 45  UGDS_WOMEN      138253 non-null  float64
 46  OPEFLAG         156324 non-null  int64  
 47  FTFTPCTFLOAN    71054 non-null   float64
dtypes: float64(37), int64(6), object(5)
memory usage: 57.8+ MB

Lets get some key statistics from the dataset so far.

col_all.describe()

I would like to focus on community colleges for this particular work so I select a subset of data that corresponds to community colleges by searching for institutions with the highest granting degree being Associate using 'HIGHDEG' == 2

df = col_all[col_all['HIGHDEG']==2]

df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 37034 entries, (100636, Period('1998', 'A-DEC')) to (492476, Period('2019', 'A-DEC'))
Data columns (total 48 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   OPEID           37034 non-null  object 
 1   OPEID6          37034 non-null  int64  
 2   INSTNM          37034 non-null  object 
 3   CITY            37034 non-null  object 
 4   STABBR          37034 non-null  object 
 5   ZIP             37034 non-null  object 
 6   HCM2            1379 non-null   float64
 7   MAIN            37034 non-null  int64  
 8   NUMBRANCH       37034 non-null  int64  
 9   PREDDEG         37034 non-null  int64  
 10  HIGHDEG         37034 non-null  int64  
 11  CONTROL         37034 non-null  float64
 12  ST_FIPS         37034 non-null  float64
 13  REGION          37034 non-null  float64
 14  ADM_RATE_ALL    5709 non-null   float64
 15  SATVRMID        458 non-null    float64
 16  SATMTMID        465 non-null    float64
 17  SATWRMID        81 non-null     float64
 18  ACTCMMID        520 non-null    float64
 19  UGDS            35321 non-null  float64
 20  UGDS_WHITE      18145 non-null  float64
 21  UGDS_BLACK      18145 non-null  float64
 22  UGDS_HISP       18145 non-null  float64
 23  UGDS_ASIAN      18145 non-null  float64
 24  UGDS_AIAN       18145 non-null  float64
 25  UGDS_NHPI       18145 non-null  float64
 26  UGDS_2MOR       18145 non-null  float64
 27  UGDS_NRA        35321 non-null  float64
 28  UGDS_UNKN       35321 non-null  float64
 29  CURROPER        1379 non-null   float64
 30  TUITIONFEE_IN   27561 non-null  float64
 31  TUITIONFEE_OUT  27565 non-null  float64
 32  AVGFACSAL       33999 non-null  float64
 33  C150_4          1133 non-null   float64
 34  C150_L4         33700 non-null  float64
 35  INC_PCT_LO      33795 non-null  float64
 36  INC_PCT_M1      32992 non-null  float64
 37  INC_PCT_M2      29896 non-null  float64
 38  INC_PCT_H1      24424 non-null  float64
 39  INC_PCT_H2      24294 non-null  float64
 40  DEBT_MDN        31455 non-null  float64
 41  C100_4          228 non-null    float64
 42  C100_L4         15300 non-null  float64
 43  ICLEVEL         37034 non-null  float64
 44  UGDS_MEN        35321 non-null  float64
 45  UGDS_WOMEN      35321 non-null  float64
 46  OPEFLAG         37034 non-null  int64  
 47  FTFTPCTFLOAN    17958 non-null  float64
dtypes: float64(37), int64(6), object(5)
memory usage: 13.8+ MB

1.4.1 Visualizing Data¶

1.4.1.1 Graduation distribution plot¶

let's look at the distribution of graduate rate by state. We use Seaborn distribution plot for illustration. "grad_rate_150 is the ultimate number of students who eventually graduated. We pick that as the indicator. We group our DataFrame by state, find the average "grad_150_rate for each state and and sort them descendingly.

df['C150_L4'].describe()

count    33700.000000
mean         0.361617
std          0.230463
min          0.000000
25%          0.182500
50%          0.301000
75%          0.513000
max          1.000000
Name: C150_L4, dtype: float64

f, ax = plt.subplots(1, 1)

sns.distplot(df['C100_L4'], label="Within 100% of Expected Time")
sns.distplot(df['C150_L4'], label="Within 150% of Expected Time")

ax.legend()
f.set_size_inches(11.7, 8.27)
ax.set(xlabel='Graduation Rate', ylabel='common ylabel')
plt.show()

1.4.1.2 Bar plot of graduation rate by state¶

One way to visualize the performance of students nation-wide is plotting bar graphs for each state. Additionally, it will be useful to compare graduation within states by gender.

For this matter, we use pandas pivot table. We pick state_abbr as the index, and gender for our columns. The values populated in the columns are the average graduation rate within 150% expected time (i.e. grad_150_rate).

dfs = col_all.pivot_table(index="STABBR", values='C150_L4', aggfunc='mean', fill_value=0)

dfs=dfs.sort_values(by='C150_L4',ascending=False)
dfs.plot(kind='bar',figsize=(20,6))
plt.ylabel('Graduate Rate')
plt.xlabel('States')
plt.title('States Graduation Rate')

Text(0.5, 1.0, 'States Graduation Rate')

1.4.2 Creating an interactive map of nationwide graduation rate¶

Choropleth Maps is a type of thematic map in which areas are shaded or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density. Along the same line of the previous graph, let's take a look at the distribution of graduation rates in the United States.

Note: Other than Pandas and NumPy, "Chart Studio" must be installed too.

import chart_studio.plotly as py
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

Now we need to set up everything so that the figures show up in the notebook:

init_notebook_mode(connected=True)

Now we need to begin to build a data dictionary. Easiest way to do this is to use the dict() function of the general form:

#Building data dictionary

data = dict(type = 'choropleth',
            locations = dfs.index,
            locationmode = 'USA-states',
            colorscale= 'Greens',
            #text= ['text1','text2','text3'],
            z=dfs['C150_L4'],
            colorbar = {'title':'Graduate Rate'})

Then we create the layout nested dictionary:

#Building nested dictionary

layout = dict(geo = {'scope':'usa'})

Then we use:

go.Figure(data = [data],layout = layout) to set up the object that finally gets passed into iplot()

#Set up object to be passed into iplot()

choromap = go.Figure(data = [data],layout = layout)

#Plotting the interactive map of US with our graduation rate data

iplot(choromap)

1.5 Modeling and Prediction¶

The college completion rate (that is the graduation rate of students) from 3800 degree-granting institutions between 2003 to 2013 was studied. Racial and gender factors were considered to establish a measure of success (completion) for students. The result indicate the average completion rates for colleges is between 30 to 50 percent. California ranks top with 50% graduation rate while Alaska sits at the bottom with 28% completion. Moreover, Asian and White students have the highest completion while African-American students have the lowest completion. Female students graduate at higher rates among all races. Moreover, female students have higher graduation rate compared to male students in every state consistently. These results are useful for future studies that include graduation rates for colleges. Especially, for California colleges where the funding formula is directly dependent on the completion rates and college racial diversity. The next step of the way is to create a predictive model to forecast the completion rates for the coming academic years. The predictive model could potentially predict the racial diversity each year as well. However, to create a robust model more data is required. The required data must be more recent, reflecting the years past 2013 up to 2020. Data must include new features such as teacher-to-student ratio.

1.5.1 Previous Work¶

Next step presents the results of analysis that will be conducted in order to gauge the predictive capability of a machine learning algorithm to predict on-time graduation that takes into consideration students' learning and development. Picking such model is crucial for college administration planning and budgeting. Anderson et al. [3] trained a set of four binary classifiers to predict FTIC students’ graduation rates. Based on the F1 and AUC scores, SGD classifier and logistic regression performed the best, while linear SVM and decision tree performed slightly worse. Pang et al. [4] predicted students graduation with 72% accuracy and precision/recall around 62% using ensemble support vector machines algorithm. Kesumawati et al. [5] applied Naïve Bayes Classifier and SVM to forecast students graduation rates. Both algorithms yielded 69% accuracy in predictions. Raju in his dissertation [6] applied regression models (forward, backward and stepwise), neural network and decision tree. Neural networks had the best performance in predicting graduation rates, while regression methods were a close contender with 77% accuracy while decision tree performed the worst.

From the brief literature review performed above, it seems that either neural networks or regression models are potential candidates for our predictive model.

1.5.2 Feature Engineering¶

1.5.2.1 Combining dependent features¶

Predictions could benefit from engineering features before feeding them to a predictive model. The share of various races in student body is different. While white students constitute more than 50% of the student body, African-American 14% and Hispanic 13%, all other races, including non-resident and unknown races constitute 20% of the student body. So a new variable called UGDS_OTH is created to include all other races. Moreover, family income in broken down to middle 1 (INC_PCT_M1) and middle 2 (INC_PCT_M2). Both variables are combined into a new variable called INC_PCT_M. Similarly, high income 1 INC_PCT_H1 and high income 2 INC_PCT_H2 are summed up in INC_PCT_H.

df['UGDS_OTH'] = df['UGDS_ASIAN']+df['UGDS_AIAN']+df['UGDS_NHPI']+df['UGDS_2MOR']+df['UGDS_NRA']+df['UGDS_UNKN']
df['INC_PCT_M'] = df['INC_PCT_M1']+df['INC_PCT_M2']
df['INC_PCT_H'] = df['INC_PCT_H1']+df['INC_PCT_H2']
df.describe()

1.5.2.2 Getting rid of unimportant features¶

The original dataframe created, df contain some features that are not useful in predictions such as institution name INSTNM or institution unit ID and its variations OPEID and OPEID6. Some other features that thought to be useful in the first round of feature selection such as admission rate ADM_RATE_ALL or SAT and ACT scores SATVRMID and ACRCMMID contain many missing values. These features are removed to increase model predictions. Finally, the features that are combined in the new variables must be removed too (e.g. UGDS_AIAN or INC_PCT_M1).

mldf = df.drop(columns=['INSTNM','ADM_RATE_ALL','SATVRMID','SATMTMID','SATWRMID','ACTCMMID','HCM2','CURROPER',
                        'C150_4','C100_4','OPEID','OPEID6','CITY','STABBR','ZIP','MAIN','NUMBRANCH',
                        'PREDDEG','HIGHDEG','CONTROL','ST_FIPS','REGION','ICLEVEL','OPEFLAG', 'UGDS_AIAN', 'UGDS_ASIAN',
                        'UGDS_NHPI','UGDS_2MOR','UGDS_NRA','UGDS_UNKN','INC_PCT_M1','INC_PCT_M2','INC_PCT_H1',
                        'INC_PCT_H2'], axis = 1)
mldf.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 37034 entries, (100636, Period('1998', 'A-DEC')) to (492476, Period('2019', 'A-DEC'))
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   UGDS            35321 non-null  float64
 1   UGDS_WHITE      18145 non-null  float64
 2   UGDS_BLACK      18145 non-null  float64
 3   UGDS_HISP       18145 non-null  float64
 4   TUITIONFEE_IN   27561 non-null  float64
 5   TUITIONFEE_OUT  27565 non-null  float64
 6   AVGFACSAL       33999 non-null  float64
 7   C150_L4         33700 non-null  float64
 8   INC_PCT_LO      33795 non-null  float64
 9   DEBT_MDN        31455 non-null  float64
 10  C100_L4         15300 non-null  float64
 11  UGDS_MEN        35321 non-null  float64
 12  UGDS_WOMEN      35321 non-null  float64
 13  FTFTPCTFLOAN    17958 non-null  float64
 14  UGDS_OTH        18145 non-null  float64
 15  INC_PCT_M       29726 non-null  float64
 16  INC_PCT_H       24257 non-null  float64
dtypes: float64(17)
memory usage: 5.0 MB

1.5.2.3 Eliminating outliers cases¶

The average population for a community college is close to 6000 and average graduation rate close to 0.5. However, in rare cases there are some institutions with small population or very high graduation rates. These institution offer only certifications and offer only a few associate degree programs or none at all. Therefore, these institutions are removed from the dataframe using a simple query. The query maintain observations with more than 500 students and graduation rates below 0.8.

dfn = mldf.query('C150_L4<0.8 & UGDS>500')

dfn.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 22713 entries, (100760, Period('1998', 'A-DEC')) to (489201, Period('2019', 'A-DEC'))
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   UGDS            22713 non-null  float64
 1   UGDS_WHITE      11945 non-null  float64
 2   UGDS_BLACK      11945 non-null  float64
 3   UGDS_HISP       11945 non-null  float64
 4   TUITIONFEE_IN   18598 non-null  float64
 5   TUITIONFEE_OUT  18601 non-null  float64
 6   AVGFACSAL       22555 non-null  float64
 7   C150_L4         22713 non-null  float64
 8   INC_PCT_LO      20810 non-null  float64
 9   DEBT_MDN        19184 non-null  float64
 10  C100_L4         10764 non-null  float64
 11  UGDS_MEN        22713 non-null  float64
 12  UGDS_WOMEN      22713 non-null  float64
 13  FTFTPCTFLOAN    11943 non-null  float64
 14  UGDS_OTH        11945 non-null  float64
 15  INC_PCT_M       18923 non-null  float64
 16  INC_PCT_H       15831 non-null  float64
dtypes: float64(17)
memory usage: 3.1 MB

The dataframe is reduced down to 22713 observations. There are however plenty of missing values in data. Most of the data is suppressed due to privacy issues. We need to perform a hard na drop on the remainder of the dataframe.

dfn.dropna(axis = 0, inplace = True)

Let's rearrange the dataframe and put family features side-by-side

dfn = dfn[['C100_L4','C150_L4','UGDS','UGDS_WHITE','UGDS_BLACK','UGDS_HISP','UGDS_OTH','UGDS_MEN','UGDS_WOMEN','TUITIONFEE_IN','TUITIONFEE_OUT','DEBT_MDN','AVGFACSAL','INC_PCT_LO','INC_PCT_M','INC_PCT_H','FTFTPCTFLOAN']]
dfn.describe()

1.5.2.4 Studying the possible correlation among features¶

Let's compute pairwise correlation of columns using dfn.corr(). Seaborn Heatmap shows the contrast of correlations beautifully. We use "RdBu" color map to contrast positive correlations with negative or inverse correlations.

plt.figure(figsize=(18,10))
sns.set_context('paper',font_scale=1.5)
#cmap = sns.cubehelix_palette(light=1, as_cmap=True)
#sns.heatmap(bf.corr(), cmap = cmap, linecolor='white', linewidths = 3, annot = True, fmt='.1g')
sns.heatmap(dfn.corr(),  cmap="RdBu", linecolor='white', linewidths = 3, annot = True, fmt='.1g')

<matplotlib.axes._subplots.AxesSubplot at 0x1effa68a348>

The top 2 rows show the correlation of graduation rate (within 100 and 150 percent of expected time) with other columns. Graduation rate has strong inverse correlation with college population and and average faculty salary. Graduation rate has a positive correlation with tuition fee and average student debt. However, we can not make early judgments until later.

1.5.3 Scaling¶

First let's make X and y matrices. X contains all the features except graduation rates. y contains graduation rates.

y = dfn[['C150_L4']]
X = dfn.drop(columns=['C150_L4', 'C100_L4'], axis = 1)

from sklearn.preprocessing import QuantileTransformer, MinMaxScaler

Scaling is necessary because features range from fractions of 1 to hundreds of thousands. Graduation rates are confined between 0 and 1. We used MinMaxScaler to keep the range while scaling all features.

We store transformed X and y in scy and scX.

#scy = MinMaxScaler(feature_range=(0, 0.8)).fit_transform(y)
scX = MinMaxScaler().fit_transform(X)

1.5.4 Fitting¶

We use 70% of data for training and set aside 30% for predictions.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scX, y, test_size=0.3)

Researchers in the field have applied different types of models to predict graduation rates. The accuracy of prediction are about 70%. In this study we use RandomForestRegressor from sklearn.ensemble.

from sklearn.ensemble import RandomForestRegressor

Additionally, we use GridSearchCV from sklearn.model_selection to find the best parameters for the model.

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   26.0s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 12.0min finished

RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                                                   n_jobs=None, oob_score=Fals...
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring=None, verbose=2)

rf_random.best_params_

{'n_estimators': 400,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': None,
 'bootstrap': False}

param_grid = {
              'bootstrap'        : [False],
              'max_depth'        : [None, 93],
              'n_estimators'     : [100, 400, 1000, 1500],
              'max_features'     : [4, 8, 12],
              'min_samples_split': [1, 2, 3],
              'min_samples_leaf' : [1, 2]   
             }
grid = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

grid.fit(X_train,y_train)

Fitting 3 folds for each of 144 candidates, totalling 432 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-1)]: Done 162 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 365 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 432 out of 432 | elapsed: 10.3min finished

GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'bootstrap': [False], 'max_depth': [None, 93],
                         'max_features': [4, 8, 12], 'min_samples_leaf': [1, 2],
                         'min_samples_split': [1, 2, 3],
                         'n_estimators': [100, 400, 1000, 1500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)

grid.best_params_

{'bootstrap': False,
 'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1500}

param_grid = {
              'bootstrap'        : [False],
              'max_depth'        : [None],
              'n_estimators'     : [1500],
              'max_features'     : [4],
              'min_samples_split': [2],
              'min_samples_leaf' : [1]   
             }
model = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

model.fit(X_train,y_train)

Fitting 3 folds for each of 1 candidates, totalling 3 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   23.4s finished

GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'bootstrap': [False], 'max_depth': [None],
                         'max_features': [4], 'min_samples_leaf': [1],
                         'min_samples_split': [2], 'n_estimators': [1500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)

#model.feature_importances_

array([0.12819509, 0.04312008, 0.04848407, 0.0357881 , 0.0358406 ,
       0.06127434, 0.05577367, 0.15924328, 0.06905436, 0.09580903,
       0.05623485, 0.0447625 , 0.05181128, 0.0375561 , 0.07705266])

1.5.5 Predicting and Evaluating¶

Let's import mean_squared_error, mean_absolute_error, explained_variance_score from sklearn.metrics.

from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score

y_pred = model.predict(X_test)

expVar = explained_variance_score(y_test, y_pred)
MSE = mean_squared_error(y_test,y_pred)
MAE = mean_absolute_error(y_test,y_pred)
"Exp Variance:{} MSE: {} MAE: {}".format(expVar, MSE, MAE)

'Exp Variance:0.7487900951478811 MSE: 0.005449179862174175 MAE: 0.05084240704268286'

The mean absolute error is ~ 0.07. In the context of graduation rates (between 0 to 1.0) it means that our model is predicting graduation with 7 percent point accuracy. The explained variance is at 0.73. There is still some variance in the model that cannot be resolved readily.

Let's plot test cases versus our prediction and look at the spread.

plt.scatter(y_test,y_pred)
plt.plot(y_test,y_test,'r')

[<matplotlib.lines.Line2D at 0x1ef8cdcdb08>]

The overall trends indicate consistent predictions. However, there are some outliers and some less accurate predictions.

The model seems to be predicting better up to 0.6 graduation rates and performs poorly for graduation rates above 0.6. One reasons the overwhelming majority of institutions report graduation rates around 0.6 or lower. There are some institutions that graduate students at higher rates that are rare.

Another way to look at the predictions vs test cases, is by getting their difference and plot the distributions plot.

errors = np.array(y_test) - y_pred
sns.distplot(errors)

<matplotlib.axes._subplots.AxesSubplot at 0x1ef8cf1ec48>

The errors distribution plot is behaving normal with majority of the error hovering around zero.

X_baseline = np.asarray(X.mean()).reshape((15,1))
X_baseline

array([[5.90140064e+03],
       [5.33810859e-01],
       [1.38852707e-01],
       [1.28654903e-01],
       [1.28312902e-01],
       [4.08781989e-01],
       [5.91218027e-01],
       [4.58957465e+03],
       [8.20187433e+03],
       [5.67319483e+03],
       [6.05138691e+03],
       [6.12463601e-01],
       [2.83733083e-01],
       [1.03803316e-01],
       [3.63204819e-01]])

BASE CASE:

UGDS = X_baseline[0]
UGDS_WHITE = X_baseline[1]
UGDS_BLACK = X_baseline[2]
UGDS_HISP = X_baseline[3]
UGDS_OTH = X_baseline[4]
UGDS_MEN = X_baseline[5]
UGDS_WOMEN = X_baseline[6]
TUITIONFEE_IN = X_baseline[7]
TUITIONFEE_OUT = X_baseline[8]
DEBT_MDN = X_baseline[9]
AVGFACSAL = X_baseline[10]
INC_PCT_LO = X_baseline[11]
INC_PCT_M = X_baseline[12]
INC_PCT_H = X_baseline[13]
FTFTPCTFLOAN = X_baseline[14]

Graduation rate of base case

X_case =scMM.fit_transform(X_baseline)
model.predict(X_case.reshape((1,15)))

array([0.36677497])

Graduation rate by reducing student debt

X_testing =[[5.90140064e+03],
       [5.33810859e-01],
       [1.38852707e-01],
       [1.28654903e-01],
       [1.28312902e-01],
       [4.08781989e-01],
       [5.91218027e-01],
       [4.58957465e+03],
       [8.20187433e+03],
       [2500],
       [6.05138691e+03],
       [6.12463601e-01],
       [2.83733083e-01],
       [1.03803316e-01],
       [3.63204819e-01]]
X_case =scMM.fit_transform(X_testing)
grid.predict(X_case.reshape((1,15))) #Debt from 5.7k to 2.7

array([0.456381])

Graduation rate by reducing increasing tutition fee

X_testing =[[5.90140064e+03],
       [5.33810859e-01],
       [1.38852707e-01],
       [1.28654903e-01],
       [1.28312902e-01],
       [4.08781989e-01],
       [5.91218027e-01],
       [9000],
       [16000],
       [5.67319483e+03],
       [6.05138691e+03],
       [6.12463601e-01],
       [2.83733083e-01],
       [1.03803316e-01],
       [3.63204819e-01]]
X_case =scMM.fit_transform(X_testing)
model.predict(X_case.reshape((1,15))) #Tuition from 4.5k to 10k

array([0.36678997])

	UNITID	OPEID	OPEID6	INSTNM	CITY	STABBR	ZIP	ACCREDAGENCY	INSTURL	NPCURL	...	OMENRUP_PARTTIME_POOLED_SUPP	FTFTPCTPELL	FTFTPCTFLOAN	UG12MN	G12MN	SCUGFFN	POOLYRS_FTFTAIDPCT	FTFTPCTPELL_POOLED_SUPP	FTFTPCTFLOAN_POOLED_SUPP	SCUGFFN_POOLED
0	100654	100200	1002	Alabama A & M University	Normal	AL	35762	Southern Association of Colleges and Schools C...	www.aamu.edu/	www.aamu.edu/admissions-aid/tuition-fees/net-p...	...	0.3193	0.7057	0.7143	5343.0	1165.0	1288.0	2.0	0.7083	0.7287	2698.0
1	100663	105200	1052	University of Alabama at Birmingham	Birmingham	AL	35294-0110	Southern Association of Colleges and Schools C...	https://www.uab.edu	https://uab.studentaidcalculator.com/survey.aspx	...	0.2475	0.3788	0.5350	14445.0	10498.0	2228.0	2.0	0.3891	0.5414	4176.0
2	100690	2503400	25034	Amridge University	Montgomery	AL	36117-3553	Southern Association of Colleges and Schools C...	www.amridgeuniversity.edu	www2.amridgeuniversity.edu:9091/	...	0.2836	1.0000	1.0000	440.0	527.0	5.0	4.0	PrivacySuppressed	PrivacySuppressed	24.0
3	100706	105500	1055	University of Alabama in Huntsville	Huntsville	AL	35899	Southern Association of Colleges and Schools C...	www.uah.edu	finaid.uah.edu/	...	0.2496	0.2707	0.4556	8145.0	2443.0	1341.0	2.0	0.2378	0.4402	2544.0
4	100724	100500	1005	Alabama State University	Montgomery	AL	36104-0271	Southern Association of Colleges and Schools C...	www.alasu.edu	www.alasu.edu/cost-aid/tuition-costs/net-price...	...	0.2621	0.7792	0.7539	4732.0	642.0	951.0	2.0	0.7684	0.7464	2094.0

	UNITID	OPEID	OPEID6	INSTNM	CITY	STABBR	ZIP	MAIN	NUMBRANCH	...	INC_PCT_H1	INC_PCT_H2	DEBT_MDN	C100_4	C100_L4	ICLEVEL	UGDS_MEN	UGDS_WOMEN	OPEFLAG	FTFTPCTFLOAN
0	100654	100200	1002	Alabama A & M University	Normal	AL	35762	1	1	...	NaN	NaN	NaN	0.0556	NaN	1	0.4076	0.5924	1	0.7143
1	100663	105200	1052	University of Alabama at Birmingham	Birmingham	AL	35294-0110	1	1	...	NaN	NaN	NaN	0.3462	NaN	1	0.3930	0.6070	1	0.5350
2	100690	2503400	25034	Amridge University	Montgomery	AL	36117-3553	1	1	...	NaN	NaN	NaN	0.4000	NaN	1	0.3191	0.6809	1	1.0000
3	100706	105500	1055	University of Alabama in Huntsville	Huntsville	AL	35899	1	1	...	NaN	NaN	NaN	0.2195	NaN	1	0.5796	0.4204	1	0.4556
4	100724	100500	1005	Alabama State University	Montgomery	AL	36104-0271	1	1	...	NaN	NaN	NaN	0.0975	NaN	1	0.3710	0.6290	1	0.7539

	OPEID6	HCM2	MAIN	NUMBRANCH	PREDDEG	HIGHDEG	CONTROL	ST_FIPS	REGION	ADM_RATE_ALL	...	INC_PCT_H1	INC_PCT_H2	DEBT_MDN	C100_4	C100_L4	ICLEVEL	UGDS_MEN	UGDS_WOMEN	OPEFLAG	FTFTPCTFLOAN
count	156324.000000	6806.000000	156324.000000	156324.000000	156324.000000	156324.000000	156295.000000	156323.000000	156323.000000	49236.000000	...	87150.000000	86156.000000	121680.000000	41189.000000	37820.000000	156295.000000	138253.000000	138253.000000	156324.000000	71054.000000
mean	14658.900425	0.008816	0.794536	4.175622	1.880396	2.210787	2.123069	29.217953	4.558203	0.694558	...	0.108304	0.109647	7980.543331	0.322554	0.368407	1.861205	0.347235	0.646777	1.079610	0.559860
std	12582.664300	0.093484	0.404042	13.669866	1.000990	1.286281	0.838157	16.601570	2.188611	0.210672	...	0.065278	0.107462	4595.143219	0.237243	0.296286	0.813569	0.237899	0.241641	0.818824	0.321981
min	1002.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	1.000000	0.000000	0.000000	...	0.003357	0.000915	310.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000
25%	3262.000000	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000	13.000000	3.000000	0.562400	...	0.050814	0.028698	4500.000000	0.134900	0.099700	1.000000	0.139400	0.530500	1.000000	0.307700
50%	9863.000000	0.000000	1.000000	1.000000	2.000000	2.000000	2.000000	29.000000	5.000000	0.718900	...	0.098608	0.068243	6625.000000	0.284800	0.300000	2.000000	0.364400	0.632200	1.000000	0.647100
75%	23482.000000	0.000000	1.000000	2.000000	3.000000	4.000000	3.000000	42.000000	6.000000	0.849500	...	0.158234	0.159363	10108.000000	0.482200	0.617600	3.000000	0.466700	0.855700	1.000000	0.826100
max	99999.000000	1.000000	1.000000	143.000000	4.000000	4.000000	3.000000	78.000000	9.000000	1.000000	...	0.476190	0.705882	46547.000000	1.000000	1.000000	3.000000	1.000000	1.000000	19.000000	1.000000

	OPEID6	HCM2	MAIN	NUMBRANCH	PREDDEG	HIGHDEG	CONTROL	ST_FIPS	REGION	ADM_RATE_ALL	...	C100_4	C100_L4	ICLEVEL	UGDS_MEN	UGDS_WOMEN	OPEFLAG	FTFTPCTFLOAN	UGDS_OTH	INC_PCT_M	INC_PCT_H
count	37034.000000	1379.000000	37034.000000	37034.000000	37034.000000	37034.0	37034.000000	37034.000000	37034.000000	5709.000000	...	228.000000	15300.000000	37034.000000	35321.000000	35321.000000	37034.000000	17958.000000	18145.000000	29726.000000	24257.000000
mean	11560.758276	0.002901	0.815440	3.071421	1.690528	2.0	1.710914	28.424475	4.761192	0.748007	...	0.158577	0.251402	1.955473	0.388406	0.610490	1.033537	0.444072	0.127930	0.280360	0.119162
std	10550.043001	0.053799	0.387946	10.910958	0.462282	0.0	0.919153	16.435467	2.076091	0.211315	...	0.187935	0.228020	0.206265	0.184164	0.184829	0.504519	0.340792	0.159417	0.067639	0.084220
min	1007.000000	0.000000	0.000000	1.000000	1.000000	2.0	1.000000	1.000000	0.000000	0.000000	...	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.057143	0.005703
25%	3273.000000	0.000000	1.000000	1.000000	1.000000	2.0	1.000000	13.000000	3.000000	0.615100	...	0.055800	0.078400	2.000000	0.289200	0.545800	1.000000	0.114800	0.039300	0.236381	0.059741
50%	7437.000000	0.000000	1.000000	1.000000	2.000000	2.0	1.000000	29.000000	5.000000	0.787500	...	0.088050	0.166350	2.000000	0.387000	0.612700	1.000000	0.411850	0.080000	0.287616	0.098142
75%	21192.000000	0.000000	1.000000	2.000000	2.000000	2.0	3.000000	42.000000	6.000000	0.919600	...	0.179150	0.364525	2.000000	0.453700	0.710000	1.000000	0.780400	0.151400	0.325899	0.156573
max	42743.000000	1.000000	1.000000	143.000000	2.000000	2.0	3.000000	72.000000	9.000000	1.000000	...	1.000000	1.000000	2.000000	1.000000	1.000000	19.000000	1.000000	1.000100	0.685619	0.654676

	C100_L4	C150_L4	UGDS	UGDS_WHITE	UGDS_BLACK	UGDS_HISP	UGDS_OTH	UGDS_MEN	UGDS_WOMEN	TUITIONFEE_IN	TUITIONFEE_OUT	DEBT_MDN	AVGFACSAL	INC_PCT_LO	INC_PCT_M	INC_PCT_H	FTFTPCTFLOAN
count	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000	6557.000000
mean	0.152103	0.255245	5901.400641	0.533811	0.138853	0.128655	0.128313	0.408782	0.591218	4589.574653	8201.874333	5673.194830	6051.386915	0.612464	0.283733	0.103803	0.363205
std	0.133473	0.144274	6779.965465	0.277290	0.163871	0.159442	0.126030	0.103581	0.103581	4100.616341	3829.373367	2405.592652	1558.671142	0.102663	0.052160	0.065420	0.265619
min	0.000000	0.000000	501.000000	0.000000	0.000000	0.000000	0.000000	0.041800	0.011400	560.000000	658.000000	1634.500000	870.000000	0.197707	0.092247	0.005703	0.000000
25%	0.059300	0.149700	1547.000000	0.335700	0.026800	0.022200	0.053700	0.366700	0.545200	2440.000000	5358.000000	4000.000000	4945.000000	0.540812	0.254188	0.053779	0.154600
50%	0.112700	0.226200	3603.000000	0.593800	0.079600	0.061500	0.090400	0.411800	0.588200	3434.000000	7436.000000	5229.000000	5883.000000	0.616013	0.290870	0.092926	0.296800
75%	0.197600	0.318000	7679.000000	0.762100	0.190300	0.179500	0.159700	0.454800	0.633300	4464.000000	10358.000000	6332.000000	6981.000000	0.685677	0.319990	0.140475	0.539300
max	0.792600	0.797300	91112.000000	0.980700	0.978700	1.000000	1.000000	0.988600	0.958200	39750.000000	39750.000000	20000.000000	12613.000000	0.890584	0.452931	0.525550	1.000000