A simple example of predictive analytics for Enrollment Managers using FREE tools.
R we can fit all sorts
of complex models in Enrollment Management, quickly, and for no cost. In
truth, data modeling can help undercover complex relationships at your
school that are not easily visible in our usual tables and charts.
However, predictive analytics is not the golden ticket to enrollment
success. You will need to understand not only what the model is telling
you, but also the risks associated with being incorrect. Lastly, once
you have your “model”, how do you actually use it? You will need to
think about how you incorporate your new model into your current
decision making processes.
Why Write this Post?
Over the last few weeks, I feel like the discussion around “Predictive Analytics” within Enrollment Management has really picked up steam. There are a ton of great vendors out there, but the aim of this post is to show you how simple it can be do build predictive models internally, for the price of “on-the-house”. I don’t mean to imply that machine learning is easy by any stretch, but I do intend to highlight how quickly models can be built. If you know your data, and understand various techniques, model building isn’t the hard part. More than likely though, you will want to take some time to think about your data and the output you see. Not to mention, how you would actually operationalize your model so that it runs quietly behind the scenes.
My goal is to try to write a post on how we can do
predictive analytics in Enrollment Management using
R. In this
example, we will fit a model to predict if an applicant is admitted. In
full disclosure, I am going to avoid the technical details as much as
possible, although understanding how these models work is critically
Previous Work and Discussion
Let the debate around predictive analytics begin! I am just kidding, but there has been quite a bit of press recently on the usage of predictive analytics within higher ed and Enrollment Management. Here are a few (self-edited, sometimes snarky) headlines.
- Colleges are using Big Data to predict which students will do well
- The Future of Predictive Analytics in Higher Ed
- Political Style Targeting
- FAFSA data
I do think it’s worth noting that
predictive analytics in actually not
a new concept. Technology is making it much easier to do, although the
underlying methodologies have been applied to higher ed for some time
now. Below are just a few journal articles.
- Enrollment Models Using Data Mining
- Data Mining: A Magic Technology for College Recruitment
- Differential Pricing in Undergraduate Education
I included the last link above because “pricing” is a pretty hot topic at the moment as well. One one hand, you have school’s blocking College Abacus, which is basically Kayak for college pricing. On the other, institutions are required to report all sorts of data to the government through IPEDS, where it is displayed on a number of sites including the College Affordability and Transparency Center. My point? There is an academic argument for each side of the debate, whether its predictive analytics or transparency. Outside of the financial reporting of public companies, what other industry has to openly report their performance at this level of detail to the public? As such, the trends of our industry are forcing us to think differently about how we do things. Now that it’s here, we need to start to get comfortable with what it can do. More importantly though, we need to understand the risks associated with modeling our enrollment data.
As mentioned a few times above, I am going to use the open-sourced
statistical programming language,
R, to download and model our data.
Here is our workflow:
- Grab a dataset from the web
- Fit a predictive model (logistic regression)
- Assess the accuracy of the model
1) Lets grab the data
If you are reading this post and are a regular
SPSS user, this next
step is pretty cool.
R allows us to grab data from the web. If you
were just using
SPSS, it would require that you scrape (or download)
the data, and then fire up the software to read in the external dataset.
That’s way too much effort! The code below grabs a very small admissions
dataset. If you are an analyst, you should check
out UCLA’s website. It’s a great resource for
analytical methods and code examples. Below, we will define the URL for
the dataset, and then use this value to read in the CSV file from the
web into a
data.frame object called
URL = "https://stats.idre.ucla.edu/stat/data/binary.csv" df = read.csv(URL)
Let’s confirm that the data are in our
dim(df)  400 4 summary(df) admit gre gpa rank Min. :0.000 Min. :220 Min. :2.26 Min. :1.00 1st Qu.:0.000 1st Qu.:520 1st Qu.:3.13 1st Qu.:2.00 Median :0.000 Median :580 Median :3.40 Median :2.00 Mean :0.318 Mean :588 Mean :3.39 Mean :2.48 3rd Qu.:1.000 3rd Qu.:660 3rd Qu.:3.67 3rd Qu.:3.00 Max. :1.000 Max. :800 Max. :4.00 Max. :4.00
dim(df) simply asks
R to print out the dimensions our
dataset. In this case, we have 400 rows and 4 columns. The
command prints the first few rows of the data, so we can see what we
head(df) admit gre gpa rank 1 0 380 3.61 3 2 1 660 3.67 3 3 1 800 4.00 1 4 1 640 3.19 4 5 0 520 2.93 4 6 1 760 3.00 2
I should have done this by now, so let’s talk about the dataset. The
admit, is the that variable we want to predict. In this
case, our variable represents a Yes/No decision. Yes is coded as a
No is coded as a
0. This type of variable is prevalent in Enrollment
Management. To name a few …
- Does a suspect respond to our search campaign?
- Does a recruit apply?
- Do we retain a student?
- Will the student graduate in 4 years?
- Does the student pay a deposit?
- Does the student melt (between May and September)?
- Does the recruit open up the next email we send them?
Even if the variable doesn’t exist in a natural Yes/No state, we can
usually force our data into this format. The other 3 variables are our
predictor variables. We will be using
rank to predict the applicant’s status into graduate school. The
gre is numeric and on an 800 scale,
gpa is also numeric on
a 4.0 scale, and rank appears to be categorical, with values ranging 1-4
based on the admission’s counselors read of the student.
2) Fit a Model
Now let’s fit our predictive model.
R is really flexible. All I have
to do below is tell
R to fit a model where I am trying to predict
admit given every other value in the database. Below, I indicate this
concept using the syntax
admit ~ .
yield_model = glm(admit ~ ., data = df, family = binomial()) summary(yield_model) Call: glm(formula = admit ~ ., family = binomial(), data = df) Deviance Residuals: Min 1Q Median 3Q Max -1.580 -0.885 -0.638 1.157 2.173 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.44955 1.13285 -3.05 0.0023 ** gre 0.00229 0.00109 2.10 0.0356 * gpa 0.77701 0.32748 2.37 0.0177 * rank -0.56003 0.12714 -4.40 1.1e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 499.98 on 399 degrees of freedom Residual deviance: 459.44 on 396 degrees of freedom AIC: 467.4 Number of Fisher Scoring iterations: 4
When we use the summary command above, we print out the “fit” of the
model. In the section called
Coefficients:, we get the estimated
weights, or effects, of each variable on the admission status.
3) Assess the Model
Now that we have fit a model, let’s “score” the our data. Imagine that you were using last year’s applicant pool to predict the admission status of this year’s class. In the code below, we are going to append the probability of being admitted. We can then use this score to assess how “acccurate” our predicted value truly is.
df = transform(df, score = predict(yield_model, newdata = df, type = "response"))
When we used the
summary command earlier, we printed out some basic
stats on the variables in our dataset. Because
admit is coded as
0/1, the average of this variable is equivalent to the proportion of
admit = Yes in the dataset. In this case, 32% of the applicants were
admitted. This is important because our model will calibrate the scores
relative to this proprtion. If our new data are wildly different, the
model will not that well. Let’s print out the distribution of predicted
Now let’s look at the distrbution of the scores based on the actual
admission status. If you do not already have the library
installed, simply use the command
executing the code below.
library(ggplot2) ggplot(df, aes(x = score, fill = factor(admit))) + geom_density(alpha = 0.3)
It’s nice to see that the peak for the predicted score on students is
higher than for those that were rejected, but I am not thrilled by this
plot. Early on, it looks like the model was not able to accurately
differentiate between admits and rejects. Below, we are going to use
ROCR for some other “goodness-of-fit” metrics. For
help on this package, go here. I
highly recommend reviewing the
Powerpoint file that is included on the
library(ROCR) pred <- prediction(df$score, df$admit) perf <- performance(pred, measure = "tpr", x.measure = "fpr") plot(perf, colorize = T, main = "Lift Chart")
If you view this plot from left to right, ideally the line would have
spiked “early” in the chart. In general, you typically think of a
45-degree line, and the more “lift” above this line the better. Finally,
I am going to compute a metric,
AUC. The higher the number, the
better. To learn more about
AUC, check out this
For a rule-of-thumb interpretation of the score, look
auc = performance(pred, "auc") firstname.lastname@example.org[]  0.6921
Real quick. You may have noticed that I usually refer to access our key
values using the
$ operator, but needed to use
@ above. This is
because the object returned from
performance is of
S4 class in
The more you play around, you will see this object class appear from
time-to-time, but usually can access your data using
$. From above, we
see that the
AUC for our model is 0.6921. In truth, the model doesn’t
fit that well. Intuitively, we can confirm this by binning our scores
into deciles and looking at the actual admit rate within each band.
library(plyr) ## add a new variable, band, which puts the score into 10 groups df = transform(df, band = cut(score, breaks = seq(0, 1, 0.1), right = FALSE)) ## create a summary table, by group, that looks at some summary stats for ## each band ddply(df, .(band), summarise, applicants = length(admit), admits = sum(admit), admit_rate = mean(admit)) band applicants admits admit_rate 1 [0,0.1) 15 1 0.06667 2 [0.1,0.2) 88 15 0.17045 3 [0.2,0.3) 84 22 0.26190 4 [0.3,0.4) 105 31 0.29524 5 [0.4,0.5) 59 29 0.49153 6 [0.5,0.6) 32 19 0.59375 7 [0.6,0.7) 15 9 0.60000 8 [0.7,0.8) 2 1 0.50000
For example, there were 2 applicants that had a predicted probability of admission status between 70-79%. Of these 2 applicants, only 1 was admitted. In a perfect world, the higher the score, we would have seen larger “true” admit rates.
Hopefully this was a fairly gentle introduction to how quickly you can fit a predictive model for your EM team. Conceptually, it doesn’t have to be hard, although interpreting the results can be tricky. Regardless, you can explore what is possible for free with open-sourced statistical software. Hey, you might even have some fun writing code!