Learn more about the visitors that complete Goals on your Website


About the post

Just like in the previous entry, we will be using R to access our school’s Google Analytics data through their API. In this post, I want to highlight how we can figure out when a vistor to our website completes our a goal on our site. In my case, I am interested in learning more about how, and when, prospective students (and/or parents) complete our information request form. This could be any goal on your site, but our recruit pool data tend to confirm that self-initiated actions are strong predictors of interest. This is why I tend to emphasize these actions over “soft-interest” conversions like a simple click-through’s on a random email. Before we begin, I assume that you are relatively familiar with the Google Analytics, what data are available, and that you have goals setup for your website. In my case, we told Google that one of our “goals” was the completion page of the web request form. I won’t talk about why goals are massively awesome things to have setup in GA, but if this concept is new to you, check out this link for an overview.


In the context of R, I am going to make one assumption. If you have been playing around with the rga package, you probably have figured out that it’s really helpful to save our connection object for later sessions. This prevents us from having to authenticate each time we want data. For help on the package, look here. After firing up R, let’s setup or environment and reconnect to the API for our undergraduate account. Below, I am using the where argument to reference the uga.rga file in my current directory. This file contains my saved credentials.

## load the R package we use to access Google Analytics

## not ideal, but a setting that we need to apply if using Windows
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL",
    "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))

## the token for GA
rga.open(instance = "ga", where = "uga.rga")

Who Converts? New or Returning Visitors?

Now that we connected to the API, we can start to have some fun. Before going too crazy, let’s answer the basic question of who. Simply, of the people that convert, are they New or Returning vistitors? We are going to count the visits by New and Returning visitors from January through November 2013.

start.date = "2013-01-01"
end.date = "2013-11-30"
DIM = "ga:visitorType"
MET = "ga:visits"

## get the data
type = ga$getData("ga:XXXXXXXX", start.date, end.date, walk = TRUE, metrics = MET,
    dimensions = DIM, sort = "", filters = "", segment = "dynamic::ga:goal1Completions>=1",
    start = 1, max = 10000)

One thing reql quick. I want to point out how we can define segments “on-the-fly” in the API. If you use the web reporting tool for GA, we can define Advanced Segments. These segments allow you put your traffic into buckets. While you can access these using the API as well, we can also generate these programatically by dusing dynamic::. This feature is prett helpful in my opinion. Also, we were able to avoid sampled data by using the walk argument above, but it means that we now have to aggregate the data by visitorType.

type_summary = aggregate(visits ~ visitorType, data = type, FUN = "sum")
type_summary$pct = type_summary$visits/sum(type_summary$visits)

        visitorType visits    pct
1       New Visitor   xxxx 0.6085
2 Returning Visitor    xxx 0.3915

After printing out the data, we can see that about 61% of our information request form conversions were from New Visitors between January and November 2013.

How long is the conversion cycle?

Now let’s dig a bit deeper and try to answer the question of when they convert. In this case, I am defining when as the number of visits before for someone to completes the form. These data will be pulled into a data frame called basic.

## use http://ga-dev-tools.appspot.com/explorer/ to explore query strings
start.date = "2013-01-01"
end.date = "2013-11-30"
DIM = "ga:date,ga:visitCount"
MET = "ga:visits"

## get the data
basic = ga$getData("ga:XXXXXXXX", start.date, end.date, walk = TRUE, metrics = MET,
    dimensions = DIM, sort = "", filters = "", segment = "dynamic::ga:goal1Completions>=1",
    start = 1, max = 10000)

First, we should take a peak what we pulled down to ensure that our dataset looks as expected.


[1] "data.frame"


[1] 893   3


        date visitCount visits
1 2013-01-01          x      x
2 2013-01-01          x      x
3 2013-01-02          x      x
4 2013-01-02          x      x
5 2013-01-03          x      x
6 2013-01-03         xx      x

At a very high level, how many visits does it take to convert a suspect?

round(mean(basic$visitCount), 2)

[1] 5.25

We see that our info request conversions typically take between 5 and 6 visits. But wait, didn’t we just point out that 61% of our conversions were from New Visitors? Because averages are easily influenced by extreme values, we should visualize the distribtion.

hist(basic$visitCount, main = "Distribution of Visits required to Convert",
    xlab = "# Visits", col = "red", breaks = 100)

plot of chunk
unnamed-chunk-9 Now things are starting to make sense. We have some very large values. Let’s standardize the data and remove these outliers.

## copy our data
basic2 = basic

## create a new variable that is the standardized value
basic2$z = scale(basic2$visitCount)

## keep only scaled values +/- 3 (in reality, only '+' values exist)
basic2 = subset(basic2, z >= -3 & z <= 3)

## re-plot the distribution
hist(basic2$visitCount, main = "Distribution of Visits required to Convert",
    xlab = "# Visits", col = "red", breaks = 100)

plot of chunk
unnamed-chunk-10 After removing very large values, our distribution starts to take shape. The chart confirms that the large majority are new visitors, but we can see that there are a decent number of conversions that happen well after the first visit. To me, these are the lurkers that we should attempt to learn more about in the future. Now, I am curious as to how many visits it takes after the first visit. Below, I am going to group (or bin) the data.

## cut our data into bands.  (0,1] = 1 visit, (1, 2] = 2 visits, (8, 14] =
## 8-14 visits
basic2 = transform(basic2, bins = cut(visitCount, breaks = c(0:7, 14, 21, 100)))

## put our data into a summary table using the plyr package
visit_summary = ddply(basic2, .(bins), summarise, visits = sum(visits))
visit_summary = transform(visit_summary, pct_total = round(visits/sum(visits),

       bins visits pct_total
1     (0,1]   xxxx     0.609
2     (1,2]    xxx     0.187
3     (2,3]    xxx     0.069
4     (3,4]     xx     0.038
5     (4,5]     xx     0.026
6     (5,6]     xx     0.015
7     (6,7]     xx     0.012
8    (7,14]     xx     0.031
9   (14,21]     xx     0.007
10 (21,100]     xx     0.008

We can see that the large majority of visitors will go on to request information within the first 3 visits to our site. I know that this is a stretch, but to me this suggests that we only have about 3 chances to influence lurkers, or those that are window shopping our institution. Just because I can’t help myself, one last cut of the data. I am going to manually classify our data into New/Returning visitors and explore if the Month impacts who converts.

## clean up the month from our date variable (which is stored as a date)
basic2 = transform(basic2, month = month(date, label = TRUE))

## manually classify visits as New/Returning
basic2 = transform(basic2, visit_type = ifelse(visitCount == 1, "New", "Returning"))

## summarize the data before we plot it
basic2_summ = ddply(basic2, .(month, visit_type), summarise, visits = sum(visits))

## plot the distribtions for each month using the ggplot2 plotting library
ggplot(basic2_summ, aes(x = month, y = visits, fill = factor(visit_type))) +
    geom_bar(position = "fill", stat = "identity")

plot of chunk
out.width=="100px" Visually, I am not sure there is a strong pattern in our data. However, there might be some evidence to suggest that our conversions increasingly come from New Visits during the fall months; senior year if you are looking at this at the undergraduate level.


Above, I ran through some quick code to determine the number of visits it takes before a suspect will request more information from our institution. In addition, we were able to figure out if our conversions are coming from New or Returning visitors. Stepping back, you could have used the web reporting interface to answer a few of the questions above, but where is the fun in that? All kidding aside, this is only a fraction of what we could have done. For example, we could have isolated conversions with a visitCount > 1 and then studied how the traffic came to our site. In addition, we could also explore if we have longer conversion cycles based on visitor geography or even evaluted the conversion impact of mobile devices.  

Brock Tibert
Lecturer (Information Systems), Analytics and Product Consultant