This week, I’m taking a break from our regular blog content to write about some of the nitty-gritty data science we do within RStudio. Specifically, I’m going to address a question I was asked soon after joining the blogging team:
Which of your blog articles received the most views in the first 15 days they were posted?
I have access to Google Analytics for the blog, and it’s a powerful tool for understanding the flow of visitors through our web site. Nevertheless, the web interface makes it very difficult to compare 15- and 30-day windows of traffic that we need to evaluate blog posts. And given my need to report on the success of the blog to our stakeholders, this turned into a very tedious manual chore.
If you’ve been reading our blog over the past few months, we’ve been writing about how we like to use code-based data science to hide complexity and improve reproducibility. I decided that our tedious process for extracting Google Analytics data posed a great opportunity to practice what we preach and build a custom dashboard in R that will:
While it took a few weeks to get the first dashboard working, we now regularly use these R-based dashboards to measure our blog post effectiveness. However, the process of getting the Google Analytics API working was tricky enough that I thought others might find documentation of the process useful.
To achieve this goal, I’ll address each of these steps listed above in its own blog post over the coming weeks. I’ll provide both code and screen shots along the way as well.
Before we begin, I want to start with a few caveats:
googleAnalyticsRpackage written by Mark Edmondson in my dashboard, but Google only officially documents Java, Python, and PHP interfaces.
With all that said, the dashboards that use this API provide insights into our blog use that would require a great deal of manual work to reproduce using the GA web interface.
While the finished dashboards use 16 different R packages, the essential ones I use are:
gargle. This package helps us set up our Google Analytics (henceforth abbreviated as GA) authorization and credentials.
googleAnalyticsR. This essential package allows us to download the raw visitor data using the Google Analytics API.
flexdashboard. This package allows us to present the results in a simple Web interface using R Markdown.
reactable. This package allows users of the dashboard to browse, search, reorder, and interact with the data presented.
All of these packages are available for download at CRAN using
You should also create an R project for your dashboard at this time. We will need a place to store our Google Analytics credentials, and having a project ready to store them will keep things organized.
IMPORTANT UPDATE: Mark Edmonston, the author
googleAnalyticsR, has created a new version of his package that eliminates the need for OAUTH credentials when running on a server. Once that update is available on CRAN, I’ll update this post to document the simpler process of only submitting service account credentials. In the meantime, all the code shown here works; it just does more authentication than is required.
I want to begin by talking about what I found to be one of the most challenging pieces of the entire project: Creating, authorizing, and applying Google Analytics credentials. It’s not hugely difficult, but it does have a lot of steps you must get right before you can get any data.
Here’s a high-level overview of what we’ll need to get the visitor data for our Google Analytics dashboard. To use the Google Analytics API, we need to present two types of credentials that represent:
For any of this to work, the author of the dashboard has to be an authorized user of Google Analytics. You can test this by going to the Google Analytics Home page (analytics.google.com). If you are an authorized user, you’ll see the web dashboard. If you aren’t, you’ll get an error message and will have to ask for access from your Google Analytics Administrator. Keep the contact information for your Google Analytics administrator handy; we’ll need that information again later.
We must perform six steps to download data using the GA API. We need to:
googleAnalyticsRis available on CRAN (see Important Update notice at the beginning of this section)
I’ll walk you through each step individually in the following sections. For readers not interested in the gory details, you can skip ahead to the conclusion of this piece where I’ll recap what we got out of this process and what the next steps are.
Google has written a comprehensive document on how to do API authentication. Because we want to build a stand-alone dashboard, we’re going to use the service account option, which Google describes this way:
Service accounts are useful for automated, offline, or scheduled access to Google Analytics data for your own account. For example, to build a live dashboard of your own Google Analytics data and share it with other users.
This sounds like exactly what we want, so let’s use that option. It will take a few sub-steps, but they are fairly straightforward. Jenny Bryan has written a nice overview about how this process works as part of her gargle package; the description of service accounts is at the bottom of the page.
To create your service account, you should:
This completes the creation of our service account.
googleAnalyticsRpackage requires the key to be in JSON format. Once you’ve selected that format, click Create.
At this point, you have the service key credentials you need to make requests. However, we still have a couple more steps to do before we can use the API.
The fact you have a valid service key is not enough to start making requests. You still need to enable the API from the Google Dashboard. To do this you:
Sadly, the fact you have a valid service key is not enough to start making requests yet. We still need to authorize the user account with GA.
You now need to add the email associated with that key to the list of authenticated project users. To do this, we’re going to return to the Cloud Resource Manager pane at https://console.cloud.google.com/cloud-resource-manager.
Please note that for many Google Analytics configurations, only GA administrators may add new members to a project. If that is the case for you, you’ll will not see the screens shown below. Instead, you must contact your GA administrator and ask them to add your service account email to the project with Viewer rights.
If you do have the appropriate permissions, however, perform the following 3 tasks:
While you may be questioning why you ever started this seemingly endless project at this point, fear not; we’re almost done. All that remains to do is to create and download the OAUTH credentials for your service key.
Now if you’re anything like me, you’re probably thinking “Wait a minute, I created a service key to bypass all this OAUTH complexity. Why do I need an OAUTH project file now?” I’m glad you asked; it’s because Google:
You don’t actually need a project client ID for debugging purposes because the
GoogleAnalyticsR package has a default project associated with it. However, this project ID is shared among all programs using the package, and you may find your API calls denied because too many users are actively using the package. You can avoid this issue entirely by setting your own project client ID as shown below.
In my opinion, acquiring an OAuth 2.0 client ID for a service account is poorly documented on the Google API dashboard, in the Google documentation, and in the
GoogleAnalyticsR package. I found this process difficult to reproduce for our test project even though I’d been through it for my own dashboards. With that said, it’s fairly straightforward if you start in the proper place as shown below:
While this multi-step process which may have seemed like something out of Lord of the Rings, you now should have all the credentials and permission to make API requests to Google Analytics. So let’s write code to fetch one day’s Google Analytics data for the rstudio.com site.
library(googleAnalyticsR) library(dplyr) library(ggplot2) library(lubridate) library(reactable) library(stringr) ## First, authenticate with our client OAUTH credentials from step 5 of the blog post. googleAuthR::gar_set_client(json = "secrets/oauth-account-key.json") ## Now, provide the service account email and private key ga_auth(email = "email@example.com", json_file = "secrets/service-account-key.json") ## At this point, we should be properly authenticated and ready to go. We can test this ## by getting a list of all the accounts that this test project has access to. Typically, ## this will be only one if you've created your own service key. If it isn't your only ## account, select the appropriate viewId from your list of accounts. my_accounts <- ga_account_list() my_id <- my_accounts$viewId ## Modify this if you have more than one account ## Let's look at all the visitors to our site. This segment is one of several provided ## by Google Analytics by default. all_users <- segment_ga4("AllTraffic", segment_id = "gaid::-1") ## Let's look at just one day. ga_start_date <- today() ga_end_date <- today() ## Make the request to GA data_fetch <- google_analytics(my_id, segments = all_users, date_range = c(ga_start_date, ga_end_date), metrics = c("pageviews"), dimensions = c("landingPagePath"), anti_sample = TRUE) ## Let's just create a table of the most viewed posts most_viewed_posts <- data_fetch %>% mutate(Path = str_trunc(landingPagePath, width=40)) %>% count(Path, wt=pageviews, sort=TRUE) head(most_viewed_posts, n=5)
Assuming you have the appropriate permissions, client ID, and service key, you should get a result that looks similar to this one I pulled from the rstudio.com web site.
While many of the details of the Google Analytics API may seem elaborate and arcane, I want to emphasize some of the main ideas behind this process:
This post has focused entirely on getting authorized to download Google Analytics data. The next post will focus on how to create a flex dashboard for stakeholders to interact with the data. The last post in this series will show how to create windowed views of this data and public a self-contained dashboard that can be used on demand from RStudio Connect.
Many tools used routinely by software developers can also be useful to data scientists.
In this post, we explore possible challenges to putting Shiny in production and how to overcome them.