Getting data with APIs

library(tidyverse)
library(httr)
library(xml2)
library(jsonlite)

We’ll use the httr package to help us interact with APIs. When it comes down to it, an API is basically a fancy URL. When I start playing with one, I will try building the URL for a query myself by hand.

For example, the Board Game Geek API. It took me a few false starts to realize what the documentation meant, but finally I was able to create a URL like this:

https://boardgamegeek.com/xmlapi2/thing?id=13&type=boardgame&comments=1

I am looking for ID number 13, which is the original version of Catan. I specified the type of thing I wanted was a boardgame, and set comments to be 1, to mean I wanted to see the reviews.

If I wanted to do this the hard way, I could manually change the parameters in that URL, right-click and download the data, and then import into R. But it’s easier to do it all from within R (or Python, or whatever you are using).

Here is that same query using the GET() function.

res <- GET("https://boardgamegeek.com/xmlapi2/thing?",
  query = list(type = "boardgame", id = 13, comments = 1)
)

Now I have the whole page in my R environment. Unfortunately, it is in XML format, which is not tidy. We can read it with read_xml() from the xml2 package.

res_xml <- read_xml(res)

This still isn’t tidy, but we can look at it with the list viewer. It appears that the reviews are in the tags.

comments <- res_xml %>%
  xml_find_all(".//comment") %>%
  xml_attr("value")

Okay, so we now have 100 comments! But probably, Catan has more than 100 reviews. How could we get the next 100?

res2 <- GET("https://boardgamegeek.com/xmlapi2/thing?", query = list(type = "boardgame", id = 13, comments = 1, page = 2))

page2 <- read_xml(res2) %>%
  xml_find_all(".//comment") %>%
  xml_attr("value")

Those look like different reviews!

Looping

There are very few places in R where a for() loop is appropriate. This might be one of them??

comments <- NULL
for (i in 1:30) {
  res <- GET("https://boardgamegeek.com/xmlapi2/thing?", query = list(type = "boardgame", id = 13, comments = 1, page = i))
  tmp <- read_xml(res) %>%
    xml_find_all(".//comment") %>%
    xml_attr("value")
  comments <- c(comments, tmp)
}

reviewdata <- tibble(text = comments, reviewID = 1:6000)

Maps

Of course, there are other (better?) approaches. The tidyverse way would be with a map. For a map, we need a function to apply. I made a kind of stupid one:

getreviews <- function(i) {
  res <- GET("https://boardgamegeek.com/xmlapi2/thing?",
    query = list(
      type = "boardgame",
      id = 13,
      comments = 1,
      page = i
    )
  )
  read_xml(res) %>%
    xml_find_all(".//comment") %>%
    xml_attr("value")
}

Then, we can map over the pages of results we want onto that function:

reviewdata2 <- tibble(page = 1:30) %>%
  mutate(comments = map(page, getreviews))

That’s way faster!

The problem is that now each page of 100 reviews is in one row of the dataset. So we need to unnest().

reviewdata2 <- reviewdata %>%
  unnest(cols = text)

Another API

Okay, let’s try another API. This one is the Recreation Information Database (RIDB). They have a nice API documentation page, which has a “try it out” button. If you try, for example, getting information about Mount Rushmore (RecAreaID 2837), you will probably get a 401 response.

That’s because this API requires the use of an API key. It’s not that hard to get one, you just need to sign up for an account and then “enable developer access” in your profile. Then you’ll get an API key, which you can use to authorize the documentation page and make sure you can get a 200 response.

It will show you the request URL, which you could try going to in your browser. https://ridb.recreation.gov/api/v1/recareas/2837 I get a 401 response because I’m not authenticated.

Okay, let’s see if we can programatically authenticate. Here’s a hard-coded version:

res <- GET(
  "https://ridb.recreation.gov/api/v1/recareas/2837",
  add_headers(
    accept = "application/json",
    apikey = "yourkeyhere"
  )
)

Or a little more programatic,

res <- GET(
  modify_url("https://ridb.recreation.gov/", path = c("api", "v1", "recareas", "2837")),
  add_headers(
    accept = "application/json",
    apikey = "yourkeyhere"
  )
)

Getting this code took me a ton of trial and error, the documentation of httr and the documentation of APIs usually give you 80% of what you need but the last 20% is guesswork.

I can tell it worked because of the status_code 200, but I can’t see my content. Let’s try to grab it,

rushmore <- fromJSON(rawToChar(res$content))
names(rushmore)