library(tidyverse)
library(httr)
library(xml2)
library(jsonlite)
Getting data with APIs
We’ll use the httr
package to help us interact with APIs. When it comes down to it, an API is basically a fancy URL. When I start playing with one, I will try building the URL for a query myself by hand.
For example, the Board Game Geek API. It took me a few false starts to realize what the documentation meant, but finally I was able to create a URL like this:
https://boardgamegeek.com/xmlapi2/thing?id=13&type=boardgame&comments=1
I am looking for ID number 13, which is the original version of Catan. I specified the type of thing I wanted was a boardgame, and set comments to be 1, to mean I wanted to see the reviews.
If I wanted to do this the hard way, I could manually change the parameters in that URL, right-click and download the data, and then import into R. But it’s easier to do it all from within R (or Python, or whatever you are using).
Here is that same query using the GET()
function.
<- GET("https://boardgamegeek.com/xmlapi2/thing?",
res query = list(type = "boardgame", id = 13, comments = 1)
)
Now I have the whole page in my R environment. Unfortunately, it is in XML format, which is not tidy. We can read it with read_xml()
from the xml2
package.
<- read_xml(res) res_xml
This still isn’t tidy, but we can look at it with the list viewer. It appears that the reviews are in the
<- res_xml %>%
comments xml_find_all(".//comment") %>%
xml_attr("value")
Okay, so we now have 100 comments! But probably, Catan has more than 100 reviews. How could we get the next 100?
<- GET("https://boardgamegeek.com/xmlapi2/thing?", query = list(type = "boardgame", id = 13, comments = 1, page = 2)) res2
<- read_xml(res2) %>%
page2 xml_find_all(".//comment") %>%
xml_attr("value")
Those look like different reviews!
Looping
There are very few places in R where a for()
loop is appropriate. This might be one of them??
<- NULL
comments for (i in 1:30) {
<- GET("https://boardgamegeek.com/xmlapi2/thing?", query = list(type = "boardgame", id = 13, comments = 1, page = i))
res <- read_xml(res) %>%
tmp xml_find_all(".//comment") %>%
xml_attr("value")
<- c(comments, tmp)
comments }
<- tibble(text = comments, reviewID = 1:6000) reviewdata
Maps
Of course, there are other (better?) approaches. The tidyverse way would be with a map. For a map, we need a function to apply. I made a kind of stupid one:
<- function(i) {
getreviews <- GET("https://boardgamegeek.com/xmlapi2/thing?",
res query = list(
type = "boardgame",
id = 13,
comments = 1,
page = i
)
)read_xml(res) %>%
xml_find_all(".//comment") %>%
xml_attr("value")
}
Then, we can map over the pages of results we want onto that function:
<- tibble(page = 1:30) %>%
reviewdata2 mutate(comments = map(page, getreviews))
That’s way faster!
The problem is that now each page of 100 reviews is in one row of the dataset. So we need to unnest()
.
<- reviewdata %>%
reviewdata2 unnest(cols = text)
Another API
Okay, let’s try another API. This one is the Recreation Information Database (RIDB). They have a nice API documentation page, which has a “try it out” button. If you try, for example, getting information about Mount Rushmore (RecAreaID 2837), you will probably get a 401 response.
That’s because this API requires the use of an API key. It’s not that hard to get one, you just need to sign up for an account and then “enable developer access” in your profile. Then you’ll get an API key, which you can use to authorize the documentation page and make sure you can get a 200 response.
It will show you the request URL, which you could try going to in your browser. https://ridb.recreation.gov/api/v1/recareas/2837 I get a 401 response because I’m not authenticated.
Okay, let’s see if we can programatically authenticate. Here’s a hard-coded version:
<- GET(
res "https://ridb.recreation.gov/api/v1/recareas/2837",
add_headers(
accept = "application/json",
apikey = "yourkeyhere"
) )
Or a little more programatic,
<- GET(
res modify_url("https://ridb.recreation.gov/", path = c("api", "v1", "recareas", "2837")),
add_headers(
accept = "application/json",
apikey = "yourkeyhere"
) )
Getting this code took me a ton of trial and error, the documentation of httr
and the documentation of APIs usually give you 80% of what you need but the last 20% is guesswork.
I can tell it worked because of the status_code 200, but I can’t see my content. Let’s try to grab it,
<- fromJSON(rawToChar(res$content))
rushmore names(rushmore)