Freeing data from PDFs

There are two ways data can be stored in pdfs:

As text (you can tell it’s this way if you can highlight numbers/phrases with your mouse). If you have this kind of PDF, you can use Tabula to free the data.
As images (the whole page or center of the page highlights). If you have this kind of PDF, you need to use Optical Character Recognition (OCR), which we’re not planning to cover in this class. (Let me know if you run into a place where you want to do OCR and we can talk.)

I would like your help in freeing data about blood lead levels in Michigan from PDFs. For example: this pdf. We’ll organize our data freeing on this spreadsheet. Go to the spreadsheet and pick one table (or two if I’ve labeled them, for example, 3a and 3b) to free.

Getting data using Tabula

Use Tabula to extract the data for your table into a csv file. Some of the PDFs have clear table numbers, others don’t (yay, real data!) so feel free to ask questions on Slack if things are unclear.

When you launch Tabula, it will open a browser window on a local server, so you’ll see what looks like a webpage with a strange URL. That’s the app. Browse to the appropriate pdf, and make your selections.

It is likely that you will need to exclude the header rows, because they are untidily crossing multiple columns. So, just select the relevant data for your extract. Do a few quick eye checks to make sure the data looks right, matching between the extract and the PDF.

Getting ready to use git and GitHub

On Tuesday, we’ll talk about the process of putting them onto GitHub. GitHub is a place to share files, and to track versions of things like code. It works closely with git, a commandline tool. I know some of you have used git and GitHub before, but if you haven’t, never fear! We’ll go through it in class.

You can use git and GitHub separately from R (as we’re going to do on Tuesday), but since it’s a common workflow to use them together, there is an awesome website by Jenny Bryan called happy git with R.

In preparation for Tuesday, please:

go to GitHub and sign up for an account. Jenny Bryan has some advice on GitHub usernames to help you think about that.
install git, if you don’t have it already. Again, Jenny has great instructions.
optional (but recommended): install a git client. Jenny has an explanation of git clients. She recommends a few choices, but the one I use and will demo on Tuesday is SourceTree so if you install that one, it will look the same when I do demos.

Adding your files to GitHub

Git and GitHub can be confusing (event for “experts” like myself). For this assignment, I think these are the main steps you will need to take.

Fork my repo so you have a copy on your GitHub account. This means you have to click a button on the GitHub website.
Use the commandline tools or your git client (e.g., SourceTree) to clone the repo locally. Essentially, you’re downloading a version of the folder and all its contents.
After you’ve downloaded the repo, I’m going to make some changes to the repo. I’ll change things, and you want to get those changes on your computer.
You will use commandline tools or your git client (e.g., SourceTree) to “fetch” those changes. This will probably require you to set a new “remote” (URL from GitHub). You can name your remote anything you want, like “AmeliasBLL” or “MainRepo” but by convention, people usually name this type of remote “upstream.”
Using your file explorer or Finder, move your file to the BLL folder.
Use commandline tools or your git client to stage the changes for commit.
Use commandline tools or your git client to commit the changes, including a “commit message” (a short description of what you did, like “added BLL_under6_zip_2016.csv”)
Use commandline tools or your git client to “push” your local changes to your GitHub repo. Go on the GitHub website to confirm the changes went there.
On the website for your fork of my repo, click the “submit pull request” button to submit a pull request to my repo.
Then, I can merge the changes together!

We’ll talk through all of this together.

Standardizing variable- and file-names

As a class, we are going to attempt to standardize the file names and variable names for our CSVs, so the data will be useful for other people who find it on GitHub. Think back to Data Organization in Spreadsheets. What were some of Broman and Woo’s suggestions?

We’re making a Data Dictionary to document our decisions, in this spreadsheet, which we can share on GitHub.

Cleaning up data

I would like you to do a few basic data cleaning tasks to neaten up your csv file. These are:

name the file to be consistent with our this spreadsheet
rename variables to be consistent with our data dictionary
replace ad-hoc NA values with true NA values. This may vary by table, but a few examples I have seen:
- “**" should become NA
- “—” should become NA
- “-” should become NA
- etc
remove commas from columns that use them to format large numbers

You can use OpenRefine and/or R to do this data cleaning. I think some tasks are easier in one than the other. A few code snippets, in case they are useful:

Open Refine

To rename variables in OpenRefine, click on the arrow in the column header, select Edit Column and then Rename this Column.

To replace ad-hoc NAs, find a cell with a weird value (like “**“), hover over it and select Edit. Then, type NA in the box and select Apply to All Identical Cells.

To remove commas from columns in Open Refine, click the down arrow on the column header, select Edit Cells and then Transform. In the box that pops up, replace value with the folliwing:

replaceChars(value, ",", "")

R

When you read data into R, you can specify things you want to be read as NA values, like this:

library(readr)
library(dplyr)
library(tidyr)
BLL_under6_county_2014 <- read_csv("~/SDS236/BLL_under6_county_2014.csv", col_names = FALSE, na = c("", "NA", "---", "**"))

Then, you don’t have to edit each column separately.

To rename variables in R, do something like this, where X1 and X2 are the old variable names

BLL_under6_county_2014 <- BLL_under6_county_2014 %>%
  rename(county=X1, number_tested = X2)

Likely, if you used read_csv() to read in your data, there won’t be as many commas in numeric variables. But, some might slip through. Here are two approaches to fix that:

BLL_under6_county_2014 <- BLL_under6_county_2014 %>%
  mutate(X4 = parse_number(X4))

BLL_under6_county_2014 <- BLL_under6_county_2014 %>%
  mutate(X4 = str_replace_all(X4, ",", ""))

Odds and ends

Since SourceTree didn’t work at first for some folks, they needed to initialize at the shell. (See Jenny’s explanation of the shell).

On my Mac, when I need to “walk” through my file structure to get to a particular directory, I use two shell commands: cd (for change directory) and ls (for… list?). I’ll often step one directory at a time. I’ll ls to see everything in my current directory, and then cd into one of the directories there. For example, from my home directory I might cd Documents to get into my documents folder, then ls to see all the files and directories there, then move again, say to cd Dissertation.

To get things going with git, once you’re in the place you want to clone your directory, you’ll then type git clone [this is where you paste that url from GitHub]. This will start the process of downloading the files.

Sean Kross has written much more about navigating the command line, so you can go read there if you want to get better. Also, I’m a Mac person, which means my shell is Unix. But, if you use Windows your shell commands might be different. This page purports to show equivalencies, and says that while cd is the same on both systems, the ls equivalent on Windows is dir.