There are two ways data can be stored in pdfs:
I would like your help in freeing data about blood lead levels in Michigan from PDFs. For example: this pdf. We’ll organize our data freeing on this spreadsheet. Go to the spreadsheet and pick one table (or two if I’ve labeled them, for example, 3a and 3b) to free.
Use Tabula to extract the data for your table into a csv file. Some of the PDFs have clear table numbers, others don’t (yay, real data!) so feel free to ask questions on Slack if things are unclear.
When you launch Tabula, it will open a browser window on a local server, so you’ll see what looks like a webpage with a strange URL. That’s the app. Browse to the appropriate pdf, and make your selections.
It is likely that you will need to exclude the header rows, because they are untidily crossing multiple columns. So, just select the relevant data for your extract. Do a few quick eye checks to make sure the data looks right, matching between the extract and the PDF.
On Tuesday, we’ll talk about the process of putting them onto GitHub. GitHub is a place to share files, and to track versions of things like code. It works closely with git, a commandline tool. I know some of you have used git and GitHub before, but if you haven’t, never fear! We’ll go through it in class.
You can use git and GitHub separately from R (as we’re going to do on Tuesday), but since it’s a common workflow to use them together, there is an awesome website by Jenny Bryan called happy git with R.
In preparation for Tuesday, please:
go to GitHub and sign up for an account. Jenny Bryan has some advice on GitHub usernames to help you think about that.
install git, if you don’t have it already. Again, Jenny has great instructions.
optional (but recommended): install a git client. Jenny has an explanation of git clients. She recommends a few choices, but the one I use and will demo on Tuesday is SourceTree so if you install that one, it will look the same when I do demos.
Git and GitHub can be confusing (event for “experts” like myself). For this assignment, I think these are the main steps you will need to take.
We’ll talk through all of this together.
As a class, we are going to attempt to standardize the file names and variable names for our CSVs, so the data will be useful for other people who find it on GitHub. Think back to Data Organization in Spreadsheets. What were some of Broman and Woo’s suggestions?
We’re making a Data Dictionary to document our decisions, in this spreadsheet, which we can share on GitHub.
I would like you to do a few basic data cleaning tasks to neaten up your csv file. These are:
You can use OpenRefine and/or R to do this data cleaning. I think some tasks are easier in one than the other. A few code snippets, in case they are useful:
To rename variables in OpenRefine, click on the arrow in the column header, select Edit Column and then Rename this Column.
To replace ad-hoc NAs, find a cell with a weird value (like “**“), hover over it and select Edit. Then, type NA in the box and select Apply to All Identical Cells.
To remove commas from columns in Open Refine, click the down arrow on the column header, select Edit Cells and then Transform. In the box that pops up, replace value
with the folliwing:
replaceChars(value, ",", "")
When you read data into R, you can specify things you want to be read as NA values, like this:
library(readr)
library(dplyr)
library(tidyr)
BLL_under6_county_2014 <- read_csv("~/SDS236/BLL_under6_county_2014.csv", col_names = FALSE, na = c("", "NA", "---", "**"))
Then, you don’t have to edit each column separately.
To rename variables in R, do something like this, where X1
and X2
are the old variable names
BLL_under6_county_2014 <- BLL_under6_county_2014 %>%
rename(county=X1, number_tested = X2)
Likely, if you used read_csv()
to read in your data, there won’t be as many commas in numeric variables. But, some might slip through. Here are two approaches to fix that:
BLL_under6_county_2014 <- BLL_under6_county_2014 %>%
mutate(X4 = parse_number(X4))
BLL_under6_county_2014 <- BLL_under6_county_2014 %>%
mutate(X4 = str_replace_all(X4, ",", ""))
Since SourceTree didn’t work at first for some folks, they needed to initialize at the shell. (See Jenny’s explanation of the shell).
On my Mac, when I need to “walk” through my file structure to get to a particular directory, I use two shell commands: cd
(for change directory) and ls
(for… list?). I’ll often step one directory at a time. I’ll ls
to see everything in my current directory, and then cd
into one of the directories there. For example, from my home directory I might cd Documents
to get into my documents folder, then ls
to see all the files and directories there, then move again, say to cd Dissertation
.
To get things going with git, once you’re in the place you want to clone your directory, you’ll then type git clone [this is where you paste that url from GitHub]
. This will start the process of downloading the files.
Sean Kross has written much more about navigating the command line, so you can go read there if you want to get better. Also, I’m a Mac person, which means my shell is Unix. But, if you use Windows your shell commands might be different. This page purports to show equivalencies, and says that while cd
is the same on both systems, the ls
equivalent on Windows is dir
.