We are going to be looking at plots and summary statistics as part of the technique of Exploratory Data Analysis (EDA). EDA was named by statistician John Tukey in the 1960s, and continues to be exceedingly useful today. Essentially, Tukey was advocating getting familiar with data before beginning modeling, so you don’t run into errors that are easy to catch visually but hard to catch numerically.
To begin, we need to load packages to use in our session. We do this with the library()
command,
Then, we need some data. For this lab I chose the Salaries dataset, which comes from the car
package. This dataset is easy to load because it comes from within a package. Plus, it has to do with college professors!
Since the data is already in R
, we can access it with the data()
command,
Look over at your Environment pane to see what happened in your environment. R uses lazy evaluation, so it’s not going to load the data in until we actually do something with it.
Let’s start by looking at it.
Looking at the structure of our data can help.
Skimming is even more complete.
Once we have an idea of what the data look like, we can make some plots.
There are three prominent graphics libraries in R, and a new package that combines two of them together:
graphics
: often called base graphics, these are the drawing methods that come pre-installed with R. These graphics are the most commonly-used, but often the least user-friendly. (e.g. plot()
)lattice
: a nice-looking and powerful graphics library that is particularly adept at making multivariate comparisons. lattice
graphics use the formula syntax. Customization of lattice graphics often involves writing panel.functions – which can be tricky, but powerful. (e.g. xyplot()
)ggplot2
: a very popular graphing library maintained by Hadley Wickham, based on his “Grammar of Graphics” paradigm. Unlike lattice, ggplot2
uses an incremental philosophy towards building graphics. (e.g. ggplot()
)ggformula
: there’s a new hybrid graphics library, ggformula
. ggformula
is based on ggplot2
, but uses the formula syntax made popular by lattice
and mosaic
. (e.g. gf_point
)We will focus on ggplot2
in this class, although we will use some base graphics as well.
For one quantitative variable, you might want to produce a histogram to show the distribution.
You could also view a smoothed version of the distribution with a density plot. Try changing the geom_histogram
code so it makes a density.
If you have categorical data, a barchart is more appropriate.
Notice that these plots use the same syntax:
ggplot(data=NameOfData) +
geom_[something](aes(x=VariableName))
geom
is short for geometric object, and geom
s include geom_histogram()
, geom_density()
or geom_bar()
. There are many other ways to write ggplot2
code, but we won’t think about those for now.
For two quantitative variables, a scatterplot is appropriate. That will use a geom_point
For one quantitative and one categorical variable, you can make side-by-side boxplots
For two categorical variables, things get trickier. We might want to facet
a plot so we could compare across groups.
Most R plotting function can take many arguments that will modify their behavior. The book R Graphics Cookbook is a good resource, or you could take STAT 336 to learn more.
Again, there are several possible syntaxes for summary statistics. We’ll mostly use the mosaic syntax, which looks like this:
How do you think we’d find the standard deviation of the salary variable?
Of course, you can see all this and more in the skim,
If you want to see summary statistics by group, you can add another variable:
Are salaries different between men and women?
Now that we’ve tried some things, let’s try “Rendering” our document. To render, click the Render button (at the top of the document, has a blue arrow icon next to it.) For problem sets in this class, we will be rendering to PDF, but you can also render to HTML or Word. This document is a presentation, which is yet another option. Feel free to try Rendering to different formats!