Lecture 02: Exploratory Data Analysis

Professor McNamara

Software in this course

We’ll be using R and RStudio throughout the course to learn statistical concepts and analyze real data and come to informed conclusions.

What is R?

R is a programming language specifically designed for statistical analysis. R is open-source, and is developed by a team of statisticians and programmers in both academia and industry. It is updated frequently and has become the de facto industry standard. In the data science realm, alternatives to R include Python with the pandas library, and Julia. In the statistics realm, alternatives include SAS, Stata, and SPSS.

What is RStudio?

RStudio is an Integrated Development Environment (IDE) for R. RStudio is also open-source software, and depends upon a valid R installation to function. RStudio as available as both a both Desktop and Cloud application. Before RStudio, people used R through the command line directly, or through graphical user interfaces like Rcmdr, but RStudio is so vastly superior that these alternatives have few users left. RStudio employees are important drivers of R innovation, and currently maintain packages like ggplot2, dplyr and tidyverse, among others.

What is Quarto?

Quarto is format for composing relatively simple documents that combine code and text. Quarto relies on a version of markdown (a general-purpose authoring format), and provides functionality for processing code and including the output. We’ll be using Quarto with R code chunks, but it also works with Python, Julia, and other programming languages.

The RStudio interface

The panel in the upper right contains your environment as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.

The panel on the left is where the action happens. It’s called the console. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

Getting ready: RStudio project

It’s important to have a good file organization structure for this class. RStudio has a feature called “Projects” that are basically special folders on your computer RStudio knows a lot about. Let’s make a project for this course. In RStudio, go to

File ->

New Project ->

New Directory1 ->

New Project ->

Pick a directory name (maybe DASC240, all one word) ->

Pick where the directory will live (I highly suggest OneDrive) ->

Create Project

RStudio should restart and you should then see the name of your project in the upper right corner of RStudio.

Getting ready: RStudio settings

The default settings in RStudio are mostly pretty good! But, there are a couple I like to change. Let’s do that. Go to:

Tools ->

Global Options

On the General page, change “Save workspace to .RData on exit” to “Never” On the RMarkdown page, uncheck “Show output inline in RMarkdown documents” If you prefer a different color scheme, check out the Appearance page

When you’re done setting things up, hit Okay.

Trying out R

To get started, enter all commands at the R prompt (i.e. right after > on the console); you can either type them in manually or copy and paste them from this document.

At its simplest, R is just a big calculator. So, it can do arithmetic and apply functions.

2+2
100/10
log(100)
sqrt(100)

R packages

R has a number of additional packages that help you perform specific tasks. For example, dplyr is an R package designed to simplify the process of data wrangling, and ggplot2 is for data visualization based on the Grammar of Graphics (a famous book).

In order to use a packages, they must be installed (you only have to do this once) and loaded (you have to do this every time you start an R session).

Installing packages

To be able to do more advanced stuff, you need additional packages. I have a list of them on Canvas, which you can install by copy-pasting into the Console and hitting enter.

install.packages(c("agricolae","babynames", "car", "devtools", "fueleconomy", 
                    "GGally","Hmisc", "infer", "janitor", "leaps", "lmtest",
                    "Lock5Data","manipulate","mosaic","nycflights13", "skimr",
                    "Stat2Data", "tidyverse","usethis"))

A bunch of red text will scroll by, which mean it’s working! Installing all the packages may take a few minutes; you’ll know when the packages have finished installing when you see the R prompt (>) return in your console.

Loading packages

The additional packages we installed will help you perform specific tasks. We’ve already installed them, which only needs to be done once. In order to use a package we have to load it. This needs to happen every time you start an R session. To load a package, you use the library() command,

library(ggplot2)

You might get a “message” when you load a package, but otherwise not much happens. But, then it’s ready for us to use! The ggplot2 package helps us make plots.

Exploratory Data Analysis

We are going to be looking at plots and summary statistics as part of the technique of Exploratory Data Analysis (EDA). EDA was named by statistician John Tukey in the 1960s, and continues to be exceedingly useful today. Essentially, Tukey was advocating getting familiar with data before beginning modeling, so you don’t run into errors that are easy to catch visually but hard to catch numerically.

Packages for EDA

To begin, we need to load packages to use in our session. We do this with the library() command,

library(tidyverse)
library(mosaic)
library(skimr)
library(car)

Salaries dataset

Then, we need some data. For this lab I chose the Salaries dataset, which comes from the car package. This dataset is easy to load because it comes from within a package. Plus, it has to do with college professors!

Since the data is already in R, we can access it with the data() command,

data(Salaries)

Look over at your Environment pane to see what happened in your environment. R uses lazy evaluation, so it’s not going to load the data in until we actually do something with it.

Starting to explore

Let’s start by looking at it.

str(Salaries)

Looking at the structure of our data can help.

Skimming the dataset

Skimming is even more complete.

skim_without_charts(Salaries)

Once we have an idea of what the data look like, we can make some plots.

Graphics libraries in R

There are three prominent graphics libraries in R, and a new package that combines two of them together:

  1. graphics: often called base graphics, these are the drawing methods that come pre-installed with R. These graphics are the most commonly-used, but often the least user-friendly. (e.g. plot())
  2. lattice: a nice-looking and powerful graphics library that is particularly adept at making multivariate comparisons. lattice graphics use the formula syntax. Customization of lattice graphics often involves writing panel.functions – which can be tricky, but powerful. (e.g. xyplot())
  3. ggplot2: a very popular graphing library maintained by Hadley Wickham, based on his “Grammar of Graphics” paradigm. Unlike lattice, ggplot2 uses an incremental philosophy towards building graphics. (e.g. ggplot())
  4. ggformula: there’s a new hybrid graphics library, ggformula. ggformula is based on ggplot2, but uses the formula syntax made popular by lattice and mosaic. (e.g. gf_point)

We will focus on ggplot2 in this class, although we will use some base graphics as well.

Single-variable plots in ggplot2

Histogram

For one quantitative variable, you might want to produce a histogram to show the distribution.

ggplot(data = Salaries) + geom_histogram(aes(x = salary))

Density plot

You could also view a smoothed version of the distribution with a density plot. Try changing the geom_histogram code so it makes a density.

ggplot(data = Salaries) + geom_histogram(aes(x = salary))

Bar chart

If you have categorical data, a barchart is more appropriate.

ggplot(data = Salaries) + geom_bar(aes(x = rank))

General ggplot2 syntax

Notice that these plots use the same syntax:

ggplot(data=NameOfData) + 
  geom_[something](aes(x=VariableName))

geom is short for geometric object, and geoms include geom_histogram(), geom_density() or geom_bar(). There are many other ways to write ggplot2 code, but we won’t think about those for now.

Multiple-variable plots in ggplot2

Scatterplot

For two quantitative variables, a scatterplot is appropriate. That will use a geom_point

ggplot(Salaries) + geom_point(aes(x = yrs.since.phd, y = salary))

Side by side boxplots

For one quantitative and one categorical variable, you can make side-by-side boxplots

ggplot(Salaries) + geom_boxplot(aes(x = sex, y = salary))

Faceted barchart

For two categorical variables, things get trickier. We might want to facet a plot so we could compare across groups.

ggplot(Salaries) + geom_bar(aes(x = sex)) + facet_wrap(vars(rank))

Bells and Whistles

Most R plotting function can take many arguments that will modify their behavior. The book R Graphics Cookbook is a good resource, or you could take STAT 336 to learn more.

Summary statistics

Again, there are several possible syntaxes for summary statistics. We’ll mostly use the mosaic syntax, which looks like this:

mean(~salary, data = Salaries)

How do you think we’d find the standard deviation of the salary variable?

The skim, again

Of course, you can see all this and more in the skim,

skim_without_charts(Salaries)

Summary statistics by group

If you want to see summary statistics by group, you can add another variable:

mean(salary~rank, data = Salaries)

Are salaries different between men and women?

Rendering

Now that we’ve tried some things, let’s try “Rendering” our document. To render, click the Render button (at the top of the document, has a blue arrow icon next to it.) For problem sets in this class, we will be rendering to PDF, but you can also render to HTML or Word. This document is a presentation, which is yet another option. Feel free to try Rendering to different formats!