R, ggplot2, plotly

Amelia McNamara

12/4/2017

R in SDS 136

There are several places you may want to start:

Some background info

R has a bit of a learning curve but once you get used to it, you’re off to the races. Keep in mind you will get a ton of errors while you’re learning R — in fact, you’ll still deal with errors even if you’re a master user! Google is your friend — copy and paste the error into Google and see what others have written.

PRO TIP: R is VERY VERY VERY case-sensitive. A misplaced capital letter or double-quote where there should be a single-quote might just be the culprit to your error.

And don’t forget — save EVERYTHING! RStudio allows you to save your data, your scripts, your history, your entire RStudio workspace — save it all and save often.

Logging in to RStudio

I’ve had our server administrator make you all RStudio server accounts. To log in, go to rstudio.smith.edu and log in with your Smith username (mine is amcnamara) and the same password you use for Bannerweb, your email, etc.

RStudio

RStudio is a great interface to the R language. The interface has 4 panes.

You’ll do your coding in the Console (lower left, looks like Terminal/command line).

The Environment and History (upper right) panes keep track of everything you have loaded into your environment (mostly, data) and all the R commands you have ever typed. This is great because you can search through your past code to find commands you may have forgotten.

The Source pane is on the upper left and displays your data tables, as well as any documents you are reading or editing (maybe, you’re looking at this document there!).

The Files/Plots/Packages/Help/View pane on the lower right does a bunch of stuff — holds files, displays the graphs you’ve made, shows the packages you have installed, and more. Perhaps the most important tab of that lower right pane is Help.

More help

R has two help functions. Forget what an average or mean is? Try this.

?mean

or

help(mean)

Packages and syntax

R has “base” functionality that is built into the language, but most of the good stuff comes from user-written “packages” that extend the language. We’ll be focused on the “tidyverse” syntax here, mostly using ggplot2 and dplyr.

Because R is open source (and was written by statisticians…) it has some strange features compared to other programming languages. The most obvious is that its syntax is not consistent. Almost any task you want to do, there are three ways to accomplish. And packages can contribute to this nonsense by providing their own syntax. We’ll try to point to places where the code we are showing you is just one of many ways to do something, but know that when you start googling questions you may see things that don’t look familiar. Don’t be afraid!

Loading packages

To use a package, you have load it into your session by using the library() command. Lets start by loading some packages:

library(dplyr)

Data!

Later in this class, we will show you how to load in external data, but for now we are going to begin by playing with some data that comes with R. This will allow us to start playing quickly.

We’ll start with some data about midwest demographics. To get started, you can read the data in with the data() command, and see the first few rows with glimpse().

data(midwest)
glimpse(midwest)
## Observations: 437
## Variables: 28
## $ PID                  <int> 561, 562, 563, 564, 565, 566, 567, 568, 5...
## $ county               <chr> "ADAMS", "ALEXANDER", "BOND", "BOONE", "B...
## $ state                <chr> "IL", "IL", "IL", "IL", "IL", "IL", "IL",...
## $ area                 <dbl> 0.052, 0.014, 0.022, 0.017, 0.018, 0.050,...
## $ poptotal             <int> 66090, 10626, 14991, 30806, 5836, 35688, ...
## $ popdensity           <dbl> 1270.9615, 759.0000, 681.4091, 1812.1176,...
## $ popwhite             <int> 63917, 7054, 14477, 29344, 5264, 35157, 5...
## $ popblack             <int> 1702, 3496, 429, 127, 547, 50, 1, 111, 16...
## $ popamerindian        <int> 98, 19, 35, 46, 14, 65, 8, 30, 8, 331, 51...
## $ popasian             <int> 249, 48, 16, 150, 5, 195, 15, 61, 23, 803...
## $ popother             <int> 124, 9, 34, 1139, 6, 221, 0, 84, 6, 1596,...
## $ percwhite            <dbl> 96.71206, 66.38434, 96.57128, 95.25417, 9...
## $ percblack            <dbl> 2.57527614, 32.90043290, 2.86171703, 0.41...
## $ percamerindan        <dbl> 0.14828264, 0.17880670, 0.23347342, 0.149...
## $ percasian            <dbl> 0.37675897, 0.45172219, 0.10673071, 0.486...
## $ percother            <dbl> 0.18762294, 0.08469791, 0.22680275, 3.697...
## $ popadults            <int> 43298, 6724, 9669, 19272, 3979, 23444, 35...
## $ perchsd              <dbl> 75.10740, 59.72635, 69.33499, 75.47219, 6...
## $ percollege           <dbl> 19.63139, 11.24331, 17.03382, 17.27895, 1...
## $ percprof             <dbl> 4.355859, 2.870315, 4.488572, 4.197800, 3...
## $ poppovertyknown      <int> 63628, 10529, 14235, 30337, 4815, 35107, ...
## $ percpovertyknown     <dbl> 96.27478, 99.08714, 94.95697, 98.47757, 8...
## $ percbelowpoverty     <dbl> 13.151443, 32.244278, 12.068844, 7.209019...
## $ percchildbelowpovert <dbl> 18.011717, 45.826514, 14.036061, 11.17953...
## $ percadultpoverty     <dbl> 11.009776, 27.385647, 10.852090, 5.536013...
## $ percelderlypoverty   <dbl> 12.443812, 25.228976, 12.697410, 6.217047...
## $ inmetro              <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,...
## $ category             <chr> "AAR", "LHR", "AAR", "ALU", "AAR", "AAR",...

For a spreadsheet-like experience of your data, just click on its name in your Environment tab to see a preview of the data in the Source pane. Note that you can only look, you can’t “touch.” R is a programming language, so if you see values you want to have changed, or you need to add another variable based on an existing one you have to do so programmatically.

If you want to read about all the variables that are included in the dataset, you can run

?midwest

or

help(midwest)

First glance at the data

A few more handy functions that have to do with datasets are str(), which tells you about the structure of your data,

str(midwest)
## Classes 'tbl_df', 'tbl' and 'data.frame':    437 obs. of  28 variables:
##  $ PID                 : int  561 562 563 564 565 566 567 568 569 570 ...
##  $ county              : chr  "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
##  $ state               : chr  "IL" "IL" "IL" "IL" ...
##  $ area                : num  0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
##  $ poptotal            : int  66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
##  $ popdensity          : num  1271 759 681 1812 324 ...
##  $ popwhite            : int  63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
##  $ popblack            : int  1702 3496 429 127 547 50 1 111 16 16559 ...
##  $ popamerindian       : int  98 19 35 46 14 65 8 30 8 331 ...
##  $ popasian            : int  249 48 16 150 5 195 15 61 23 8033 ...
##  $ popother            : int  124 9 34 1139 6 221 0 84 6 1596 ...
##  $ percwhite           : num  96.7 66.4 96.6 95.3 90.2 ...
##  $ percblack           : num  2.575 32.9 2.862 0.412 9.373 ...
##  $ percamerindan       : num  0.148 0.179 0.233 0.149 0.24 ...
##  $ percasian           : num  0.3768 0.4517 0.1067 0.4869 0.0857 ...
##  $ percother           : num  0.1876 0.0847 0.2268 3.6973 0.1028 ...
##  $ popadults           : int  43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
##  $ perchsd             : num  75.1 59.7 69.3 75.5 68.9 ...
##  $ percollege          : num  19.6 11.2 17 17.3 14.5 ...
##  $ percprof            : num  4.36 2.87 4.49 4.2 3.37 ...
##  $ poppovertyknown     : int  63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
##  $ percpovertyknown    : num  96.3 99.1 95 98.5 82.5 ...
##  $ percbelowpoverty    : num  13.15 32.24 12.07 7.21 13.52 ...
##  $ percchildbelowpovert: num  18 45.8 14 11.2 13 ...
##  $ percadultpoverty    : num  11.01 27.39 10.85 5.54 11.14 ...
##  $ percelderlypoverty  : num  12.44 25.23 12.7 6.22 19.2 ...
##  $ inmetro             : int  0 0 0 1 0 0 0 0 0 1 ...
##  $ category            : chr  "AAR" "LHR" "AAR" "ALU" ...

and summary(), which provides a summary.

summary(midwest)
##       PID          county             state                area        
##  Min.   : 561   Length:437         Length:437         Min.   :0.00500  
##  1st Qu.: 670   Class :character   Class :character   1st Qu.:0.02400  
##  Median :1221   Mode  :character   Mode  :character   Median :0.03000  
##  Mean   :1437                                         Mean   :0.03317  
##  3rd Qu.:2059                                         3rd Qu.:0.03800  
##  Max.   :3052                                         Max.   :0.11000  
##     poptotal         popdensity          popwhite          popblack      
##  Min.   :   1701   Min.   :   85.05   Min.   :    416   Min.   :      0  
##  1st Qu.:  18840   1st Qu.:  622.41   1st Qu.:  18630   1st Qu.:     29  
##  Median :  35324   Median : 1156.21   Median :  34471   Median :    201  
##  Mean   :  96130   Mean   : 3097.74   Mean   :  81840   Mean   :  11024  
##  3rd Qu.:  75651   3rd Qu.: 2330.00   3rd Qu.:  72968   3rd Qu.:   1291  
##  Max.   :5105067   Max.   :88018.40   Max.   :3204947   Max.   :1317147  
##  popamerindian        popasian         popother        percwhite    
##  Min.   :    4.0   Min.   :     0   Min.   :     0   Min.   :10.69  
##  1st Qu.:   44.0   1st Qu.:    35   1st Qu.:    20   1st Qu.:94.89  
##  Median :   94.0   Median :   102   Median :    66   Median :98.03  
##  Mean   :  343.1   Mean   :  1310   Mean   :  1613   Mean   :95.56  
##  3rd Qu.:  288.0   3rd Qu.:   401   3rd Qu.:   345   3rd Qu.:99.07  
##  Max.   :10289.0   Max.   :188565   Max.   :384119   Max.   :99.82  
##    percblack       percamerindan        percasian        percother      
##  Min.   : 0.0000   Min.   : 0.05623   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 0.1157   1st Qu.: 0.15793   1st Qu.:0.1737   1st Qu.:0.09102  
##  Median : 0.5390   Median : 0.21502   Median :0.2972   Median :0.17844  
##  Mean   : 2.6763   Mean   : 0.79894   Mean   :0.4872   Mean   :0.47906  
##  3rd Qu.: 2.6014   3rd Qu.: 0.38362   3rd Qu.:0.5212   3rd Qu.:0.48050  
##  Max.   :40.2100   Max.   :89.17738   Max.   :5.0705   Max.   :7.52427  
##    popadults          perchsd        percollege        percprof      
##  Min.   :   1287   Min.   :46.91   Min.   : 7.336   Min.   : 0.5203  
##  1st Qu.:  12271   1st Qu.:71.33   1st Qu.:14.114   1st Qu.: 2.9980  
##  Median :  22188   Median :74.25   Median :16.798   Median : 3.8142  
##  Mean   :  60973   Mean   :73.97   Mean   :18.273   Mean   : 4.4473  
##  3rd Qu.:  47541   3rd Qu.:77.20   3rd Qu.:20.550   3rd Qu.: 4.9493  
##  Max.   :3291995   Max.   :88.90   Max.   :48.079   Max.   :20.7913  
##  poppovertyknown   percpovertyknown percbelowpoverty percchildbelowpovert
##  Min.   :   1696   Min.   :80.90    Min.   : 2.180   Min.   : 1.919      
##  1st Qu.:  18364   1st Qu.:96.89    1st Qu.: 9.199   1st Qu.:11.624      
##  Median :  33788   Median :98.17    Median :11.822   Median :15.270      
##  Mean   :  93642   Mean   :97.11    Mean   :12.511   Mean   :16.447      
##  3rd Qu.:  72840   3rd Qu.:98.60    3rd Qu.:15.133   3rd Qu.:20.352      
##  Max.   :5023523   Max.   :99.86    Max.   :48.691   Max.   :64.308      
##  percadultpoverty percelderlypoverty    inmetro         category        
##  Min.   : 1.938   Min.   : 3.547     Min.   :0.0000   Length:437        
##  1st Qu.: 7.668   1st Qu.: 8.912     1st Qu.:0.0000   Class :character  
##  Median :10.008   Median :10.869     Median :0.0000   Mode  :character  
##  Mean   :10.919   Mean   :11.389     Mean   :0.3432                     
##  3rd Qu.:13.182   3rd Qu.:13.412     3rd Qu.:1.0000                     
##  Max.   :43.312   Max.   :31.162     Max.   :1.0000

Names of variables

You can also get the list of variable names using the names() function,

names(midwest)
##  [1] "PID"                  "county"               "state"               
##  [4] "area"                 "poptotal"             "popdensity"          
##  [7] "popwhite"             "popblack"             "popamerindian"       
## [10] "popasian"             "popother"             "percwhite"           
## [13] "percblack"            "percamerindan"        "percasian"           
## [16] "percother"            "popadults"            "perchsd"             
## [19] "percollege"           "percprof"             "poppovertyknown"     
## [22] "percpovertyknown"     "percbelowpoverty"     "percchildbelowpovert"
## [25] "percadultpoverty"     "percelderlypoverty"   "inmetro"             
## [28] "category"

Tips and tricks

Data visualization in R

Of course, what we’re really interested in is making data visualizations!

There are many ways to make graphics in R.

We’ll focus on ggplot2, which implements the Grammar of Graphics we’ve talked about in this class.

Grammar of Graphics

Remember our retinal variables?

The grammar of graphics gives you a layered way to build up graphics, mapping data to aesthetics.

ggplot2

ggplot2 is an R package by Hadley Wickham that lets you make beautiful R graphics (relatively) easily.

To use it, we need to load the package,

library(ggplot2)

Diamonds data

To start, I’m going to use the diamonds data that comes with ggplot2,

str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

qplot()– the easy way out

qplot(x=carat, data=diamonds)

ggplot2 syntax

qplot(x=carat, data=diamonds)

As a function qplot basically works like this:

qplot([x_variable], [y_variable], data = [data_frame], geom = “[geom_type]”)

(Remember, use ?qplot() or help(qplot) if you need help.) The function makes assumptions on the best type of chart to use. In the above example, because you only passed one numeric variable it assumed a histogram is what is needed. We could explictly tell it to do that.

More qplot()

qplot(x=clarity, fill=cut, data=diamonds)

ggplot()

But, in order to really harness the power of ggplot2 you need to use the more general ggplot() command. The idea of the package is you can “layer” pieces on top of a plot to build it up over time.

You always need to use a ggplot() call to initialize the plot. I usually put my dataset in here, and at least some of my “aesthetics.” But, one of the things that can make ggplot2 tough to understand is that there are no hard and fast rules.

p1 <- ggplot(aes(x=clarity, fill=cut), data=diamonds)

If you try to show p1 at this point, you will get “Error: No layers in plot.” This is because we haven’t given it any geometric objects yet.

geoms

In order to get a plot to work, you need to use “geoms” (geometric objects) to specify the way you want your variables mapped to graphical parameters.

p1 + geom_bar()

geoms have options

p1 + geom_bar(position="dodge")

Lots of options

p1 + geom_bar(position="fill")

Two variables

p2 <- ggplot(aes(x=carat, y=depth), data=diamonds)
p2 + geom_point()

Same data, different geom

p2 + geom_bin2d()

Saving your work (or not)

Notice that I’m not saving these geom layers– I’m just running

p2 + [something]

to see what happens. But, I can save the new version to start building up my plot,

p2 <- p2 + geom_bin2d()

Better labels

p2 <- p2 + xlab("Carat") + ylab("Depth") + 
 guides(fill=guide_legend(title="Number of diamonds"))
p2

Different breaks

p2 + scale_fill_continuous(breaks=c(1500, 2500, 3500,4500))

Log scale

p2 + scale_fill_continuous(trans="log")

Log scale, different breaks

p2 + scale_fill_continuous(trans="log", breaks=c(1,10,100,1000,5000))

Faceting (wrapping)

p2 + facet_wrap(~color)

Faceting (grid)

p2 + facet_grid(color~clarity)

Many layers

baseplot <- ggplot() + 
  aes(x=citystate, y=Freq, fill = Response, order=Response) +
  facet_wrap(~year, nrow=3)+geom_bar(data = trial2$neg, stat = "identity") + 
  scale_fill_manual(breaks=c("Not at all satisfied", "2", "3", "4", 
                             "Extremely satisfied"), values=colorsB, 
                    name="Response") + 
  geom_bar(data = trial2$pos, stat = "identity") + coord_flip() + 
  ggtitle("Community satisfaction") + xlab("") +ylab("") + 
  scale_y_continuous(limits=c(-0.5, 1), 
                     breaks=seq(from=-0.5, to=0.75, by=0.25), 
                     labels=c("50%", "25%", "0", "25%", "50%", "75%")) + 
  theme(legend.text=element_text(size=14), 
        legend.title=element_text(size=16), 
        axis.text=element_text(size=14), 
        strip.text=element_text(size=14))
baseplot 

Colors

# install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all(n=5, type="qual")

p1 + geom_bar(position="dodge") + scale_fill_brewer(palette="Dark2")

Saving your work

p1 <- p1 + geom_bar(position="dodge") + scale_fill_brewer(palette="Dark2")
ggsave(p1, filename="mycoolplot.jpg")

ATUS data

Now, I want you to try some of your new skills on the American Time Use Survey data.

Loading Data

Most of the time, you won’t be working with data sets available through packages and will need to import data in to R. RStudio makes this very simple. On the Environment pane (upper right corner), the Import Dataset icon provides a step-by-step process to importing files. Following those steps will not only import the data, but print out the code in the Console showing how it was done.

Loading data programmatically

library(readr)
atus <- read.csv("atus.csv")

To import this data set, we used the read_csv() function. The key part of this command is making sure that you direct the command to where the file is located on your computer. Check other functionality of ?read_csv if you have questions.

Tips for loading data

Note: The way file directories are referenced in R varies between a Mac and PC.

MAC file directories follow this protocol use “~/file_location”

PC versions would follow this: “C:_lcocation"

You can load it in the point-and-click way, or do it programmatically.

Challenge questions

Resources for ggplot2

Plotly API

So far, all the plots we have seen were static. There are many ways to make interactive plots in R, but one of the best is using the Plotly API. (This is where their developers spend a lot of time!)

To use Plotly in R, you need to install the Plotly package

install.packages("plotly")

Then, you can load it into your session

library(plotly)

ggplotly

If you are familiar with ggplot, a nice extension is ggplotly. You can take an existing ggplot object and wrap it in ggplotly() to get it to become a little interactive.

p3 <- ggplotly(p1)

Adding to your Plotly account online

If you want to see the plots online, you will need to set up your API credentials in R. Read about that here. You need to fill in your username and API key

Sys.setenv("plotly_username"="AmeliaMN") 
Sys.setenv("plotly_api_key"="your_api_key")

Sending to your profile

api_create(p3, filename = "barchart")

Even more Plotly

If you want to get fancier interactivity, you can use the more complex plotly graphing commands. They have a slightly different syntax.

p <- plot_ly(midwest, x = ~percollege, color = ~state, type = "box")

There is much more documentation about these functions online.

More R

Summary statistics

Another part of exploratory data analysis is finding summary statistics. We’ll be using dplyr to do “tidy” computations, so we’ll use the pipe to bring data into a summarize() command.

midwest %>% 
  summarize(mean(popblack))
## # A tibble: 1 x 1
##   `mean(popblack)`
##              <dbl>
## 1         11023.88

Other than mean() you can try median() min() max() sd() and many more.

Summary statistics broken down

If you have two variables and want to see their relationship, you can group_by() one and then summarize. For example,

midwest %>%
  group_by(inmetro) %>%
  summarize(mean(popblack))
## # A tibble: 2 x 2
##   inmetro `mean(popblack)`
##     <int>            <dbl>
## 1       0         503.8711
## 2       1       31152.1667
  1. Do metro areas have larger percentages of whites than rural areas? How do you know?

Categorical data

For categorical data, n() is useful,

midwest %>%
  group_by(inmetro) %>%
  summarize(n=n())
## # A tibble: 2 x 2
##   inmetro     n
##     <int> <int>
## 1       0   287
## 2       1   150

And, n() can be used to count two variables together,

midwest %>%
  group_by(inmetro, category) %>%
  summarize(n=n())
## # A tibble: 16 x 3
## # Groups:   inmetro [?]
##    inmetro category     n
##      <int>    <chr> <int>
##  1       0      AAR   193
##  2       0      AHR    16
##  3       0      ALR    11
##  4       0      HAR     6
##  5       0      HHR     1
##  6       0      HLR     2
##  7       0      LAR    30
##  8       0      LHR    28
##  9       1      AAU    77
## 10       1      AHU     1
## 11       1      ALU    20
## 12       1      HAU    20
## 13       1      HHU     1
## 14       1      HLU    26
## 15       1      LAU     3
## 16       1      LHU     2

Sorting and ordering

Maybe we’re working on a story about the county in the midwest that has the largest percentage of black people. We can use the arrange() function to see this.

midwest %>% 
  arrange(percblack)
## # A tibble: 437 x 28
##      PID      county state  area poptotal popdensity popwhite popblack
##    <int>       <chr> <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1  3020   MENOMINEE    WI 0.021     3890   185.2381      416        0
##  2   753       WHITE    IN 0.030    23265   775.5000    23127        2
##  3   600      JASPER    IL 0.029    10609   365.8276    10574        1
##  4  3041      TAYLOR    WI 0.057    18901   331.5965    18807        2
##  5  1256 MONTMORENCY    MI 0.033     8936   270.7879     8861        1
##  6   748      WARREN    IN 0.021     8176   389.3333     8140        1
##  7   717      MORGAN    IN 0.024    55920  2330.0000    55635        9
##  8  3006        IRON    WI 0.047     6153   130.9149     6121        1
##  9   646       SCOTT    IL 0.015     5644   376.2667     5634        1
## 10   567     CALHOUN    IL 0.017     5322   313.0588     5298        1
## # ... with 427 more rows, and 20 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>

Sorting in decreasing order

But, this is sorting in increasing order, and we want in decreasing order. We could look in the documentation for arrange() to see how to change this.

midwest %>% 
  arrange(desc(percblack))
## # A tibble: 437 x 28
##      PID    county state  area poptotal popdensity popwhite popblack
##    <int>     <chr> <chr> <dbl>    <int>      <dbl>    <int>    <int>
##  1  1278     WAYNE    MI 0.035  2111687 60333.9143  1212007   849109
##  2   562 ALEXANDER    IL 0.014    10626   759.0000     7054     3496
##  3   637   PULASKI    IL 0.011     7523   683.9091     5032     2466
##  4   642  ST CLAIR    IL 0.040   262852  6571.3000   187866    71275
##  5   576      COOK    IL 0.058  5105067 88018.3966  3204947  1317147
##  6  2026  CUYAHOGA    OH 0.026  1412140 54313.0769  1025756   350185
##  7   707      LAKE    IN 0.030   475594 15853.1333   334203   116688
##  8   711    MARION    IN 0.023   797159 34659.0870   615039   169654
##  9  2039  HAMILTON    OH 0.025   866228 34649.1200   672972   181145
## 10  3021 MILWAUKEE    WI 0.015   959275 63951.6667   718918   195470
## # ... with 427 more rows, and 20 more variables: popamerindian <int>,
## #   popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## #   percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>
  1. What county has the largest American Indian population (in absolute numbers)?

Looking at new data

library(readr)
docs <- read_csv("PartD_Providers.csv")

Let’s see what our data looks like to make sure it imported correctly. Remember, the str() command tells you the structure of your data.

str(docs)
## Classes 'tbl_df', 'tbl' and 'data.frame':    55918 obs. of  12 variables:
##  $ NPI            : int  1548260086 1659380202 1497775316 1164415824 1093735268 1194925560 1609897842 1144212291 1205949831 1548468655 ...
##  $ LAST_NAME      : chr  "ELSBREE" "ERBAY" "ANTIPORDA" "GONZALEZ" ...
##  $ FIRST_NAME     : chr  "SCOTT" "CELAL" "GLORIOSA" "CIRILO" ...
##  $ CITY           : chr  "LAKEWOOD RANCH" "GAINESVILLE" "JACKSONVILLE" "MIAMI" ...
##  $ STATE          : chr  "FL" "FL" "FL" "FL" ...
##  $ ZIP            : int  342408425 326053170 322216650 331351428 330103115 333084608 33525 323084573 334429473 35640 ...
##  $ SPECIALTY      : chr  "Internal Medicine" "Internal Medicine" "Internal Medicine" "Internal Medicine" ...
##  $ CLAIM_COUNT    : int  24512 15921 24682 47015 14656 18619 22684 27849 11075 6142 ...
##  $ BENE_COUNT     : int  1026 474 1340 2082 619 861 802 802 818 289 ...
##  $ BRAND_COUNT    : int  6466 4551 3832 9530 3653 4509 3821 6602 2720 1011 ...
##  $ LOWINCOME_COUNT: int  11024 13750 10634 25596 9502 11211 8522 21078 292 2664 ...
##  $ COST_SUM       : num  1484516 1013012 964914 957790 950820 ...

The data appears to have imported correctly for each field.

We can move on to some basic summary statistics.

summary(docs)
##       NPI             LAST_NAME          FIRST_NAME       
##  Min.   :1.003e+09   Length:55918       Length:55918      
##  1st Qu.:1.245e+09   Class :character   Class :character  
##  Median :1.498e+09   Mode  :character   Mode  :character  
##  Mean   :1.500e+09                                        
##  3rd Qu.:1.750e+09                                        
##  Max.   :1.993e+09                                        
##                                                           
##      CITY              STATE                ZIP           
##  Length:55918       Length:55918       Min.   :      601  
##  Class :character   Class :character   1st Qu.:104573360  
##  Mode  :character   Mode  :character   Median :338053019  
##                                        Mean   :408346829  
##                                        3rd Qu.:707602661  
##                                        Max.   :999015746  
##                                                           
##   SPECIALTY          CLAIM_COUNT       BENE_COUNT       BRAND_COUNT   
##  Length:55918       Min.   :    61   Min.   :   12.0   Min.   :    0  
##  Class :character   1st Qu.:  2339   1st Qu.:  185.0   1st Qu.:  524  
##  Mode  :character   Median :  4889   Median :  308.0   Median : 1079  
##                     Mean   :  6619   Mean   :  353.1   Mean   : 1509  
##                     3rd Qu.:  8841   3rd Qu.:  469.0   3rd Qu.: 1939  
##                     Max.   :120788   Max.   :24164.0   Max.   :50199  
##                                      NA's   :11        NA's   :1      
##  LOWINCOME_COUNT     COST_SUM       
##  Min.   :     0   Min.   :     530  
##  1st Qu.:   564   1st Qu.:  146413  
##  Median :  1405   Median :  298512  
##  Mean   :  2975   Mean   :  401606  
##  3rd Qu.:  3463   3rd Qu.:  531413  
##  Max.   :102445   Max.   :10102993  
##  NA's   :226

Sometimes, we want to use existing data to create new variables. Take the CLAIM_COUNT and BRAND_COUNT fields. We know the total number of claims each doctor has approved and how many of those claims were filled with brand name drugs. We need to divide BRAND_COUNT by CLAIM_COUNT. We can use mutate() to create a field in our data called BRAND_PCT with this ratio.

docs <- docs %>%
  mutate(BRAND_PCT = BRAND_COUNT/CLAIM_COUNT)

Checking our work

we can use glimpse() and str() to take a look at the data field we just created.

glimpse(docs)
## Observations: 55,918
## Variables: 13
## $ NPI             <int> 1548260086, 1659380202, 1497775316, 1164415824...
## $ LAST_NAME       <chr> "ELSBREE", "ERBAY", "ANTIPORDA", "GONZALEZ", "...
## $ FIRST_NAME      <chr> "SCOTT", "CELAL", "GLORIOSA", "CIRILO", "CARLO...
## $ CITY            <chr> "LAKEWOOD RANCH", "GAINESVILLE", "JACKSONVILLE...
## $ STATE           <chr> "FL", "FL", "FL", "FL", "FL", "FL", "FL", "FL"...
## $ ZIP             <int> 342408425, 326053170, 322216650, 331351428, 33...
## $ SPECIALTY       <chr> "Internal Medicine", "Internal Medicine", "Int...
## $ CLAIM_COUNT     <int> 24512, 15921, 24682, 47015, 14656, 18619, 2268...
## $ BENE_COUNT      <int> 1026, 474, 1340, 2082, 619, 861, 802, 802, 818...
## $ BRAND_COUNT     <int> 6466, 4551, 3832, 9530, 3653, 4509, 3821, 6602...
## $ LOWINCOME_COUNT <int> 11024, 13750, 10634, 25596, 9502, 11211, 8522,...
## $ COST_SUM        <dbl> 1484515.8, 1013012.5, 964914.3, 957790.5, 9508...
## $ BRAND_PCT       <dbl> 0.2637892, 0.2858489, 0.1552548, 0.2027013, 0....

str(docs)
## Classes 'tbl_df', 'tbl' and 'data.frame':    55918 obs. of  13 variables:
##  $ NPI            : int  1548260086 1659380202 1497775316 1164415824 1093735268 1194925560 1609897842 1144212291 1205949831 1548468655 ...
##  $ LAST_NAME      : chr  "ELSBREE" "ERBAY" "ANTIPORDA" "GONZALEZ" ...
##  $ FIRST_NAME     : chr  "SCOTT" "CELAL" "GLORIOSA" "CIRILO" ...
##  $ CITY           : chr  "LAKEWOOD RANCH" "GAINESVILLE" "JACKSONVILLE" "MIAMI" ...
##  $ STATE          : chr  "FL" "FL" "FL" "FL" ...
##  $ ZIP            : int  342408425 326053170 322216650 331351428 330103115 333084608 33525 323084573 334429473 35640 ...
##  $ SPECIALTY      : chr  "Internal Medicine" "Internal Medicine" "Internal Medicine" "Internal Medicine" ...
##  $ CLAIM_COUNT    : int  24512 15921 24682 47015 14656 18619 22684 27849 11075 6142 ...
##  $ BENE_COUNT     : int  1026 474 1340 2082 619 861 802 802 818 289 ...
##  $ BRAND_COUNT    : int  6466 4551 3832 9530 3653 4509 3821 6602 2720 1011 ...
##  $ LOWINCOME_COUNT: int  11024 13750 10634 25596 9502 11211 8522 21078 292 2664 ...
##  $ COST_SUM       : num  1484516 1013012 964914 957790 950820 ...
##  $ BRAND_PCT      : num  0.264 0.286 0.155 0.203 0.249 ...

Overwriting variables

While the ratio is correct, if you’d prefer to have it look more like a percentage by multiplying the ratio by 100. We can simply overwrite the existing data we just created.

docs <- docs %>%
  mutate(BRAND_PCT = (BRAND_COUNT/CLAIM_COUNT)*100)

glimpse(docs)
## Observations: 55,918
## Variables: 13
## $ NPI             <int> 1548260086, 1659380202, 1497775316, 1164415824...
## $ LAST_NAME       <chr> "ELSBREE", "ERBAY", "ANTIPORDA", "GONZALEZ", "...
## $ FIRST_NAME      <chr> "SCOTT", "CELAL", "GLORIOSA", "CIRILO", "CARLO...
## $ CITY            <chr> "LAKEWOOD RANCH", "GAINESVILLE", "JACKSONVILLE...
## $ STATE           <chr> "FL", "FL", "FL", "FL", "FL", "FL", "FL", "FL"...
## $ ZIP             <int> 342408425, 326053170, 322216650, 331351428, 33...
## $ SPECIALTY       <chr> "Internal Medicine", "Internal Medicine", "Int...
## $ CLAIM_COUNT     <int> 24512, 15921, 24682, 47015, 14656, 18619, 2268...
## $ BENE_COUNT      <int> 1026, 474, 1340, 2082, 619, 861, 802, 802, 818...
## $ BRAND_COUNT     <int> 6466, 4551, 3832, 9530, 3653, 4509, 3821, 6602...
## $ LOWINCOME_COUNT <int> 11024, 13750, 10634, 25596, 9502, 11211, 8522,...
## $ COST_SUM        <dbl> 1484515.8, 1013012.5, 964914.3, 957790.5, 9508...
## $ BRAND_PCT       <dbl> 26.37892, 28.58489, 15.52548, 20.27013, 24.924...

Filtering data with dplyr

Now this has been interesting, but in many cases we want to pull out specific pieces of our data. What if we’re only interested in doctors in Colorado. We can do this with the filter() function, which is part of the dplyr package. Remember, if this is our first time using dplyr this session we need to load it using library() first.

library(dplyr)

Filtering

This is how the filter function works. data %>% filter(variable you want to filter on). If we do this, it will simply print the results in the Console. We can also assign the results to a new dataframe. To find only doctors from Colorado:

colorado_docs <- docs %>%
  filter(STATE == "CO")

You don’t have to be limited by one variable. Earlier we looked at high-cost prescribers. Perhaps, we’d like to do that just for Colorado doctors. We can use the & to to look at all Colorado doctors who have

BOOLEANS

In R we can use two types of Boolean characters

AND represented by &

OR represented by |

& limits our filter because both criteria must be met. | broadens our filter. If either criteria is met the data is included.

Filtering using AND

The above code shows doctors just from Colorado who also exceeded $500,000 in costs. Below would give doctors who either are from Colorado or have a cost that exceeds $500,000 and could be from any state.

colorado_high <- docs %>%
  filter(STATE == "CO" & COST_SUM > 500000)

Now we can view the new data set we created in our Environment or order by using arrange() which you learned earlier

colorado_high %>% 
  arrange(desc(COST_SUM))
## # A tibble: 80 x 13
##           NPI  LAST_NAME FIRST_NAME       CITY STATE       ZIP
##         <int>      <chr>      <chr>      <chr> <chr>     <int>
##  1 1891724027  CHEBANOVA      ELENA     DENVER    CO 802472307
##  2 1770564437  ANISIMOVA      ELENA     DENVER    CO 802107009
##  3 1477550960     FIERER     ROBERT    BOULDER    CO 803057182
##  4 1164583324 MAHALINGAM     INDIRA BROOMFIELD    CO 800209541
##  5 1881622629   SCHWARTZ       MARK     PUEBLO    CO 810013754
##  6 1730166687   BERNTSEN       MARK    GREELEY    CO 806315114
##  7 1770526881       BLUM     JOSHUA     DENVER    CO 802053003
##  8 1295014710     ABUABA     ROMANO     AURORA    CO 800145476
##  9 1124134697      PINES      IRINA     DENVER    CO 802044507
## 10 1235134008    BORISOV       IGOR     DENVER    CO 802203920
## # ... with 70 more rows, and 7 more variables: SPECIALTY <chr>,
## #   CLAIM_COUNT <int>, BENE_COUNT <int>, BRAND_COUNT <int>,
## #   LOWINCOME_COUNT <int>, COST_SUM <dbl>, BRAND_PCT <dbl>

Grouping with dplyr

That’s helpful, but what if you want to get a sense of the average salary or total number of doctors in each state. In Excel you might use a PivotTable, in SQL you’d used a “Group By” query. In dplyr the syntax is very similar to SQL with the group_by() and summarise() functions.

You’ll do this in two steps. First, decide which variables to put in to the group_by function.

state_group <- docs %>% 
  group_by(STATE)

Then use what you just created to summarise our data and then view the result.

state_docs <- state_group %>%
  summarise(count = n(),
            median_cost = median(COST_SUM)
            )

state_docs
## # A tibble: 64 x 3
##    STATE count median_cost
##    <chr> <int>       <dbl>
##  1    AE     1    93738.55
##  2    AK    55   195571.65
##  3    AL   789   498793.89
##  4    AR   260   404949.92
##  5    AZ   946   266389.42
##  6    CA  6784   243460.89
##  7    CO   676   229384.89
##  8    CT  1024   306055.54
##  9    DC   192   130242.49
## 10    DE   157   349815.91
## # ... with 54 more rows

Summarizing

You can add as many summary statistics as you want.

state_docs <- state_group %>%
  summarise(count = n(),
            median_cost = median(COST_SUM),
            total_cost = sum(COST_SUM),
            high_cost = max(COST_SUM),
            sd_cost = sd(COST_SUM),
            median_scripts = sum(CLAIM_COUNT),
            total_scripts = sum(CLAIM_COUNT)
            )

Grouping by multiple variables

If you decided you wanted to group by more than one variable, just add it to the function, separating each variable by a comma. To group by city and state would be:

state_group <- docs %>%
  group_by(CITY, STATE)

Then, just re-run the summarise function.

state_docs <- state_group %>%
  summarise(count = n(),
            median_cost = median(COST_SUM),
            total_cost = sum(COST_SUM),
            high_cost = max(COST_SUM),
            sd_cost = sd(COST_SUM),
            median_scripts = sum(CLAIM_COUNT),
            total_scripts = sum(CLAIM_COUNT)
            )

Chaining multiple operations together

You can also accomplish the same thing in one “step” by chaining together multiple operations with the pipe.

state_docs <- docs %>%
  group_by(CITY, STATE) %>%
  summarise(count = n(),
            median_cost = median(COST_SUM),
            total_cost = sum(COST_SUM),
            high_cost = max(COST_SUM),
            sd_cost = sd(COST_SUM),
            median_scripts = sum(CLAIM_COUNT),
            total_scripts = sum(CLAIM_COUNT)
            )

Questions

Other syntaxes

We’ve shown you several R functions, but keep in mind. If you Google for an answer, you may see different ways to do things. The syntax for R’s base package requires users to use a $ to separate the data frame from the variable:

[data_frame]$[variable]

So for example, if you want to find the median of the CLAIM_COUNT in the colorado_high data frame:

median(colorado_high$CLAIM_COUNT)
## [1] 10267

Working with dates

# install.packages("lubridate")
# install.packages("dplyr")
library(lubridate)
library(dplyr)

Data with dates

The economics dataset also comes with ggplot2, and is time series data. The observations are months, and the variables are economic indicators like the number of unemployed people. Because this is data from a package, you can read about it by using the ?.

?economics
econ <- economics
str(econ)
## Classes 'tbl_df', 'tbl' and 'data.frame':    574 obs. of  6 variables:
##  $ date    : Date, format: "1967-07-01" "1967-08-01" ...
##  $ pce     : num  507 510 516 513 518 ...
##  $ pop     : int  198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
##  $ psavert : num  12.5 12.5 11.7 12.5 12.5 12.1 11.7 12.2 11.6 12.2 ...
##  $ uempmed : num  4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
##  $ unemploy: int  2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...
econ <- econ %>%
  mutate(date = ymd(date)) 
old <- econ %>%
  filter(year(date)<"1970")
old <- old %>%
  mutate(date = ymd(date))

p3 <- ggplot(data=old) + geom_line(aes(x=date, y=pce))
p3

For nicer date formatting

# install.packages("scales") 
library(scales)

p3 + 
  scale_x_date(breaks = date_breaks("3 months"), 
                   labels = date_format("%b %Y"))

p3 + 
  scale_x_date(breaks = date_breaks("6 months"), 
                   labels = date_format("%B %y"))

Base R plotting

Similarly, the base package comes with it’s own charting library. So we can create a histogram hist()

hist(colorado_high$CLAIM_COUNT)

or simply plot values plot(x, y):

plot(colorado_high$CLAIM_COUNT, colorado_high$BRAND_COUNT)

R as a calculator

2+4
## [1] 6

R follows standard mathematical order of operations (PEMDAS). So, this:

2+4*3^2
## [1] 38

gives you a different result than this:

(2+4)*3^2
## [1] 54