Amelia McNamara
12/4/2017
There are several places you may want to start:
R has a bit of a learning curve but once you get used to it, you’re off to the races. Keep in mind you will get a ton of errors while you’re learning R — in fact, you’ll still deal with errors even if you’re a master user! Google is your friend — copy and paste the error into Google and see what others have written.
PRO TIP: R is VERY VERY VERY case-sensitive. A misplaced capital letter or double-quote where there should be a single-quote might just be the culprit to your error.
And don’t forget — save EVERYTHING! RStudio allows you to save your data, your scripts, your history, your entire RStudio workspace — save it all and save often.
I’ve had our server administrator make you all RStudio server accounts. To log in, go to rstudio.smith.edu and log in with your Smith username (mine is amcnamara) and the same password you use for Bannerweb, your email, etc.
RStudio is a great interface to the R language. The interface has 4 panes.
You’ll do your coding in the Console (lower left, looks like Terminal/command line).
The Environment and History (upper right) panes keep track of everything you have loaded into your environment (mostly, data) and all the R commands you have ever typed. This is great because you can search through your past code to find commands you may have forgotten.
The Source pane is on the upper left and displays your data tables, as well as any documents you are reading or editing (maybe, you’re looking at this document there!).
The Files/Plots/Packages/Help/View pane on the lower right does a bunch of stuff — holds files, displays the graphs you’ve made, shows the packages you have installed, and more. Perhaps the most important tab of that lower right pane is Help.
R has two help functions. Forget what an average or mean is? Try this.
?mean
or
help(mean)
R has “base” functionality that is built into the language, but most of the good stuff comes from user-written “packages” that extend the language. We’ll be focused on the “tidyverse” syntax here, mostly using ggplot2
and dplyr
.
Because R is open source (and was written by statisticians…) it has some strange features compared to other programming languages. The most obvious is that its syntax is not consistent. Almost any task you want to do, there are three ways to accomplish. And packages can contribute to this nonsense by providing their own syntax. We’ll try to point to places where the code we are showing you is just one of many ways to do something, but know that when you start googling questions you may see things that don’t look familiar. Don’t be afraid!
To use a package, you have load it into your session by using the library()
command. Lets start by loading some packages:
library(dplyr)
Later in this class, we will show you how to load in external data, but for now we are going to begin by playing with some data that comes with R. This will allow us to start playing quickly.
We’ll start with some data about midwest demographics. To get started, you can read the data in with the data()
command, and see the first few rows with glimpse()
.
data(midwest)
glimpse(midwest)
## Observations: 437
## Variables: 28
## $ PID <int> 561, 562, 563, 564, 565, 566, 567, 568, 5...
## $ county <chr> "ADAMS", "ALEXANDER", "BOND", "BOONE", "B...
## $ state <chr> "IL", "IL", "IL", "IL", "IL", "IL", "IL",...
## $ area <dbl> 0.052, 0.014, 0.022, 0.017, 0.018, 0.050,...
## $ poptotal <int> 66090, 10626, 14991, 30806, 5836, 35688, ...
## $ popdensity <dbl> 1270.9615, 759.0000, 681.4091, 1812.1176,...
## $ popwhite <int> 63917, 7054, 14477, 29344, 5264, 35157, 5...
## $ popblack <int> 1702, 3496, 429, 127, 547, 50, 1, 111, 16...
## $ popamerindian <int> 98, 19, 35, 46, 14, 65, 8, 30, 8, 331, 51...
## $ popasian <int> 249, 48, 16, 150, 5, 195, 15, 61, 23, 803...
## $ popother <int> 124, 9, 34, 1139, 6, 221, 0, 84, 6, 1596,...
## $ percwhite <dbl> 96.71206, 66.38434, 96.57128, 95.25417, 9...
## $ percblack <dbl> 2.57527614, 32.90043290, 2.86171703, 0.41...
## $ percamerindan <dbl> 0.14828264, 0.17880670, 0.23347342, 0.149...
## $ percasian <dbl> 0.37675897, 0.45172219, 0.10673071, 0.486...
## $ percother <dbl> 0.18762294, 0.08469791, 0.22680275, 3.697...
## $ popadults <int> 43298, 6724, 9669, 19272, 3979, 23444, 35...
## $ perchsd <dbl> 75.10740, 59.72635, 69.33499, 75.47219, 6...
## $ percollege <dbl> 19.63139, 11.24331, 17.03382, 17.27895, 1...
## $ percprof <dbl> 4.355859, 2.870315, 4.488572, 4.197800, 3...
## $ poppovertyknown <int> 63628, 10529, 14235, 30337, 4815, 35107, ...
## $ percpovertyknown <dbl> 96.27478, 99.08714, 94.95697, 98.47757, 8...
## $ percbelowpoverty <dbl> 13.151443, 32.244278, 12.068844, 7.209019...
## $ percchildbelowpovert <dbl> 18.011717, 45.826514, 14.036061, 11.17953...
## $ percadultpoverty <dbl> 11.009776, 27.385647, 10.852090, 5.536013...
## $ percelderlypoverty <dbl> 12.443812, 25.228976, 12.697410, 6.217047...
## $ inmetro <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,...
## $ category <chr> "AAR", "LHR", "AAR", "ALU", "AAR", "AAR",...
For a spreadsheet-like experience of your data, just click on its name in your Environment
tab to see a preview of the data in the Source pane. Note that you can only look, you can’t “touch.” R is a programming language, so if you see values you want to have changed, or you need to add another variable based on an existing one you have to do so programmatically.
If you want to read about all the variables that are included in the dataset, you can run
?midwest
or
help(midwest)
A few more handy functions that have to do with datasets are str()
, which tells you about the structure of your data,
str(midwest)
## Classes 'tbl_df', 'tbl' and 'data.frame': 437 obs. of 28 variables:
## $ PID : int 561 562 563 564 565 566 567 568 569 570 ...
## $ county : chr "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
## $ state : chr "IL" "IL" "IL" "IL" ...
## $ area : num 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
## $ poptotal : int 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
## $ popdensity : num 1271 759 681 1812 324 ...
## $ popwhite : int 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
## $ popblack : int 1702 3496 429 127 547 50 1 111 16 16559 ...
## $ popamerindian : int 98 19 35 46 14 65 8 30 8 331 ...
## $ popasian : int 249 48 16 150 5 195 15 61 23 8033 ...
## $ popother : int 124 9 34 1139 6 221 0 84 6 1596 ...
## $ percwhite : num 96.7 66.4 96.6 95.3 90.2 ...
## $ percblack : num 2.575 32.9 2.862 0.412 9.373 ...
## $ percamerindan : num 0.148 0.179 0.233 0.149 0.24 ...
## $ percasian : num 0.3768 0.4517 0.1067 0.4869 0.0857 ...
## $ percother : num 0.1876 0.0847 0.2268 3.6973 0.1028 ...
## $ popadults : int 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
## $ perchsd : num 75.1 59.7 69.3 75.5 68.9 ...
## $ percollege : num 19.6 11.2 17 17.3 14.5 ...
## $ percprof : num 4.36 2.87 4.49 4.2 3.37 ...
## $ poppovertyknown : int 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
## $ percpovertyknown : num 96.3 99.1 95 98.5 82.5 ...
## $ percbelowpoverty : num 13.15 32.24 12.07 7.21 13.52 ...
## $ percchildbelowpovert: num 18 45.8 14 11.2 13 ...
## $ percadultpoverty : num 11.01 27.39 10.85 5.54 11.14 ...
## $ percelderlypoverty : num 12.44 25.23 12.7 6.22 19.2 ...
## $ inmetro : int 0 0 0 1 0 0 0 0 0 1 ...
## $ category : chr "AAR" "LHR" "AAR" "ALU" ...
and summary()
, which provides a summary.
summary(midwest)
## PID county state area
## Min. : 561 Length:437 Length:437 Min. :0.00500
## 1st Qu.: 670 Class :character Class :character 1st Qu.:0.02400
## Median :1221 Mode :character Mode :character Median :0.03000
## Mean :1437 Mean :0.03317
## 3rd Qu.:2059 3rd Qu.:0.03800
## Max. :3052 Max. :0.11000
## poptotal popdensity popwhite popblack
## Min. : 1701 Min. : 85.05 Min. : 416 Min. : 0
## 1st Qu.: 18840 1st Qu.: 622.41 1st Qu.: 18630 1st Qu.: 29
## Median : 35324 Median : 1156.21 Median : 34471 Median : 201
## Mean : 96130 Mean : 3097.74 Mean : 81840 Mean : 11024
## 3rd Qu.: 75651 3rd Qu.: 2330.00 3rd Qu.: 72968 3rd Qu.: 1291
## Max. :5105067 Max. :88018.40 Max. :3204947 Max. :1317147
## popamerindian popasian popother percwhite
## Min. : 4.0 Min. : 0 Min. : 0 Min. :10.69
## 1st Qu.: 44.0 1st Qu.: 35 1st Qu.: 20 1st Qu.:94.89
## Median : 94.0 Median : 102 Median : 66 Median :98.03
## Mean : 343.1 Mean : 1310 Mean : 1613 Mean :95.56
## 3rd Qu.: 288.0 3rd Qu.: 401 3rd Qu.: 345 3rd Qu.:99.07
## Max. :10289.0 Max. :188565 Max. :384119 Max. :99.82
## percblack percamerindan percasian percother
## Min. : 0.0000 Min. : 0.05623 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.1157 1st Qu.: 0.15793 1st Qu.:0.1737 1st Qu.:0.09102
## Median : 0.5390 Median : 0.21502 Median :0.2972 Median :0.17844
## Mean : 2.6763 Mean : 0.79894 Mean :0.4872 Mean :0.47906
## 3rd Qu.: 2.6014 3rd Qu.: 0.38362 3rd Qu.:0.5212 3rd Qu.:0.48050
## Max. :40.2100 Max. :89.17738 Max. :5.0705 Max. :7.52427
## popadults perchsd percollege percprof
## Min. : 1287 Min. :46.91 Min. : 7.336 Min. : 0.5203
## 1st Qu.: 12271 1st Qu.:71.33 1st Qu.:14.114 1st Qu.: 2.9980
## Median : 22188 Median :74.25 Median :16.798 Median : 3.8142
## Mean : 60973 Mean :73.97 Mean :18.273 Mean : 4.4473
## 3rd Qu.: 47541 3rd Qu.:77.20 3rd Qu.:20.550 3rd Qu.: 4.9493
## Max. :3291995 Max. :88.90 Max. :48.079 Max. :20.7913
## poppovertyknown percpovertyknown percbelowpoverty percchildbelowpovert
## Min. : 1696 Min. :80.90 Min. : 2.180 Min. : 1.919
## 1st Qu.: 18364 1st Qu.:96.89 1st Qu.: 9.199 1st Qu.:11.624
## Median : 33788 Median :98.17 Median :11.822 Median :15.270
## Mean : 93642 Mean :97.11 Mean :12.511 Mean :16.447
## 3rd Qu.: 72840 3rd Qu.:98.60 3rd Qu.:15.133 3rd Qu.:20.352
## Max. :5023523 Max. :99.86 Max. :48.691 Max. :64.308
## percadultpoverty percelderlypoverty inmetro category
## Min. : 1.938 Min. : 3.547 Min. :0.0000 Length:437
## 1st Qu.: 7.668 1st Qu.: 8.912 1st Qu.:0.0000 Class :character
## Median :10.008 Median :10.869 Median :0.0000 Mode :character
## Mean :10.919 Mean :11.389 Mean :0.3432
## 3rd Qu.:13.182 3rd Qu.:13.412 3rd Qu.:1.0000
## Max. :43.312 Max. :31.162 Max. :1.0000
You can also get the list of variable names using the names()
function,
names(midwest)
## [1] "PID" "county" "state"
## [4] "area" "poptotal" "popdensity"
## [7] "popwhite" "popblack" "popamerindian"
## [10] "popasian" "popother" "percwhite"
## [13] "percblack" "percamerindan" "percasian"
## [16] "percother" "popadults" "perchsd"
## [19] "percollege" "percprof" "poppovertyknown"
## [22] "percpovertyknown" "percbelowpoverty" "percchildbelowpovert"
## [25] "percadultpoverty" "percelderlypoverty" "inmetro"
## [28] "category"
mea
and hit tab, it will suggest mean()
among other things. If you type mean(~hwy, data=vehicles,
and hit tab, it will tell you the other arguments you can use for the mean()
function.>
at the beginning of the line (which means it is waiting for a new command) to the +
at the beginning of the line (which means it is waiting for you to finish a command). To get out, hit the Escape key.Of course, what we’re really interested in is making data visualizations!
There are many ways to make graphics in R.
We’ll focus on ggplot2, which implements the Grammar of Graphics we’ve talked about in this class.
Remember our retinal variables?
The grammar of graphics gives you a layered way to build up graphics, mapping data to aesthetics.
ggplot2 is an R package by Hadley Wickham that lets you make beautiful R graphics (relatively) easily.
To use it, we need to load the package,
library(ggplot2)
To start, I’m going to use the diamonds data that comes with ggplot2,
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
qplot(x=carat, data=diamonds)
qplot(x=carat, data=diamonds)
As a function qplot basically works like this:
qplot([x_variable], [y_variable], data = [data_frame], geom = “[geom_type]”)
(Remember, use ?qplot()
or help(qplot)
if you need help.) The function makes assumptions on the best type of chart to use. In the above example, because you only passed one numeric variable it assumed a histogram is what is needed. We could explictly tell it to do that.
qplot(x=clarity, fill=cut, data=diamonds)
But, in order to really harness the power of ggplot2 you need to use the more general ggplot() command. The idea of the package is you can “layer” pieces on top of a plot to build it up over time.
You always need to use a ggplot() call to initialize the plot. I usually put my dataset in here, and at least some of my “aesthetics.” But, one of the things that can make ggplot2 tough to understand is that there are no hard and fast rules.
p1 <- ggplot(aes(x=clarity, fill=cut), data=diamonds)
If you try to show p1 at this point, you will get “Error: No layers in plot.” This is because we haven’t given it any geometric objects yet.
In order to get a plot to work, you need to use “geoms” (geometric objects) to specify the way you want your variables mapped to graphical parameters.
p1 + geom_bar()
p1 + geom_bar(position="dodge")
p1 + geom_bar(position="fill")
p2 <- ggplot(aes(x=carat, y=depth), data=diamonds)
p2 + geom_point()
p2 + geom_bin2d()
Notice that I’m not saving these geom layers– I’m just running
p2 + [something]
to see what happens. But, I can save the new version to start building up my plot,
p2 <- p2 + geom_bin2d()
p2 <- p2 + xlab("Carat") + ylab("Depth") +
guides(fill=guide_legend(title="Number of diamonds"))
p2
p2 + scale_fill_continuous(breaks=c(1500, 2500, 3500,4500))
p2 + scale_fill_continuous(trans="log")
p2 + scale_fill_continuous(trans="log", breaks=c(1,10,100,1000,5000))
p2 + facet_wrap(~color)
p2 + facet_grid(color~clarity)
baseplot <- ggplot() +
aes(x=citystate, y=Freq, fill = Response, order=Response) +
facet_wrap(~year, nrow=3)+geom_bar(data = trial2$neg, stat = "identity") +
scale_fill_manual(breaks=c("Not at all satisfied", "2", "3", "4",
"Extremely satisfied"), values=colorsB,
name="Response") +
geom_bar(data = trial2$pos, stat = "identity") + coord_flip() +
ggtitle("Community satisfaction") + xlab("") +ylab("") +
scale_y_continuous(limits=c(-0.5, 1),
breaks=seq(from=-0.5, to=0.75, by=0.25),
labels=c("50%", "25%", "0", "25%", "50%", "75%")) +
theme(legend.text=element_text(size=14),
legend.title=element_text(size=16),
axis.text=element_text(size=14),
strip.text=element_text(size=14))
baseplot
# install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all(n=5, type="qual")
p1 + geom_bar(position="dodge") + scale_fill_brewer(palette="Dark2")
p1 <- p1 + geom_bar(position="dodge") + scale_fill_brewer(palette="Dark2")
ggsave(p1, filename="mycoolplot.jpg")
Now, I want you to try some of your new skills on the American Time Use Survey data.
Most of the time, you won’t be working with data sets available through packages and will need to import data in to R. RStudio makes this very simple. On the Environment pane (upper right corner), the Import Dataset icon provides a step-by-step process to importing files. Following those steps will not only import the data, but print out the code in the Console showing how it was done.
library(readr)
atus <- read.csv("atus.csv")
To import this data set, we used the read_csv()
function. The key part of this command is making sure that you direct the command to where the file is located on your computer. Check other functionality of ?read_csv
if you have questions.
Note: The way file directories are referenced in R varies between a Mac and PC.
MAC file directories follow this protocol use “~/file_location”
PC versions would follow this: “C:_lcocation"
You can load it in the point-and-click way, or do it programmatically.
Make a stacked bar chart of atus
data, but change the color scheme.
Make a scatterplot of atus
data, but remove the grey-and-white background.
Make a plot that helps you determine which state has the most veterans.
Are most veterans married or not?
So far, all the plots we have seen were static. There are many ways to make interactive plots in R, but one of the best is using the Plotly API. (This is where their developers spend a lot of time!)
To use Plotly in R, you need to install the Plotly package
install.packages("plotly")
Then, you can load it into your session
library(plotly)
If you are familiar with ggplot, a nice extension is ggplotly. You can take an existing ggplot object and wrap it in ggplotly()
to get it to become a little interactive.
p3 <- ggplotly(p1)
If you want to see the plots online, you will need to set up your API credentials in R. Read about that here. You need to fill in your username and API key
Sys.setenv("plotly_username"="AmeliaMN")
Sys.setenv("plotly_api_key"="your_api_key")
api_create(p3, filename = "barchart")
If you want to get fancier interactivity, you can use the more complex plotly graphing commands. They have a slightly different syntax.
p <- plot_ly(midwest, x = ~percollege, color = ~state, type = "box")
There is much more documentation about these functions online.
Another part of exploratory data analysis is finding summary statistics. We’ll be using dplyr
to do “tidy” computations, so we’ll use the pipe to bring data into a summarize()
command.
midwest %>%
summarize(mean(popblack))
## # A tibble: 1 x 1
## `mean(popblack)`
## <dbl>
## 1 11023.88
Other than mean()
you can try median()
min()
max()
sd()
and many more.
If you have two variables and want to see their relationship, you can group_by()
one and then summarize. For example,
midwest %>%
group_by(inmetro) %>%
summarize(mean(popblack))
## # A tibble: 2 x 2
## inmetro `mean(popblack)`
## <int> <dbl>
## 1 0 503.8711
## 2 1 31152.1667
For categorical data, n()
is useful,
midwest %>%
group_by(inmetro) %>%
summarize(n=n())
## # A tibble: 2 x 2
## inmetro n
## <int> <int>
## 1 0 287
## 2 1 150
And, n()
can be used to count two variables together,
midwest %>%
group_by(inmetro, category) %>%
summarize(n=n())
## # A tibble: 16 x 3
## # Groups: inmetro [?]
## inmetro category n
## <int> <chr> <int>
## 1 0 AAR 193
## 2 0 AHR 16
## 3 0 ALR 11
## 4 0 HAR 6
## 5 0 HHR 1
## 6 0 HLR 2
## 7 0 LAR 30
## 8 0 LHR 28
## 9 1 AAU 77
## 10 1 AHU 1
## 11 1 ALU 20
## 12 1 HAU 20
## 13 1 HHU 1
## 14 1 HLU 26
## 15 1 LAU 3
## 16 1 LHU 2
Maybe we’re working on a story about the county in the midwest that has the largest percentage of black people. We can use the arrange()
function to see this.
midwest %>%
arrange(percblack)
## # A tibble: 437 x 28
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 3020 MENOMINEE WI 0.021 3890 185.2381 416 0
## 2 753 WHITE IN 0.030 23265 775.5000 23127 2
## 3 600 JASPER IL 0.029 10609 365.8276 10574 1
## 4 3041 TAYLOR WI 0.057 18901 331.5965 18807 2
## 5 1256 MONTMORENCY MI 0.033 8936 270.7879 8861 1
## 6 748 WARREN IN 0.021 8176 389.3333 8140 1
## 7 717 MORGAN IN 0.024 55920 2330.0000 55635 9
## 8 3006 IRON WI 0.047 6153 130.9149 6121 1
## 9 646 SCOTT IL 0.015 5644 376.2667 5634 1
## 10 567 CALHOUN IL 0.017 5322 313.0588 5298 1
## # ... with 427 more rows, and 20 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
But, this is sorting in increasing order, and we want in decreasing order. We could look in the documentation for arrange()
to see how to change this.
midwest %>%
arrange(desc(percblack))
## # A tibble: 437 x 28
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 1278 WAYNE MI 0.035 2111687 60333.9143 1212007 849109
## 2 562 ALEXANDER IL 0.014 10626 759.0000 7054 3496
## 3 637 PULASKI IL 0.011 7523 683.9091 5032 2466
## 4 642 ST CLAIR IL 0.040 262852 6571.3000 187866 71275
## 5 576 COOK IL 0.058 5105067 88018.3966 3204947 1317147
## 6 2026 CUYAHOGA OH 0.026 1412140 54313.0769 1025756 350185
## 7 707 LAKE IN 0.030 475594 15853.1333 334203 116688
## 8 711 MARION IN 0.023 797159 34659.0870 615039 169654
## 9 2039 HAMILTON OH 0.025 866228 34649.1200 672972 181145
## 10 3021 MILWAUKEE WI 0.015 959275 63951.6667 718918 195470
## # ... with 427 more rows, and 20 more variables: popamerindian <int>,
## # popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
library(readr)
docs <- read_csv("PartD_Providers.csv")
Let’s see what our data looks like to make sure it imported correctly. Remember, the str()
command tells you the structure of your data.
str(docs)
## Classes 'tbl_df', 'tbl' and 'data.frame': 55918 obs. of 12 variables:
## $ NPI : int 1548260086 1659380202 1497775316 1164415824 1093735268 1194925560 1609897842 1144212291 1205949831 1548468655 ...
## $ LAST_NAME : chr "ELSBREE" "ERBAY" "ANTIPORDA" "GONZALEZ" ...
## $ FIRST_NAME : chr "SCOTT" "CELAL" "GLORIOSA" "CIRILO" ...
## $ CITY : chr "LAKEWOOD RANCH" "GAINESVILLE" "JACKSONVILLE" "MIAMI" ...
## $ STATE : chr "FL" "FL" "FL" "FL" ...
## $ ZIP : int 342408425 326053170 322216650 331351428 330103115 333084608 33525 323084573 334429473 35640 ...
## $ SPECIALTY : chr "Internal Medicine" "Internal Medicine" "Internal Medicine" "Internal Medicine" ...
## $ CLAIM_COUNT : int 24512 15921 24682 47015 14656 18619 22684 27849 11075 6142 ...
## $ BENE_COUNT : int 1026 474 1340 2082 619 861 802 802 818 289 ...
## $ BRAND_COUNT : int 6466 4551 3832 9530 3653 4509 3821 6602 2720 1011 ...
## $ LOWINCOME_COUNT: int 11024 13750 10634 25596 9502 11211 8522 21078 292 2664 ...
## $ COST_SUM : num 1484516 1013012 964914 957790 950820 ...
The data appears to have imported correctly for each field.
We can move on to some basic summary statistics.
summary(docs)
## NPI LAST_NAME FIRST_NAME
## Min. :1.003e+09 Length:55918 Length:55918
## 1st Qu.:1.245e+09 Class :character Class :character
## Median :1.498e+09 Mode :character Mode :character
## Mean :1.500e+09
## 3rd Qu.:1.750e+09
## Max. :1.993e+09
##
## CITY STATE ZIP
## Length:55918 Length:55918 Min. : 601
## Class :character Class :character 1st Qu.:104573360
## Mode :character Mode :character Median :338053019
## Mean :408346829
## 3rd Qu.:707602661
## Max. :999015746
##
## SPECIALTY CLAIM_COUNT BENE_COUNT BRAND_COUNT
## Length:55918 Min. : 61 Min. : 12.0 Min. : 0
## Class :character 1st Qu.: 2339 1st Qu.: 185.0 1st Qu.: 524
## Mode :character Median : 4889 Median : 308.0 Median : 1079
## Mean : 6619 Mean : 353.1 Mean : 1509
## 3rd Qu.: 8841 3rd Qu.: 469.0 3rd Qu.: 1939
## Max. :120788 Max. :24164.0 Max. :50199
## NA's :11 NA's :1
## LOWINCOME_COUNT COST_SUM
## Min. : 0 Min. : 530
## 1st Qu.: 564 1st Qu.: 146413
## Median : 1405 Median : 298512
## Mean : 2975 Mean : 401606
## 3rd Qu.: 3463 3rd Qu.: 531413
## Max. :102445 Max. :10102993
## NA's :226
Sometimes, we want to use existing data to create new variables. Take the CLAIM_COUNT and BRAND_COUNT fields. We know the total number of claims each doctor has approved and how many of those claims were filled with brand name drugs. We need to divide BRAND_COUNT by CLAIM_COUNT. We can use mutate()
to create a field in our data called BRAND_PCT with this ratio.
docs <- docs %>%
mutate(BRAND_PCT = BRAND_COUNT/CLAIM_COUNT)
we can use glimpse()
and str()
to take a look at the data field we just created.
glimpse(docs)
## Observations: 55,918
## Variables: 13
## $ NPI <int> 1548260086, 1659380202, 1497775316, 1164415824...
## $ LAST_NAME <chr> "ELSBREE", "ERBAY", "ANTIPORDA", "GONZALEZ", "...
## $ FIRST_NAME <chr> "SCOTT", "CELAL", "GLORIOSA", "CIRILO", "CARLO...
## $ CITY <chr> "LAKEWOOD RANCH", "GAINESVILLE", "JACKSONVILLE...
## $ STATE <chr> "FL", "FL", "FL", "FL", "FL", "FL", "FL", "FL"...
## $ ZIP <int> 342408425, 326053170, 322216650, 331351428, 33...
## $ SPECIALTY <chr> "Internal Medicine", "Internal Medicine", "Int...
## $ CLAIM_COUNT <int> 24512, 15921, 24682, 47015, 14656, 18619, 2268...
## $ BENE_COUNT <int> 1026, 474, 1340, 2082, 619, 861, 802, 802, 818...
## $ BRAND_COUNT <int> 6466, 4551, 3832, 9530, 3653, 4509, 3821, 6602...
## $ LOWINCOME_COUNT <int> 11024, 13750, 10634, 25596, 9502, 11211, 8522,...
## $ COST_SUM <dbl> 1484515.8, 1013012.5, 964914.3, 957790.5, 9508...
## $ BRAND_PCT <dbl> 0.2637892, 0.2858489, 0.1552548, 0.2027013, 0....
str(docs)
## Classes 'tbl_df', 'tbl' and 'data.frame': 55918 obs. of 13 variables:
## $ NPI : int 1548260086 1659380202 1497775316 1164415824 1093735268 1194925560 1609897842 1144212291 1205949831 1548468655 ...
## $ LAST_NAME : chr "ELSBREE" "ERBAY" "ANTIPORDA" "GONZALEZ" ...
## $ FIRST_NAME : chr "SCOTT" "CELAL" "GLORIOSA" "CIRILO" ...
## $ CITY : chr "LAKEWOOD RANCH" "GAINESVILLE" "JACKSONVILLE" "MIAMI" ...
## $ STATE : chr "FL" "FL" "FL" "FL" ...
## $ ZIP : int 342408425 326053170 322216650 331351428 330103115 333084608 33525 323084573 334429473 35640 ...
## $ SPECIALTY : chr "Internal Medicine" "Internal Medicine" "Internal Medicine" "Internal Medicine" ...
## $ CLAIM_COUNT : int 24512 15921 24682 47015 14656 18619 22684 27849 11075 6142 ...
## $ BENE_COUNT : int 1026 474 1340 2082 619 861 802 802 818 289 ...
## $ BRAND_COUNT : int 6466 4551 3832 9530 3653 4509 3821 6602 2720 1011 ...
## $ LOWINCOME_COUNT: int 11024 13750 10634 25596 9502 11211 8522 21078 292 2664 ...
## $ COST_SUM : num 1484516 1013012 964914 957790 950820 ...
## $ BRAND_PCT : num 0.264 0.286 0.155 0.203 0.249 ...
While the ratio is correct, if you’d prefer to have it look more like a percentage by multiplying the ratio by 100. We can simply overwrite the existing data we just created.
docs <- docs %>%
mutate(BRAND_PCT = (BRAND_COUNT/CLAIM_COUNT)*100)
glimpse(docs)
## Observations: 55,918
## Variables: 13
## $ NPI <int> 1548260086, 1659380202, 1497775316, 1164415824...
## $ LAST_NAME <chr> "ELSBREE", "ERBAY", "ANTIPORDA", "GONZALEZ", "...
## $ FIRST_NAME <chr> "SCOTT", "CELAL", "GLORIOSA", "CIRILO", "CARLO...
## $ CITY <chr> "LAKEWOOD RANCH", "GAINESVILLE", "JACKSONVILLE...
## $ STATE <chr> "FL", "FL", "FL", "FL", "FL", "FL", "FL", "FL"...
## $ ZIP <int> 342408425, 326053170, 322216650, 331351428, 33...
## $ SPECIALTY <chr> "Internal Medicine", "Internal Medicine", "Int...
## $ CLAIM_COUNT <int> 24512, 15921, 24682, 47015, 14656, 18619, 2268...
## $ BENE_COUNT <int> 1026, 474, 1340, 2082, 619, 861, 802, 802, 818...
## $ BRAND_COUNT <int> 6466, 4551, 3832, 9530, 3653, 4509, 3821, 6602...
## $ LOWINCOME_COUNT <int> 11024, 13750, 10634, 25596, 9502, 11211, 8522,...
## $ COST_SUM <dbl> 1484515.8, 1013012.5, 964914.3, 957790.5, 9508...
## $ BRAND_PCT <dbl> 26.37892, 28.58489, 15.52548, 20.27013, 24.924...
Now this has been interesting, but in many cases we want to pull out specific pieces of our data. What if we’re only interested in doctors in Colorado. We can do this with the filter()
function, which is part of the dplyr
package. Remember, if this is our first time using dplyr
this session we need to load it using library()
first.
library(dplyr)
This is how the filter function works. data %>% filter(variable you want to filter on)
. If we do this, it will simply print the results in the Console. We can also assign the results to a new dataframe. To find only doctors from Colorado:
colorado_docs <- docs %>%
filter(STATE == "CO")
You don’t have to be limited by one variable. Earlier we looked at high-cost prescribers. Perhaps, we’d like to do that just for Colorado doctors. We can use the &
to to look at all Colorado doctors who have
In R we can use two types of Boolean characters
AND represented by &
OR represented by |
& limits our filter because both criteria must be met. | broadens our filter. If either criteria is met the data is included.
The above code shows doctors just from Colorado who also exceeded $500,000 in costs. Below would give doctors who either are from Colorado or have a cost that exceeds $500,000 and could be from any state.
colorado_high <- docs %>%
filter(STATE == "CO" & COST_SUM > 500000)
Now we can view the new data set we created in our Environment or order by using arrange()
which you learned earlier
colorado_high %>%
arrange(desc(COST_SUM))
## # A tibble: 80 x 13
## NPI LAST_NAME FIRST_NAME CITY STATE ZIP
## <int> <chr> <chr> <chr> <chr> <int>
## 1 1891724027 CHEBANOVA ELENA DENVER CO 802472307
## 2 1770564437 ANISIMOVA ELENA DENVER CO 802107009
## 3 1477550960 FIERER ROBERT BOULDER CO 803057182
## 4 1164583324 MAHALINGAM INDIRA BROOMFIELD CO 800209541
## 5 1881622629 SCHWARTZ MARK PUEBLO CO 810013754
## 6 1730166687 BERNTSEN MARK GREELEY CO 806315114
## 7 1770526881 BLUM JOSHUA DENVER CO 802053003
## 8 1295014710 ABUABA ROMANO AURORA CO 800145476
## 9 1124134697 PINES IRINA DENVER CO 802044507
## 10 1235134008 BORISOV IGOR DENVER CO 802203920
## # ... with 70 more rows, and 7 more variables: SPECIALTY <chr>,
## # CLAIM_COUNT <int>, BENE_COUNT <int>, BRAND_COUNT <int>,
## # LOWINCOME_COUNT <int>, COST_SUM <dbl>, BRAND_PCT <dbl>
That’s helpful, but what if you want to get a sense of the average salary or total number of doctors in each state. In Excel you might use a PivotTable, in SQL you’d used a “Group By” query. In dplyr
the syntax is very similar to SQL with the group_by()
and summarise()
functions.
You’ll do this in two steps. First, decide which variables to put in to the group_by
function.
state_group <- docs %>%
group_by(STATE)
Then use what you just created to summarise our data and then view the result.
state_docs <- state_group %>%
summarise(count = n(),
median_cost = median(COST_SUM)
)
state_docs
## # A tibble: 64 x 3
## STATE count median_cost
## <chr> <int> <dbl>
## 1 AE 1 93738.55
## 2 AK 55 195571.65
## 3 AL 789 498793.89
## 4 AR 260 404949.92
## 5 AZ 946 266389.42
## 6 CA 6784 243460.89
## 7 CO 676 229384.89
## 8 CT 1024 306055.54
## 9 DC 192 130242.49
## 10 DE 157 349815.91
## # ... with 54 more rows
You can add as many summary statistics as you want.
state_docs <- state_group %>%
summarise(count = n(),
median_cost = median(COST_SUM),
total_cost = sum(COST_SUM),
high_cost = max(COST_SUM),
sd_cost = sd(COST_SUM),
median_scripts = sum(CLAIM_COUNT),
total_scripts = sum(CLAIM_COUNT)
)
If you decided you wanted to group by more than one variable, just add it to the function, separating each variable by a comma. To group by city and state would be:
state_group <- docs %>%
group_by(CITY, STATE)
Then, just re-run the summarise function.
state_docs <- state_group %>%
summarise(count = n(),
median_cost = median(COST_SUM),
total_cost = sum(COST_SUM),
high_cost = max(COST_SUM),
sd_cost = sd(COST_SUM),
median_scripts = sum(CLAIM_COUNT),
total_scripts = sum(CLAIM_COUNT)
)
You can also accomplish the same thing in one “step” by chaining together multiple operations with the pipe.
state_docs <- docs %>%
group_by(CITY, STATE) %>%
summarise(count = n(),
median_cost = median(COST_SUM),
total_cost = sum(COST_SUM),
high_cost = max(COST_SUM),
sd_cost = sd(COST_SUM),
median_scripts = sum(CLAIM_COUNT),
total_scripts = sum(CLAIM_COUNT)
)
We’ve shown you several R functions, but keep in mind. If you Google for an answer, you may see different ways to do things. The syntax for R’s base package requires users to use a $ to separate the data frame from the variable:
[data_frame]$[variable]
So for example, if you want to find the median of the CLAIM_COUNT in the colorado_high data frame:
median(colorado_high$CLAIM_COUNT)
## [1] 10267
# install.packages("lubridate")
# install.packages("dplyr")
library(lubridate)
library(dplyr)
The economics dataset also comes with ggplot2, and is time series data. The observations are months, and the variables are economic indicators like the number of unemployed people. Because this is data from a package, you can read about it by using the ?
.
?economics
econ <- economics
str(econ)
## Classes 'tbl_df', 'tbl' and 'data.frame': 574 obs. of 6 variables:
## $ date : Date, format: "1967-07-01" "1967-08-01" ...
## $ pce : num 507 510 516 513 518 ...
## $ pop : int 198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
## $ psavert : num 12.5 12.5 11.7 12.5 12.5 12.1 11.7 12.2 11.6 12.2 ...
## $ uempmed : num 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
## $ unemploy: int 2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...
econ <- econ %>%
mutate(date = ymd(date))
old <- econ %>%
filter(year(date)<"1970")
old <- old %>%
mutate(date = ymd(date))
p3 <- ggplot(data=old) + geom_line(aes(x=date, y=pce))
p3
# install.packages("scales")
library(scales)
p3 +
scale_x_date(breaks = date_breaks("3 months"),
labels = date_format("%b %Y"))
p3 +
scale_x_date(breaks = date_breaks("6 months"),
labels = date_format("%B %y"))
Similarly, the base package comes with it’s own charting library. So we can create a histogram hist()
hist(colorado_high$CLAIM_COUNT)
or simply plot values plot(x, y)
:
plot(colorado_high$CLAIM_COUNT, colorado_high$BRAND_COUNT)
2+4
## [1] 6
R follows standard mathematical order of operations (PEMDAS). So, this:
2+4*3^2
## [1] 38
gives you a different result than this:
(2+4)*3^2
## [1] 54