Data Appendix Instructions

For the Data Appendix, you need to submit your data file(s) and a knitted HTML file showing the data being loaded into R.

Data file(s)

Your raw data file(s) should ideally be in CSV format (.csv). This means that the first row should be a comma-separated list of variable names, and the rest should be rows of data.

Excel data can easily be converted to .csv by using Save As or Export and choosing “comma separated values” as the format.

If your data is in another format, talk to your professor about converting the data or using a different R command for reading it in.

Making your R Markdown file

To get started looking at your data, you need to do three main steps:

Download the data (from the data source or website). Often you click a Download button or right-click and select Save As. Your data file should be named something like group-X-data.csv, so change the filename if necessary.
Upload the data to the RStudio file server. Once you have the data on your computer, log in to RStudio and look for the Upload button on the Files tab (lower right corner). Select the button and then browse to your data file.
Load the data into R. Once the data file is on the server, you can read it in to your R environment.

Loading the data

To get around file path issues, try the Import Dataset button on the Environment tab (top right corner) and navigate to the file on the server. Look at the preview. You may need to switch the radio button for Heading from No to Yes to get your variable names at the top of the columns. If the preview looks funny and you can’t figure out how to make it better, ask your professor.

Once you’ve used the button, some code will appear in your Console. You can copy-paste this into the top of your R Markdown document to load the data there.

The Import Dataset button may use the function read.csv() by default. This is base R syntax, and is falling out of favor. A better function is from the tidyverse package readr and is called read_csv(). You can leave the filepath the way it is and just change the . to a _ in your code.

library(readr)
movies <- read_csv("~/SDS220/www/data/movies2.csv")

Initial data exploration

For the Data Appendix submission, you need to at least run str() on your dataset and look at the format of each variable.

str(movies)

You should also try to do some initial summary statistics about your response variable and your most important explanatory variables.

For numeric variables, something like

favstats(~duration, data=movies)

movies %>% 
  summarize(mean=mean(duration), sd=sd(duration))

For categorical variables, perhaps

tally(~color, data=movies)

movies %>%
  group_by(color) %>%
  summarize(n=n())

For each variable, consider whether the distribution makes sense.

Most pressing data cleaning issues

Bulleted list of the issues you need to deal with before you can work with the data. Do you need to join files? Mutate your dataset to create a new variable? Clean up a variable that has 0s where it needs NAs? Write these out.