For the Data Appendix, you need to submit your data file(s) and a knitted HTML file showing the data being loaded into R.
Your raw data file(s) should ideally be in CSV format (.csv
). This means that the first row should be a comma-separated list of variable names, and the rest should be rows of data.
Excel data can easily be converted to .csv by using Save As or Export and choosing “comma separated values” as the format.
If your data is in another format, talk to your professor about converting the data or using a different R command for reading it in.
To get started looking at your data, you need to do three main steps:
group-X-data.csv
, so change the filename if necessary.To get around file path issues, try the Import Dataset button on the Environment tab (top right corner) and navigate to the file on the server. Look at the preview. You may need to switch the radio button for Heading from No to Yes to get your variable names at the top of the columns. If the preview looks funny and you can’t figure out how to make it better, ask your professor.
Once you’ve used the button, some code will appear in your Console. You can copy-paste this into the top of your R Markdown document to load the data there.
The Import Dataset button may use the function read.csv()
by default. This is base R syntax, and is falling out of favor. A better function is from the tidyverse package readr
and is called read_csv()
. You can leave the filepath the way it is and just change the .
to a _
in your code.
library(readr)
movies <- read_csv("~/SDS220/www/data/movies2.csv")
For the Data Appendix submission, you need to at least run str()
on your dataset and look at the format of each variable.
str(movies)
You should also try to do some initial summary statistics about your response variable and your most important explanatory variables.
For numeric variables, something like
favstats(~duration, data=movies)
or
movies %>%
summarize(mean=mean(duration), sd=sd(duration))
For categorical variables, perhaps
tally(~color, data=movies)
or
movies %>%
group_by(color) %>%
summarize(n=n())
For each variable, consider whether the distribution makes sense.
Bulleted list of the issues you need to deal with before you can work with the data. Do you need to join files? Mutate your dataset to create a new variable? Clean up a variable that has 0s where it needs NAs? Write these out.