Data Appendix

Published

January 1, 2023

Data appendix

The data appendix is a first foray into examining your data. It is designed to help you identify issues you may need to address with data cleaning, merging, or other solutions. It should be prepared as an qmd file, and you will need to submit three documents on Canvas:

  • Your data file (or a zip folder containing multiple data files) in a CSV format (.csv), Excel file, or some other filetype that R can read.
  • The qmd file of your data appendix.
  • A rendered HTML or PDF file of your data appendix.

At the beginning of the data appendix, you will load your data. This will probably mean using the readr::read_csv() function or the readxl::read_excel() function. If you are not sure how to use this function, try using the Import Dataset button in RStudio and then copying the resulting code.

Then, run skim() or str() on your data.

Consider each of your variables in turn. Questions to ponder:

  1. Is each variable of the type you expect? For example, make sure things that should be numeric are being considered as num or int variables, rather than chr (character strings). If you have variables that should be categorical ensure they are chr or Factor variables, rather than a num or int (recall the Income example from the data wrangling lab, where Income was integers 1-5, but those numbers were actually representing categories, not dollar amounts).
  2. Are the variable names readable, both by humans and computers? Do the names look like words that you can understand? Do they include spaces, periods, underscores, or other things that will make them hard to use in code? Are they helpful and contextual? For example, you would want Airport and WaterTemp, not apt and wt, and certainly not A and B as variable names.
  3. Consider the content of your variables.
  1. For categorical variables: look at the category names. Are they readable? Are they understandable phrases or cryptic codes? For example, do they use Male and Female or something like 1 and 2? A binary variable isFemale could be coded 0 for male, and 1 for female (and then is self-documenting). A variable sex coded 1 and 2 is just asking for trouble.
  2. For numeric variables: look at the min and max values for each of the variables. Do they make sense in context? For example, a variable like Bedrooms could be 0 for a studio apartment, but rarely do apartments have 0 Bathrooms, so that minimum would not make sense. Similarly, an Income variable that ranges from 1-5 probably doesn’t make sense, because incomes are usually much more than $1.
  3. For all variables: consider how many missing values there are. Is it just a few, or so many as to be concerning?

Make a list of all the variables in your dataset (or at least, all the ones you are interested in studying), commenting on as many of the above questions as makes sense.

After examining your variables, write up a list of your most pressing data wrangling, data cleaning, and data joining issues. For example, “we need to condense the political spectrum variable down to two categories so it is binary” or “we need to change the Income variable into a categorical variable and recode the levels.”

Then, do your best to accomplish those tasks. If you are able to do the data wrangling right off the bat, great! If not, show me the code you tried (you can use the #| error: true option to make it so your document will render) and explain what you were thinking as you wrote it. Identifying the dplyr “verbs” is a good first step.

At the end of the document, run skim() again, and look to see if your data issues have been solved.