Project Instructions
As a major component of this class, you (and your group) will conduct a statistical study on a topic of your choice. This task will require you to write a project proposal, acquire and analyze relevant data, present your results orally to the class, and hand in a written report describing your study and its findings. Your project must involve fitting a regression model. The project is an opportunity to show off what you’ve learned about data analysis, visualization, and statistical inference. It is a major component of the class, and successful completion is required to pass.
Group formation
You will work in a group of three, of your choosing. Each group will be assigned a letter, and you will use Group-X
with your particular letter in all communications and file submissions from that point forward.
Assignment
You should pose a problem that you find interesting and which may be addressed (at least in part) through the analysis of data. Many interesting quantitative questions (and perhaps even more uninteresting ones) involve the relationships among several variables. Recent projects have considered the following questions:
What factors influence the number of Hospital Acquired Infections (HAIs) found in the United States?
What is the relationship between the price of real estate in Allegheny County and the size of the house, as measured in square feet, lot size, number of bedrooms and more?
Can we predict the price of AirB&B rentals?
What factors are associated with a “helpful” Yelp review?
What factors influence the incidence of tuberculosis in the U.S.?
How can we predict teen alcohol consumption?
You should pose the problem that you want to solve as precisely as possible at the outset. Next, identify the population you want to describe, and think about how you will obtain relevant data. What kind of model might be appropriate in this context? You should also make some guesses, a priori (before you analyze the data), about the results you expect to see.
Most of you will pose your own question and acquire data from the internet, some may wish to analyze data that someone else (e.g., a professor or office at St. Thomas, published data in a magazine or newspaper) has collected for another purpose. Please consult me if you plan to do this.
Finding an appropriate dataset
Finding the right data to answer your particular question is part of your responsibility for this assignment. Public data sets are available from hundreds of different websites, on virtually any topic. You might not be able to find the exact data that you want, but you should be able to find data that is relevant to your topic. You may also need to refine your research question so that it can be more clearly addressed by the data that you found. Professor McNamara will provide you with a list of places to start looking for data, and you can consult with her if you need more help finding an appropriate dataset.
Keep the following in mind as you select your topic and dataset:
You need to have enough data to make meaningful inferences. There is no magic number of observations required for all projects. But aim for at least 200 observations and make sure there are at least 20 individuals in each category of each of your categorical variables (if you have any).
Most projects will involve a quantitative response variable, for which you can use multiple linear regression for your primary analyses.
If you want to use a response variable that is non-quantitative, it should be categorical with exactly two categories. This will allow you to use logistic regression to predict the response. We cover logistic regression near the end of the semester. We won’t do any modeling for categorical response variables with more than two categories.
In addition to your single response variable, you want to find a dataset with multiple variables you can use as explanatory variables. These variables can be quantitative or categorical, and categorical explanatory variables can have many categories (unlike a categorical response variable, which must have just two). Again, there is no magic number of variables for this project, but typically at least 10 variables in the dataset gives you a good place to start. As you work, you will likely narrow the number of variables you use.
The observations in your data should be reasonably independent (the I in the LINE conditions) from one another. This means your observations should not be spatial units (e.g., countries, states, counties, etc), time periods (e.g., years, months, days), or anything else obviously non-independent.
If you want to use observations that are time periods, you will need to use time series analysis, which is the final topic covered at the end of the semester.
Scaffolding
There are several assignments that lead up to the final technical report. In order, they are:
- group formation
- initial project proposal
- revised project proposal
- data appendix
- first draft of technical report
Technical report
Your technical report will be produced using an RMarkdown file (.Rmd
) that contains your R code, interspersed with explanations of what the code is doing, and what it tells you about the problem. You will do two drafts of the technical report: a rough draft the week before we do presentations, and a final draft submitted at the end of the semester.
Content
You should not need to include all of the R code that you wrote throughout the process of working on this project. Rather, the technical report should contain the minimal set of R code that is necessary to understand your results and findings in full.
That being said, if you make a claim, it must be justified by an explicit calculation. A knowledgeable reviewer should be able to compile your .Rmd
file without modification, and verify every statement that you have made. All of the R code necessary to produce your figures and tables must appear in the technical report. In short, the technical report should enable a reviewer to reproduce your work in full.
Tone
This document should be written for peer reviewers, who comprehend statistics at least as well as you do. You should aim for a level of complexity that is more statistically sophisticated than an article in the Science section of The New York Times, but less sophisticated than an academic journal. (Chance magazine might provide a good example.) For example, you may use terms that that you will likely never see in the Times (e.g. bootstrap), but should not dwell on technical points with no obvious ramifications for the reader (e.g. reporting F-statistics). Your goal for this paper is to convince a statistically-minded reader (e.g. a student in this class, or a student from another school who has taken an introductory statistics class) that you have addressed an interesting research question in a meaningful way. Even a reader with no background in statistics should be able to read your paper and get the gist of it.
Format
Your technical report should follow this basic format:
Abstract: a short, one paragraph explanation of your project. The abstract should not consist of more than 5 or 6 sentences, but should relate what you studied and what you found. It need only convey a general sense of what you actually did. The purpose of the abstract is to give a prospective reader enough information to decide if they want to read the full paper.
Introduction: an overview of your project, including a literature review. In a few paragraphs, you should explain clearly and precisely what your research question is, why it is interesting, and what contribution you have made towards answering that question. As you explain your topic, cite any relevant sources. You should give an overview of the specifics of your model, but not the full details. Most readers never make it past the introduction, so this is your chance to hook the reader, and is in many ways the most important part of the paper!
Data: a brief description of your data set. Where did it come from? Please include a citation that would allow a reader to find the dataset for themself. What variables are included? Where did they come from? What are units of measurement? What is the population that was sampled? How was the sample collected? You should also include some basic univariate analysis. Much of this can be taken from the data appendix.
Results: an explanation of what your model tells us about the research question. You should interpret coefficients in context and explain their relevance. What does your model tell us that we didn’t already know before? You may want to include negative results, but be careful about how you interpret them. For example, you may want to say something along the lines of: “we found no evidence that explanatory variable \(x\) is associated with response variable \(y\),” or “explanatory variable \(x\) did not provide any additional explanatory power above what was already conveyed by explanatory variable \(z\).” On other hand, you probably shouldn’t claim: “there is no relationship between \(x\) and \(y\).”
Conclusion: a summary of your findings and a discussion of their limitations. First, remind the reader of the question that you originally set out to answer, and summarize your findings. Second, discuss the limitations of your model, and what could be done to improve it. You might also want to do the same for your data. This is your last opportunity to clarify the scope of your findings before a journalist misinterprets them and makes wild extrapolations! Protect yourself by being clear about what is not implied by your research.
References: after the text, there should be a short works cited section, including the references you have made in the text. Aim for 3-5 scholarly citations, such as books or journal articles. More informal sources like blog posts and websites can be cited in addition to scholarly sources. Please make sure to include a citation to your data source, as well.
Additional Thoughts
The technical report is not simply a dump of all the R code you wrote during this project. Rather, it is a narrative, with technical details, that describes how you addressed your research question. You should not present tables or figures without a written explanation of the information that is supposed to be conveyed by that table or figure. Keep in mind the distinction between data and information. Data is just numbers, whereas information is the result of analyzing that data and digesting it into meaningful ideas that human beings can understand. Your technical report should allow a reviewer to follow your steps from converting data into information. There is no limit to the length of the technical report, but it should not be longer than it needs to be. You will not receive additional credit for simply describing your data ad infinitum. For example, simply displaying a table with the means and standard deviations of your variables is not meaningful. Writing a sentence that reiterates the content of the table (e.g. “the mean of variable \(x\) was 34.5 and the standard deviation was 2.8…”) is equally meaningless. What you should strive to do is interpret these values in context (e.g. “although variables \(x_1\) and \(x_2\) have similar means, the spread of \(x_1\) is much larger, suggesting…”).
You should present figures and tables in your technical report in context. These items should be understandable on their own – in the sense that they have understandable titles, axis labels, legends, and captions. Someone glancing through your technical report should be able to make sense of your figures and tables without having to read the entire report. That said, you should also include a discussion of what you want the reader to learn from your figures and tables.
Your report should be submitted on Canvas as an R Markdown (.Rmd
) file and the corresponding rendered output (.pdf
) file.
Assessment Criteria
Your project will be evaluated based on the following criteria:
General: Is the topic original, interesting, and substantial – or is it trite, pedantic, and trivial? How much creativity, initiative, and ambition did the group demonstrate? Is the basic question driving the project worth investigating, or is it obviously answerable without a data-based study?
Design: Are the variables chosen appropriately and defined clearly, and is it clear how they were measured/observed? Can the effects of lurking variables be controlled for? Is there sufficient data to make meaningful conclusions?
Analysis: Are the chosen analyses appropriate for the variables/relationships under investigation, and are the assumptions underlying these analyses met? Do the analyses involve fitting and interpreting a multiple regression model? Are the analyses carried out correctly? Is there an effective mix of graphical, numerical, and inferential analyses? Did the group make appropriate conclusions from the analyses, and are these conclusions justified?
Communication: How effectively does the written report communicate the goals, procedures, and results of the study? Are the claims adequately supported? Are sources cited? How well is the report structured and organized? Does the writing style enhance what the group is trying to communicate? How well is the report edited? Are the statistical claims justified? Are text and analyses effectively interwoven in the technical report? Clear writing, correct spelling, and good grammar are important.