You will conduct a statistical study on a topic of your choice. This task will require you to write a project proposal, acquire and analyze relevant data, present your results orally to the class, and hand in a written report describing your study and its findings. Your project must involve fitting a multiple regression model. The project is an opportunity to show off what you’ve learned about data analysis, visualization, and statistical inference. It is a major component of the class, and successful completion is required to pass.

Big picture

You should pose a problem that you find interesting and which may be addressed (at least in part) through the analysis of data. Many interesting quantitative questions (and perhaps even more uninteresting ones) involve the relationships among several variables. Recent projects have considered the following questions:

  • What is the role of executive compensation in determining company performance?

  • What is the relationship between the value of a share of stock and financial characteristics of a company?

  • How is the state’s murder rate affected by its demographics and social characteristics?

  • How is the percentage of Massachusetts high school seniors going on to four-year colleges influenced by town and school characteristics?

  • What factors influence the incidence of tuberculosis in the U.S.?

  • How can we predict real estate prices in Northampton?

You should pose the problem that you want to solve as precisely as possible at the outset. Next, identify the population you want to describe, and think about how you will obtain relevant data. What kind of model might be appropriate in this context? You should also make a hypothesis, a priori (before you analyze the data), about the results you expect to see.

Most of you will pose your own question and acquire data from the Internet, some may wish to analyze data that someone else (e.g., a professor or office at Smith, published data in a magazine or newspaper) has collected for another purpose. Please consult me if you plan to do this.

General Rules

You all must speak during the oral presentation. You may discuss your project with other students, but each of you will have a different topic, so there is a limit to how much you can help each other. You may consult other sources for information about the non-statistical, substantive issues in your problem, but you should credit these sources in your report. Feel free to consult your professor about statistical questions.


Please see the project schedule for due dates.


All deliverables must be delivered electronically via Moodle by 11:55pm (five minutes before midnight) on the dates above. Only one person from the group should submit the group’s product for each checkpoint (with the exception of the Group Dynamic, which is individual).

Group formation and roster

Form a group of 3 students from your section. Have one person send a group roster electronically by the date listed above, with appropriate cc’s, using the message subject header MTH/SDS 220 Group Roster. Take the initiative to ask around and find a group to work with. Each group will be assigned a letter, and you will use Group-X with your particular letter in all communications from that point forward.

Initial and revised project proposal

You will need to submit an initial and a revised project proposal. For more information about these documents, see the ProjectProposal.html.

Data Appendix

Technical Report

Your technical report will be an annotated R Markdown file (.Rmd) that contains your R code, interspersed with explanations of what the code is doing, and what it tells you about the problem.

Much more about the technical report can be found on the ProjectTechnicalReport.html.

Places to Find Data

Finding the right data to answer your particular question is part of your responsibility for this assignment. Public data sets are available from hundreds of different websites, on virtually any topic. You might not be able to find the exact data that you want, but you should be able to find data that is relevant to your topic. You may also want to refine your research question so that it can be more clearly addressed by the data that you found. But be creative! Go find the data that you want!

Below is a list of places to get started, but this list should be considered grossly non-exhaustive:

Keep the following in mind as you select your topic and dataset:

  • You need to have enough data to make meaningful inferences. There is no magic number of individuals required for all projects. But aim for at least 200 individuals and make sure there are at least 20 individuals in each category of each of your categorical variables (if you have any).

  • Most projects will measure a quantitative outcome, with at least two other variables included in the dataset (ideally at least one of which is quantitative). Most of you will use multiple linear regression for your primary analyses.

Once we respond to your initial proposal, you will revise it (perhaps starting with a different dataset), then submit a new proposal that addresses our feedback. Supply essentially the same information required for the initial proposal, but give a bit more detail.

Assessment Criteria

Your project will be evaluated based on the following criteria:

  • General: Is the topic original, interesting, and substantial – or is it trite, pedantic, and trivial? How much creativity, initiative, and ambition did the group demonstrate? Is the basic question driving the project worth investigating, or is it obviously answerable without a data-based study?

  • Design: Are the variables chosen appropriately and defined clearly, and is it clear how they were measured/observed? Can the effects of lurking variables be controlled for? Is there sufficient data to make meaningful conclusions?

  • Analysis: Are the chosen analyses appropriate for the variables/relationships under investigation, and are the assumptions underlying these analyses met? Do the analyses involve fitting and interpreting a multiple regression model? Are the analyses carried out correctly? Is there an effective mix of graphical, numerical, and inferential analyses? Did the group make appropriate conclusions from the analyses, and are these conclusions justified?

  • Technical Report: How effectively does the written report communicate the goals, procedures, and results of the study? Are the claims adequately supported? How well is the report structured and organized? Does the writing style enhance what the group is trying to communicate? How well is the report edited? Are the statistical claims justified? Are text and analyses effectively interwoven in the technical report? Clear writing, correct spelling, and good grammar are important.

  • Oral Presentation: How effectively does the oral presentation communicate the goals, procedures, and results of the study? Do the slides help to illustrate the points being made by the speaker without distracting the audience? Do the presenters seem to be well-rehearsed? Did they properly budget their time? Does she appear to be confident in what she is saying? Are her arguments persuasive?

Undergraduate Statistics Project Competition

If you are interested, you can submit your project to the Undergraduate Statistics Project Competition, or USPROC. For this course, you would submit to the Undergraduate Class Project Competition (USCLAP), in the “first course in statistics” category. I’d be happy to chat with you about how strong a submission you might have.