\documentclass[10pt]{article}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{fancyhdr,url,hyperref}
\usepackage{graphicx,xspace}
\usepackage{subfigure}
\usepackage{tikz}
\usetikzlibrary{arrows,decorations.pathmorphing,backgrounds,positioning,fit,through}
\oddsidemargin 0in %0.5in
\topmargin 0in
\leftmargin 0in
\rightmargin 0in
\textheight 9in
\textwidth 6in %6in
%\headheight 0in
%\headsep 0in
%\footskip 0.5in
\newtheorem{thm}{Theorem}
\newtheorem{cor}[thm]{Corollary}
\newtheorem{obs}{Observation}
\newtheorem{lemma}{Lemma}
\newtheorem{claim}{Claim}
\newtheorem{definition}{Definition}
\newtheorem{question}{Question}
\newtheorem{answer}{Answer}
\newtheorem{problem}{Problem}
\newtheorem{solution}{Solution}
\newtheorem{conjecture}{Conjecture}
\pagestyle{fancy}
\lhead{\textsc{Prof. McNamara}}
\chead{\textsc{SDS/MTH 220: Lecture notes}}
\lfoot{}
\cfoot{}
%\cfoot{\thepage}
\rfoot{}
\renewcommand{\headrulewidth}{0.2pt}
\renewcommand{\footrulewidth}{0.0pt}
\newcommand{\ans}{\vspace{0.25in}}
\newcommand{\R}{{\sf R}\xspace}
\newcommand{\cmd}[1]{\texttt{#1}}
\newcommand{\Ex}{\mathbb{E}}
\rhead{\textsc{November 15, 2017}}
\begin{document}
\paragraph{Agenda}
\begin{enumerate}
\itemsep0em
\item USPROC competition
\item Difference of paired samples
\item Difference of two means
\end{enumerate}
\paragraph{Inference for a single mean}
Recall the conditions for using a $t$-based sampling distribution for a single mean:
\begin{enumerate}
\item The samples come from independent observations
\item The distribution of the original variable is approximately normal, or the sample size is large
\end{enumerate}
\paragraph{Gifted Children's Parents}
From last time:
<>=
require(mosaic)
require(openintro)
favstats(~motheriq, data = gifted)
favstats(~fatheriq, data = gifted)
@
\begin{enumerate}
\itemsep0.5in
\item Write the 90\% confidence interval for the mean IQ of the mothers. Do the same for the fathers. Do they overlap?
% \item Using the data above, write out a hypothesis test about whether the mothers of gifted children have higher IQs, on average, than the fathers. What do you conclude?
% \vspace{0.5in}
\end{enumerate}
\paragraph{Paired data}
Since in this data set, the IQ of both parents is recorded for all children, the IQ data is naturally paired. We can define a new variable, \texttt{diff}, to be the difference between the mother's IQ and the father's for each gifted child.
<>=
gifted <- gifted %>%
mutate(diff = motheriq - fatheriq)
favstats(~diff, data = gifted)
@
\begin{enumerate}
\itemsep0.5in
\item Find a 90\% confidence interval for the difference in IQ scores. Does it agree with what you found in the previous question?
\item Test the hypothesis that the mothers of gifted children have higher IQs, on average, than the fathers. Write out all of the steps. What do you conclude?
\begin{enumerate}
\itemsep0.5in
\item State the null and alternative hypotheses
\item Check that \texttt{diff} meets the conditions listed above
\item Compute the standard error of the mean ($SE_{\texttt{diff}}$) and the degrees of freedom
\item Compute the test statistic ($T$)
\item Compute the p-value and draw a conclusion [Use the table at the back of the book, or the \cmd{pt()} function in \R.]
\item Write a sentence that provides an interpretation of your result
\end{enumerate}
\vspace{0.5in}
\end{enumerate}
\paragraph{Difference of two means}
Often the data are \emph{not} naturally paired. In particular, we are often interested in comparing mean from two groups of unequal sizes. For example, the 11 children whose fathers had higher IQs than mothers had a lower average score on the skills test than the 25 children whose mothers had higher IQs than the fathers.
<>=
favstats(score ~ (diff > 0), data = gifted)
@
Now the samples are \emph{not} naturally paired. How do we know if the observed difference in means between these two groups is meaningful? Let $X$ be the random variable that gives the analytical skills test score for a gifted child whose father has a higher IQ than her mother, and let $Y$ be the random variable that gives the test score for a gifted child whose mother has a higher IQ. Then we need to understand the sampling distribution of the test statistic $D = \bar{X}- \bar{Y}$.
Just as we did with proportions, the standard error of the difference is a combination of the standard errors of the variables.
$$
SE_D = \sqrt{(SE_X)^2 + (SE_Y)^2}
$$
If both $X$ and $Y$ meet the conditions for a $t$-based sampling distribution, then $D$ will meet those conditions as well. We typically use $\min(n_1-1, n_2-1)$ for the degrees of freedom.
The hypothesis test for a difference of two means constructed in this manner is called the \emph{two-sample $t$-test}, and it is a commonly applied statistical technique.
\begin{enumerate}
\itemsep1in
\item Use the information above to conduct a two-sample $t$-test for a difference in mean test score between gifted children whose fathers have higher IQs vs. those whose mothers have higher IQs.
\end{enumerate}
\vspace{2in}
% \paragraph{ANOVA}
%
% We just developed a way to compare differences in means between \emph{two} groups. But what if we have more than two groups? Analysis of Variance (ANOVA) provides a mechanism for simultaneously assessing the differences between multiple groups.
%
% The HELP study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. We'll consider two variables:
%
% \begin{itemize}
% \itemsep0in
% \item \cmd{cesd}: Center for Epidemiologic Studies Depression measure at baseline (high scores indicate more depressive symptoms)
% \item \cmd{substance}: primary substance of abuse: alcohol, cocaine, or heroin
% \end{itemize}
%
% Are there important differences in the depression scores among patients depending on their drug of abuse?
%
% <>=
% favstats(cesd ~ substance, data = HELPrct)
% qplot(y = cesd, x = substance, data = HELPrct, geom = "boxplot")
% anova(aov(cesd ~ substance, data = HELPrct))
% @
%
% \begin{enumerate}
% \itemsep1.25in
% \item Write down the null and alternative hypotheses
% \item Check the conditions for ANOVA: is independence reasonable? Is normality reasonable? What about equal variance?
% \item Find the value of the test statistic ($F$) in the ANOVA table. Can you derive it from the other numbers in the table?
% \item Draw a picture of the sampling distribution of $F$. How many degrees of freedom do we have?
% \item Find the p-value. [You will need the function \cmd{pf()}.]
% \item What do you conclude? Write a sentence summarizing your findings.
% \vspace{0.75in}
% \end{enumerate}
% \newpage
%
% \paragraph{Solution to Warmup}
%
% We need to make three intervals:
%
% <>=
% require(mosaic)
% require(openintro)
% n <- nrow(gifted)
% # mothers
% mean_m <- mean(~motheriq, data = gifted)
% se_m <- sd(~motheriq, data = gifted) / sqrt(n)
% mean_m + qt(c(0.05, 0.95), df = (n - 1)) * se_m
% # fathers
% mean_f <- mean(~fatheriq, data = gifted)
% se_f <- sd(~fatheriq, data = gifted) / sqrt(n)
% mean_f + qt(c(0.05, 0.95), df = (n - 1)) * se_f
% # pairs
% gifted <- mutate(gifted, diff = motheriq - fatheriq)
% x_bar <- mean(~diff, data = gifted)
% se <- sd(~diff, data = gifted) / sqrt(n)
% 2 * pt(x_bar/se, df = (n-1), lower.tail = FALSE)
% @
%
%
% Recall that by definition, the standard error $SE_D$ is the standard deviation of the sampling distribution of the test statistic, $D$. Thus, \emph{since $D$ is not a mean},
%
% $$
% SE_{D} = \sigma_{D} = \sqrt{Var(D)} = \sqrt{Var(\bar{X} - \bar{Y})}
% $$
% If $X$ and $Y$ are independent, then we have:
% $$
% \sqrt{Var(\bar{X} - \bar{Y})} = \sqrt{Var(\bar{X}) + Var(\bar{Y})} = \sqrt{\sigma_{\bar{X}}^2 + \sigma_{\bar{Y}}^2} = \sqrt{(SE_{\bar{X}})^2 + (SE_{\bar{Y}})^2}
% $$
% We need to verify that both $X$ and $Y$ are distributed approximately normally. If they are, then $SE_{\bar{X}} = \sigma_X / \sqrt{n}$ and $SE_{\bar{Y}} = \sigma_Y / \sqrt{n}$. Thus,
% $$
% SE_{D} = \sqrt{ \left( \frac{\sigma_X}{\sqrt{n_X}} \right)^2 + \left( \frac{\sigma_Y}{\sqrt{n_Y}} \right)^2} = \sqrt{ \frac{\sigma_X^2}{n_X} + \frac{\sigma_Y^2}{n_Y}}
% $$
%
%
% \paragraph{Solution to two-sample t-test}
%
% <<>>=
% fv <- favstats(score ~ (diff > 0), data = gifted) %>%
% mutate(se = sd / sqrt(n))
% fv %>%
% summarize(obs = diff(mean),
% se = sqrt(sum(se^2)),
% df = min(n - 1)) %>%
% mutate(p_value = 2 * pt(obs/se, df = df, lower.tail = FALSE))
% @
%
% To match the output from \cmd{t.test()}, use the more complicated estimate of d.f.
%
% <<>>=
% fv %>%
% mutate(se_i = se^4 / (n - 1)) %>%
% summarize(obs = diff(mean),
% num = (sum(se^2))^2,
% denom = (sum(se_i / (n - 1)))) %>%
% mutate(df = num / denom,
% p_value = 2 * pt(obs/se, df = df, lower.tail = FALSE))
% t.test(score ~ (diff > 0), data = gifted)
% @
%
%
% \newpage
%
% \paragraph{One-way ANOVA}
%
% \begin{itemize}
% \item Key insight: One-way ANOVA is just regression with a quantitative response variable and a single categorical explanatory variable
% \item A two-sample t-test is just a special case of ANOVA where there are only two groups
% \item Consider the following formulations \emph{of the same model}:
% \begin{align*}
% y_{ij} &= \mu_i + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma) \\
% y_{ij} &= \mu + \alpha_i + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma) \\
% y_{ij} &= \mu_1 + \beta_i \cdot X_i + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma)
% \end{align*}
% for groups $i = 1,\ldots, I$ and individuals $j=1,\ldots,n_i$, with common standard deviation $\sigma$
% \item The $\mu_i$'s are the group means, $\mu$ is the grand mean, the $\alpha_i$'s are the group effects, and the $\beta_i$'s are the group effects relative to the \emph{reference group}.
% \item The models are the same, because the $\hat{y}_{ij}$'s are all the same.
% \item Let $X$ be the categorical variable that has a unique value for each of the $i = 1,\ldots,I$ groups. Furthermore, let $X_i$ be the indicator (binary) variable corresponding to the $i^{th}$ group.
% \item The third model is exactly what happens when you compute {\tt lm(y $\sim$ x)} in R, with {\tt x} being a categorical variable
% \item The pooled standard deviation $s_p$, a weighted average of the standard deviations of the groups, is an estimate of $\sigma$, the unknown common standard deviation. This equal to the residual standard error.
% \item The null hypothesis for one-way ANOVA is that $\beta_i = 0$ for all $i=1,\ldots,I-1$
% \item The sum of squares and degrees of freedom are partitioned similarly as
% $$
% SS_{Total} = SS_{Groups} + SS_{Residuals} \, , \qquad d.f._{Total} = d.f._{Groups} + d.f._{Residuals}
% $$
% \end{itemize}
\end{document}