--- title: 'Week 5 problem set: Worked solutions' subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525" author: "Aaron Shaw" date: "May 2, 2019" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## PC0 Load the dataset. As usual, you'll need to edit this block of code to point at the appropriate file location to make it work on your machine. ```{r} pop <- read.delim(file="~/Documents/Teaching/2019/stats/data/week_05/population.tsv") ``` ## PC1 ```{r} mean(pop$x, na.rm=T) ## Insert code for your group as needed: w3 <- read.csv(file="~/Documents/Teaching/2019/stats/data/week_03/group_03.csv", row.names=NULL) mean(w3$x, na.rm=T) ``` Given the structure of the full dataset, it's also easy to calculate all of the group means: ```{r} s.means <- tapply(pop$x, pop$group, mean, na.rm=T) s.means ``` We will discuss the relationship of the individual group means to population mean in class. Basically, we can think of each group as a sample, so the sample means are the *sampling distribution* of the population mean. ## PC2 I'll do this two ways. First, just plugging the values from the group sample into the formula for the standard error, I can then add/subtract twice the standard from the mean to find the 95% CI. ```{r} se <- sd(w3$x, na.rm=T) / sqrt(length(w3$x[!is.na(w3$x)])) mean(w3$x, na.rm=T)-(2*se) mean(w3$x, na.rm=T)+(2*se) ``` Now, I'll write a more general function to calculate confidence intervals. Note that I create an 'alpha' argument with a default value of 0.05. I then divide alpha by 2. Can you explain why this division step is necessary? ```{r} ci <- function (x, alpha=0.05) { x <- x[!is.na(x)] probs <- c(alpha/2, 1-alpha/2) critical.values <- qnorm(probs, mean=0, sd=1) se <- sd(x)/sqrt(length(x)) mean(x) + critical.values * se } ## Here I run the function on the group 3 data ci(w3$x) ``` Again, it's easy to estimate this for every group: ```{r} group.confints <- tapply(pop$x, pop$group, ci) group.confints ``` ## PC3 We'll discuss this one in class too. Since the samples are (random) samples, we should not be surprised that their individual group means are different from the population mean. We should also not be surprised that the 95% CI for the population mean estimated from at least one of the samples does *not* include the true population mean. Since our confidence interval is 95%, we would expect to be wrong about 1/20 times on average! ## PC4 Comparing the distributions (first for the single comparison) ```{r} hist(pop$x) hist(w3$x) summary(pop$x) sd(pop$x, na.rm=T) summary(w3$x) sd(w3$x, na.rm=T) ``` And here's code to do it for each group. A ggplot2 "faceted" histogram will look great and makes for concise code. ```{r} library(ggplot2) ggplot(pop, aes(x)) + geom_histogram() + facet_wrap(. ~ group) tapply(pop$x, pop$group, summary) tapply(pop$x, pop$group, sd, na.rm=T) ``` They all look a little bit different from each other and from the population distribution. We'll discuss these differences in class. Again, none of this should be shocking given the relationship of the samples to the population. ## PC5 ```{r} s.means <- tapply(pop$x, pop$group, mean, na.rm=T) s.means sd(s.means) ## My standard error from one of the groups above: se ``` We will discuss the relationship of these values in class. As mentioned earlier, the distribution of sample means drawn from the population is the *sampling distribution*. The standard error of the mean estimated from any of the individual groups/samples should be a good approximation of (but not necessarily equal to!) the standard deviation of the sampling distribution of the means. ## PC 6 ### (a) Since there's going to be some randomness in the next few commands, I'll set a seed value to ensure the results are consistent on other laptops. ```{r} set.seed(20190501) pop.unif <- sample(seq(0, 9), 10000, replace=TRUE) ``` ### (b) ```{r} mean(pop.unif) hist(pop.unif) ``` ### (c) ```{r} hist( sapply( rep(1, 100), function (x) { mean(sample(pop.unif, 2))} ) ) ``` ## (d) Same commands as (c) just changing the sample size ```{r} hist(sapply(rep(1, 100), function (x) { mean(sample(pop.unif, 10))})) hist(sapply(rep(1, 100), function (x) { mean(sample(pop.unif, 100))})) ``` ## PC7 We'll discuss this in class. Noteable things you might observe include that the sampling distribution of the means approaches normality as it gets larger in size whether the population we draw from is uniform, log-normal, or really just about any other distribution. This is an illustration of some aspects of the *central limit theorem*. It is also an illustration of the *t-distribution* (the basis for the t-tests that you learned about this week). # Statistical Questions ## SQ1 — 5.22 (a) We want to calculate the following: $$ \overline{x}_{diff} \pm t^*_{df}\frac{s_{diff}}{\sqrt{n}}$$ Plug in the corresponding values and the Z-score for a 95% confidence interval (1.98): $$-0.545 \pm 1.98 \times \frac{8.887}{\sqrt{200}} $$ I'll let R take it from here: ```{r} -0.545 - (1.98 * (8.887/sqrt(200))) -0.545 + (1.98 * (8.887/sqrt(200))) ``` (b) With 95% confidence, the average difference between reading and writing test scores falls between -1.79 and 0.70. (c) No, because the interval includes/crosses zero. ## SQ2 — 5.28 We want to test the following hypotheses: $$H_0~:~\mu_{0.99}=\mu_{1}$$ $$H_A~:~\mu_{0.99}\neq\mu_{1}$$ To do so, we can use a two-sample t-test to compare the two sample means. The conditions we'd like to satisfy are independence and normality. Re: independence, the samples are random and not terribly large (presumably less than 10% of all the diamonds of each carat rating on earth), so we should be good to go. Re: normality, visual inspection of the histograms presented in the textbook suggests that what skew may be present in either distribution is not extreme. Given that the conditions are met, here's how you could construct the test statistic $T$: $$T~=~\frac{Point~estimate~-~Null}{Standard~error_{difference}}$$ Plugging in formulas from chapter 5 of the textbook this looks like: $$T~=~ \frac{(\overline{x}_1-\overline{x}_2)-(0)}{\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}}$$ Now, plug in values from the table: $$T~=~ \frac{(44.51-56.81)-(0)}{\sqrt{\frac{13.32^2}{23}+\frac{16.13^2}{23}}}$$ Work that out and you should have $T~=~-2.82$. The degrees of freedom are estimated by the smaller of $n_1-1$ or $n_2-1$ (which are equal in this case), so $df~=~22$. Consulting the table of T-statistics from the back of the book, we find: $$p_{value}=P(T_{22} \gt 2.82) = 0.01$$ Assuming we're okay with a fale positive rate of $p\leq0.05$, this provides support for the alternative hypothesis and we can reject the null of no difference between the average standardized prices of 0.99 and 1 carat diamonds. ## SQ3 — 5.30 I want to calculate the following: $$(\overline{x_1}-\overline{x_2})~\pm~t^*_{df}\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}$$ Since the critical values for the distribution of the test statistic $T$ is available in R using the `qt()` function, I can use a similar procedure here as in my `ci()` function from earlier in the problem set to calculate the value of $t_{df}$ (in this case, $t_{22}$). Note that if you didn't use the `qt()` part of this you could have looked up the critical value for a two-tailed test in the t distribution table at the back of the textbook (or elsewhere). ```{r} diff.means <- 44.51-56.81 t.star <- qt(0.025, df=22) se <- sqrt( ((13.32^2)/23) + ((16.13^2)/23) ) diff.means - (t.star*se) diff.means + (t.star*se) ``` In words, I am 95% confident that the average difference between the standardized prices of 0.99 carat diamond and a 1 carat diamond falls between \$3.27 and \$21.33 (the 1 carat diamond costs more). ## SQ4 — 5.48 (a) Hypotheses: $H_0:$ The mean hours worked for the groups are all equal. $$\mu_{\lt~hs} = \mu_{hs} = \mu_{jc} = \mu_{ba} = \mu_{grad} $$ $H_A:$ The mean hours worked vary by education level. In other words, the means are not equal. (b) Conditions and assumptions necessary for unbiased ANOVA estimates include: Independent observations, normal(ish) distributions, and constant(ish) variance. The problem doesn't say much about the sample to help evaluate the independence of the observations, but it's definitely less than 10% of the population and is likely a fairly good approximation of a random sample (thereby satisfying the rule of thumb). From the boxplots the distributions all look fairly normal. The standard deviations are also similar. We'll assume that the conditions are met for the purposes of the test. (c) Working across the blanks in the table by row: * The degrees of freedom for degree $= 5-1 = 4$ * The Sum of Squares between degree levels $= 501.54 \times 4 = 2006.16$ * The F value $= Sum~Sq~degree / Mean~Sq~residuals = 501.54 / 229.12 = 2.189$ * The degrees of freedom for Residuals $= 1171-4 = 1167$ * Mean Square Residuals (Error) $= 267382/1167 = 229.12$ * Total degrees of freedom $=1172 - 1 = 1171$ * Total Sum of squares $=2006.16+267382 = 269388.16$ (d) According to the ANOVA results, we cannot reject the null hypothesis at a $p \leq 0.05$ level, suggesting that the mean hours worked may be equal across education levels. ## SQ5 — 5.52 (a) False. The ANOVA procedure does not evaluate the pairwise comparisons, but the overall variation across groups. (b) True, otherwise the F-value will not be large enough to reject the null hypothesis. (c) False. It is possible that none of the pairwise comparisons will be significantly different even if the ANOVA rejects the null. (d) Assuming this question is about the Bonferroni correction, False. The correction does not divide $\alpha$ by the number of groups, but rather the number of pairwise tests. In this case, 4 groups yields ${4}\choose{2} = 6$ pairs, meaning that the corrected value for $\alpha = 0.05/6 = 0.0083$. Other corrections exist even though they were not discussed in the book and they may choose other values for $\alpha$. ## SQ6 — Reinhart We'll discuss this one as a group. Personally, I find the focus on p-values somewhat thought-stopping and would prefer to see researchers and publications embrace reporting standards that yield more intuitive, flexible, and meaningful understanding. Confidence intervals seem to do this more effectively than p-values and, in that respect, I like confidence intervals quite a lot! # Empirical Paper Questions ## EQ1 — RQ4 questions (a) For RQ4, the units of analysis are the 109 respondents. The credibility index (a group of five survey items evaluating how credible the respondents perceived the blogs they read to be) was split at the mean and these high/low groups were the independent (grouping) variable in the ANOVA. The six perceived relationship management factors (survey items) were the dependent variables for the ANOVA. (b) The analysis tests whether the mean responses on the survey items composing each of the different relationship management factors were equal across the high/low blog credibility groups. The alternative hypotheses are whether there are differences between the groups for any of the perceived relationship management dimensions. (c) None of the ANOVA tests rejected the null hypothesis of no difference. In other words, there was no evidence that perceptions of relationship management dimensions varied across individuals perceiving blogs as low or high credibiliy. (d) It is (usually) a bit hard to say much from a null result! See the answer to (c) above. ## EQ5 — RQ5 questions (a) Again, the units are the 109 respondents and the partitioned (low/high) credibility index serves as the independent (grouping) variable. The crisis index is the dependent variable. (b) The ANOVA tests whether average assessments of perceived crisis in the organization in question were equal by whether participants perceived the blogs to be low/high credibility. The alternative hypotheses are whether there are differences between the groups for perceptions of the organization being in crisis. (c) The results of the ANOVA reject the null, suggesting support for the alternative hypothesis that participants reporting low credibility blogs reported different (higher) scores on the crisis index. (d) I find the reported differences compelling, but would like more information in order to determine more specific takeaways. For example, I would like to see descriptive statistics about the various measures to help evaluate whether they meet the assumptions for identifying the ANOVA. Survey indices like this are a bit slippery insofar as they can seem to yield results when the differences are really artifacts of the measurements and how they are coded. I am also a bit concerned that the questions seemed to ask about blog credibility in general rather than the specific credibility of the specific blogs read by the study participants? The presumed relationship between credibility and the assignment to the blogs in question is not confirmed empirically, meaning that the differences in perceptions of organizational crisis might be more related to baseline attitudes than to anything specific about the treatment conditions in the experiment. I would also like to know more about the conditional means and standard errors in order to evaluate whether the pairwise average perceptions of organizational crisis vary across perceived credibility levels. ## EQ6 — RQ6 questions (a) Analogous to RQ5 except that the (six) different dimensions of relationship management separated into high/low categories served as the independent (grouping) variables in the ANOVA. Perceptions of organizational crisis remained the dependent variable. (b) This set of ANOVAs test whether assessments of perceived organizational crisis were equal or varied depending on the prevalence of specific relationship management strategies. (c) The results of the ANOVAs are mixed, rejecting the null hypothesis of no difference for two of the relaionship management dimensions (task sharing and responsiveness/customer service), but failing to do so for the other four dimensions. For the two dimensions in question, higher prevalence of each strategy appeared to reduce the perception of crisis. (d) Again, the differences reported are compelling insofar as they indicate larger varation across the groups in terms of perceived crisis than would be expected in a world where no difference existed. That said, in addition to all the issues mentioned above for EQ5 part (d), this set of tests also raisesw issues of multiple comparisons. Not only is it useful to consider multiple comparisons in the context of a single ANOVA, but when running many ANOVAs across many grouping variables. If the authors really wanted to test whether the specific PR strategies mentioned here altered perceptions of organizational crisis across different types of blogs, they should/could have incorporated those variations into their experimental design much more directly. As such, the results remain suggestive that some of the observed relationships may exist, but can provide only weak support for even those that reject the null hypotheses in statistical tests.