Week 8 problem set: Worked solutions

Programming Challenges

w3 <- read.csv(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/group_08.csv"))

## Cleanup and recoding
w3$j <- as.logical(w3$j)
w3$l <- as.logical(w3$l)

w3$k <- factor(w3$k,
               levels=c(0,1,2,3),
               labels=c("none", "some", "lots", "all"))

summary(w3)

##        x                j               l              k     
##  Min.   :0.001657   Mode :logical   Mode :logical   none: 0  
##  1st Qu.:0.691977   FALSE:55        FALSE:48        some:31  
##  Median :2.059659   TRUE :45        TRUE :52        lots:39  
##  Mean   :2.400375                                   all :30  
##  3rd Qu.:3.534149                                            
##  Max.   :9.170684                                            
##  NA's   :5                                                   
##        y        
##  Min.   :-4.40  
##  1st Qu.: 3.02  
##  Median : 6.08  
##  Mean   : 7.47  
##  3rd Qu.:11.95  
##  Max.   :29.51  
##  NA's   :5

PC2

t.test(w3$x, w3$y)

## 
##  Welch Two Sample t-test
## 
## data:  w3$x and w3$y
## t = -7.0336, df = 111.6, p-value = 1.711e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.497250 -3.641157
## sample estimates:
## mean of x mean of y 
##  2.400375  7.469579

PC3

cor(w3$x, w3$y, use="complete.obs")

## [1] 0.9063767

PC4

f <- formula(y ~ x)
m <- lm(f, data=w3) # 'lm' stands for 'linear model', and is used for linear regression

summary(m)

## 
## Call:
## lm(formula = f, data = w3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3044 -2.2549  0.1348  2.2098  4.8179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.3898     0.4502   0.866    0.389    
## x             2.9494     0.1426  20.690   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.852 on 93 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8196 
## F-statistic: 428.1 on 1 and 93 DF,  p-value: < 2.2e-16

PC5

## Using base R graphing tools
plot(m$residuals)

hist(m$residuals)

plot(m$fitted.values, m$residuals)

qqnorm(m$residuals)

library(ggplot2)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

library(ggfortify)

autoplot(m)

PC6

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

stargazer(m, type="text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                  y             
## -----------------------------------------------
## x                            2.949***          
##                               (0.143)          
##                                                
## Constant                       0.390           
##                               (0.450)          
##                                                
## -----------------------------------------------
## Observations                    95             
## R2                             0.822           
## Adjusted R2                    0.820           
## Residual Std. Error       2.852 (df = 93)      
## F Statistic           428.063*** (df = 1; 93)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

PC7

library(readstata13)
df <- read.dta13('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta')
df <- df[complete.cases(df),] # Remove rows with NA

f1 <- formula(fruit ~ obama) # Create a formula for the regression

m1 <- lm(f1, data=df) # Create a model and store it as m1

summary(m1) # Display a summary of the regression model

## 
## Call:
## lm(formula = f1, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2748 -0.2748 -0.2368  0.7252  0.7632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.23681    0.01555  15.233   <2e-16 ***
## obama        0.03797    0.02578   1.473    0.141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4333 on 1219 degrees of freedom
## Multiple R-squared:  0.001776,   Adjusted R-squared:  0.0009572 
## F-statistic: 2.169 on 1 and 1219 DF,  p-value: 0.1411

PC8

autoplot(m1)

hist(residuals(m1))

plot(df$obama, m1$residuals)

PC9

I’ll fit the models separately and then present them using Stargazer to facilitate readability:

m2012 <- lm(f1, data=df[df$year == 2012,]) # Create a regression model for just 2012
m2014 <- lm(f1, data=df[df$year == 2014,])
m2015 <- lm(f1, data=df[df$year == 2015,])

stargazer(m2012, m2014, m2015, column.labels = c("2012", "2014", "2015"), type="text")

## 
## ================================================================================
##                                         Dependent variable:                     
##                     ------------------------------------------------------------
##                                                fruit                            
##                            2012                2014                 2015        
##                             (1)                 (2)                 (3)         
## --------------------------------------------------------------------------------
## obama                      0.050               0.017               0.065*       
##                           (0.067)             (0.042)             (0.038)       
##                                                                                 
## Constant                 0.203***            0.219***             0.253***      
##                           (0.051)             (0.026)             (0.021)       
##                                                                                 
## --------------------------------------------------------------------------------
## Observations                164                 422                 635         
## R2                         0.003              0.0004               0.004        
## Adjusted R2               -0.003              -0.002               0.003        
## Residual Std. Error  0.424 (df = 162)    0.419 (df = 420)     0.445 (df = 633)  
## F Statistic         0.550 (df = 1; 162) 0.159 (df = 1; 420) 2.849* (df = 1; 633)
## ================================================================================
## Note:                                                *p<0.1; **p<0.05; ***p<0.01

Statistical Questions

SQ1—8.14

Assumptions and how they look from the plots: Linear relationship between explanatory and outcome variables: Some outliers in the plots of residuals against each of the explanatory variables (IQ and gender) suggests that the fit is less than ideal. It does not look like there’s a strong linear relationship from these plots. Normally distributed residuals: The quantile-quantile plot (top left) shows some systematic deviation from the normal distribution. Constant variability of residuals: There are some outliers (individual points with very large residual values) in the top right plot. There also seems to be some relationship between the residuals and gender (bottom plot). These features suggest that the assumption is not met. Independent residuals: None of the plots show obvious evidence of dependence between data points, but there are (again) some outliers that could be concerning.

Overall, the data do not seem to meet several of the assumptions necessary to identify the model due to some outliers and some unobserved relationships structuring the data. It would be risky to rely on this model to draw valid inferences about the relationships between the variables as a result.

SQ2—8.16

The damaged O-rings almost all occurred at the lowest launch-time temperature.
The model suggests that lower launch-time temperatures result in a higher probability of O-ring damage. The coefficient of the “Temperature” term is negative with a very small (proportionally speaking) standard error. It is statistically significant with a p-value near 0 (\(H_0:~\beta_{temp} = 0\), \(H_A:~\beta_{temp}\neq 0\)), indicating that the data provide evidence that the coefficient is likely different from 0. Exponentiating the coefficient (see the R-code below), we see that a one degree farenheit increase in launch-time temperature is associated with 81% as large odds of O-ring damage occurring.

exp(-.2162)

## [1] 0.8055742

The corresponding logistic model where \(\hat{p}_{damage}\) represents the probability of a damaged O-ring:
\[log(\frac{\hat{p}_{damage}}{1-\hat{p}_{damage}}) = 11.663 - 0.2162\times~Temperature\]
Given the high stakes in terms of human lives and vast costs involved, concerns about the relationship between O-rings and launch-time temperature seem more than justified from this data. The large negative association between temperature and O-ring damage suggest increased potential for failures at low launch temperatures. That said, several limitations of the data, modeling strategy, and estimates should be kept in mind. See my answer to part (c) of SQ3 below for more.

SQ3—8.18

Let’s do this in R. Remember that we’ll need to plug in the values for temperature and do some algebra with the natural logarithm parts of the formula (on the left hand side above) to find the predicted probability of O-ring damage:

my.fits <- function(x){
  p.hat <- exp(11.663-(0.2162*x) ) # Function to get the odds
  print(p.hat)
  return(p.hat / (1 + p.hat)) # Convert the odds into a probability
}

vals <- c(51,52, 53, 55) # Vector of values we want probabilities for

my.fits(vals)

## [1] 1.8904218 1.5228750 1.2267888 0.7961243

## [1] 0.6540297 0.6036268 0.5509228 0.4432456

I’ll use my little function above to build a data frame with the predicted values and plot everything in ggplot. Note that the question asks for a “smooth curve” fit to the dots. There are many ways to do this in ggplot. I demonstrate one here using geom_smooth() that fits a quadratic function (\(y = x^2\)) to the points. You might experiment with the different options for geom_smooth() or, for a simpler solution, just try geom_line() (with no arguments) instead.

vals <- seq(51, 71, by=2) # This creates a vector from 51 to 71, counting by twos 

probs <- my.fits(vals) # store the probabilities as another vector

##  [1] 1.89042184 1.22678877 0.79612426 0.51664464 0.33527640 0.21757754
##  [7] 0.14119689 0.09162968 0.05946306 0.03858854 0.02504202

pred <- data.frame(temp=vals, pred.probability=probs) # Change to a data frame for ggplot

ggplot(data=pred, aes(x=temp, y=pred.probability)) + 
  geom_point(color="purple") + # Plot the points
  geom_smooth(color="orange", # Add a smooth line
              method="glm", # Create a line fit to the data
              formula = y ~ poly(x, 2), # Using this formula
              se=FALSE) # Don't show standard errors

I have several concerns about this application of logistic regression to the problem and data at hand. First, this is a fairly small observational dataset modeling a relatively rare outcome. It seems like the model treats each O-ring as independent (the problem wording is not clear about the model in this respect) and, if we accept that assumption, ~11 damaged O-ring outcomes occurred out of almost 140 trials (less than 10%). This assumption of independence between the observations seems likely to be problematic since each O-ring in a given mission is probably more similar to the others on that mission that to O-rings on another mission. It is also possible that the O-ring production or installation procedures may have changed across the missions over time. Any such clustered or time-dependent structures lurking in the data could lead to unobserved bias in the estimates when we model each O-ring outcome as independent events. Furthermore, about 50% of the observed failures occurred in a single mission—the mission with the lowest observed launch-time temperature. The result is that one mission drives the model results disproportionately (it generates observations that exert “high leverage” on the model fit to the data). Without knowing ahead of time that temperature was a likely explanation (as compared against any of the other infinite details of that one mission), it’s hard to see how NASA analysts necessarily should have drawn this conclusion on the basis of evidence like this.

Empirical Questions

We will discuss these in class.