Programming Questions

Note that Jeremy created this solution set with some code that uses functions drawn from the tidyverse. Wherever possible, we have also provided solution code using base R functions or alternatives since the course has not really provided an introduction to the tidyverse.

PC0

First we import the data.

raw_df = read.csv("/tmp/owan03.csv") # Note that I saved the file as a CSV for importing to R
head(raw_df)
##   X1 X2 X3 X4
## 1 70 49 30 34
## 2 77 60 37 36
## 3 83 63 56 48
## 4 87 67 65 48
## 5 92 70 76 65
## 6 93 74 83 91

PC1

Let’s organize the data. We want this in “long” format where the survival times in weeks for each of the three dosage levels are all arranged in a single column and another variable labels the dosage level. There are (surprise!) many ways to do this. Here are two examples using a function called melt() from a library called reshape and another function called gather() from the tidyverse library. Note that you’ll need to install the respective library and load it before you can call either function!

library(reshape)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following objects are masked from 'package:reshape':
## 
##     expand, smiths
# Rename the columns
colnames(raw_df) <- c("None", "Low", "Medium","High")

# Here's a version using "melt"
d <- melt(raw_df, na.rm=TRUE)
## Using  as id variables
names(d) <- c("dose", "weeks_alive")

# "gather" similarly takes all of the columns from the dataframe and uses them as keys, with the value as the value from that # cell. This gives us a "long" dataframe
df <- gather(raw_df, dose, weeks_alive)
# We want to treat dose as a factor, so we can group by it easier. Specifying the levels helps with labels later on.
df$dose <- factor(df$dose, levels=c("None", "Low", "Medium", "High"))
# Finally, remove NAs
df <- df[complete.cases(df),]

# Uncomment this to see that the results are the same:
## df==d

PC2

Now we’re going to calculate summary statistics and create some visualizations

# First, using the tapply function
tapply(df$weeks_alive, df$dose, summary)
## $None
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   70.00   85.00   93.00   91.36  101.00  103.00 
## 
## $Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.00   63.00   70.00   69.89   77.00   89.00 
## 
## $Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   58.25   79.50   71.50   89.25   97.00 
## 
## $High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.00   45.00   56.50   65.25   92.75  102.00
# Alternative way to do this using tidyverse "group_by" function:
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))
## # A tibble: 4 x 5
##   dose     min   max  mean   IQR
##   <fct>  <dbl> <dbl> <dbl> <dbl>
## 1 None      70   103  91.4  16  
## 2 Low       49    89  69.9  14  
## 3 Medium    30    97  71.5  31  
## 4 High      34   102  65.2  47.8

When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do.

library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
# Histograms

h_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_histogram(position = 'dodge', bins = 5)
h_plot

# In this case, faceted histograms is probably better
h_facet <- ggplot(data=df, aes(x=weeks_alive)) + geom_histogram(bins = 5) + facet_grid(~dose)
h_facet

# Even easier to compare if we flip the coordinates (rotate the axes)
h_facet+coord_flip()

# Density plots. Note that setting a value <1 for "alpha" makes the fill more transparent
d_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_density(alpha = .2)
d_plot

# Boxplots
box_plot <- ggplot(data=df, aes(y=weeks_alive,  x = dose)) + geom_boxplot()
box_plot

# Jeremy's favorite - ridgeline plots

# install.packages('ggridges')
library(ggridges) 
## 
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
## 
##     scale_discrete_manual
ridge_plot <- ggplot(data=df, aes(x=weeks_alive, y = dose)) + geom_density_ridges(jittered_points = T, fill = 'orange')
ridge_plot
## Picking joint bandwidth of 10.5