Northwestern University

MTS 525

Note that Jeremy created this solution set with some code that uses functions drawn from the `tidyverse`

. Wherever possible, we have also provided solution code using base R functions or alternatives since the course has not really provided an introduction to the tidyverse.

First we import the data.

```
raw_df = read.csv("/tmp/owan03.csv") # Note that I saved the file as a CSV for importing to R
head(raw_df)
```

```
## X1 X2 X3 X4
## 1 70 49 30 34
## 2 77 60 37 36
## 3 83 63 56 48
## 4 87 67 65 48
## 5 92 70 76 65
## 6 93 74 83 91
```

Let’s organize the data. We want this in “long” format where the survival times in weeks for each of the three dosage levels are all arranged in a single column and another variable labels the dosage level. There are (surprise!) many ways to do this. Here are two examples using a function called `melt()`

from a library called `reshape`

and another function called `gather()`

from the `tidyverse`

library. Note that you’ll need to install the respective library and load it before you can call either function!

```
library(reshape)
library(tidyr)
```

```
##
## Attaching package: 'tidyr'
```

```
## The following objects are masked from 'package:reshape':
##
## expand, smiths
```

```
# Rename the columns
colnames(raw_df) <- c("None", "Low", "Medium","High")
# Here's a version using "melt"
d <- melt(raw_df, na.rm=TRUE)
```

`## Using as id variables`

```
names(d) <- c("dose", "weeks_alive")
# "gather" similarly takes all of the columns from the dataframe and uses them as keys, with the value as the value from that # cell. This gives us a "long" dataframe
df <- gather(raw_df, dose, weeks_alive)
# We want to treat dose as a factor, so we can group by it easier. Specifying the levels helps with labels later on.
df$dose <- factor(df$dose, levels=c("None", "Low", "Medium", "High"))
# Finally, remove NAs
df <- df[complete.cases(df),]
# Uncomment this to see that the results are the same:
## df==d
```

Now we’re going to calculate summary statistics and create some visualizations

```
# First, using the tapply function
tapply(df$weeks_alive, df$dose, summary)
```

```
## $None
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 70.00 85.00 93.00 91.36 101.00 103.00
##
## $Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.00 63.00 70.00 69.89 77.00 89.00
##
## $Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 58.25 79.50 71.50 89.25 97.00
##
## $High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.00 45.00 56.50 65.25 92.75 102.00
```

```
# Alternative way to do this using tidyverse "group_by" function:
library(dplyr)
```

```
##
## Attaching package: 'dplyr'
```

```
## The following object is masked from 'package:reshape':
##
## rename
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

`df %>% group_by(dose) %>% summarize_all(c('min','max','mean', 'IQR'))`

```
## # A tibble: 4 x 5
## dose min max mean IQR
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 None 70 103 91.4 16
## 2 Low 49 89 69.9 14
## 3 Medium 30 97 71.5 31
## 4 High 34 102 65.2 47.8
```

When it comes to visualizations, we definitely want to use ggplot. We have lots of options for what to do.

`library(ggplot2)`

```
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
```

```
# Histograms
h_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_histogram(position = 'dodge', bins = 5)
h_plot
```

```
# In this case, faceted histograms is probably better
h_facet <- ggplot(data=df, aes(x=weeks_alive)) + geom_histogram(bins = 5) + facet_grid(~dose)
h_facet
```

```
# Even easier to compare if we flip the coordinates (rotate the axes)
h_facet+coord_flip()
```

```
# Density plots. Note that setting a value <1 for "alpha" makes the fill more transparent
d_plot <- ggplot(data=df, aes(x=weeks_alive, fill = dose)) + geom_density(alpha = .2)
d_plot
```

```
# Boxplots
box_plot <- ggplot(data=df, aes(y=weeks_alive, x = dose)) + geom_boxplot()
box_plot
```

```
# Jeremy's favorite - ridgeline plots
# install.packages('ggridges')
library(ggridges)
```

```
##
## Attaching package: 'ggridges'
```

```
## The following object is masked from 'package:ggplot2':
##
## scale_discrete_manual
```

```
ridge_plot <- ggplot(data=df, aes(x=weeks_alive, y = dose)) + geom_density_ridges(jittered_points = T, fill = 'orange')
ridge_plot
```

`## Picking joint bandwidth of 10.5`