---
title: "Week 2 R lecture"
subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525"
author: "Aaron Shaw"
date: "April 4, 2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Adding comments to your code
Sorry, I realized that I forgot to explain this in last week's R lecture!
R interprets the `#` character and anything that comes after it as a comment. R will not try to interpret whatever comes next as a command:
```{r}
2+2
# This is a comment. The next line is too:
# 2+2
```
## More advanced variable types:
### Factors
These are for categorical data. You can create them with the `factor()` command or by running `as.factor()` on a character vector.
```{r}
cities <- factor(c("Chicago", "Detroit", "Milwaukee"))
summary(cities)
class(cities)
more.cities <- c("Oakland", "Seattle", "San Diego")
summary(more.cities)
class(more.cities)
more.cities <- as.factor(more.cities)
summary(more.cities)
class(more.cities)
```
### Lists
Lists are a lot like vectors, but can contain any kind of object (e.g., other variables).
```{r}
cities.list <- list(cities, more.cities)
cities.list
```
We can name the items in the list just like we did for a vector:
```{r}
names(cities.list) <- c("midwest", "west")
# This works too:
cities.list <- list("midwest" = cities, "west" = more.cities)
```
You can index into the list just like you can with a vector except that instead of one set of square brackets you have to use two:
```{r}
cities.list[["midwest"]]
cities.list[[2]]
```
With a list you can also index recursively (down into the individual objects contained in the list). For example:
```{r}
cities.list[["west"]][2]
```
Some functions "just work" on lists. Others don't or produce weird output that you probably weren't expecting. You should be careful and check the output to see what happens:
```{r}
# summary works as you might hope:
summary(cities.list)
# table produces something very weird:
table(cities.list)
```
### Matrices
Matrices are a little less common for everyday use, but it's good to know they're there and that you can do matrix arithmetic to your heart's content. An example is below. Check out the help documentation for the `matrix()` function for more.
```{r}
m1 <- matrix(c(1:12), nrow=4, byrow=FALSE)
m1
m2 <- matrix(seq(2,24,2), nrow=4, byrow=FALSE)
m2
m1*m2
t(m2) # transposition
```
### Data frames
A data frame is a format for storing tabular data. Formally, it consists of a list of vectors of equal length.
For our purposes, data frames are the most important data structure (or type) in R. We will use them constantly. They have rows (usually units or observations) and columns (usually variables). There are also many functions designed to work especially (even exclusively) with data frames. Let's take a look at another built-in example dataset, `faithful` (note: read the help documentation on `faithful` to learn about the dataset!):
```{r}
faithful <- faithful # This makes the dataset visible in the "Environment" tab in RStudio.
dim(faithful) # often the first thing I do with any data frame
nrow(faithful)
names(faithful) ## try colnames(faithful) too
head(faithful) ## look at the first few rows of data
summary(faithful)
```
You can index into a data frame using numeric values or variable names. The notation uses square brackets again and requires you to remember the convention of `[, ]`:
```{r}
faithful[1,1] # The item in the first row of the first column
faithful[,2] # all of the items in the second column
faithful[10:20, 2] # ranges work too
faithful[37, "eruptions"]
```
It is very useful to work with column (variable) names in a data frame using the `$` symbol:
```{r}
faithful$eruptions
mean(faithful$waiting)
boxplot(faithful$waiting)
```
Data frames are very useful for bivariate analyses (e.g., plots and tables). The base R notation for a bivariate presentation usually uses the `~` character. If both of the variables in your bivariate comparison are within the same data frame you can use the `data=` argument. For example, here is a scatterplot of eruption time (Y axis) over waiting time (X axis):
```{r}
plot(eruptions ~ waiting, data=faithful)
```
Data frames can have an arbitrary number of columns (variables). Another built in dataset used frequently in R documentation and examples is `mtcars` (read the help documentation! it contains a codebook that tells you about each variable). Let's look at that one next:
```{r}
mtcars <- mtcars
dim(mtcars)
head(mtcars)
```
There are many ways to create and modify data frames. Here is an example playing with the `mtcars` data. I use the `data.frame` command to build a new data frame from three vectors:
```{r}
my.mpg <- mtcars$mpg
my.cyl <- mtcars$cyl
my.disp <- mtcars$disp
df.small <- data.frame(my.mpg, my.cyl, my.disp)
class(df.small)
head(df.small)
# recode a value as missing
df.small[5,1] <- NA
# removing a column
df.small[,3] <- NULL
dim(df.small)
head(df.small)
```
Creating new variables, recoding, and transformations look very similar to working with vectors. Notice the `na.rm=TRUE` argument I am passing to the `mean` function in the first line here:
```{r}
df.small$mpg.big <- df.small$my.mpg > mean(df.small$my.mpg, na.rm=TRUE)
table(df.small$mpg.big)
df.small$mpg.l <- log1p(df.small$my.mpg) # notice: log1p()
head(df.small$mpg.l)
## convert a number into a factor:
df.small$my.cyl.factor <- factor(df.small$my.cyl)
summary(df.small$my.cyl.factor)
```
Some special functions are particularly useful for working with data frames:
```{r}
is.na(df.small$my.mpg)
sum(is.na(df.small$my.mpg)) # sum() works in mysterious ways sometimes...
complete.cases(df.small)
sum(complete.cases(df.small))
```
## "Apply" functions and beyond
R has some special functions to help apply operations over vectors, lists, etc. These can seem a little complicated at first, but they are super, super useful.
Most of the base R versions of these have "apply" in the name. There are also alternatives (some created by the same people who created ggplot2 that you can read more about in, for example, the Healy *Data Visualization* book). I will stick to the base R versions here. Please feel free to read about and use the alternatives!
Let's start with an example using the `mtcars` dataset again. The `sapply()` and `lapply()` functions both "apply" the second argument (a function) iteratively to the items (variables) in the first argument:
```{r}
sapply(mtcars, quantile)
lapply(mtcars, quantile) # Same output, different format/class
```
Experiment with that idea on your own a little bit before moving on. For example, can you find the mean of each variable using either `sapply` or `lapply`?
The `tapply` function allows you to apply functions conditionally. For example, below I find mean gas mileage by number of cylinders. The second argument (`mtcars$cyl`) provides an index into the first (`mtcars$mpg`) before the third argument (`mean`) is applied to each of the conditional subsets:
```{r}
tapply(mtcars$mpg, mtcars$cyl, mean)
```
Try some other calculations using `tapply()`. Can you calculate the average engine discplacement conditional on number of cylinders? What about the average miles per gallon conditional on whether the car has an automatic transmission?
Note that `apply()` works pretty smoothly with matrices, but it can be a bit complicated/surprising otherwise.
### Some basic graphs with ggplot2
ggplot2 is what I like to use for plotting so I'll develop examples with it from here on out.
Make sure you've installed the package with `install.packages()` and load it with `library()`.
There is another built-in (automotive) dataset that comes along with the ggplot2 package called `mpg`. This dataset includes variables of several types and is used in much of the package documentation, so it's helpful to become familiar with it.
I'll develop a few simple examples below. For more, please take a look at the (extensive) [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/index.html). There are **many** options and arguments for most of these functions and many more functions to help you produce publication-ready graphics. Chapter 3 of the Healy book is also an extraordinary resource for getting started creating visualizations with ggplot2.
```{r}
library(ggplot2)
mpg <- mpg
# First thing, call ggplot() to start building up a plot
# aes() indicates which variables to use as "aesthetic" mappings
p <- ggplot(data=mpg, aes(manufacturer, hwy))
p + geom_boxplot()
# another relationship:
p <- ggplot(data=mpg, aes(fl, hwy))
p + geom_boxplot()
p + geom_violin()
```
Here's another that visualizes the relationship between miles per gallon (mpg) in the city vs. mpg on the highway:
```{r}
p <- ggplot(data=mpg, aes(cty, hwy))
p+geom_point()
# Multivariate graphical displays can get pretty wild
p + geom_point(aes(color=factor(class), shape=factor(cyl)))
```