This week, we’ll introduce some ways to do some the discrete math, generate distributions, and use a for-loop to run a small simulation.

1 Binomial and factorial functions

In Chapter 3 (and in the previous problem set), you needed to calculate some binomial choice arithmetic and/or factorials. They weren’t absolutely necessary for the problem set, but here are the corresponding functions in R.

Let’s say we want to calculate how many possible pairs you can draw from a population of ten individuals, a.k.a., \(10 \choose 2\) or, instead you wanted to calculate \(10!\)

choose(10,2)
## [1] 45
factorial(10)
## [1] 3628800

Note that factorial arithmetic can get quite large quite fast, so consider your processor/memory constraints and options before you try to calculate something truly large like \(365!\).

2 Distribution functions

R has a number of built-in functions to help you work with distributions in various ways that also started to come up in OpenIntro Chapter 3. I will introduce a couple of points about them here, but I also highly recommend you look at the relevant section of the Verzani Using R Introductory Statistics book (pp 222-229) for more on this (and, honestly, for more on most of the topics we’re covering in R).

The key to using R to analyze distributions is that R has a set of built-in distributions (e.g. uniform, normal, binomial, log-normal, etc.) and a set of functions (d, p, q, and r) that can be run for each distribution. In the example that follows, I’ll use a uniform distribuition (a distribution between any two values (min, max) where the values may occur with uniform probability) for my example below. Verzani has others for when you need them.

The d function gets you information about the density function of the distribution. The p function works with the cumulative probabilities. The q function gets you quantiles from the distribution. The r function allows you to generate random samples from the distribution. As you can see, the letters corresponding to each function almost make sense…<sigh>. They also each take specific arguments that can vary a bit depending on which kind of distribution you are using them with (as always, the help documentation and the internet can be helpful here).

Onwards to the example code, which uses the different functions to calculate information about a uniform distribution between 0 and 3 (take a moment to think about what that would look like in terms of raw data and/or a plot):

dunif(x=1, min=0, max=3) # What proportion of the area is the to the left of 1?
## [1] 0.3333333
punif(q=1, min=0, max=3) # Same as the prior example in this case.
## [1] 0.3333333
qunif(p=0.5, min=0, max=3) # 50th percentile
## [1] 1.5
runif(n=4, min=0, max=3) # Random values in [0,3]
## [1] 2.8583748 1.1226368 0.3599361 1.4233048

Look at the Verzani text for additional examples, including several that can solve binomial probability calculations (e.g., if you flip a fair coin 100 times, what are the odds of observing heads 60 or more times?).

3 For-loops and a quick simulation

Beyond proving invaluable for rapid calculations of solutions to problem set questions, the distribution functions are very, very useful for running simulations. We won’t really spend a lot of time on simulations in class, but I’ll give you an example here that can generalize to more complicated problems. I also use a programming technique we haven’t talked about yet called a for-loop to help repeat the sampling process multiple times. For-loops feature prominently in some programming tasks/languages, but I encourage you to minimize your use of them in R for reasons that are sort of beyond the scope of the course. That said, it’s still super important to learn how they work!

For my simulation let’s say that I want to repeatedly draw random samples from a distribution and examine the distribution of the resulting sample means (this example is going to feature prominently in Chapter 5 of OpenIntro). I’ll start by generating a vector of 10,000 random data points drawn from a log-normal distribution where the mean and standard deviation of the log-transformed values are 0 and 1 respectively:

d <- rlnorm(10000, meanlog=0, sdlog=1)

head(d)
## [1] 1.5344713 2.3165957 1.7490242 4.3605884 0.4123013 0.5559845
mean(d)
## [1] 1.650292
sd(d)
## [1] 2.099644
hist(d)