This week’s R tutorial focuses on the basics of correlations and linear regression. I’ll work with the mtcars
dataset that comes built-in with R.
data(mtcars)
Calculating correlation coefficients is straightforward: use the cor()
function:
with(mtcars, cor(mpg, hp))
## [1] -0.7761684
All you prius drivers out there will be shocked to learn that miles-per-gallon is negatively correlated with horsepower.
The cor()
function works with two variables or with more—the following generates a correlation matrix for the whole dataset!
cor(mtcars)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
Note that if you are calculating correlations with variables that are not distributed normally you should use cor(method="spearman")
because it calculates rank-based correlations (look it up online for more details).
To calculate covariance, you use the cov()
function:
with(mtcars, cov(mpg, hp))
## [1] -320.7321
While OpenIntro does not spend much time on either covariance or correlation, I highly recommend taking the time to review at least the corresponding Wikipedia articles and ideally another statistics textbook) to understand what goes into each one. 1 A key point to notice/understand is that covariance is calculated in such a way that the magnitude may vary depending on the scale of the variables involved. Correlation is not (every correlation coefficient falls between -1 and 1).
Linear models are fit using the lm()
command. As with aov()
, the lm()
function requires a formula as an input and is usually presented with a call to summary()
. You can enter the formula directly in the call to lm()
or define it separately. For this example, I’ll regress mpg
on a single predictor, hp
:
model1 <- lm(mpg ~ hp, data=mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Notice how much information the output of summary()
gives you for a linear model! You have details about the residuals, the usual information about the coefficients, standard errors, t-values, etc., little stars corresponding to conventional significance levels, \(R^2\) values, degrees of freedom, F-statistics (remember those?) and p-values for the overall model fit.
There’s even more under the hood. Try looking at all the different things in the model object R has created:
names(model1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
You can directly inspect the residuals using model1$residuals
. This makes plotting and other diagnostic activities pretty straightforward:
summary(model1$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.7121 -2.1122 -0.8854 0.0000 1.5819 8.2360
More on that in a moment. In the meantime, you can also use the items generated by the call to summary()
as well:
names(summary(model1))
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
summary(model1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
## hp -0.06822828 0.0101193 -6.742389 1.787835e-07
There are also functions to help you do things with the model such as predict the fitted values for new (unobserved) data. For example, if I found some new cars with horsepowers ranging from 90-125, what would this model predict for the corresponding mpg for each car?
new.data <- data.frame(hp=seq(90,125,5))
predict(model1, new.data, type="response")
## 1 2 3 4 5 6 7 8
## 23.95832 23.61717 23.27603 22.93489 22.59375 22.25261 21.91147 21.57033
A call to predict can also provide standard errors around these predictions (which you could use, for example, to construct a 95% confidence interval around the model-predicted values):
predict(model1, new.data, type="response", se.fit = TRUE)
## $fit
## 1 2 3 4 5 6 7 8
## 23.95832 23.61717 23.27603 22.93489 22.59375 22.25261 21.91147 21.57033
##
## $se.fit
## 1 2 3 4 5 6 7 8
## 0.8918453 0.8601743 0.8303804 0.8026728 0.7772744 0.7544185 0.7343427 0.7172804
##
## $df
## [1] 30
##
## $residual.scale
## [1] 3.862962
Linear model objects also have a built-in method for generating confidence intervals around the values of \(\beta\):
confint(model1, "hp", level=0.95) # Note that I provide the variable name in quotes
## 2.5 % 97.5 %
## hp -0.08889465 -0.0475619
Feeling old-fashioned? You can always calculate residuals or confidence intervals (or anything else) “by hand”:
# Residuals
mtcars$mpg - model1$fitted.values
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## -1.59374995 -1.59374995 -0.95363068 -1.19374995
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 0.54108812 -4.83489134 0.91706759 -1.46870730
## Merc 230 Merc 280 Merc 280C Merc 450SE
## -0.81717412 -2.50678234 -3.90678234 -1.41777049
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## -0.51777049 -2.61777049 -5.71206353 -5.02978075
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 0.29364342 6.80420581 3.84900992 8.23597754
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## -1.98071757 -4.36461883 -4.66461883 -0.08293241
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 1.04108812 1.70420581 2.10991276 8.01093488
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 3.71340487 1.54108812 7.75761261 -1.26197823
# 95% CI for the coefficient on horsepower
est <- model1$coefficients["hp"]
se <- summary(model1)$coefficients[2,2]
est + 1.96 * c(-1,1) * se
## [1] -0.08806211 -0.04839444
You can generate diagnostic plots of residuals in various ways:
hist(residuals(model1))