Lesson objectives
- Learn how to apply and interpret linear regression for a variety of data
- Understand probability and the importance of presenting confidence intervals
- Learn the importance of visualizing your data when doing any analyses or statistics
Lesson outline
- Basic descriptive and inferential statistics (20 min)
- Generalized linear models (50-60 min)
- Linear regression
- ANOVA
- Logistic regression
Theoretical models are powerful tools for explaining and understanding the world. However, they are limited in that the real-world often doesn’t perfectly fit these models. The real world is messy and noisy. We can’t always blindly trust our data as there are inherent biases and errors in it. Measuring it, collecting it, recording it, and inputting it are some of the possible sources of error and bias. We use statistics and probability to determine whether the way we understand and conceptualize the world (as models) matches reality (even with the error and bias).
A reason we use statistical methods in R compared to writing up the formulas and equations ourselves is that we can focus on answering questions without worrying about whether we are doing the math or computation wrong. This is of course dependent on the type of research questions you may be interested in (e.g. for more theoretical questions, doing the math and computation yourself is probably a goal!) and on the type of data you are using/collecting. There is a lot of complexity that has already been taken care of in the available R packages and functions. For example, the function lm
that we will use in this lesson uses the ordinary least squares (OLS) method, which is a common method for determining fit for linear models. That way, you can answer your research questions and not worry too much about the exact math involved and instead worry about the specifics of your field (e.g. Are you measuring the right thing? Are you collecting the right data? Are you asking the right questions? Is there an ecological or biological aspect you are missing in your analysis?)
table()
:The table()
function displays counts of identical observations for either a single data vector or a dataframe.
##
## Ammodramus Ammospermophilus Amphispiza Baiomys
## 2 437 303 46
## Calamospiza Callipepla Campylorhynchus Chaetodipus
## 13 16 50 6029
## Cnemidophorus Crotalus Dipodomys Lizard
## 2 2 16167 4
## Neotoma Onychomys Perognathus Peromyscus
## 1252 3267 1629 2234
## Pipilo Pooecetes Reithrodontomys Rodent
## 52 8 2694 10
## Sceloporus Sigmodon Sparrow Spermophilus
## 6 233 4 249
## Sylvilagus Zonotrichia
## 75 2
##
## Bird Rabbit Reptile Rodent
## Ammodramus 2 0 0 0
## Ammospermophilus 0 0 0 437
## Amphispiza 303 0 0 0
## Baiomys 0 0 0 46
## Calamospiza 13 0 0 0
## Callipepla 16 0 0 0
## Campylorhynchus 50 0 0 0
## Chaetodipus 0 0 0 6029
## Cnemidophorus 0 0 2 0
## Crotalus 0 0 2 0
## Dipodomys 0 0 0 16167
## Lizard 0 0 4 0
## Neotoma 0 0 0 1252
## Onychomys 0 0 0 3267
## Perognathus 0 0 0 1629
## Peromyscus 0 0 0 2234
## Pipilo 52 0 0 0
## Pooecetes 8 0 0 0
## Reithrodontomys 0 0 0 2694
## Rodent 0 0 0 10
## Sceloporus 0 0 6 0
## Sigmodon 0 0 0 233
## Sparrow 4 0 0 0
## Spermophilus 0 0 0 249
## Sylvilagus 0 75 0 0
## Zonotrichia 2 0 0 0
describe()
from the psych
package:describe()
provides item name, item number, nvalid, mean, sd, median, mad (median absolute deviation from the median), min, max, skew, kurtosis, se.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 31438 29.29 9.56 32 28.87 8.9 2 70 68 0.32 -0.61
## se
## X1 0.05
describeBy()
from the psych
package:describeBy()
is simple way of generating summary statistics by grouping variable.
##
## Descriptive statistics by group
## group: F
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 14894 28.84 9.46 27 28.33 11.86 7 64 57 0.41 -0.52
## se
## X1 0.08
## --------------------------------------------------------
## group: M
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 16476 29.71 9.63 34 29.38 10.38 2 58 56 0.23 -0.67
## se
## X1 0.08
Compare single set of scores (sample) to a population score.
##
## One Sample t-test
##
## data: survey$weight
## t = 13.108, df = 32282, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 40
## 95 percent confidence interval:
## 42.27282 43.07203
## sample estimates:
## mean of x
## 42.67243
Compare two independent sets of scores to each other.
#one way to compare groups:
#t.test(object1, object1)
#or use sex as the grouping variable:
t.test(survey$weight~survey$sex) #note (data~grouping variable) format
##
## Welch Two Sample t-test
##
## data: survey$weight by survey$sex
## t = -2.0226, df = 31751, p-value = 0.04312
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.62412287 -0.02552529
## sample estimates:
## mean in group F mean in group M
## 42.17055 42.99538
Compare two matched sets (e.g., pre- and post-) of scores to each other.
Bivariate correlation (r
) gives us the strength and direction of relationship between two variables (linear). Say we find that the correlation (\(r\)) between hindfoot length and weight is \(r\) = .609. This means that .6092 = .371 of the variance in \(y\) (hindfoot length) is common to the variance in \(x\) (weight). Alternatively, we can say that these two variables share 37.1% of the variance in common. In the case of the Pearson correlation, this will be true whether we consider weight or hindfoot length to be the dependent variable.
##
## Pearson's product-moment correlation
##
## data: survey$weight and survey$hindfoot_length
## t = 164.3, df = 30736, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6778151 0.6897195
## sample estimates:
## cor
## 0.6838128
A version of GLM that uses continuous \(y\) values is called linear regression, which I’m going to focus on. The formula for linear regression (or GLM in general) is:
\[ Y = \alpha + X\beta + \varepsilon \]
Or, a simplified, alphabetized version is:
\[ y = a + Xb + e \]
Where \(a\) is the intercept (where the line crosses the vertical, i.e. y, axis of the graph), \(X\) is the predictor variable, \(b\) is the slope/coefficient, and \(e\) is the error term.
We construct these regression models for several reasons. Sometimes we want to infer how some variables (\(x\)) cause or influence another variable (\(y\)). Or maybe we know that \(y\) has a lot of error in the measurement or is difficult to measure, so we want to derive a formula in order to predict \(y\) based on more accurately measured variables. In R we can run linear regression either using lm
or using glm()
. lm()
assumes a Gaussian (normal) distribution of your data. On the other hand, when you use glm()
in R, you can specifiy the data’s distribution with the parameter model =
. This allows you to construct a linear model when your data don’t follow a Gaussian distribution.
Regression requires a predictor (independent) variable and a predicted (dependent) variable. Changing which variable is the predictor/predicted gives you a different regression line with a different regression equation. The function we are going to use today, lm
, uses the OLS method by default, as mentioned above. The least squares line is the line chosen by lm
to fit your data. The goal of OLS is to choose a line that minimizes prediction error. With any other line, errors of prediction would be greater. Note, the best fitting model given your data (i.e. OLS) does not equal the best model period. We must pay attention to fit statistics like R2, the amount (%) of variance in the outcome explained by our predictor (i.e., model), to determine how well our model is doing.
So how do we use these functions? In R, dependent variables are predicted by a tilde \(~\). The formula to regress \(y\) on \(x\) is y ~ x
:
##
## Call:
## lm(formula = weight ~ sex, data = survey)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.995 -22.171 -5.995 5.005 237.005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.1706 0.2951 142.917 <2e-16 ***
## sexM 0.8248 0.4074 2.024 0.0429 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 32180 degrees of freedom
## (2604 observations deleted due to missingness)
## Multiple R-squared: 0.0001273, Adjusted R-squared: 9.627e-05
## F-statistic: 4.098 on 1 and 32180 DF, p-value: 0.04293
#multiple predictors with interaction terms
result <- lm(weight~sex*hindfoot_length, data=survey)
summary(result)
##
## Call:
## lm(formula = weight ~ sex * hindfoot_length, data = survey)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.687 -12.919 -4.031 3.259 229.889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.042646 0.692255 -46.287 <2e-16 ***
## sexM -1.816600 0.959032 -1.894 0.0582 .
## hindfoot_length 2.558271 0.022894 111.746 <2e-16 ***
## sexM:hindfoot_length 0.003292 0.031264 0.105 0.9161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.02 on 30672 degrees of freedom
## (4110 observations deleted due to missingness)
## Multiple R-squared: 0.4678, Adjusted R-squared: 0.4678
## F-statistic: 8989 on 3 and 30672 DF, p-value: < 2.2e-16
#use + for main effect of predictor only
result <- lm(weight~sex + hindfoot_length, data=survey)
summary(result)
##
## Call:
## lm(formula = weight ~ sex + hindfoot_length, data = survey)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.749 -12.907 -4.028 3.253 229.893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.09337 0.49714 -64.556 < 2e-16 ***
## sexM -1.72062 0.29787 -5.776 7.7e-09 ***
## hindfoot_length 2.56004 0.01559 164.202 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.02 on 30673 degrees of freedom
## (4110 observations deleted due to missingness)
## Multiple R-squared: 0.4678, Adjusted R-squared: 0.4678
## F-statistic: 1.348e+04 on 2 and 30673 DF, p-value: < 2.2e-16
ANOVA is simply a different way of evaluating explained variance in linear modelling. Anova is a special case of linear modelling You must always wrap the anova()
function around a lm()
function.
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 5461 5461 10.801 0.001016 **
## genus 9 26567752 2951972 5839.067 < 2.2e-16 ***
## sex:genus 9 48457 5384 10.650 < 2.2e-16 ***
## Residuals 32162 16259676 506
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, R uses type II sums of squares by default. Anova()
from the car
package can give you “Type III Sums of Squares”. This matters when you have more than one predictor (e.g. taxa x sex). Asking for type III sums of squares will match what you get from SPSS or SAS.
## Anova Table (Type III tests)
##
## Response: weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 2602 1 5.1464 0.0233 *
## sex 31 1 0.0621 0.8032
## genus 13368626 9 2938.1598 <2e-16 ***
## sex:genus 48457 9 10.6498 <2e-16 ***
## Residuals 16259676 32162
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R comes with a default pairwise t-test (pairwise.t.test()
). However, multcomp
offers more posthoc options than base R:
result <- lm(weight~genus, data=survey)
postHocs<-glht(result, linfct = mcp(genus = "Tukey"))
#summary function gives results of multiple comparisons
summary(postHocs)
## Warning in RET$pfunction("adjusted", ...): Completion with error > abseps
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = weight ~ genus, data = survey)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Chaetodipus - Baiomys == 0 15.5793 3.3770 4.613 < 0.001
## Dipodomys - Baiomys == 0 47.2602 3.3689 14.028 < 0.001
## Neotoma - Baiomys == 0 150.6457 3.4291 43.931 < 0.001
## Onychomys - Baiomys == 0 17.8962 3.3881 5.282 < 0.001
## Perognathus - Baiomys == 0 -0.2225 3.4117 -0.065 1.00000
## Peromyscus - Baiomys == 0 12.8563 3.3987 3.783 0.00388
## Reithrodontomys - Baiomys == 0 2.0679 3.3928 0.610 0.99971
## Sigmodon - Baiomys == 0 58.6646 3.6879 15.907 < 0.001
## Spermophilus - Baiomys == 0 84.9000 16.3078 5.206 < 0.001
## Dipodomys - Chaetodipus == 0 31.6809 0.3464 91.464 < 0.001
## Neotoma - Chaetodipus == 0 135.0663 0.7275 185.667 < 0.001
## Onychomys - Chaetodipus == 0 2.3168 0.4995 4.638 < 0.001
## Perognathus - Chaetodipus == 0 -15.8019 0.6400 -24.689 < 0.001
## Peromyscus - Chaetodipus == 0 -2.7231 0.5671 -4.802 < 0.001
## Reithrodontomys - Chaetodipus == 0 -13.5114 0.5306 -25.465 < 0.001
## Sigmodon - Chaetodipus == 0 43.0852 1.5397 27.982 < 0.001
## Spermophilus - Chaetodipus == 0 69.3207 15.9598 4.343 < 0.001
## Neotoma - Dipodomys == 0 103.3854 0.6891 150.023 < 0.001
## Onychomys - Dipodomys == 0 -29.3640 0.4418 -66.458 < 0.001
## Perognathus - Dipodomys == 0 -47.4828 0.5961 -79.654 < 0.001
## Peromyscus - Dipodomys == 0 -34.4040 0.5170 -66.544 < 0.001
## Reithrodontomys - Dipodomys == 0 -45.1923 0.4767 -94.810 < 0.001
## Sigmodon - Dipodomys == 0 11.4044 1.5220 7.493 < 0.001
## Spermophilus - Dipodomys == 0 37.6398 15.9581 2.359 0.27129
## Onychomys - Neotoma == 0 -132.7495 0.7775 -170.746 < 0.001
## Perognathus - Neotoma == 0 -150.8682 0.8744 -172.538 < 0.001
## Peromyscus - Neotoma == 0 -137.7894 0.8225 -167.522 < 0.001
## Reithrodontomys - Neotoma == 0 -148.5777 0.7978 -186.241 < 0.001
## Sigmodon - Neotoma == 0 -91.9811 1.6510 -55.713 < 0.001
## Spermophilus - Neotoma == 0 -65.7457 15.9709 -4.117 < 0.001
## Perognathus - Onychomys == 0 -18.1187 0.6964 -26.019 < 0.001
## Peromyscus - Onychomys == 0 -5.0399 0.6300 -8.000 < 0.001
## Reithrodontomys - Onychomys == 0 -15.8282 0.5973 -26.500 < 0.001
## Sigmodon - Onychomys == 0 40.7684 1.5640 26.067 < 0.001
## Spermophilus - Onychomys == 0 67.0038 15.9622 4.198 < 0.001
## Peromyscus - Perognathus == 0 13.0788 0.7463 17.525 < 0.001
## Reithrodontomys - Perognathus == 0 2.2905 0.7190 3.186 0.03061
## Sigmodon - Perognathus == 0 58.8871 1.6144 36.477 < 0.001
## Spermophilus - Perognathus == 0 85.1225 15.9672 5.331 < 0.001
## Reithrodontomys - Peromyscus == 0 -10.7883 0.6549 -16.474 < 0.001
## Sigmodon - Peromyscus == 0 45.8083 1.5869 28.867 < 0.001
## Spermophilus - Peromyscus == 0 72.0437 15.9644 4.513 < 0.001
## Sigmodon - Reithrodontomys == 0 56.5966 1.5742 35.953 < 0.001
## Spermophilus - Reithrodontomys == 0 82.8321 15.9632 5.189 < 0.001
## Spermophilus - Sigmodon == 0 26.2354 16.0285 1.637 0.76570
##
## Chaetodipus - Baiomys == 0 ***
## Dipodomys - Baiomys == 0 ***
## Neotoma - Baiomys == 0 ***
## Onychomys - Baiomys == 0 ***
## Perognathus - Baiomys == 0
## Peromyscus - Baiomys == 0 **
## Reithrodontomys - Baiomys == 0
## Sigmodon - Baiomys == 0 ***
## Spermophilus - Baiomys == 0 ***
## Dipodomys - Chaetodipus == 0 ***
## Neotoma - Chaetodipus == 0 ***
## Onychomys - Chaetodipus == 0 ***
## Perognathus - Chaetodipus == 0 ***
## Peromyscus - Chaetodipus == 0 ***
## Reithrodontomys - Chaetodipus == 0 ***
## Sigmodon - Chaetodipus == 0 ***
## Spermophilus - Chaetodipus == 0 ***
## Neotoma - Dipodomys == 0 ***
## Onychomys - Dipodomys == 0 ***
## Perognathus - Dipodomys == 0 ***
## Peromyscus - Dipodomys == 0 ***
## Reithrodontomys - Dipodomys == 0 ***
## Sigmodon - Dipodomys == 0 ***
## Spermophilus - Dipodomys == 0
## Onychomys - Neotoma == 0 ***
## Perognathus - Neotoma == 0 ***
## Peromyscus - Neotoma == 0 ***
## Reithrodontomys - Neotoma == 0 ***
## Sigmodon - Neotoma == 0 ***
## Spermophilus - Neotoma == 0 ***
## Perognathus - Onychomys == 0 ***
## Peromyscus - Onychomys == 0 ***
## Reithrodontomys - Onychomys == 0 ***
## Sigmodon - Onychomys == 0 ***
## Spermophilus - Onychomys == 0 ***
## Peromyscus - Perognathus == 0 ***
## Reithrodontomys - Perognathus == 0 *
## Sigmodon - Perognathus == 0 ***
## Spermophilus - Perognathus == 0 ***
## Reithrodontomys - Peromyscus == 0 ***
## Sigmodon - Peromyscus == 0 ***
## Spermophilus - Peromyscus == 0 ***
## Sigmodon - Reithrodontomys == 0 ***
## Spermophilus - Reithrodontomys == 0 ***
## Spermophilus - Sigmodon == 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Normally-distributed dependent variable assumption is violated in logistic regression, where we want to predict a binary outcome. So, you must use the glm()
function rather than lm()
. We only have one binary variable in our dataset: sex.
##
## Call:
## glm(formula = survey$sex ~ survey$weight * survey$hindfoot_length,
## family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.472 -1.203 1.050 1.143 1.430
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.766e-01 5.416e-02 -3.261
## survey$weight -5.778e-03 1.266e-03 -4.565
## survey$hindfoot_length 1.332e-02 2.020e-03 6.596
## survey$weight:survey$hindfoot_length 8.781e-05 3.212e-05 2.734
## Pr(>|z|)
## (Intercept) 0.00111 **
## survey$weight 4.99e-06 ***
## survey$hindfoot_length 4.23e-11 ***
## survey$weight:survey$hindfoot_length 0.00626 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 42452 on 30675 degrees of freedom
## Residual deviance: 42341 on 30672 degrees of freedom
## (4110 observations deleted due to missingness)
## AIC: 42349
##
## Number of Fisher Scoring iterations: 3
This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.