## Lesson preamble

### Lesson objectives

• Learn how to apply and interpret linear regression for a variety of data
• Understand probability and the importance of presenting confidence intervals
• Learn the importance of visualizing your data when doing any analyses or statistics

### Lesson outline

• Basic descriptive and inferential statistics (20 min)
• Generalized linear models (50-60 min)
• Linear regression
• ANOVA
• Logistic regression

## Setup

library(tidyverse)
# Have the following packages installed and loaded:
# install.packages("car")
# install.packages("psych")
# install.packages("multcomp")
library(car)
library(psych)
library(multcomp)

# uses dplyr function to change all character vectors to factors
survey <- survey %>%
mutate_if(is.character, as.factor) 

## Statistics and probability

Theoretical models are powerful tools for explaining and understanding the world. However, they are limited in that the real-world often doesn’t perfectly fit these models. The real world is messy and noisy. We can’t always blindly trust our data as there are inherent biases and errors in it. Measuring it, collecting it, recording it, and inputting it are some of the possible sources of error and bias. We use statistics and probability to determine whether the way we understand and conceptualize the world (as models) matches reality (even with the error and bias).

A reason we use statistical methods in R compared to writing up the formulas and equations ourselves is that we can focus on answering questions without worrying about whether we are doing the math or computation wrong. This is of course dependent on the type of research questions you may be interested in (e.g. for more theoretical questions, doing the math and computation yourself is probably a goal!) and on the type of data you are using/collecting. There is a lot of complexity that has already been taken care of in the available R packages and functions. For example, the function lm that we will use in this lesson uses the ordinary least squares (OLS) method, which is a common method for determining fit for linear models. That way, you can answer your research questions and not worry too much about the exact math involved and instead worry about the specifics of your field (e.g. Are you measuring the right thing? Are you collecting the right data? Are you asking the right questions? Is there an ecological or biological aspect you are missing in your analysis?)

## Descriptive statistics

### Frequencies with table():

The table() function displays counts of identical observations for either a single data vector or a dataframe.

#table
table(survey$genus) ## ## Ammodramus Ammospermophilus Amphispiza Baiomys ## 2 437 303 46 ## Calamospiza Callipepla Campylorhynchus Chaetodipus ## 13 16 50 6029 ## Cnemidophorus Crotalus Dipodomys Lizard ## 2 2 16167 4 ## Neotoma Onychomys Perognathus Peromyscus ## 1252 3267 1629 2234 ## Pipilo Pooecetes Reithrodontomys Rodent ## 52 8 2694 10 ## Sceloporus Sigmodon Sparrow Spermophilus ## 6 233 4 249 ## Sylvilagus Zonotrichia ## 75 2 #cross tabulate #note: first variable is set as the rows table(survey$genus, survey$taxa) ## ## Bird Rabbit Reptile Rodent ## Ammodramus 2 0 0 0 ## Ammospermophilus 0 0 0 437 ## Amphispiza 303 0 0 0 ## Baiomys 0 0 0 46 ## Calamospiza 13 0 0 0 ## Callipepla 16 0 0 0 ## Campylorhynchus 50 0 0 0 ## Chaetodipus 0 0 0 6029 ## Cnemidophorus 0 0 2 0 ## Crotalus 0 0 2 0 ## Dipodomys 0 0 0 16167 ## Lizard 0 0 4 0 ## Neotoma 0 0 0 1252 ## Onychomys 0 0 0 3267 ## Perognathus 0 0 0 1629 ## Peromyscus 0 0 0 2234 ## Pipilo 52 0 0 0 ## Pooecetes 8 0 0 0 ## Reithrodontomys 0 0 0 2694 ## Rodent 0 0 0 10 ## Sceloporus 0 0 6 0 ## Sigmodon 0 0 0 233 ## Sparrow 4 0 0 0 ## Spermophilus 0 0 0 249 ## Sylvilagus 0 75 0 0 ## Zonotrichia 2 0 0 0 ### describe() from the psych package: describe() provides item name, item number, nvalid, mean, sd, median, mad (median absolute deviation from the median), min, max, skew, kurtosis, se. describe(survey$hindfoot_length)
##    vars     n  mean   sd median trimmed mad min max range skew kurtosis
## X1    1 31438 29.29 9.56     32   28.87 8.9   2  70    68 0.32    -0.61
##      se
## X1 0.05

### describeBy() from the psych package:

describeBy() is simple way of generating summary statistics by grouping variable.

describeBy(survey$hindfoot_length, survey$sex) #summary variable is the first argument, grouping variable is the second argument
##
##  Descriptive statistics by group
## group: F
##    vars     n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 14894 28.84 9.46     27   28.33 11.86   7  64    57 0.41    -0.52
##      se
## X1 0.08
## --------------------------------------------------------
## group: M
##    vars     n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 16476 29.71 9.63     34   29.38 10.38   2  58    56 0.23    -0.67
##      se
## X1 0.08

## Inferential statistics

### One-sample t-test:

Compare single set of scores (sample) to a population score.

t.test(survey$weight, mu=40) ## ## One Sample t-test ## ## data: survey$weight
## t = 13.108, df = 32282, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 40
## 95 percent confidence interval:
##  42.27282 43.07203
## sample estimates:
## mean of x
##  42.67243

### Two sample t-test:

Compare two independent sets of scores to each other.

#one way to compare groups:
#t.test(object1, object1)
#or use sex as the grouping variable:
t.test(survey$weight~survey$sex) #note (data~grouping variable) format
##
##  Welch Two Sample t-test
##
## data:  survey$weight by survey$sex
## t = -2.0226, df = 31751, p-value = 0.04312
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.62412287 -0.02552529
## sample estimates:
## mean in group F mean in group M
##        42.17055        42.99538

### Paired samples t-test:

Compare two matched sets (e.g., pre- and post-) of scores to each other.

#t.test(object1, object1, paired=TRUE)

### Correlations:

Bivariate correlation (r) gives us the strength and direction of relationship between two variables (linear). Say we find that the correlation ($$r$$) between hindfoot length and weight is $$r$$ = .609. This means that .6092 = .371 of the variance in $$y$$ (hindfoot length) is common to the variance in $$x$$ (weight). Alternatively, we can say that these two variables share 37.1% of the variance in common. In the case of the Pearson correlation, this will be true whether we consider weight or hindfoot length to be the dependent variable.

cor.test(survey$weight, survey$hindfoot_length)
##
##  Pearson's product-moment correlation
##
## data:  survey$weight and survey$hindfoot_length
## t = 164.3, df = 30736, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6778151 0.6897195
## sample estimates:
##       cor
## 0.6838128

### General linear model (i.e., with a normally-distributed dependent variable):

A version of GLM that uses continuous $$y$$ values is called linear regression, which I’m going to focus on. The formula for linear regression (or GLM in general) is:

$Y = \alpha + X\beta + \varepsilon$

Or, a simplified, alphabetized version is:

$y = a + Xb + e$

Where $$a$$ is the intercept (where the line crosses the vertical, i.e. y, axis of the graph), $$X$$ is the predictor variable, $$b$$ is the slope/coefficient, and $$e$$ is the error term.

We construct these regression models for several reasons. Sometimes we want to infer how some variables ($$x$$) cause or influence another variable ($$y$$). Or maybe we know that $$y$$ has a lot of error in the measurement or is difficult to measure, so we want to derive a formula in order to predict $$y$$ based on more accurately measured variables. In R we can run linear regression either using lm or using glm(). lm() assumes a Gaussian (normal) distribution of your data. On the other hand, when you use glm() in R, you can specifiy the data’s distribution with the parameter model =. This allows you to construct a linear model when your data don’t follow a Gaussian distribution.

Regression requires a predictor (independent) variable and a predicted (dependent) variable. Changing which variable is the predictor/predicted gives you a different regression line with a different regression equation. The function we are going to use today, lm, uses the OLS method by default, as mentioned above. The least squares line is the line chosen by lm to fit your data. The goal of OLS is to choose a line that minimizes prediction error. With any other line, errors of prediction would be greater. Note, the best fitting model given your data (i.e. OLS) does not equal the best model period. We must pay attention to fit statistics like R2, the amount (%) of variance in the outcome explained by our predictor (i.e., model), to determine how well our model is doing.

So how do we use these functions? In R, dependent variables are predicted by a tilde $$~$$. The formula to regress $$y$$ on $$x$$ is y ~ x:

result <- lm(weight~sex, data=survey)
summary(result)
##
## Call:
## lm(formula = weight ~ sex, data = survey)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -38.995 -22.171  -5.995   5.005 237.005
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  42.1706     0.2951 142.917   <2e-16 ***
## sexM          0.8248     0.4074   2.024   0.0429 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 32180 degrees of freedom
##   (2604 observations deleted due to missingness)
## Multiple R-squared:  0.0001273,  Adjusted R-squared:  9.627e-05
## F-statistic: 4.098 on 1 and 32180 DF,  p-value: 0.04293
#multiple predictors with interaction terms
result <- lm(weight~sex*hindfoot_length, data=survey)
summary(result)
##
## Call:
## lm(formula = weight ~ sex * hindfoot_length, data = survey)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -96.687 -12.919  -4.031   3.259 229.889
##
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)          -32.042646   0.692255 -46.287   <2e-16 ***
## sexM                  -1.816600   0.959032  -1.894   0.0582 .
## hindfoot_length        2.558271   0.022894 111.746   <2e-16 ***
## sexM:hindfoot_length   0.003292   0.031264   0.105   0.9161
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.02 on 30672 degrees of freedom
##   (4110 observations deleted due to missingness)
## Multiple R-squared:  0.4678, Adjusted R-squared:  0.4678
## F-statistic:  8989 on 3 and 30672 DF,  p-value: < 2.2e-16
#use + for main effect of predictor only
result <- lm(weight~sex + hindfoot_length, data=survey)
summary(result)
##
## Call:
## lm(formula = weight ~ sex + hindfoot_length, data = survey)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -96.749 -12.907  -4.028   3.253 229.893
##
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -32.09337    0.49714 -64.556  < 2e-16 ***
## sexM             -1.72062    0.29787  -5.776  7.7e-09 ***
## hindfoot_length   2.56004    0.01559 164.202  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.02 on 30673 degrees of freedom
##   (4110 observations deleted due to missingness)
## Multiple R-squared:  0.4678, Adjusted R-squared:  0.4678
## F-statistic: 1.348e+04 on 2 and 30673 DF,  p-value: < 2.2e-16

### Analysis of Variance:

ANOVA is simply a different way of evaluating explained variance in linear modelling. Anova is a special case of linear modelling You must always wrap the anova() function around a lm() function.

anova(lm(weight~sex*genus, data=survey))
## Analysis of Variance Table
##
## Response: weight
##              Df   Sum Sq Mean Sq  F value    Pr(>F)
## sex           1     5461    5461   10.801  0.001016 **
## genus         9 26567752 2951972 5839.067 < 2.2e-16 ***
## sex:genus     9    48457    5384   10.650 < 2.2e-16 ***
## Residuals 32162 16259676     506
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, R uses type II sums of squares by default. Anova() from the car package can give you “Type III Sums of Squares”. This matters when you have more than one predictor (e.g. taxa x sex). Asking for type III sums of squares will match what you get from SPSS or SAS.

result <- lm(weight~sex*genus, data=survey)
Anova(result, type=3)
## Anova Table (Type III tests)
##
## Response: weight
##               Sum Sq    Df   F value Pr(>F)
## (Intercept)     2602     1    5.1464 0.0233 *
## sex               31     1    0.0621 0.8032
## genus       13368626     9 2938.1598 <2e-16 ***
## sex:genus      48457     9   10.6498 <2e-16 ***
## Residuals   16259676 32162
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

### Post hoc tests:

R comes with a default pairwise t-test (pairwise.t.test()). However, multcomp offers more posthoc options than base R:

result <- lm(weight~genus, data=survey)
postHocs<-glht(result, linfct = mcp(genus = "Tukey"))
#summary function gives results of multiple comparisons
summary(postHocs)
## Warning in RET$pfunction("adjusted", ...): Completion with error > abseps ## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Fit: lm(formula = weight ~ genus, data = survey) ## ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|) ## Chaetodipus - Baiomys == 0 15.5793 3.3770 4.613 < 0.001 ## Dipodomys - Baiomys == 0 47.2602 3.3689 14.028 < 0.001 ## Neotoma - Baiomys == 0 150.6457 3.4291 43.931 < 0.001 ## Onychomys - Baiomys == 0 17.8962 3.3881 5.282 < 0.001 ## Perognathus - Baiomys == 0 -0.2225 3.4117 -0.065 1.00000 ## Peromyscus - Baiomys == 0 12.8563 3.3987 3.783 0.00372 ## Reithrodontomys - Baiomys == 0 2.0679 3.3928 0.610 0.99971 ## Sigmodon - Baiomys == 0 58.6646 3.6879 15.907 < 0.001 ## Spermophilus - Baiomys == 0 84.9000 16.3078 5.206 < 0.001 ## Dipodomys - Chaetodipus == 0 31.6809 0.3464 91.464 < 0.001 ## Neotoma - Chaetodipus == 0 135.0663 0.7275 185.667 < 0.001 ## Onychomys - Chaetodipus == 0 2.3168 0.4995 4.638 < 0.001 ## Perognathus - Chaetodipus == 0 -15.8019 0.6400 -24.689 < 0.001 ## Peromyscus - Chaetodipus == 0 -2.7231 0.5671 -4.802 < 0.001 ## Reithrodontomys - Chaetodipus == 0 -13.5114 0.5306 -25.465 < 0.001 ## Sigmodon - Chaetodipus == 0 43.0852 1.5397 27.982 < 0.001 ## Spermophilus - Chaetodipus == 0 69.3207 15.9598 4.343 < 0.001 ## Neotoma - Dipodomys == 0 103.3854 0.6891 150.023 < 0.001 ## Onychomys - Dipodomys == 0 -29.3640 0.4418 -66.458 < 0.001 ## Perognathus - Dipodomys == 0 -47.4828 0.5961 -79.654 < 0.001 ## Peromyscus - Dipodomys == 0 -34.4040 0.5170 -66.544 < 0.001 ## Reithrodontomys - Dipodomys == 0 -45.1923 0.4767 -94.810 < 0.001 ## Sigmodon - Dipodomys == 0 11.4044 1.5220 7.493 < 0.001 ## Spermophilus - Dipodomys == 0 37.6398 15.9581 2.359 0.27064 ## Onychomys - Neotoma == 0 -132.7495 0.7775 -170.746 < 0.001 ## Perognathus - Neotoma == 0 -150.8682 0.8744 -172.538 < 0.001 ## Peromyscus - Neotoma == 0 -137.7894 0.8225 -167.522 < 0.001 ## Reithrodontomys - Neotoma == 0 -148.5777 0.7978 -186.241 < 0.001 ## Sigmodon - Neotoma == 0 -91.9811 1.6510 -55.713 < 0.001 ## Spermophilus - Neotoma == 0 -65.7457 15.9709 -4.117 < 0.001 ## Perognathus - Onychomys == 0 -18.1187 0.6964 -26.019 < 0.001 ## Peromyscus - Onychomys == 0 -5.0399 0.6300 -8.000 < 0.001 ## Reithrodontomys - Onychomys == 0 -15.8282 0.5973 -26.500 < 0.001 ## Sigmodon - Onychomys == 0 40.7684 1.5640 26.067 < 0.001 ## Spermophilus - Onychomys == 0 67.0038 15.9622 4.198 < 0.001 ## Peromyscus - Perognathus == 0 13.0788 0.7463 17.525 < 0.001 ## Reithrodontomys - Perognathus == 0 2.2905 0.7190 3.186 0.03070 ## Sigmodon - Perognathus == 0 58.8871 1.6144 36.477 < 0.001 ## Spermophilus - Perognathus == 0 85.1225 15.9672 5.331 < 0.001 ## Reithrodontomys - Peromyscus == 0 -10.7883 0.6549 -16.474 < 0.001 ## Sigmodon - Peromyscus == 0 45.8083 1.5869 28.867 < 0.001 ## Spermophilus - Peromyscus == 0 72.0437 15.9644 4.513 < 0.001 ## Sigmodon - Reithrodontomys == 0 56.5966 1.5742 35.953 < 0.001 ## Spermophilus - Reithrodontomys == 0 82.8321 15.9632 5.189 < 0.001 ## Spermophilus - Sigmodon == 0 26.2354 16.0285 1.637 0.76582 ## ## Chaetodipus - Baiomys == 0 *** ## Dipodomys - Baiomys == 0 *** ## Neotoma - Baiomys == 0 *** ## Onychomys - Baiomys == 0 *** ## Perognathus - Baiomys == 0 ## Peromyscus - Baiomys == 0 ** ## Reithrodontomys - Baiomys == 0 ## Sigmodon - Baiomys == 0 *** ## Spermophilus - Baiomys == 0 *** ## Dipodomys - Chaetodipus == 0 *** ## Neotoma - Chaetodipus == 0 *** ## Onychomys - Chaetodipus == 0 *** ## Perognathus - Chaetodipus == 0 *** ## Peromyscus - Chaetodipus == 0 *** ## Reithrodontomys - Chaetodipus == 0 *** ## Sigmodon - Chaetodipus == 0 *** ## Spermophilus - Chaetodipus == 0 *** ## Neotoma - Dipodomys == 0 *** ## Onychomys - Dipodomys == 0 *** ## Perognathus - Dipodomys == 0 *** ## Peromyscus - Dipodomys == 0 *** ## Reithrodontomys - Dipodomys == 0 *** ## Sigmodon - Dipodomys == 0 *** ## Spermophilus - Dipodomys == 0 ## Onychomys - Neotoma == 0 *** ## Perognathus - Neotoma == 0 *** ## Peromyscus - Neotoma == 0 *** ## Reithrodontomys - Neotoma == 0 *** ## Sigmodon - Neotoma == 0 *** ## Spermophilus - Neotoma == 0 *** ## Perognathus - Onychomys == 0 *** ## Peromyscus - Onychomys == 0 *** ## Reithrodontomys - Onychomys == 0 *** ## Sigmodon - Onychomys == 0 *** ## Spermophilus - Onychomys == 0 *** ## Peromyscus - Perognathus == 0 *** ## Reithrodontomys - Perognathus == 0 * ## Sigmodon - Perognathus == 0 *** ## Spermophilus - Perognathus == 0 *** ## Reithrodontomys - Peromyscus == 0 *** ## Sigmodon - Peromyscus == 0 *** ## Spermophilus - Peromyscus == 0 *** ## Sigmodon - Reithrodontomys == 0 *** ## Spermophilus - Reithrodontomys == 0 *** ## Spermophilus - Sigmodon == 0 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method) ### Logistic regression: Normally-distributed dependent variable assumption is violated in logistic regression, where we want to predict a binary outcome. So, you must use the glm() function rather than lm(). We only have one binary variable in our dataset: sex. summary(glm(survey$sex ~ survey$weight*survey$hindfoot_length, family=binomial))
##
## Call:
## glm(formula = survey$sex ~ survey$weight * survey$hindfoot_length, ## family = binomial) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.472 -1.203 1.050 1.143 1.430 ## ## Coefficients: ## Estimate Std. Error z value ## (Intercept) -1.766e-01 5.416e-02 -3.261 ## survey$weight                        -5.778e-03  1.266e-03  -4.565
## survey$hindfoot_length 1.332e-02 2.020e-03 6.596 ## survey$weight:survey$hindfoot_length 8.781e-05 3.212e-05 2.734 ## Pr(>|z|) ## (Intercept) 0.00111 ** ## survey$weight                        4.99e-06 ***
## survey$hindfoot_length 4.23e-11 *** ## survey$weight:survey\$hindfoot_length  0.00626 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 42452  on 30675  degrees of freedom
## Residual deviance: 42341  on 30672  degrees of freedom
##   (4110 observations deleted due to missingness)
## AIC: 42349
##
## Number of Fisher Scoring iterations: 3