Lesson preamble

Lesson objectives

  • Learn how to apply and interpret linear regression for a variety of data
  • Understand probability and the importance of presenting confidence intervals
  • Learn the importance of visualizing your data when doing any analyses or statistics

Lesson outline

  • Basic descriptive and inferential statistics (20 min)
  • Generalized linear models (50-60 min)
    • Linear regression
    • ANOVA
    • Logistic regression

Statistics and probability

Theoretical models are powerful tools for explaining and understanding the world. However, they are limited in that the real-world often doesn’t perfectly fit these models. The real world is messy and noisy. We can’t always blindly trust our data as there are inherent biases and errors in it. Measuring it, collecting it, recording it, and inputting it are some of the possible sources of error and bias. We use statistics and probability to determine whether the way we understand and conceptualize the world (as models) matches reality (even with the error and bias).

A reason we use statistical methods in R compared to writing up the formulas and equations ourselves is that we can focus on answering questions without worrying about whether we are doing the math or computation wrong. This is of course dependent on the type of research questions you may be interested in (e.g. for more theoretical questions, doing the math and computation yourself is probably a goal!) and on the type of data you are using/collecting. There is a lot of complexity that has already been taken care of in the available R packages and functions. For example, the function lm that we will use in this lesson uses the ordinary least squares (OLS) method, which is a common method for determining fit for linear models. That way, you can answer your research questions and not worry too much about the exact math involved and instead worry about the specifics of your field (e.g. Are you measuring the right thing? Are you collecting the right data? Are you asking the right questions? Is there an ecological or biological aspect you are missing in your analysis?)

Descriptive statistics

Frequencies with table():

The table() function displays counts of identical observations for either a single data vector or a dataframe.

## 
##       Ammodramus Ammospermophilus       Amphispiza          Baiomys 
##                2              437              303               46 
##      Calamospiza       Callipepla  Campylorhynchus      Chaetodipus 
##               13               16               50             6029 
##    Cnemidophorus         Crotalus        Dipodomys           Lizard 
##                2                2            16167                4 
##          Neotoma        Onychomys      Perognathus       Peromyscus 
##             1252             3267             1629             2234 
##           Pipilo        Pooecetes  Reithrodontomys           Rodent 
##               52                8             2694               10 
##       Sceloporus         Sigmodon          Sparrow     Spermophilus 
##                6              233                4              249 
##       Sylvilagus      Zonotrichia 
##               75                2
##                   
##                     Bird Rabbit Reptile Rodent
##   Ammodramus           2      0       0      0
##   Ammospermophilus     0      0       0    437
##   Amphispiza         303      0       0      0
##   Baiomys              0      0       0     46
##   Calamospiza         13      0       0      0
##   Callipepla          16      0       0      0
##   Campylorhynchus     50      0       0      0
##   Chaetodipus          0      0       0   6029
##   Cnemidophorus        0      0       2      0
##   Crotalus             0      0       2      0
##   Dipodomys            0      0       0  16167
##   Lizard               0      0       4      0
##   Neotoma              0      0       0   1252
##   Onychomys            0      0       0   3267
##   Perognathus          0      0       0   1629
##   Peromyscus           0      0       0   2234
##   Pipilo              52      0       0      0
##   Pooecetes            8      0       0      0
##   Reithrodontomys      0      0       0   2694
##   Rodent               0      0       0     10
##   Sceloporus           0      0       6      0
##   Sigmodon             0      0       0    233
##   Sparrow              4      0       0      0
##   Spermophilus         0      0       0    249
##   Sylvilagus           0     75       0      0
##   Zonotrichia          2      0       0      0

describe() from the psych package:

describe() provides item name, item number, nvalid, mean, sd, median, mad (median absolute deviation from the median), min, max, skew, kurtosis, se.

##    vars     n  mean   sd median trimmed mad min max range skew kurtosis
## X1    1 31438 29.29 9.56     32   28.87 8.9   2  70    68 0.32    -0.61
##      se
## X1 0.05

describeBy() from the psych package:

describeBy() is simple way of generating summary statistics by grouping variable.

## 
##  Descriptive statistics by group 
## group: F
##    vars     n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 14894 28.84 9.46     27   28.33 11.86   7  64    57 0.41    -0.52
##      se
## X1 0.08
## -------------------------------------------------------- 
## group: M
##    vars     n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 16476 29.71 9.63     34   29.38 10.38   2  58    56 0.23    -0.67
##      se
## X1 0.08

Inferential statistics

One-sample t-test:

Compare single set of scores (sample) to a population score.

## 
##  One Sample t-test
## 
## data:  survey$weight
## t = 13.108, df = 32282, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 40
## 95 percent confidence interval:
##  42.27282 43.07203
## sample estimates:
## mean of x 
##  42.67243

Two sample t-test:

Compare two independent sets of scores to each other.

## 
##  Welch Two Sample t-test
## 
## data:  survey$weight by survey$sex
## t = -2.0226, df = 31751, p-value = 0.04312
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.62412287 -0.02552529
## sample estimates:
## mean in group F mean in group M 
##        42.17055        42.99538

Paired samples t-test:

Compare two matched sets (e.g., pre- and post-) of scores to each other.

Correlations:

Bivariate correlation (r) gives us the strength and direction of relationship between two variables (linear). Say we find that the correlation (\(r\)) between hindfoot length and weight is \(r\) = .609. This means that .6092 = .371 of the variance in \(y\) (hindfoot length) is common to the variance in \(x\) (weight). Alternatively, we can say that these two variables share 37.1% of the variance in common. In the case of the Pearson correlation, this will be true whether we consider weight or hindfoot length to be the dependent variable.

## 
##  Pearson's product-moment correlation
## 
## data:  survey$weight and survey$hindfoot_length
## t = 164.3, df = 30736, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6778151 0.6897195
## sample estimates:
##       cor 
## 0.6838128

General linear model (i.e., with a normally-distributed dependent variable):

A version of GLM that uses continuous \(y\) values is called linear regression, which I’m going to focus on. The formula for linear regression (or GLM in general) is:

\[ Y = \alpha + X\beta + \varepsilon \]

Or, a simplified, alphabetized version is:

\[ y = a + Xb + e \]

Where \(a\) is the intercept (where the line crosses the vertical, i.e. y, axis of the graph), \(X\) is the predictor variable, \(b\) is the slope/coefficient, and \(e\) is the error term.

We construct these regression models for several reasons. Sometimes we want to infer how some variables (\(x\)) cause or influence another variable (\(y\)). Or maybe we know that \(y\) has a lot of error in the measurement or is difficult to measure, so we want to derive a formula in order to predict \(y\) based on more accurately measured variables. In R we can run linear regression either using lm or using glm(). lm() assumes a Gaussian (normal) distribution of your data. On the other hand, when you use glm() in R, you can specifiy the data’s distribution with the parameter model =. This allows you to construct a linear model when your data don’t follow a Gaussian distribution.

Regression requires a predictor (independent) variable and a predicted (dependent) variable. Changing which variable is the predictor/predicted gives you a different regression line with a different regression equation. The function we are going to use today, lm, uses the OLS method by default, as mentioned above. The least squares line is the line chosen by lm to fit your data. The goal of OLS is to choose a line that minimizes prediction error. With any other line, errors of prediction would be greater. Note, the best fitting model given your data (i.e. OLS) does not equal the best model period. We must pay attention to fit statistics like R2, the amount (%) of variance in the outcome explained by our predictor (i.e., model), to determine how well our model is doing.

So how do we use these functions? In R, dependent variables are predicted by a tilde \(~\). The formula to regress \(y\) on \(x\) is y ~ x:

## 
## Call:
## lm(formula = weight ~ sex, data = survey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.995 -22.171  -5.995   5.005 237.005 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.1706     0.2951 142.917   <2e-16 ***
## sexM          0.8248     0.4074   2.024   0.0429 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.5 on 32180 degrees of freedom
##   (2604 observations deleted due to missingness)
## Multiple R-squared:  0.0001273,  Adjusted R-squared:  9.627e-05 
## F-statistic: 4.098 on 1 and 32180 DF,  p-value: 0.04293
## 
## Call:
## lm(formula = weight ~ sex * hindfoot_length, data = survey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.687 -12.919  -4.031   3.259 229.889 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -32.042646   0.692255 -46.287   <2e-16 ***
## sexM                  -1.816600   0.959032  -1.894   0.0582 .  
## hindfoot_length        2.558271   0.022894 111.746   <2e-16 ***
## sexM:hindfoot_length   0.003292   0.031264   0.105   0.9161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.02 on 30672 degrees of freedom
##   (4110 observations deleted due to missingness)
## Multiple R-squared:  0.4678, Adjusted R-squared:  0.4678 
## F-statistic:  8989 on 3 and 30672 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = weight ~ sex + hindfoot_length, data = survey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.749 -12.907  -4.028   3.253 229.893 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -32.09337    0.49714 -64.556  < 2e-16 ***
## sexM             -1.72062    0.29787  -5.776  7.7e-09 ***
## hindfoot_length   2.56004    0.01559 164.202  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.02 on 30673 degrees of freedom
##   (4110 observations deleted due to missingness)
## Multiple R-squared:  0.4678, Adjusted R-squared:  0.4678 
## F-statistic: 1.348e+04 on 2 and 30673 DF,  p-value: < 2.2e-16

Analysis of Variance:

ANOVA is simply a different way of evaluating explained variance in linear modelling. Anova is a special case of linear modelling You must always wrap the anova() function around a lm() function.

## Analysis of Variance Table
## 
## Response: weight
##              Df   Sum Sq Mean Sq  F value    Pr(>F)    
## sex           1     5461    5461   10.801  0.001016 ** 
## genus         9 26567752 2951972 5839.067 < 2.2e-16 ***
## sex:genus     9    48457    5384   10.650 < 2.2e-16 ***
## Residuals 32162 16259676     506                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, R uses type II sums of squares by default. Anova() from the car package can give you “Type III Sums of Squares”. This matters when you have more than one predictor (e.g. taxa x sex). Asking for type III sums of squares will match what you get from SPSS or SAS.

## Anova Table (Type III tests)
## 
## Response: weight
##               Sum Sq    Df   F value Pr(>F)    
## (Intercept)     2602     1    5.1464 0.0233 *  
## sex               31     1    0.0621 0.8032    
## genus       13368626     9 2938.1598 <2e-16 ***
## sex:genus      48457     9   10.6498 <2e-16 ***
## Residuals   16259676 32162                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post hoc tests:

R comes with a default pairwise t-test (pairwise.t.test()). However, multcomp offers more posthoc options than base R:

## Warning in RET$pfunction("adjusted", ...): Completion with error > abseps
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lm(formula = weight ~ genus, data = survey)
## 
## Linear Hypotheses:
##                                      Estimate Std. Error  t value Pr(>|t|)
## Chaetodipus - Baiomys == 0            15.5793     3.3770    4.613  < 0.001
## Dipodomys - Baiomys == 0              47.2602     3.3689   14.028  < 0.001
## Neotoma - Baiomys == 0               150.6457     3.4291   43.931  < 0.001
## Onychomys - Baiomys == 0              17.8962     3.3881    5.282  < 0.001
## Perognathus - Baiomys == 0            -0.2225     3.4117   -0.065  1.00000
## Peromyscus - Baiomys == 0             12.8563     3.3987    3.783  0.00404
## Reithrodontomys - Baiomys == 0         2.0679     3.3928    0.610  0.99971
## Sigmodon - Baiomys == 0               58.6646     3.6879   15.907  < 0.001
## Spermophilus - Baiomys == 0           84.9000    16.3078    5.206  < 0.001
## Dipodomys - Chaetodipus == 0          31.6809     0.3464   91.464  < 0.001
## Neotoma - Chaetodipus == 0           135.0663     0.7275  185.667  < 0.001
## Onychomys - Chaetodipus == 0           2.3168     0.4995    4.638  < 0.001
## Perognathus - Chaetodipus == 0       -15.8019     0.6400  -24.689  < 0.001
## Peromyscus - Chaetodipus == 0         -2.7231     0.5671   -4.802  < 0.001
## Reithrodontomys - Chaetodipus == 0   -13.5114     0.5306  -25.465  < 0.001
## Sigmodon - Chaetodipus == 0           43.0852     1.5397   27.982  < 0.001
## Spermophilus - Chaetodipus == 0       69.3207    15.9598    4.343  < 0.001
## Neotoma - Dipodomys == 0             103.3854     0.6891  150.023  < 0.001
## Onychomys - Dipodomys == 0           -29.3640     0.4418  -66.458  < 0.001
## Perognathus - Dipodomys == 0         -47.4828     0.5961  -79.654  < 0.001
## Peromyscus - Dipodomys == 0          -34.4040     0.5170  -66.544  < 0.001
## Reithrodontomys - Dipodomys == 0     -45.1923     0.4767  -94.810  < 0.001
## Sigmodon - Dipodomys == 0             11.4044     1.5220    7.493  < 0.001
## Spermophilus - Dipodomys == 0         37.6398    15.9581    2.359  0.27113
## Onychomys - Neotoma == 0            -132.7495     0.7775 -170.746  < 0.001
## Perognathus - Neotoma == 0          -150.8682     0.8744 -172.538  < 0.001
## Peromyscus - Neotoma == 0           -137.7894     0.8225 -167.522  < 0.001
## Reithrodontomys - Neotoma == 0      -148.5777     0.7978 -186.241  < 0.001
## Sigmodon - Neotoma == 0              -91.9811     1.6510  -55.713  < 0.001
## Spermophilus - Neotoma == 0          -65.7457    15.9709   -4.117  0.00102
## Perognathus - Onychomys == 0         -18.1187     0.6964  -26.019  < 0.001
## Peromyscus - Onychomys == 0           -5.0399     0.6300   -8.000  < 0.001
## Reithrodontomys - Onychomys == 0     -15.8282     0.5973  -26.500  < 0.001
## Sigmodon - Onychomys == 0             40.7684     1.5640   26.067  < 0.001
## Spermophilus - Onychomys == 0         67.0038    15.9622    4.198  < 0.001
## Peromyscus - Perognathus == 0         13.0788     0.7463   17.525  < 0.001
## Reithrodontomys - Perognathus == 0     2.2905     0.7190    3.186  0.03073
## Sigmodon - Perognathus == 0           58.8871     1.6144   36.477  < 0.001
## Spermophilus - Perognathus == 0       85.1225    15.9672    5.331  < 0.001
## Reithrodontomys - Peromyscus == 0    -10.7883     0.6549  -16.474  < 0.001
## Sigmodon - Peromyscus == 0            45.8083     1.5869   28.867  < 0.001
## Spermophilus - Peromyscus == 0        72.0437    15.9644    4.513  < 0.001
## Sigmodon - Reithrodontomys == 0       56.5966     1.5742   35.953  < 0.001
## Spermophilus - Reithrodontomys == 0   82.8321    15.9632    5.189  < 0.001
## Spermophilus - Sigmodon == 0          26.2354    16.0285    1.637  0.76598
##                                        
## Chaetodipus - Baiomys == 0          ***
## Dipodomys - Baiomys == 0            ***
## Neotoma - Baiomys == 0              ***
## Onychomys - Baiomys == 0            ***
## Perognathus - Baiomys == 0             
## Peromyscus - Baiomys == 0           ** 
## Reithrodontomys - Baiomys == 0         
## Sigmodon - Baiomys == 0             ***
## Spermophilus - Baiomys == 0         ***
## Dipodomys - Chaetodipus == 0        ***
## Neotoma - Chaetodipus == 0          ***
## Onychomys - Chaetodipus == 0        ***
## Perognathus - Chaetodipus == 0      ***
## Peromyscus - Chaetodipus == 0       ***
## Reithrodontomys - Chaetodipus == 0  ***
## Sigmodon - Chaetodipus == 0         ***
## Spermophilus - Chaetodipus == 0     ***
## Neotoma - Dipodomys == 0            ***
## Onychomys - Dipodomys == 0          ***
## Perognathus - Dipodomys == 0        ***
## Peromyscus - Dipodomys == 0         ***
## Reithrodontomys - Dipodomys == 0    ***
## Sigmodon - Dipodomys == 0           ***
## Spermophilus - Dipodomys == 0          
## Onychomys - Neotoma == 0            ***
## Perognathus - Neotoma == 0          ***
## Peromyscus - Neotoma == 0           ***
## Reithrodontomys - Neotoma == 0      ***
## Sigmodon - Neotoma == 0             ***
## Spermophilus - Neotoma == 0         ** 
## Perognathus - Onychomys == 0        ***
## Peromyscus - Onychomys == 0         ***
## Reithrodontomys - Onychomys == 0    ***
## Sigmodon - Onychomys == 0           ***
## Spermophilus - Onychomys == 0       ***
## Peromyscus - Perognathus == 0       ***
## Reithrodontomys - Perognathus == 0  *  
## Sigmodon - Perognathus == 0         ***
## Spermophilus - Perognathus == 0     ***
## Reithrodontomys - Peromyscus == 0   ***
## Sigmodon - Peromyscus == 0          ***
## Spermophilus - Peromyscus == 0      ***
## Sigmodon - Reithrodontomys == 0     ***
## Spermophilus - Reithrodontomys == 0 ***
## Spermophilus - Sigmodon == 0           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Logistic regression:

Normally-distributed dependent variable assumption is violated in logistic regression, where we want to predict a binary outcome. So, you must use the glm() function rather than lm(). We only have one binary variable in our dataset: sex.

## 
## Call:
## glm(formula = survey$sex ~ survey$weight * survey$hindfoot_length, 
##     family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.472  -1.203   1.050   1.143   1.430  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                          -1.766e-01  5.416e-02  -3.261
## survey$weight                        -5.778e-03  1.266e-03  -4.565
## survey$hindfoot_length                1.332e-02  2.020e-03   6.596
## survey$weight:survey$hindfoot_length  8.781e-05  3.212e-05   2.734
##                                      Pr(>|z|)    
## (Intercept)                           0.00111 ** 
## survey$weight                        4.99e-06 ***
## survey$hindfoot_length               4.23e-11 ***
## survey$weight:survey$hindfoot_length  0.00626 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 42452  on 30675  degrees of freedom
## Residual deviance: 42341  on 30672  degrees of freedom
##   (4110 observations deleted due to missingness)
## AIC: 42349
## 
## Number of Fisher Scoring iterations: 3

This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.