## Lesson preamble

### Lesson objectives

• Learn how to apply and interpret linear regression for a variety of data
• Understand probability and the importance of presenting confidence intervals
• Learn the importance of visualizing your data when doing any analyses or statistics

### Lesson outline

• Statistics and probability (5 min)
• Generalized linear models (40-50 min)
• Linear regression
• Logistic regression
• Confidence intervals and p-values (15-20 min)
• Importance of visualization (20 min)

### Setup

• install.packages("dplyr") # or tidyverse
• install.packages("tidyr") # or tidyverse
• install.packages("broom")
• install.packages("datasauRus")

library(tidyverse)
library(broom)
library(datasauRus)

## Statistics and probability

Theoretical models are powerful tools for explaining and understanding the world. However, they are limited in that the real-world often doesn’t perfectly fit these models. The real world is messy and noisy. We can’t always blindly trust our data as there are inherent biases and errors in it. Measuring it, collecting it, recording it, and inputting it are some of the possible sources of error and bias. We use statistics and probability to determine whether the way we understand and conceptualize the world (as models) matches reality (even with the error and bias).

A reason we use statistical methods in R compared to writing up the formulas and equations ourselves is that we can focus on answering questions without worrying about whether we are doing the math or computation wrong. This is of course dependent on the type of research questions you may be interested in (e.g. for more theoretical questions, doing the math and computation yourself is probably a goal!) and on the type of data you are using/collecting. There is a lot of complexity that has already been taken care of in the available R packages and functions. For example, the function glm that we will use in this lesson takes a maximum likelihood-based approach to estimation, which is a complex method for determining fit (like the least squares method of fitting you did in the previous lesson). That way, you can answer your research questions and not worry too much about the exact math involved and instead worry about the specifics of your field (e.g. Are you measuring the right thing? Are you collecting the right data? Are you asking the right questions? Is there an ecological or biological aspect you are missing in your analysis?)

## Generalized linear models

Generalized linear models (or GLM) is a family of statistical modelling techniques, that can also be used in place of several other techniques, e.g. correlation or analysis of variance (ANOVA), linear regression, logistic regression. GLM is a powerful technique that you can use on a wide range of data and research questions, and is the foundation for understanding how more advanced techniques work. This is the reason I will be covering GLM in depth. (For those who will go into graduate studies or into research in general, you will likely use mixed effects models. If you can understand and grasp GLM, it will be substantially easier for you to understand mixed effects modelling and other techniques.)

### Linear regression

A version of GLM that uses continuous $$y$$ values is called linear regression, which I’m going to focus on. The formula for linear regression (or GLM in general) is:

$Y = \alpha + X\beta + \varepsilon$

Or, a simplified, alphabetized version is:

$y = a + Xb + e$

Where $$a$$ is the intercept, $$X$$ is the variable (or as a matrix of all variables), $$b$$ is the slope/coefficient, and $$e$$ is the error term. In the case of multiple linear regression (more than one $$x$$), it is expanded to:

$y = a + x_1b_1 + x_2b_2 +...+ x_nb_n + e$

Where each $$x_n$$ is a variable (either continuous or categorical) in a data frame and $$b_n$$ is the variable coefficient/slope.

We construct these regression models for several reasons. Sometimes we want to infer how some variables ($$x$$) cause or influence another variable ($$y$$). Or maybe we know that $$y$$ has a lot of error in the measurement or is difficult to measure, so we want to derive a formula in order to predict $$y$$ based on more accurately measured variables. In R we can run linear regression either using lm or using glm. We’ll use glm since it can be expanded to other types of analyses. First, let’s load up a dataset.

portal <- read_csv("portal_data.csv")
## Parsed with column specification:
## cols(
##   record_id = col_integer(),
##   month = col_integer(),
##   day = col_integer(),
##   year = col_integer(),
##   plot_id = col_integer(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_integer(),
##   weight = col_integer(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character(),
##   plot_type = col_character()
## )

Now, let’s fit a model to the data. Let’s say we want to see the role that sex has on hindfoot length in various species in the portal dataset. (gaussian is stating that hindfoot_length is a continuous variable and that you assume the error terms have a Gaussian, or normal, distribution.)

fit1 <- glm(hindfoot_length ~ sex, data = portal, family = gaussian)
summary(fit1)
##
## Call:
## glm(formula = hindfoot_length ~ sex, family = gaussian, data = portal)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -27.710   -8.710    2.290    7.163   35.163
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.83678    0.07826 368.469  < 2e-16 ***
## sexM         0.87280    0.10799   8.082 6.58e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 91.22251)
##
##     Null deviance: 2867427  on 31369  degrees of freedom
## Residual deviance: 2861468  on 31368  degrees of freedom
##   (3416 observations deleted due to missingness)
## AIC: 230610
##
## Number of Fisher Scoring iterations: 2

There’s a lot of results and information contained within fit1, which summary() extracts and presents in a fairly nice format. However, a lot of this information we aren’t really interested in. So let’s extract only what we want, using the broom package. (get it, broom to clean up?)

tidy(fit1)
##          term   estimate  std.error  statistic      p.value
## 1 (Intercept) 28.8367799 0.07826099 368.469411 0.000000e+00
## 2        sexM  0.8727977 0.10798830   8.082336 6.580085e-16

Much nicer! Ok, so what does this actually mean? Let’s go back to the equation and convert the above into the formula:

$hindfoot\_length = Intercept + (sexM \times b)$

Let’s substitute in the numbers. The intercept, is well, the intercept in the data above. The $$b$$ is the estimate for sex.

$hindfoot\_length = 28.836 + (sexM \times 0.872)$

We can use this equation to estimate the value of hindfoot length for a specific sex. But first, let’s make sure this is working correctly. Let’s find out the mean hindfoot length by sex.

portal %>%
filter(!is.na(sex)) %>%
group_by(sex) %>%
summarize(mean_hfl = mean(hindfoot_length, na.rm = TRUE)) 
## # A tibble: 2 x 2
##   sex   mean_hfl
##   <chr>    <dbl>
## 1 F         28.8
## 2 M         29.7

Ok, let’s compare to the equation. The variable sexM is 1 for male and 0 for female. So if we want to calculate the hindfoot length of female, we have to set sexM as 0 in the equation. Let’s do it:

28.836 + (0 * 0.872)
## [1] 28.836

Same as above! And for males (equal to 1):

28.836 + (1 * 0.872)
## [1] 29.708

Good, it matches what we calculated. Why is this useful? Well, we can add more terms to the GLM equation. This allows us to determine how a variable influences $$y$$ when other variables are held constant and is known as multiple linear regression. Since hindfoot length is probably also dependent on weight, let’s add that to the model and put the values into the equation.

fit2 <- glm(hindfoot_length ~ sex + weight, data = portal)
tidy(fit2)
##          term   estimate   std.error statistic      p.value
## 1 (Intercept) 21.1572732 0.073787670 286.73182 0.000000e+00
## 2        sexM  0.7994193 0.079493966  10.05635 9.375428e-24
## 3      weight  0.1827353 0.001112868 164.20209 0.000000e+00

$hindfoot\_length = 21.157 + (sexM \times 0.799) + (weight \times 0.183)$

So if we wanted to know what the hindfoot length would be for a female of 50 weight:

21.157 + (0 * 0.799) + (50 * 0.183)
## [1] 30.307

Sometimes, depending on the research question, we want to be able to see the predicted $$y$$ based on a single variable after removing the other variables. To remove a variable you set it to zero, but a zero weight of an animal doesn’t make sense. It doesn’t exist! So, we can transform the data so that zero is possible. There are many ways to transform data, but one way that is often used is to mean center the data. Mean centering is subtracting each value by the mean. After mean centering, 0 is equal to the mean of the variable. So let’s do that to weight.

portal2 <- portal %>%
mutate(wt_center = weight - mean(weight, na.rm = TRUE))
# You can also use scale
# mutate(wt_center = scale(weight, scale = FALSE))

fit3 <- glm(hindfoot_length ~ sex + wt_center, data = portal2)
tidy(fit3)
##          term   estimate   std.error statistic      p.value
## 1 (Intercept) 28.9550335 0.057589207 502.78576 0.000000e+00
## 2        sexM  0.7994193 0.079493966  10.05635 9.375428e-24
## 3   wt_center  0.1827353 0.001112868 164.20209 0.000000e+00

$hindfoot\_length = 28.96 + (sexM \times 0.8) + (wt\_center \times 0.18)$

Much better. So if we wanted to compare males vs females who have a mean weight (set to 0):

# Female
28.96 + (0 * 0.8) + (0 * 0.18)
## [1] 28.96
# Male
28.96 + (1 * 0.8) + (0 * 0.18)
## [1] 29.76

Or if we wanted to see males with weight 10 g minus the mean vs 10 g plus the mean:

# Males 10 cm minus mean
28.96 + (1 * 0.8) + (-10 * 0.18)
## [1] 27.96
# Male 10 cm plus mean
28.96 + (1 * 0.8) + (10 * 0.18)
## [1] 31.56

#### Challenge

1. Create a new dataframe called challenge1 that keeps only taxa of “Rodent” (filter) and with a new column (mutate) called “Dipodomys” where if genus is equal to “Dipodomys” than the value is “Yes” and if not the value is “No” (hint: use ifelse). Then, create a model using glm called fit_challenge1 that has the terms hindfoot_length (as $$y$$), sex, wt_center, and Dipodomys. Extract the relevant information from the model using tidy. What is the hindfoot length when sex is female, weight is 5 above the mean, and Dipodomys is yes?
# Solution
challenge1 <- portal2 %>%
filter(taxa == "Rodent") %>%
mutate(Dipodomys = ifelse(genus == "Dipodomys", "Yes", "No"))

fit_challenge1 <- glm(hindfoot_length ~ sex + wt_center + Dipodomys, data = challenge1)
tidy(fit_challenge1)
##           term    estimate    std.error  statistic    p.value
## 1  (Intercept) 22.93750286 0.0288897314 793.967329 0.00000000
## 2         sexM -0.06471891 0.0334040826  -1.937455 0.05269891
## 3    wt_center  0.11647255 0.0004981872 233.792743 0.00000000
## 4 DipodomysYes 13.52878749 0.0356700466 379.275857 0.00000000
22.9 + (-0.06 * 0) + (0.12 * 5) + (13.5 * 1)
## [1] 37
1. Use the below code to create a new variable Disease with (random) values as either 0 for “Healthy” or 1 for “Diseased”. Then, write up a formula using glm to analyze the role that weight, sex, and hindfoot length (not mean centered) has on disease status. Because the $$y$$ is not continuous, you need to set family = binomial in glm. Run the code, check the summary (or tidy), put the numbers into an equation (as we did above), and try to interpret the results (where sex is female, weight is 50, and hindfoot length is 30).
# So random numbers is the same for everyone
set.seed(1002)
challenge2 <- portal %>%
# rbinom randomly creates values for binary data.
mutate(Disease = rbinom(nrow(.), 1, 0.25))
# Solution
fit_challenge2 <- glm(Disease ~ weight + sex + hindfoot_length,
data = challenge2, family = binomial)
summary(fit_challenge2)
##
## Call:
## glm(formula = Disease ~ weight + sex + hindfoot_length, family = binomial,
##     data = challenge2)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.7780  -0.7700  -0.7609   1.6406   1.6960
##
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -1.0351924  0.0467550 -22.141   <2e-16 ***
## weight          -0.0003031  0.0005085  -0.596    0.551
## sexM            -0.0338419  0.0263049  -1.287    0.198
## hindfoot_length -0.0004350  0.0018926  -0.230    0.818
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 34757  on 30675  degrees of freedom
## Residual deviance: 34754  on 30672  degrees of freedom
##   (4110 observations deleted due to missingness)
## AIC: 34762
##
## Number of Fisher Scoring iterations: 4
tidy(fit_challenge2)
##              term      estimate    std.error   statistic       p.value
## 1     (Intercept) -1.0351924290 0.0467549597 -22.1408046 1.279324e-108
## 2          weight -0.0003031094 0.0005085043  -0.5960803  5.511216e-01
## 3            sexM -0.0338419125 0.0263049452  -1.2865228  1.982607e-01
## 4 hindfoot_length -0.0004350449 0.0018926324  -0.2298623  8.181987e-01
-1.035 + (-0.0003 * 50) + (-0.0338 * 0) + (-0.0004 * 30)
## [1] -1.062

### Logistic regression

Logistic regression is a technique that has the $$y$$ as a binary or categorical variable (e.g. female vs male, mammal vs plant vs bacteria). We won’t be covering this in too much detail, but I will go over it to highlight the flexibility that GLM has in handling different types of data.

The reason why the result of the challenge two above was difficult to interpret was because there’s a bit more to logistic regression then with linear regression. It becomes clearer when we look at the formula:

$logit(p) = \ln\left(\frac{P(y=1)}{P(y=0)}\right) = a + Xb + e$

In this case, the $$y$$ is the logged odds of the event occurring (in the challenge the event was disease). In order to interpret the model results, we need to exponentiate both sides of the equation.

$\exp\left(\ln\left(\frac{p}{1 - p}\right)\right) = \exp(a + Xb + e)$ $\frac{p}{1 - p} = e^{a + Xb + e}$

where $$p$$ is the probability of the event (equal to 1, in this case of disease). So if we exponentiate the estimates, and solve for the formula, we can interpret the result as the odds of an event occurring.

exp(-1.035 + (-0.0003 * 50) + (-0.0338 * 0) + (-0.0004 * 30))
## [1] 0.3457636

In this case, for a female with a weight of 50 and hindfoot of 30, the probability of disease is 0.345, or 34%. Interpretation actually gets a bit more complicated than that depending on what you want to look at. So depending on your final project, if you need/want to use this, we can help walk you through this more.

## Confidence intervals, and p-values

So what’s the point of using glm? Not only does it create a model with coefficients and the magnitude (effect size), but it is also used to calculate how certain we are about the results. The main power of GLM comes from the ability to estimate model parameters and derive meaning from how certain $$x$$ variables (independent variables) influence the $$y$$ (dependent variable). We can use this model to predict what could happen to some value if we knew or could change the independent variables ($$x$$). But we also want to know whether this model reflects reality.

### Confidence intervals

From GLM we can determine the confidence (or rather, the uncertainty) we have about the beta ($$b$$) estimates. In the case of linear regression model, we can use tidy with the conf.int argument to calculate the confidence interval. In the default case, this calculates the 95% confidence interval, or rather we are 95% certain that the estimate lies in this range.

fit_wt <- glm(hindfoot_length ~ weight, data = portal)
tidy(fit_wt, conf.int = TRUE) %>%
select(term, estimate, conf.low, conf.high, p.value)
##          term   estimate   conf.low  conf.high p.value
## 1 (Intercept) 21.5621621 21.4422759 21.6820482       0
## 2      weight  0.1830001  0.1808171  0.1851831       0

In this case, the uncertainty of the estimate for hindfoot length is between 2.524 to 2.58, which is pretty narrow (that’s a good thing). This tells us that the estimate that hindfoot length influences weight is probably reflective of reality. That makes sense since longer hindfeet mean there is more space and amount of area for more flesh to attach to and that there is more bone, which adds weight. We can also be more strict about our uncertainty (counter-intuitively, a larger confidence interval is more strict).

# Confidence interval of 99%
tidy(fit_wt, conf.int = TRUE, conf.level = 0.99) %>%
select(term, estimate, conf.low, conf.high, p.value)
##          term   estimate   conf.low  conf.high p.value
## 1 (Intercept) 21.5621621 21.4046050 21.7197192       0
## 2      weight  0.1830001  0.1801311  0.1858691       0

We can also expand the model to include more terms.

fit_wtsex <- glm(hindfoot_length ~ weight + sex, data = portal)
tidy(fit_wtsex, conf.int = TRUE) %>%
select(term, estimate, conf.low, conf.high, p.value)
##          term   estimate   conf.low  conf.high      p.value
## 1 (Intercept) 21.1572732 21.0126521 21.3018944 0.000000e+00
## 2      weight  0.1827353  0.1805541  0.1849165 0.000000e+00
## 3        sexM  0.7994193  0.6436140  0.9552246 9.375428e-24

Great! In almost all cases, you should favour confidence intervals and estimation over the p-value. In some parts of science (especially biomedical research), p-values are ubiquitous. But, there is a major problem with p-values. What does the p-value mean?

### P-values are counter-intuitive

P-values are based on and used in hypothesis testing. Hypothesis testing is when you ask a question such as “Are males heavier than females in rodents?”. There is then the null hypothesis (“females and males weigh the same”) and the alternative hypothesis (“males are heavier than females”). Using statistical tests, we can calculate the p-value to give us an indication of which hypothesis is true. Traditionally, if the p-value is <0.05 you ‘reject’ the null hypothesis. This cut off is completely arbitrary.

However, the true meaning of the p-value is a bit counter-intuitive. The p-value is the probability of getting a result that is the same as or more extreme as what you are getting given the null hypothesis is true. There is nothing in that statement that says the alternate hypothesis is true… This is important because often researchers use the p-value to say that their hypothesis is true or that the p-value is the probability their hypothesis (and thus reality) is true. But that is not at all the case! This is why the use of p-values should be minimized as much as possible and to instead use estimation and confidence intervals.

The other thing with p-values is that they are ‘unstable’/‘unreliable’… meaning they can change with even just slight differences in the data. Let’s do a demonstration:

num_tests <- 100
sample_size <- 100
actual_difference <- 0
variability <- 1

p.values <- sapply(1:num_tests, function(x) {
x <- rnorm(sample_size, mean = 0,
sd = variability)
y <- rnorm(sample_size, mean = 0 + actual_difference,
sd = variability)
tidy(t.test(x, y))[['p.value']]
})

qplot(y = p.values) +
geom_hline(yintercept = 0.05)

As you can see, we know that x and y are the same (both have a mean of 0 and standard deviation of 1). And yet, there are still ~6 or so (below the line) tests that show a “significant” difference between x and y! Which makes sense, since a p-value of 0.05 equals a 5% chance you will see another similar result if the null hypothesis were true. Confusing eh?

#### Challenge

1. Play around with the above code and try this different conditions. What do you notice?
• sample_size to 10
• sample_size to 500
• num_tests to 1000
• num_tests to 10
• actual_difference to 0.25
• actual_difference to 1
• variability to 0.5
• variability to 3

## Data visualization

### A key step in all analyses

Visualizing your data before, during, and after doing statistics or modelling your data is an incredibly important and sometimes overlooked aspect of data analysis. Sometimes statistics and modelling give you an answer that you sort of expect, but the actual data is telling a different story.

#### Challenge

There are several datasets found within the datasauRus package. This package is trying to illustrate a point. First, let’s look at the mean and standard deviation of the variables in each dataset:

datasaurus_dozen %>%
group_by(dataset) %>%
summarise(
mean_x = mean(x),
sd_x = sd(x),
mean_y = mean(y),
sd_y = sd(y)
)
## # A tibble: 13 x 5
##    dataset    mean_x  sd_x mean_y  sd_y
##    <chr>       <dbl> <dbl>  <dbl> <dbl>
##  1 away         54.3  16.8   47.8  26.9
##  2 bullseye     54.3  16.8   47.8  26.9
##  3 circle       54.3  16.8   47.8  26.9
##  4 dino         54.3  16.8   47.8  26.9
##  5 dots         54.3  16.8   47.8  26.9
##  6 h_lines      54.3  16.8   47.8  26.9
##  7 high_lines   54.3  16.8   47.8  26.9
##  8 slant_down   54.3  16.8   47.8  26.9
##  9 slant_up     54.3  16.8   47.8  26.9
## 10 star         54.3  16.8   47.8  26.9
## 11 v_lines      54.3  16.8   47.8  26.9
## 12 wide_lines   54.3  16.8   47.8  26.9
## 13 x_shape      54.3  16.8   47.8  26.9

They’re all basically the same… What about for linear regression? (lm is the same as glm with family gaussian.)

# Linear regression on each dataset
datasaurus_dozen %>%
group_by(dataset) %>%
do(tidy(lm(y ~ x, data = .))[2, ])
## # A tibble: 13 x 6
## # Groups:   dataset [13]
##    dataset    term  estimate std.error statistic p.value
##    <chr>      <chr>    <dbl>     <dbl>     <dbl>   <dbl>
##  1 away       x      -0.103      0.135    -0.760   0.448
##  2 bullseye   x      -0.110      0.135    -0.813   0.417
##  3 circle     x      -0.110      0.135    -0.811   0.419
##  4 dino       x      -0.104      0.136    -0.764   0.446
##  5 dots       x      -0.0969     0.135    -0.715   0.476
##  6 h_lines    x      -0.0992     0.136    -0.732   0.466
##  7 high_lines x      -0.110      0.135    -0.812   0.418
##  8 slant_down x      -0.111      0.135    -0.818   0.415
##  9 slant_up   x      -0.110      0.135    -0.814   0.417
## 10 star       x      -0.101      0.135    -0.746   0.457
## 11 v_lines    x      -0.112      0.135    -0.824   0.412
## 12 wide_lines x      -0.107      0.135    -0.789   0.431
## 13 x_shape    x      -0.105      0.135    -0.778   0.438

Same thing, all are more or less the same. But, there is something seriously wrong with every since one of the datasets. Pair up and explore the data visually to see what’s going on. Can you see it?

# Solution
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset)) +
geom_point() +
theme_void() +
theme(legend.position = "none") +
facet_wrap(~ dataset, ncol = 3)

### Other examples of using visualization to find a problem

From what you found in the challenge above, can you see why it’s dangerous to not visualize the data first? Sometimes just visualizing the data can give you greater insight into the data and meaning than any statistical or mathematical model can provide. (For a real-world example of this, check out this article on the discoveries physicists made when they created a visualization of a black hole based on the math when making the movie Interstellar).

But even the type of graph you use can hide data. Check out the figure below for an example of that. In general, it’s better to visualize the raw data points, but this quickly becomes a problem as you get more data.

Another example is known as Simpson’s Paradox. You all have already encountered Simpson’s Paradox in the challenges of lecture 4 and 5, where the average weight of all species decreased over time, but for each species weight remained constant. These types of problems are encountered all the time in data analysis and it is part of what makes science so hard, because data often doesn’t behave as we expect it to. The figure below illustrates Simpson’s Paradox.

ggplot(simpsons_paradox, aes(x = x, y = y, colour = dataset)) +
geom_point() +
geom_smooth(method = "lm") +
theme(legend.position = "none") +
facet_wrap( ~ dataset, ncol = 3)

And sometimes, by not looking at the raw data, it can lead to actual harm. For instance, Simpson’s paradox comes up in medical studies of drugs that could be lifesaving for a patient. If inappropriate conclusions are drawn because the data wasn’t completely explored, people could be harmed or could die. This is true in any scientific field, including ecology. Policies aimed at benefiting the environment could in fact be harming it because these types of things were not examined.