Assignment 5: Model selection and multivariate statistics (9 marks)

To submit this assignment, upload the full document on blackboard, including the original questions, your code, and the output. Submit your assignment as a knitted .pdf (preferred) or .html file.

In this exercise, we will once again use the data of Santangelo et al.

that you used in assignment 4. Let’s go ahead and load in the data. See assignment 4 if you need a refresher on the details of the experiment and the dataset. (5 marks total)

```r
library(tidyverse)
library(lme4)
library(lmerTest)


Santangelo_data <- "https://uoftcoders.github.io/rcourse/data/Santangelo_JEB_2018.csv"
download.file(Santangelo_data, "Santangelo_JEB_2018.csv")
Santangelo_data <- read_csv("Santangelo_JEB_2018.csv", 
                            col_names = TRUE)
glimpse(Santangelo_data)
```

```
## Observations: 792
## Variables: 11
## $ Genotype      <chr> "173-1", "173-1", "173-1", "173-1", "173-1", "173-1", "…
## $ Pollination   <chr> "Open", "Open", "Open", "Supp", "Supp", "Supp", "Open",…
## $ Herbivory     <chr> "Reduced", "Ambient", "Reduced", "Ambient", "Ambient", …
## $ HCN           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",…
## $ Flower.date   <dbl> 86, 23, 12, 25, 16, 38, 22, 61, 32, 25, 30, 33, 54, 23,…
## $ Avg.Bnr.Wdth  <dbl> 3.32, NA, 3.02, 3.14, 2.76, NA, 3.11, 3.07, 3.16, 3.16,…
## $ Avg.Bnr.Ht    <dbl> 6.50, NA, 7.02, 6.03, 5.20, NA, 5.69, 6.18, 6.04, 5.94,…
## $ Biomass.plant <dbl> 9.27, 8.34, 7.74, 15.76, 4.16, 7.75, 48.67, 10.32, 28.8…
## $ Num.flwrs     <dbl> 52.5, 67.0, 57.5, NA, NA, 47.5, 60.0, 58.5, 66.5, 64.5,…
## $ Total.Inf     <dbl> 4, 1, 21, 9, 0, 15, 41, 19, 17, 23, 129, 46, 35, 35, 22…
## $ Block         <chr> "E", "A", "E", "A", "A", "A", "D", "F", "C", "C", "A", …
```

```r
head(Santangelo_data)
```

```
## # A tibble: 6 x 11
##   Genotype Pollination Herbivory HCN   Flower.date Avg.Bnr.Wdth Avg.Bnr.Ht
##   <chr>    <chr>       <chr>     <chr>       <dbl>        <dbl>      <dbl>
## 1 173-1    Open        Reduced   Yes            86         3.32       6.5 
## 2 173-1    Open        Ambient   Yes            23        NA         NA   
## 3 173-1    Open        Reduced   Yes            12         3.02       7.02
## 4 173-1    Supp        Ambient   Yes            25         3.14       6.03
## 5 173-1    Supp        Ambient   Yes            16         2.76       5.2 
## 6 173-1    Supp        Ambient   Yes            38        NA         NA   
## # … with 4 more variables: Biomass.plant <dbl>, Num.flwrs <dbl>,
## #   Total.Inf <dbl>, Block <chr>
```


a. Model selection works best when there are no missing values in your
dataset. We will be identifying the best model that predict variation in
flowering time (`Flower.date`) across plants. Create a dataset that excludes
rows where there is missing flowering date data. (0.25 marks)

b. We want to know how HCN, herbivores, and pollinators influence flowering
date. We also think that the effect of herbivores and pollinators on
flowering data might depend on whether the plant is producing HCN. Create a
model that includes fixed-effects that test these predictions. Be sure to
account for variation due to `Genotype`, `Block` and the `Genotype` by
`Herbivory` interactions by including these terms as random effects. This
will be our saturated model. You can ignore `boundary (singular) fit`
warnings that may arise. (1 marks)

c. We will generate a reduced model from the saturated model in (a). Should
we use AIC or AIC~c~. Why? Show your calculation. (0.5 marks)

d. Using the approach described in lecture 11, optimize the random effect
structure of this model. Show the AIC/AIC~c~ output for each model of
varying random effect strucure. Provide a one sentence justification for
each random effect the model, justifying whether it is fixed (i.e., in every
model) or whether some models will drop this effect. Describe in one
sentence what the optimal random effect structure of the model is and why.
(0.5 marks)

e. Using the model with the optimal random-effect structure identified in
(c), find the optimal fixed-effect structure. Be sure to show all the models
and their AIC/AIC~c~ scores. (1.5 mark)

f. Based on the AIC/AIC~c~ output from (d), generate your final model with
both optimized fixed and random effects. Summarize the model and interpret
its output. Is there a significant effect of any treatment? If so, which
one(s) and in which direction. Make a statement about the significant
treatments' effects on flowering date. Use the model's output to support
your answer. You only need to interpret significant main effects here (i.e.
not interactions). (0.75 marks)

g. Do you think we were justified in interpreting a single model? Why or why
not? What alternative approach could we have used? (0.25 marks).

h. Use `dplyr` and `ggplot2` to plot the flowering date of plants by the
_main_ effect that showed a significant effect in the optimized model above.
The figure should show the mean plus and minus a single standard error of
the mean. Suggest one biological interpretation of the pattern you see in
the figure and in the model (i.e. why do you think this would happen). If
there are no significant effects in the model, simply write "There are no
significant effects!". (0.25 marks)

During the multivariate statistics lecture, we made use of vector community and malaria survey data collected by Mbogo et al. (2003) to disentangle the effects of vector abundance, species richness, and composition, on malaria prevalence (see path diagram in lecture 10 for a reminder of these relationships). In this exercise, we will complete the analysis of the strucutral equation model we began building in class. (1.5 marks total)

Here are some relevant snippets of code taken from the lecture notes to get you started on this exercise.

library(lavaan)

kenya.wide <- read.csv("kenya.wide.csv", header=TRUE, sep=",")

kenya.pca <- kenya.wide %>% 
  dplyr::select(arabiensis, gambiae, funestus, merus) %>% #choose relevant columns
  mutate_all(sqrt) %>% #this is the Hellinger transformation 
  prcomp(.) #pipe directly into baseR function for PCA

axes <- data.frame(kenya.pca$x)
kenya.wide <- bind_cols(kenya.wide, axes) 

kenya.wide$s.abun <- log(kenya.wide$total.abundance) 
kenya.wide$s.sr <- log(kenya.wide$SR) 
kenya.wide$s.pfpr <- log(kenya.wide$PfPR) 

sem02 <- '
# regressions
l.pfpr ~ a*l.sr + b*l.abun + c*PC2
# correlations
l.sr ~~ d*l.abun
PC2 ~~ e*l.sr
# defined parameters
indirect.abun := (a*d) #indirect effect of abundance via SR
indirect.abun2 := (d*e*c) #indirect effect of abundance via PC2
total.abun := b + (a*d) + (d*e*c) 
'

a. Complete the structural equation model by adding in calculations for the
indirect and total effects of species richness (SR) and composition (PC2) on
malaria prevalence (PfPR). (0.5 marks)

b. Evaluate the model, bootstrapping confidence intervals for path
coefficients with seed #778. Which predictor had the largest _direct_ effect
on malaria prevalence? How about _total_ effect? Briefly explain these
effects in plain english (1 sentence each). (1 mark)

In this exercise, we will be investigating the relationship between vector community structure and another commonly used metric of disease risk – entomological inoculation rate (EIR). EIR is a measure of the number of bites by infectious mosquitoes per person per unit time. We will be making use of the same data from Mbogo et al. (2003) as before, only this time, we will start with the long form data. (2.5 marks total)

kenya.long <- read.csv("kenya.long.csv", header=TRUE, sep=",")

This dataset consists of the same information as kenya.wide, with the addition of one new columns for “EIR”.

a. Convert this dataset to the wide format. Fill cells in the wide dataset
with the **relative abundance** of each species, and include the columns
"total abundance" and "EIR" in the final product. (Hint: use xxxx_join to
add the desired columns to the wide dataset after you spread it) (Hint2:
pivot_wider() may be easier to use than spread()) (1 marks)

b. Construct a series of linear models to investigate the relationship
between EIR and i) total mosquito abundance and ii) the abundance of each
species. Interpret the results of these models. (Hint: is EIR a simple
function of total mosquito abundance, or is there a particular species that
is contributing disproportionately to it?) (0.5 mark)

c. Investigate the influence of total abundance and community structure (use
the first two PC axes) on EIR with a strcutural equation model. Include only
direct effects only in this model, and pretend we have reason to believe
total abundance is associated with community composition.
    i. Briefly explain the correlation structure you have chosen for your
    predictors, total abundance, PC1, and PC2. (0.5 marks)
    ii. Evaluate the model. Are these results congruent with your findings
    from part (a)? (0.5 marks)

This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.