To submit this assignment, upload the full document on blackboard, including the original questions, your code, and the output. Submit your assignment as a knitted .pdf (preferred) or .html file.

1. In this exercise, we will once again use the data of Santangelo et al.
1. that you used in assignment 4. Let’s go ahead and load in the data. See assignment 4 if you need a refresher on the details of the experiment and the dataset. (5 marks total)
r
library(tidyverse)
library(lme4)
library(lmerTest)

Santangelo_data <- "https://uoftcoders.github.io/rcourse/data/Santangelo_JEB_2018.csv"
col_names = TRUE)
glimpse(Santangelo_data)



## Observations: 792
## Variables: 11
## $Genotype <chr> "173-1", "173-1", "173-1", "173-1", "173-1", "173-1", "… ##$ Pollination   <chr> "Open", "Open", "Open", "Supp", "Supp", "Supp", "Open",…
## $Herbivory <chr> "Reduced", "Ambient", "Reduced", "Ambient", "Ambient", … ##$ HCN           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",…
## $Flower.date <dbl> 86, 23, 12, 25, 16, 38, 22, 61, 32, 25, 30, 33, 54, 23,… ##$ Avg.Bnr.Wdth  <dbl> 3.32, NA, 3.02, 3.14, 2.76, NA, 3.11, 3.07, 3.16, 3.16,…
## $Avg.Bnr.Ht <dbl> 6.50, NA, 7.02, 6.03, 5.20, NA, 5.69, 6.18, 6.04, 5.94,… ##$ Biomass.plant <dbl> 9.27, 8.34, 7.74, 15.76, 4.16, 7.75, 48.67, 10.32, 28.8…
## $Num.flwrs <dbl> 52.5, 67.0, 57.5, NA, NA, 47.5, 60.0, 58.5, 66.5, 64.5,… ##$ Total.Inf     <dbl> 4, 1, 21, 9, 0, 15, 41, 19, 17, 23, 129, 46, 35, 35, 22…
## $Block <chr> "E", "A", "E", "A", "A", "A", "D", "F", "C", "C", "A", …  r head(Santangelo_data)   ## # A tibble: 6 x 11 ## Genotype Pollination Herbivory HCN Flower.date Avg.Bnr.Wdth Avg.Bnr.Ht ## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 173-1 Open Reduced Yes 86 3.32 6.5 ## 2 173-1 Open Ambient Yes 23 NA NA ## 3 173-1 Open Reduced Yes 12 3.02 7.02 ## 4 173-1 Supp Ambient Yes 25 3.14 6.03 ## 5 173-1 Supp Ambient Yes 16 2.76 5.2 ## 6 173-1 Supp Ambient Yes 38 NA NA ## # … with 4 more variables: Biomass.plant <dbl>, Num.flwrs <dbl>, ## # Total.Inf <dbl>, Block <chr>  a. Model selection works best when there are no missing values in your dataset. We will be identifying the best model that predict variation in flowering time (Flower.date) across plants. Create a dataset that excludes rows where there is missing flowering date data. (0.25 marks) b. We want to know how HCN, herbivores, and pollinators influence flowering date. We also think that the effect of herbivores and pollinators on flowering data might depend on whether the plant is producing HCN. Create a model that includes fixed-effects that test these predictions. Be sure to account for variation due to Genotype, Block and the Genotype by Herbivory interactions by including these terms as random effects. This will be our saturated model. You can ignore boundary (singular) fit warnings that may arise. (1 marks) c. We will generate a reduced model from the saturated model in (a). Should we use AIC or AIC~c~. Why? Show your calculation. (0.5 marks) d. Using the approach described in lecture 11, optimize the random effect structure of this model. Show the AIC/AIC~c~ output for each model of varying random effect strucure. Provide a one sentence justification for each random effect the model, justifying whether it is fixed (i.e., in every model) or whether some models will drop this effect. Describe in one sentence what the optimal random effect structure of the model is and why. (0.5 marks) e. Using the model with the optimal random-effect structure identified in (c), find the optimal fixed-effect structure. Be sure to show all the models and their AIC/AIC~c~ scores. (1.5 mark) f. Based on the AIC/AIC~c~ output from (d), generate your final model with both optimized fixed and random effects. Summarize the model and interpret its output. Is there a significant effect of any treatment? If so, which one(s) and in which direction. Make a statement about the significant treatments' effects on flowering date. Use the model's output to support your answer. You only need to interpret significant main effects here (i.e. not interactions). (0.75 marks) g. Do you think we were justified in interpreting a single model? Why or why not? What alternative approach could we have used? (0.25 marks). h. Use dplyr and ggplot2 to plot the flowering date of plants by the _main_ effect that showed a significant effect in the optimized model above. The figure should show the mean plus and minus a single standard error of the mean. Suggest one biological interpretation of the pattern you see in the figure and in the model (i.e. why do you think this would happen). If there are no significant effects in the model, simply write "There are no significant effects!". (0.25 marks) 1. During the multivariate statistics lecture, we made use of vector community and malaria survey data collected by Mbogo et al. (2003) to disentangle the effects of vector abundance, species richness, and composition, on malaria prevalence (see path diagram in lecture 10 for a reminder of these relationships). In this exercise, we will complete the analysis of the strucutral equation model we began building in class. (1.5 marks total) Here are some relevant snippets of code taken from the lecture notes to get you started on this exercise. library(lavaan) kenya.wide <- read.csv("kenya.wide.csv", header=TRUE, sep=",") kenya.pca <- kenya.wide %>% dplyr::select(arabiensis, gambiae, funestus, merus) %>% #choose relevant columns mutate_all(sqrt) %>% #this is the Hellinger transformation prcomp(.) #pipe directly into baseR function for PCA axes <- data.frame(kenya.pca$x)
kenya.wide <- bind_cols(kenya.wide, axes)

kenya.wide$s.abun <- log(kenya.wide$total.abundance)
kenya.wide$s.sr <- log(kenya.wide$SR)
kenya.wide$s.pfpr <- log(kenya.wide$PfPR)

sem02 <- '
# regressions
l.pfpr ~ a*l.sr + b*l.abun + c*PC2
# correlations
l.sr ~~ d*l.abun
PC2 ~~ e*l.sr
# defined parameters
indirect.abun := (a*d) #indirect effect of abundance via SR
indirect.abun2 := (d*e*c) #indirect effect of abundance via PC2
total.abun := b + (a*d) + (d*e*c)
'
a. Complete the structural equation model by adding in calculations for the
indirect and total effects of species richness (SR) and composition (PC2) on
malaria prevalence (PfPR). (0.5 marks)

b. Evaluate the model, bootstrapping confidence intervals for path
coefficients with seed #778. Which predictor had the largest _direct_ effect
on malaria prevalence? How about _total_ effect? Briefly explain these
effects in plain english (1 sentence each). (1 mark)
1. In this exercise, we will be investigating the relationship between vector community structure and another commonly used metric of disease risk – entomological inoculation rate (EIR). EIR is a measure of the number of bites by infectious mosquitoes per person per unit time. We will be making use of the same data from Mbogo et al. (2003) as before, only this time, we will start with the long form data. (2.5 marks total)
kenya.long <- read.csv("kenya.long.csv", header=TRUE, sep=",")

This dataset consists of the same information as kenya.wide, with the addition of one new columns for “EIR”.

a. Convert this dataset to the wide format. Fill cells in the wide dataset
with the **relative abundance** of each species, and include the columns
"total abundance" and "EIR" in the final product. (Hint: use xxxx_join to
add the desired columns to the wide dataset after you spread it) (Hint2:
pivot_wider() may be easier to use than spread()) (1 marks)

b. Construct a series of linear models to investigate the relationship
between EIR and i) total mosquito abundance and ii) the abundance of each
species. Interpret the results of these models. (Hint: is EIR a simple
function of total mosquito abundance, or is there a particular species that
is contributing disproportionately to it?) (0.5 mark)

c. Investigate the influence of total abundance and community structure (use
the first two PC axes) on EIR with a strcutural equation model. Include only
direct effects only in this model, and pretend we have reason to believe
total abundance is associated with community composition.
i. Briefly explain the correlation structure you have chosen for your
predictors, total abundance, PC1, and PC2. (0.5 marks)
ii. Evaluate the model. Are these results congruent with your findings
from part (a)? (0.5 marks)