Assignment 6: Spatial statistics, simulating data, and randomization tests (8 marks)

To submit this assignment, upload the full document on blackboard, including the original questions, your code, and the output. Submit your assignment as a knitted .pdf (preferred) or .html file.

In this exercise, we will continue to use the vector community and malaria survey data collected by Mbogo et al. (2003) (i.e., kenya.wide.csv). We will inspect whether spatial autocorrelation affects our inference of the effects of vector abundance, species richness, and composition, on malaria prevalence. We will also investigate which environmental factors underpin the distribution of mosquitoes, all in the context of space. (4 marks).
1. Compute Moran’s autocorrelation statistic to assess whether mosquito abundance and species richness, and malaria prevalence are correlated in space. (1 mark)
2. Extract annual average temperature and rainfall data from WorldClim2 using the raster files provided in class. Create maps (see appendix to lecture notes) to show the variation in these climatic factors across the sites. (1 mark)
3. Investigate whether rainfall influence mosquito abundance across sites. Your complete analysis should include formally testing whether the temperature and rainfall patterns across sites are correlated in space (construct a variogram and interpret when neccessary), regression models with the appropriate autocorrelation structure, and an interpretation of your model outputs. Feel free to make use of additional plots to help explain your findings (2 mark)
Simulating data (2 marks)
1. Generate a gamma distribution by randomly sampling 30 points from a distribution with shape parameter equal to 1.35 and rate parameter equal to 0.5. Plot this distribution. Set a seed of 42. (0.5 mark).
2. Plot the distribution of sample means obtained by generating 5000 gamma distributions with the same parameters as in (a). In other words, the distribution should be made up of 5000 means, each from a different simulated gamma distribution. Set a seed of 43. What do you notice about this distribution when compared to the original distribution in (a)? Why would we expect this? (0.5 marks)
3. In this exercise you will simulate a multiple regression. Remember, multiple regression means that there is more than one explanatory (aka predictor, independent) variable for a given response variable. Multiple regression thus estimates a separate effect (i.e. beta) for each explanatory variable in the model, while holding the other variables constant. This exercise is only a slight extension of the model that we simulated in lecture. Simulate a model that satisfies the conditions below and show the model output using summary(). Set a seed of 44. (1 mark).
  1. x1 is an explanatory variable with sequence from 51 to 70 with 1 unit intervals between each value (i.e. 20 values total).
  2. x2 is an explanatory variable of length 20 randomly drawn for a normal distribution with mean equal to 62 and standard deviation equal to 2.7.
  3. x3 is an explanatory variable of length 20 randomly drawn for a gamma distribution with shape equal to 5 and rate equal to 0.5.
  4. the y_intercept is 22
  5. The beta associated with x1 is 0.62.
  6. The beta associated with x2 is 0.047`
  7. The beta associated with x3 is 0.185
  8. The error is drawn from a normal distribution with mean equal to 0 and standard deviation equal to 1.65.
  9. y is a linear combination of x1, x2 and x3. There are no interations.
Randomization test (2 marks)

Run the code chunk below to load the data that will be used in this exercise.
```
library(tidyverse)
df <- "https://uoftcoders.github.io/rcourse/data/Assign05_Question3.csv"
download.file(df, "Assign05_Question3.csv")
df <- read_csv("data/Assign05_Question3.csv", 
                       col_names = TRUE)
glimpse(df)
```
```
## Observations: 60
## Variables: 2
## $ x     <dbl> 6.426000, 5.120825, 5.525578, 5.067441, 4.877366, 5.581507, 5.3…
## $ group <chr> "One", "One", "One", "One", "One", "One", "One", "One", "One", …
```
1. Generate a histogram showing the null distribution of t-statistics between groups one and two from the df dataframe that you just loaded. The null distribution should be based 5,000 reshufflings of the data. (1 mark). Overlay onto this histogram the observed t-statistic as a dashed vertical line. Set a seed of 45. Hint: t-tests return list objects that can be indexed using $
2. Perform a permutation test testing whether the observed t statistic between groups one and two is different than what would be expected by chance. Include a statement about whether there is a significant difference between groups based on your permutation test and be sure to include the P-value. How does this P-value compare to one obtained from a simple t-test? Why? (1 mark)

This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.