To submit this assignment, upload the full document on Quercus, including the original questions, your code, and the output. Submit your assignment as a knitted .pdf (prefered) or .html file.

Part of being an effective scientist involves being able to solve problems you have not encountered before. This is certainly true of programming as well, where problems are typically solved by furious bouts of Googling, reading documentation, and trial and error of proposed solutions. In this assignment, like previous ones, you will be evaluated on your ability to solve data manipulation and analysis tasks. However, unlike previous assignments, some of the solutions to the problems will require more research and effort on your part. It may require the use of packages and techniques not explored in class, but all problems are solveable, often with only a few lines of code. By now, you should all have the terminology required to search for solutions to the problems below. As with all programming problems, there are many possible ways to get the answer to the problems below.

1. Simpson’s diversity index and bootstrapping (6 marks)

In lecture 12, we used data from the National Ecological Observatory Networkd on the abundance and percent cover of plant species across sites in the Harvard Forest from the year 2017. Run the code chunck below to read in the data if you do not already have it.

library(tidyverse)
neon_data <- "https://uoftcoders.github.io/rcourse/data/NEON_PlantPA_HARV_201707.csv"
col_names = TRUE)
1. Using the raw NEON data, create a matrix with plant species as rows and sites as columns. The cell values should represent the abundance of each species at a given site. (1 mark)

2. Write a function that takes a single numeric vector as an argument and returns Simpson’s Index of Diversity for the numeric vector. Test your function by computing Simpson’s diversity index on the test_vector in the code chunk below. Be sure to report the Simpson’s index for test_vector in your final assignment. As a reminder, I have included the formula for Simpson’s index below:

test_vector <- c(0, 1, 0, 5, 0, 1, 0, 4, 3, 0, 0, 0, 0, 1, 4, 0, 5, 0,
3, 0, 11, 2, 19, 0, 11, 0, 0, 0, 0, 0)

$D = 1 - \sum_{i = 1}^{s}(p_i)^2$

where s is the species richness (i.e. number of species) and p_i is th relative abundance of species i. A D value of 0 represents no diversity and a D value of 1 represent infinite diversity. (1 mark)

1. In lecture 12, we discussed how resampling techniques can be used to conduct hypothesis tests (i.e., permutation tests) by comparing an observed population parameter (e.g., mean, median) to a null distribution of that parameter generated by resampling your observed data without replacement many times. Resampling can additionally be used to generate confidence intervals around a population parameter (e.g., mean median, slope) or other statistic (e.g., C-score) using a technique known as bootstrapping. Bootstrapping allows us to estimate the true distribution of a statistic in cases where it is unknown, which we can use to estimate our uncertainty (e.g., Standard Error, Confidence Intervals) in a population parameter. Importantly, this applies regardless of the shape of the distribution, although some adjustments often have to be made for strongly skewed distributions. While there are a few different types of bootstrapped confidence intervals (e.g., bias-corrected and accelerated, t with bootstrap), in this exercise, you will write a function that calculates the simple 95% percentile bootstrapped confidence interval from a given numeric vector. Your function should have the following properties:

1. It should take in two arguments: A numeric vector and an integer representing the number of iterations
2. It should return a data frame with two columns: The lower and upper quantiles of the distribution, representing the 2.5 and 97.5 percentiles, respectively (hint: The quantile() function).

Write the function as described above. How does the bootstrap differ from a permutation test (note: you don’t need code here, just tell me in words). Set a seed of 42 and test your function on the test_vector from part (b). (1.5 marks)

1. Use your functions defined in (b) and (c) to estimate the Simpson’s diversity index and corresponding 95% bootstrapped confidence intervals around the Simpson’s index for each of the sites in your dataframe from (a). Use 1000 iterations for the bootstrapping procedure. Your answer should be a single dataframe with four columns: site, simpson, lower, and upper. Points are awarded for conciseness of the code. (Hint: The most concise answer will likely make use of the purrr package, which is part of the tidyverse). Set a seed of 43. (1.5 marks)

2. Using the dataframe from (d), plot the Simpson’s index (y-axis) for each of the sites (x-axis) as a single point surrounded by its lower and upper 95% CIs. Order the x-axis from lowest to highest Simpson’s index. (1 mark)

1. Recreating a figure

In assignment 3, we explored a dataset containing changes in yearly plant biomass in Abisko National Park, Sweden. In this question, we will use ggplot2 to reproduce a figure in the original paper (Olofsson et al, 2013; link).

You should still have the dataset from when you completed assignment 3, but if not, run the code chunk below to download it.

plant_biomass_url <- 'https://raw.githubusercontent.com/UofTCoders/rcourse/master/data/plant-biomass-preprocess.csv'
plant_biomass <- read_csv('plant-biomass-preprocess.csv')

Reproduce figure 4 from the paper using ggplot2. Pay close attention to the overall structure of the figure, the scale on the axes, and so on. The colors of the points do not need to be the exact same colors as those in the figure, but they should be sufficiently close. (Note: you can ignore the SEM points, and the ‘look’ of your axes does not have to match the figure exactly. The species names also not have to be included in the body of the plot (like in the paper) as long as they are visible in some form (1 mark)

1. Use the built-in R dataset iris, for this question. (2 marks)

1. Test the relationship between sepal length and sepal width for each species using linear models. Set sepal width as the response and sepal length as the predictor. Output a single data frame containing the beta estimate and p-value associated with the predictor for each of the three models, and also make sure to include a column called ‘species’ in the final data frame. Your final data frame should only have species, term, estimate, and p.value as columns. Do not include the intercepts. (Hint: You want to simultaneously perform the same model on different subsets of the data with only a few lines of code) (1 mark)

2. Use ggplot2 to plot a scatter plot of sepal width by sepal length, coloured by species. Plot the three linear fits as well, also coloured by species. Below your code, comment on how the estimate values from your linear models above correspond to the plotted fits. (1 mark)

2. The Canadian lynx population cycle (3.5 marks)

The Canadian lynx experiences large periodic changes in its population size over a timescale of several years. This is thought to be driven by oscillations in the population size of the snowshoe hare, the primary food source for the lynx. Read more about the lynx population cycle on this Northwest Territories website.

R has a built-in dataset called lynx which contains annual population measurements for the Canadian lynx as a time series.

1. Plot the abundance of lynx vs. time in years using either ggplot or qplot. Plot points and a connect them by a line . Create a time series that starts at 0 and ends at the total number of years in the dataset (total years $$= 1934-1821$$). By eye, estimate the time between peaks in the population. (0.5 marks)

2. Define a function called sine_model that takes 5 arguments: a vector of years for the x-axis and four parameters (amplitude, period, phase, and offset). Recall the general formula for a sine wave: $y = A \text{sin}(kx - b) + c$ where $$k = 2\pi / T$$, $$T$$ is the period or length of time between peaks, $$A$$ is the amplitude, $$b$$ is the phase, and $$c$$ is the offset. Using a value of $$A = c = 1700$$ for both the amplitude and offset and a value of $$b = 2.75$$ for the phase, plot the lynx data as before and add a sine curve using your guess of the timescale from part (a) for the period. Use a colour other than black to plot the sine wave. Note that the x axis must start at 0 in order for the offset of $$2.75$$ to match the data. (1 mark)

3. Use least-squares fitting to refine your estimate of the lynx cycle length. (1.5 marks)
• Create a numeric vector with a range of values for the period that span your guess from part (a).
• Write a function to calculate the Residual Sums of Squares (RSS) from a model fitted to the lynx data.
• This function should calculate the sum of the difference (residuals) between the lynx data and your prediction, then return the sum of the residuals squared.
• Apply this function over the numeric vector of period values you created. Essentially, you should be striving to obtain an RSS value for models fitted using all of the different period values in your numeric vector.
• Plot the sum of the residuals squared vs. the range of period values. By eye, what is the minimum of this curve?
• Identify the period value that provides the best fit to the lynx data. Models with lowest RSS fit the data best. What is your calculated length of the lynx population cycle?
4. Plot the lynx data again and plot your best fit curve on top. (0.5 marks)

3. In class, we worked with a simple one-locus haploid model of evolution and investigated how different forces of evolution (e.g., selection vs. drift) affect allele frequencies. In this challenge assignment, we will naturally upgrade to a diploid model to see how evolution work in this kind of system.

We will once again be considering the case of malaria (of course, what else does Amber think about anyway). In assignment 7, we (you) worked on ways to incorporate various facets of biological detail into a baseline, super simple, disease model. We will incorporate another level of detail in this exercise: host genotype. The trait that we will be working with is the sickle cell trait. You can read more about the sickle-cell trait on Wikipedia. Basically, this is a blood-cell-shape trait that is controlled at a single locus by two alleles: $$A$$ for wildtype “normal” blood shape, and $$S$$ for sickle-shaped blood cells. There are thus three possible host genotypes with respect to the sickle cell trait: homozygous “normal” blood ($$AA$$), heterozygous ($$AS$$), and homozygous sickle blood ($$SS$$). People with homozygous $$SS$$ suffer from debilitating illness (sickle-cell disease) and often result in mortality at a young age if untreated. Heterozygotes ($$AS$$), while still suffering minor illness, receives partial immunity against malaria. It was thus hypothesized that the deleterious $$S$$ allele is maintained in the population due to balancing selection – the strong selective pressure exerted by malaria gave heterozygotes an advantage over the homozygotes, thus saving the $$S$$ allele from being purged from the population.

We will be using mathematical models to help us test this hypothesis. A good start is to first extend the framework we used in lecture and think about how to represent allele frequency change in a diploid population.

In population genetics it is often easier to work with allele frequencies (proportion of a certain allele in a population) rather than changes to their absolute numbers. Here, we will use $$p$$ and $$q$$ to denote the proportions of the $$A$$ and $$S$$ alleles in the population, respectively. The frequencies of the three genotypes in the population are thus given by $$p^2$$, $$2pq$$, and $$q^2$$ for AA, AS, and SS individuals, respectively. $$p^2$$, $$2pq$$, and $$q^2$$ are known as Hardy-Weinberg proportions and they have the convenient property of summing to 1.

Applying selection to this population alters the frequency of each genotype by an amount that is proportional to their fitness. We will denote the fitness of each host genotype group as $$W_{AA}$$, $$W_{AS}$$, and $$W_{SS}$$. The frequencies of each genotype following selection (i.e., as weighted by its fitness) is therefore $$p^2W_{AA}$$, $$2pqW_{AS}$$, and $$q^2W_{SS}$$. Dividing these quantities by their sum (the mean fitness of the population, often denoted with $${\bar{W}}$$) allows us to retrieve the frequency of the three genotypes after selection (e.g., $$p^2 \frac{W_{AA}}{\bar{W}}$$). (2 marks)

1. Derive a recursion equation that describe the change in allele frequencies of $$A$$ and $$S$$ in one time step. (Hint: How do you go from genotype frequency to allele frequency?) (1 marks)

2. Simulate the trajetory of change of the A allele. Use the parameter set provided, and include a graph in your answer. Explain what you see, and refer to the parameter values provide in your explanation. (1 mark)

# Time
times <- 500
timevec <- seq(1, times, 1)

# Parameters
N <- 10000 # Total population size (assume constant)
WAA <- 0.5
WAS <- 0.7
WSS <- 0.2

# Intitial condition
p <- 0.5 # proportion of A allele