Assignment 3: dplyr and ggplot (8 marks)

To submit this assignment, upload the full document on blackboard, including the original questions, your code, and the output. Submit you assignment as a knitted .pdf (prefered) or .html file.

Plotting (1 mark)

Run the block below to create a categorical variable of the activ column. This will make dplyr recognize that there are only two levels of activity (0 and 1), rather than a continuous range 0-1, which will facilitate plotting.
```
library(tidyverse)
```
```
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
```
```
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.4
## ✔ tidyr   1.0.2     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
```
```
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
```
```
beaver1 <- beaver1 %>%
    mutate(factor_activ = factor(activ))
```
1. In the previous assignment, we saw that the beaver’s body temperature was the highest when the beaver was outside the retreat. However, we did not explore the distribution of temperatures for the active and inactive conditions. Create a histogram with the temperature on the x-axis and color the bins corresponding to the activity variable. Hint: You need to use the fill parameter rather than color; and make sure you are using the correct activ column! (0.25 marks)
2. We already know that the beaver’s body temperature is correlated with whether it is outside the retreat or not. However, we did not control for the time of day, maybe the beaver’s temperature is even better predicted by knowing what time of day it is. To satisfactorily answer this question, we should perform a regression analysis, but we easily can get a good overview by plotting the data. Make a scatter plot with the time of day on the x-axis and the body temperature on the y-axis. Color the scatter points according the beaver’s activity level and separate the measurements into one plot per day. Hint: To separate measurements per day, you could use filter() and two chunks of code, but try the more efficient way of facetting into subplots, which we talked about in the lecture. (0.75 marks)
Read in and pre-process data (1.5 marks)

Ok, that’s enough about beaver body temperatures. Now you will apply your data wrangling skills on the yearly change in biomass of plants in the beautiful Abisko national park in northern Sweden. We have preprocessed this data and made it available as a csv file via this link. You can find the original data and a short readme on figshare and dryad. The original study¹ is available with an open access license. Reading through the readme on figshare, and the study abstract will increase your understanding for working with the data.
1. Read the data directly from the provided URL into a variable called plant_biomass and display the first six rows. (0.25 mark)
2. Convert the Latin column names into their common English names: lingonberry, bilberry, bog bilberry, dwarf birch, crowberry, and wavy hair grass. After this, display all column names. Hint: Search online to find out which Latin and English names pair up. There is a function in the dplyr cheat sheet that might help you rename these columns. Finally, check the tidyverse style guide to make sure your new column names are formatted correctly. (0.5 marks)
3. This is a wide data frame (species make up the column names). A long format is easier to analyze, so gather the species names into one column (species) and the measurement values into another column (biomass). Assign it to the variable plant_biomass to overwrite the previous data frame. Make sure you don’t lose any columns in the reshaping process! Hint: Make sure the output is correct before overwriting the old variable. (0.75 marks)
Data exploration (4.5 marks)

Now that our data is in a tidy format, we can start exploring it!
1. What is the average biomass in g/m^2 for all observations in the study? (0.25 marks)
2. How does the average biomass compare between the grazed control sites and those that were protected from herbivores. (0.25 marks)
3. Display a table of the average plant biomass for each year. (0.25 marks)
4. What is the mean plant biomass per year for the grazedcontrol and rodentexclosure groups (spread these variables as separate columns in a table). (0.5 marks)
5. Compare the biomass for grazedcontrol with that of rodentexclosure graphically in a line plot. What could explain the big dip in biomass year 2012? Hint: The published study might be able to help with the second question… (0.5 marks)
6. How many distinct species are there? (0.25 marks)
7. Check whether there is an equal number of observations per species. (0.25 marks)
8. Compare the yearly change in mean biomass for each species in a lineplot. (0.5 marks)
9. From the previous two questions, we found that the biomass is higher in the sites with rodent exclosures (especially in recent years), and that the crowberry is the dominant species. Notice how the lines for rodentexclosure (refer back to 3.d above) and crowberry are of similar shape. Coincidence? Let’s find out! Use a facetted line plot to explore whether all plant species are impacted equally by grazing. (0.75 mark)
10. The habitat could also be affecting the biomass of different species. Explore graphically if this is the case. Hint: Think about how to change your dataset groupings to make this plot (0.5 marks)
11. It looks like both habitat and treatment have an effect on most of the species! Let’s dissect the data further by visualizing the effect on each species of both the habitat and treatment by facetting the plot accordingly. Hint: This is a hard one! You may want to explore R’s documentation for ggplot’s facet_grid (0.5 marks)
Create a new column that represents the square of the biomass. Display the three largest squared_biomass observations in descending order. Only include the columns year, squared_biomass and species and only observations between the years 2003 and 2008 from the forest habitat. Hint: Break this down into single criteria and add one at a time. You will be able to obtain the desired result with five operations. (1 mark)

Olofsson J, te Beest M, Ericson L (2013) Complex biotic interactions drive long-term vegetation dynamics in a subarctic ecosystem. Philosophical Transactions of the Royal Society B 368(1624): 20120486. https://dx.doi.org/10.1098/rstb.2012.0486 ↩

This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.