Lesson preamble

Lecture objectives

  • Learn about tidy data.
  • Transform data from the long to wide format.
  • Reproduce existing figures from raw data.
  • Understand which raw data is underlying a figure.
  • Understand which types of figures are suitable to create from raw data.
  • Explore scientific questions using dplyr and ggplot.

Lecture outline

  • Reshaping with gather and spread (25 min)
  • Exporting data (15 min)
  • Reproducing figures (50 min)

Setting up

Start by loading the required packages. Both ggplot2 and dplyr are included in the tidyverse package collection.

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.7
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Load the data we saved in the previous lesson.

## Parsed with column specification:
## cols(
##   record_id = col_integer(),
##   month = col_integer(),
##   day = col_integer(),
##   year = col_integer(),
##   plot_id = col_integer(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_integer(),
##   weight = col_integer(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character(),
##   plot_type = col_character()
## )

Reshaping with gather and spread

dplyr is one part of a larger tidyverse that enables you to work with data in tidy data formats. tidyr enables a wide range of manipulations of the structure data itself. For example, the survey data presented here is almost in what we call a long format - every observation of every individual is its own row. This is an ideal format for data with a rich set of information per observation. It makes it difficult, however, to look at the relationships between measurements across plots. For example, what is the relationship between mean weights of different genera across the entire data set?

To answer that question, we’d want each plot to have a single row, with all of the measurements in a single plot having their own column. This is called a wide data format. For the surveys data as we have it right now, this is going to be one heck of a wide data frame! However, if we were to summarize data within plots and species, we might begin to have some relationships we’d want to examine.

Let’s see this in action. First, using dplyr, let’s create a data frame with the mean body weight of each genus by plot.

## # A tibble: 6 x 3
## # Groups:   genus [1]
##   genus   plot_id mean_weight
##   <chr>     <int>       <dbl>
## 1 Baiomys       1        7   
## 2 Baiomys       2        6   
## 3 Baiomys       3        8.61
## 4 Baiomys       5        7.75
## 5 Baiomys      18        9.5 
## 6 Baiomys      19        9.53

Long to Wide with spread

Now, to make this long data wide, we use spread from tidyr to spread out the different taxa into columns. spread takes three arguments: - the data, the key column (or column with identifying information), the values column (the one with the numbers/values). We’ll use a pipe so we can ignore the data argument.

## # A tibble: 6 x 11
##   plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus
##     <int>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1       1    7           22.2      60.2    156.      27.7        9.62
## 2       2    6           25.1      55.7    169.      26.9        6.95
## 3       3    8.61        24.6      52.0    158.      26.0        7.51
## 4       4   NA           23.0      57.5    164.      28.1        7.82
## 5       5    7.75        18.0      51.1    190.      27.0        8.66
## 6       6   NA           24.9      58.6    180.      25.9        7.81
## # ... with 4 more variables: Peromyscus <dbl>, Reithrodontomys <dbl>,
## #   Sigmodon <dbl>, Spermophilus <dbl>

Notice that some genera have NA values. That’s because some of those genera don’t have any record in that plot. Sometimes it is fine to leave those as NA. Sometimes we want to fill them as zeros, in which case we would add the argument fill=0.

## # A tibble: 6 x 11
##   plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus
##     <int>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1       1    7           22.2      60.2    156.      27.7        9.62
## 2       2    6           25.1      55.7    169.      26.9        6.95
## 3       3    8.61        24.6      52.0    158.      26.0        7.51
## 4       4    0           23.0      57.5    164.      28.1        7.82
## 5       5    7.75        18.0      51.1    190.      27.0        8.66
## 6       6    0           24.9      58.6    180.      25.9        7.81
## # ... with 4 more variables: Peromyscus <dbl>, Reithrodontomys <dbl>,
## #   Sigmodon <dbl>, Spermophilus <dbl>

We can now do things like plot the weight of Baiomys against Chaetodipus or examine their correlation.

##                    plot_id     Baiomys Chaetodipus    Dipodomys
## plot_id          1.0000000 -0.11041371   0.2945925 -0.165214440
## Baiomys         -0.1104137  1.00000000   0.3687496 -0.050774964
## Chaetodipus      0.2945925  0.36874958   1.0000000  0.325918149
## Dipodomys       -0.1652144 -0.05077496   0.3259181  1.000000000
## Neotoma         -0.5505067 -0.03268885  -0.2764929  0.121595440
## Onychomys       -0.4805452 -0.08689875  -0.3254973  0.093274704
## Perognathus      0.2468167  0.16572228   0.1799742  0.134225846
## Peromyscus      -0.3042821  0.10570094   0.2109187  0.219996783
## Reithrodontomys -0.2693462  0.34710430  -0.2258934 -0.142993836
## Sigmodon         0.2745083 -0.01421689   0.1971181 -0.092613012
## Spermophilus     0.1342651 -0.05423613  -0.3096137 -0.003720851
##                     Neotoma   Onychomys Perognathus Peromyscus
## plot_id         -0.55050668 -0.48054516   0.2468167 -0.3042821
## Baiomys         -0.03268885 -0.08689875   0.1657223  0.1057009
## Chaetodipus     -0.27649287 -0.32549730   0.1799742  0.2109187
## Dipodomys        0.12159544  0.09327470   0.1342258  0.2199968
## Neotoma          1.00000000  0.43466237  -0.2562367  0.3072292
## Onychomys        0.43466237  1.00000000  -0.1929882  0.1146881
## Perognathus     -0.25623672 -0.19298820   1.0000000  0.1813099
## Peromyscus       0.30722919  0.11468812   0.1813099  1.0000000
## Reithrodontomys  0.31713778 -0.03186637  -0.2796847  0.2924641
## Sigmodon         0.04665664 -0.19977238  -0.2493341  0.1562804
## Spermophilus    -0.14037233  0.37939827   0.1057038 -0.1593698
##                 Reithrodontomys    Sigmodon Spermophilus
## plot_id             -0.26934622  0.27450825  0.134265067
## Baiomys              0.34710430 -0.01421689 -0.054236130
## Chaetodipus         -0.22589344  0.19711806 -0.309613679
## Dipodomys           -0.14299384 -0.09261301 -0.003720851
## Neotoma              0.31713778  0.04665664 -0.140372334
## Onychomys           -0.03186637 -0.19977238  0.379398267
## Perognathus         -0.27968468 -0.24933409  0.105703800
## Peromyscus           0.29246412  0.15628042 -0.159369768
## Reithrodontomys      1.00000000  0.16463343 -0.217470366
## Sigmodon             0.16463343  1.00000000 -0.270401476
## Spermophilus        -0.21747037 -0.27040148  1.000000000

Wide to long with gather

What if we had the opposite problem, and wanted to go from a wide to long format? For that, we use gather to sweep up a set of columns into one key-value pair. We give it the arguments of a new key and value column name, and then we specify which columns we either want or do not want gathered up. So, to go backwards from surveys_gw_wide, and exclude plot_id from the gathering, we would do the following:

## # A tibble: 6 x 3
##   plot_id genus   mean_weight
##     <int> <chr>         <dbl>
## 1       1 Baiomys        7   
## 2       2 Baiomys        6   
## 3       3 Baiomys        8.61
## 4       4 Baiomys       NA   
## 5       5 Baiomys        7.75
## 6       6 Baiomys       NA

Note that now the NA genera are included in the long format. Going from wide to long to wide can be a useful way to balance out a dataset so every replicate has the same composition.

We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are sequential, we don’t even need to list them all out - just use the : operator!

## # A tibble: 6 x 3
##   plot_id genus   mean_weight
##     <int> <chr>         <dbl>
## 1       1 Baiomys        7   
## 2       2 Baiomys        6   
## 3       3 Baiomys        8.61
## 4       4 Baiomys       NA   
## 5       5 Baiomys        7.75
## 6       6 Baiomys       NA

Challenge

  1. Make a wide data frame with year as columns, plot_id as rows, and the values are the number of genera per plot. You will need to summarize before reshaping, and use the function n_distinct to get the number of unique types of a genus. It’s a powerful function! See ?n_distinct for more.

  2. Now take that data frame, and make it long again, so each row is a unique plot_id - year combination.

  3. The surveys data set is not truly wide or long because there are two columns of measurement - hindfoot_length and weight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, use gather to create a truly long dataset where we have a key column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You’ll need to specify which columns are being gathered.

  4. With this new truly long data set, calculate the average of each measurement in each year for each different plot_type. Then spread them into a wide data set with a column for hindfoot_length and weight. Hint: Remember, you only need to specify the key and value columns for spread.

## # A tibble: 6 x 27
## # Groups:   plot_id [6]
##   plot_id `1977` `1978` `1979` `1980` `1981` `1982` `1983` `1984` `1985`
##     <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
## 1       1      2      3      4      7      5      6      7      6      4
## 2       2      6      6      6      8      5      9      9      9      6
## 3       3      5      6      4      6      6      8     10     11      7
## 4       4      4      4      3      4      5      4      6      3      4
## 5       5      4      3      2      5      4      6      7      7      3
## 6       6      3      4      3      4      5      9      9      7      5
## # ... with 17 more variables: `1986` <int>, `1987` <int>, `1988` <int>,
## #   `1989` <int>, `1990` <int>, `1991` <int>, `1992` <int>, `1993` <int>,
## #   `1994` <int>, `1995` <int>, `1996` <int>, `1997` <int>, `1998` <int>,
## #   `1999` <int>, `2000` <int>, `2001` <int>, `2002` <int>
## # A tibble: 624 x 3
## # Groups:   plot_id [24]
##    plot_id year  n_genera
##      <int> <chr>    <int>
##  1       1 1977         2
##  2       2 1977         6
##  3       3 1977         5
##  4       4 1977         4
##  5       5 1977         4
##  6       6 1977         3
##  7       7 1977         3
##  8       8 1977         2
##  9       9 1977         3
## 10      10 1977         1
## # ... with 614 more rows
## # A tibble: 130 x 4
## # Groups:   year [26]
##     year plot_type                 hindfoot_length weight
##    <int> <chr>                               <dbl>  <dbl>
##  1  1977 Control                              36.1   50.4
##  2  1977 Long-term Krat Exclosure             33.7   34.8
##  3  1977 Rodent Exclosure                     39.1   48.2
##  4  1977 Short-term Krat Exclosure            35.8   41.3
##  5  1977 Spectab exclosure                    37.2   47.1
##  6  1978 Control                              38.1   70.8
##  7  1978 Long-term Krat Exclosure             22.6   35.9
##  8  1978 Rodent Exclosure                     37.8   67.3
##  9  1978 Short-term Krat Exclosure            36.9   63.8
## 10  1978 Spectab exclosure                    42.3   80.1
## # ... with 120 more rows

Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new datasets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data-processed, in our working directory that will store this generated dataset. We don’t want to store manipulated datasets in the same directory as our raw data. It’s good practice to keep them separate. The raw data would ideally be put in a data-raw folder, which should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it from how it was when we downloaded or recorded it ourself. In contrast, our R code will create the contents of the data-processed directory, so even if the files it contains are deleted, we can always re-generate them.

Use the getwd() function to find out which is the current working directory.

## [1] "/home/travis/build/UofTCoders/rcourse"

Navigate to this directory in your file browser and create a folder called data-processed.

Alternatively, you could use R to create this directory.

We are going to prepare a cleaned up version of the data without NAs. Let’s start by removing observations for which the species_id is missing. Let’s also remove observations for which weight and the hindfoot_length are missing. This dataset should also only contain observations of animals for which the sex has been determined:

## [1] 34786    13
## [1] 30676    13
## [1] 30676    13
## [1] 30676    13

Now that our dataset is ready, we can save it as a CSV file in our data-processed folder.

Team Challenges

These are four exercises in exploratory data analyses, which will train you to think about data and to use the tools you have been learning about in this class to solve scientific questions and to reproduce figures from the literature. To solve these challenges, you will need to understand what data underlies a figure and how you need to manipulate it to recreate the figure.

Setting up

Start by loading the required packages. Both ggplot2 and dplyr are included in the tidyverse package collection.

2. Reproduce figure 3 from the paper

For this section, you will apply your data wrangling and plotting skills to reproduce a couple of figures from a study on the yearly change in biomass of plants in the Abisko national park in northern Sweden. This paper is publish under an open license and the figures can be accessed via this link.

You will be working with this dataset in assignment 3, so we have prepared the data for you in a format that is easier to work with. Download the data and read it into a dataframe called plant_biomass. Confirm that the data frame is 180 rows by 10 columns

Reproduce figure 3 from the paper. Focus on the overall message on the plot, i.e. two panels for different habitats of the plant biomass over the year in grazed controls vs rodent exclosures. You do not need to get the figure aesthetics to look exactly the same as in the published figures (i.e. no need for the exact colors and shapes, axis styles, or to include the small dots around the main lines).

3. Reproduce figure 4 from the paper

Use the plant_biomass data set and reproduce figure 4 from the paper.

Hints:

  • To get the right dimensions of the subplots, explore how to use the function facet_grid instead of facet_wrap.
  • Remember to search online for help, e.g. “How do I change the figure size in R markdown?”
  • You will also need to search online or in the R documentation to find out how you change the y-scale to be constant within a species, but not between species (as in the paper figure).

4. Make figure 3 perfect

Let’s try to bring figure 3 closer to the paper version. Use the R documentation and search online to find out how to:

  • Change the size of markers in your plots and adjust them appropriately.
  • Slightly adjust the thickness of the lines.
  • Change the colors of the lines to match those online.
  • Apply a ggplot theme to make the figure background white and the overall figure appearance more like the paper version.
  • Change the x and y label to match the paper figure.

Parts of this lesson material were taken and modified from Data Carpentry under their CC-BY copyright license. See their lesson page for the original source.


This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.