More dplyr and ggplot, tidying and exporting data

Lesson preamble

Lecture objectives

Understand how to combine dplyr and ggplot.

Understand and apply faceting in ggplot.

Learn about tidy data.

Transform data from the long to wide format.

Lecture outline

Split-apply-combine… plot! (30 min)

Faceting (10 min)

Reshaping with gather and spread (30 min)

Exporting data (15 min)

Setting up

Start by loading the required packages. Both ggplot2 and dplyr are included in the tidyverse package collection.

# Install if needed
# install.packages('tidyverse')
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.4
## ✔ tidyr   1.0.2     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Load the data we saved in the previous lesson, and prepare the filtered abundant species dataset from the end of lecture 4.

# Download if needed
# download.file("https://ndownloader.figshare.com/files/2292169", "portal_data.csv")
surveys <- read_csv('portal_data.csv')
surveys_abun_species <- surveys %>%
    filter(!is.na(hindfoot_length) & !is.na(weight)) %>%
    group_by(species) %>%
    mutate(n = n()) %>% # add count value to each row
    filter(n > 800) %>%
    select(-n)

## Parsed with column specification:
## cols(
##   record_id = col_double(),
##   month = col_double(),
##   day = col_double(),
##   year = col_double(),
##   plot_id = col_double(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_double(),
##   weight = col_double(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character(),
##   plot_type = col_character()
## )

Split-apply-combine… plot!

In this section, we will learn how to work with **dplyr** and **ggplot** together. Aided by the pipes (%>%), we can create a powerful data exploration workflow using these two packages.

Let’s calculate number of counts per year for each species. First we need to group the data and count records within each group:

surveys_abun_species %>%
    group_by(year, genus) %>%
    tally() %>%
    arrange(desc(n)) # Adding arrange just to compare with histogram

## # A tibble: 178 x 3
## # Groups:   year [26]
##     year genus           n
##    <dbl> <chr>       <int>
##  1  2002 Chaetodipus  1243
##  2  1983 Dipodomys     953
##  3  1985 Dipodomys     951
##  4  2000 Chaetodipus   913
##  5  1982 Dipodomys     896
##  6  1997 Dipodomys     830
##  7  2001 Chaetodipus   781
##  8  1981 Dipodomys     733
##  9  1987 Dipodomys     688
## 10  1980 Dipodomys     667
## # … with 168 more rows

We could assign this table to a variable, and then pass that variable to ggplot().

yearly_counts <- surveys_abun_species %>%
    group_by(year, species) %>%
    tally()

ggplot(yearly_counts, aes(x = n)) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Creating an intermediate variable would be preferable for time consuming calculations, because you would not want to do that operation every time you change the plot aesthetics.

If it is not a time consuming calculation or you would like the flexibility of changing the data summary and the plotting options in the same code chunk, you can pipe the output of your split-apply-combine operation to the plotting command:

surveys_abun_species %>%
    group_by(year, species) %>%
    tally() %>%
    ggplot(aes(x = n)) +
        geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can perform a quick check that the plot corresponds to the table by coloring the histogram by species:

surveys_abun_species %>%
    group_by(year, species) %>%
    tally() %>%
    ggplot(aes(x = n, fill = species)) + # `fill` is specific for histograms
        geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s explore how the number of each genus varies over time. Longitudinal data can be visualized as a line plot with years on the x axis and counts on the y axis:

surveys_abun_species %>%
    group_by(year, species) %>%
    tally() %>%
    ggplot(aes(x = year, y = n)) +
        geom_line()

Unfortunately, this does not work because we plotted data for all the species together. We need to tell ggplot to draw a line for each species by modifying the aesthetic function to include group = species:

surveys_abun_species %>%
    group_by(year, species) %>%
    tally() %>%
    ggplot(aes(x = year, y = n, group = species)) +
        geom_line()

We will be able to distinguish species in the plot if we add colors (using color also automatically groups the data):

surveys_abun_species %>%
    group_by(year, species) %>%
    tally() %>%
    ggplot(aes(x = year, y = n, color = species)) + # `color` groups automatically
        geom_line() # try adding `size = 2`

Faceting

ggplot has a special technique called faceting that allows the user to split one plot into multiple subplots based on a variable included in the dataset. We will use it to make a time series plot for each species:

surveys_abun_species %>%
    group_by(year, species) %>%
    tally() %>%
    ggplot(aes(x = year, y = n)) + #
        geom_line() +
        facet_wrap(~ species)

Now we would like to split the line in each plot by the sex of each individual measured. To do that we need to make counts in the data frame grouped by year, species, and sex:

surveys_abun_species %>%
    group_by(year, species, sex) %>%
    tally()

## # A tibble: 575 x 4
## # Groups:   year, species [275]
##     year species     sex       n
##    <dbl> <chr>       <chr> <int>
##  1  1977 eremicus    M         2
##  2  1977 flavus      F        14
##  3  1977 flavus      M         8
##  4  1977 leucogaster F         1
##  5  1977 leucogaster <NA>      1
##  6  1977 megalotis   F         1
##  7  1977 megalotis   M         1
##  8  1977 merriami    F        75
##  9  1977 merriami    M       106
## 10  1977 ordii       F        10
## # … with 565 more rows

We can make the faceted plot by splitting further by sex using color (within a single plot):

surveys_abun_species %>%
    group_by(year, species, sex) %>%
    tally() %>%
    ggplot(aes(x = year, y = n, color = sex)) +
        geom_line() +
        facet_wrap(~ species)

There are several observations where no sex was recorded. In this case, we are not really interested in the observations of unknown sex, so we can filter out those values.

surveys_abun_species %>%
    filter(!is.na(sex)) %>%
    group_by(year, species, sex) %>%
    tally() %>%
    ggplot(aes(x = year, y = n, color = sex)) +
        geom_line() +
        facet_wrap(~ species)

It is possible to specify exactly which colors to use, to avoid those that are hard to distinguish. We can also change the thickness of the lines, and adjust the x labels so that they don’t overlap.

color_names <- c('black', 'orange')

surveys_abun_species %>%
    filter(!is.na(sex)) %>%
    group_by(year, species, sex) %>%
    tally() %>%
    ggplot(aes(x = year, y = n, color = sex)) +
        geom_line(size = 1) +
        scale_color_manual(values = color_names) +
        facet_wrap(~ species) +
        theme_bw() +
        theme(text = element_text(size=12),
               axis.text.x = element_text(angle=30, hjust=1))

Challenge

Use the filtered data frame for all of these questions.

Remember the histogram colored according to each species? Starting from that code, how could we separate each species into its own subplot?

Which year was the average weight of the animals the highest?
Is the yearly trend the same for all species?

Reshaping with gather and spread

dplyr is one part of a larger tidyverse that enables you to work with data in tidy data formats. tidyr enables a wide range of manipulations of the structure data itself. For example, the survey data presented here is almost in what we call a long format - every observation of every individual is its own row. This is an ideal format for data with a rich set of information per observation. It makes it difficult, however, to look at the relationships between measurements across plots. For example, what is the relationship between mean weights of different genera across the entire data set?

To answer that question, we’d want each plot to have a single row, with all of the measurements in a single plot having their own column. This is called a wide data format. For the surveys data as we have it right now, this is going to be one heck of a wide data frame! However, if we were to summarize data within plots and species, we might begin to have some relationships we’d want to examine.

Let’s see this in action. First, using dplyr, let’s create a data frame with the mean body weight of each genus by plot.

surveys_gw <- surveys %>%
    filter(!is.na(weight)) %>%
    group_by(genus, plot_id) %>%
    summarize(mean_weight = mean(weight))

head(surveys_gw)

## # A tibble: 6 x 3
## # Groups:   genus [1]
##   genus   plot_id mean_weight
##   <chr>     <dbl>       <dbl>
## 1 Baiomys       1        7   
## 2 Baiomys       2        6   
## 3 Baiomys       3        8.61
## 4 Baiomys       5        7.75
## 5 Baiomys      18        9.5 
## 6 Baiomys      19        9.53

Long to Wide with `spread`

Now, to make this long data wide, we use spread from tidyr to spread out the different taxa into columns. spread takes three arguments: - the data, the key column (or column with identifying information), the values column (the one with the numbers/values). We’ll use a pipe so we can ignore the data argument.

surveys_gw_wide <- surveys_gw %>%
  spread(genus, mean_weight)

head(surveys_gw_wide)

## # A tibble: 6 x 11
##   plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
##     <dbl>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>      <dbl>
## 1       1    7           22.2      60.2    156.      27.7        9.62       22.2
## 2       2    6           25.1      55.7    169.      26.9        6.95       22.3
## 3       3    8.61        24.6      52.0    158.      26.0        7.51       21.4
## 4       4   NA           23.0      57.5    164.      28.1        7.82       22.6
## 5       5    7.75        18.0      51.1    190.      27.0        8.66       21.2
## 6       6   NA           24.9      58.6    180.      25.9        7.81       21.8
## # … with 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>,
## #   Spermophilus <dbl>

Notice that some genera have NA values. That’s because some of those genera don’t have any record in that plot. Sometimes it is fine to leave those as NA. Sometimes we want to fill them as zeros, in which case we would add the argument fill=0.

surveys_gw %>%
  spread(genus, mean_weight, fill = 0) %>%
  head

## # A tibble: 6 x 11
##   plot_id Baiomys Chaetodipus Dipodomys Neotoma Onychomys Perognathus Peromyscus
##     <dbl>   <dbl>       <dbl>     <dbl>   <dbl>     <dbl>       <dbl>      <dbl>
## 1       1    7           22.2      60.2    156.      27.7        9.62       22.2
## 2       2    6           25.1      55.7    169.      26.9        6.95       22.3
## 3       3    8.61        24.6      52.0    158.      26.0        7.51       21.4
## 4       4    0           23.0      57.5    164.      28.1        7.82       22.6
## 5       5    7.75        18.0      51.1    190.      27.0        8.66       21.2
## 6       6    0           24.9      58.6    180.      25.9        7.81       21.8
## # … with 3 more variables: Reithrodontomys <dbl>, Sigmodon <dbl>,
## #   Spermophilus <dbl>

We can now do things like plot the weight of Baiomys against Chaetodipus or examine their correlation.

surveys_gw %>%
  spread(genus, mean_weight, fill = 0) %>%
  cor(use = "pairwise.complete")

##                    plot_id     Baiomys Chaetodipus    Dipodomys     Neotoma
## plot_id          1.0000000 -0.11041371   0.2945925 -0.165214440 -0.55050668
## Baiomys         -0.1104137  1.00000000   0.3687496 -0.050774964 -0.03268885
## Chaetodipus      0.2945925  0.36874958   1.0000000  0.325918149 -0.27649287
## Dipodomys       -0.1652144 -0.05077496   0.3259181  1.000000000  0.12159544
## Neotoma         -0.5505067 -0.03268885  -0.2764929  0.121595440  1.00000000
## Onychomys       -0.4805452 -0.08689875  -0.3254973  0.093274704  0.43466237
## Perognathus      0.2468167  0.16572228   0.1799742  0.134225846 -0.25623672
## Peromyscus      -0.3042821  0.10570094   0.2109187  0.219996783  0.30722919
## Reithrodontomys -0.2693462  0.34710430  -0.2258934 -0.142993836  0.31713778
## Sigmodon         0.2745083 -0.01421689   0.1971181 -0.092613012  0.04665664
## Spermophilus     0.1342651 -0.05423613  -0.3096137 -0.003720851 -0.14037233
##                   Onychomys Perognathus Peromyscus Reithrodontomys    Sigmodon
## plot_id         -0.48054516   0.2468167 -0.3042821     -0.26934622  0.27450825
## Baiomys         -0.08689875   0.1657223  0.1057009      0.34710430 -0.01421689
## Chaetodipus     -0.32549730   0.1799742  0.2109187     -0.22589344  0.19711806
## Dipodomys        0.09327470   0.1342258  0.2199968     -0.14299384 -0.09261301
## Neotoma          0.43466237  -0.2562367  0.3072292      0.31713778  0.04665664
## Onychomys        1.00000000  -0.1929882  0.1146881     -0.03186637 -0.19977238
## Perognathus     -0.19298820   1.0000000  0.1813099     -0.27968468 -0.24933409
## Peromyscus       0.11468812   0.1813099  1.0000000      0.29246412  0.15628042
## Reithrodontomys -0.03186637  -0.2796847  0.2924641      1.00000000  0.16463343
## Sigmodon        -0.19977238  -0.2493341  0.1562804      0.16463343  1.00000000
## Spermophilus     0.37939827   0.1057038 -0.1593698     -0.21747037 -0.27040148
##                 Spermophilus
## plot_id          0.134265067
## Baiomys         -0.054236130
## Chaetodipus     -0.309613679
## Dipodomys       -0.003720851
## Neotoma         -0.140372334
## Onychomys        0.379398267
## Perognathus      0.105703800
## Peromyscus      -0.159369768
## Reithrodontomys -0.217470366
## Sigmodon        -0.270401476
## Spermophilus     1.000000000

Wide to long with `gather`

What if we had the opposite problem, and wanted to go from a wide to long format? For that, we use gather to sweep up a set of columns into one key-value pair. We give it the arguments of a new key and value column name, and then we specify which columns we either want or do not want gathered up. So, to go backwards from surveys_gw_wide, and exclude plot_id from the gathering, we would do the following:

surveys_gw_long <- surveys_gw_wide %>%
  gather(genus, mean_weight, -plot_id)

head(surveys_gw_long)

## # A tibble: 6 x 3
##   plot_id genus   mean_weight
##     <dbl> <chr>         <dbl>
## 1       1 Baiomys        7   
## 2       2 Baiomys        6   
## 3       3 Baiomys        8.61
## 4       4 Baiomys       NA   
## 5       5 Baiomys        7.75
## 6       6 Baiomys       NA

Note that now the NA genera are included in the long format. Going from wide to long to wide can be a useful way to balance out a dataset so every replicate has the same composition.

We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are sequential, we don’t even need to list them all out - just use the : operator!

surveys_gw_wide %>%
  gather(genus, mean_weight, Baiomys:Spermophilus) %>%
  head()

## # A tibble: 6 x 3
##   plot_id genus   mean_weight
##     <dbl> <chr>         <dbl>
## 1       1 Baiomys        7   
## 2       2 Baiomys        6   
## 3       3 Baiomys        8.61
## 4       4 Baiomys       NA   
## 5       5 Baiomys        7.75
## 6       6 Baiomys       NA

Challenge

Make a wide data frame with year as columns, plot_id as rows, and the values are the number of genera per plot. You will need to summarize before reshaping, and use the function n_distinct to get the number of unique types of a genus. It’s a powerful function! See ?n_distinct for more.
Now take that data frame, and make it long again, so each row is a unique plot_id - year combination.
The surveys data set is not truly wide or long because there are two columns of measurement - hindfoot_length and weight. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, use gather to create a truly long dataset where we have a key column called measurement and a value column that takes on the value of either hindfoot_length or weight. Hint: You’ll need to specify which columns are being gathered.
With this new truly long data set, calculate the average of each measurement in each year for each different plot_type. Then spread them into a wide data set with a column for hindfoot_length and weight. Hint: Remember, you only need to specify the key and value columns for spread.

Exporting data

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new datasets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data-processed, in our working directory that will store this generated dataset. We don’t want to store manipulated datasets in the same directory as our raw data. It’s good practice to keep them separate. The raw data would ideally be put in a data-raw folder, which should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it from how it was when we downloaded or recorded it ourself. In contrast, our R code will create the contents of the data-processed directory, so even if the files it contains are deleted, we can always re-generate them.

Use the getwd() function to find out which is the current working directory.

getwd()

## [1] "/home/travis/build/UofTCoders/rcourse"

Navigate to this directory in your file browser and create a folder called data-processed.

Alternatively, you could use R to create this directory.

dir.create("data-processed")

# To suppress the warning, we could do
dir.create("data-processed", showWarnings = FALSE)

# Another alternative would be to use a conditional expression, which only
# creates the directory *if* it does not already exist. The syntax here is
# similar to the for loop we created in the second lecture.
if (!dir.exists('data-processed')) {
    dir.create("data-processed")
}

We are going to prepare a cleaned up version of the data without NAs. Let’s start by removing observations for which the species_id is missing. Let’s also remove observations for which weight and the hindfoot_length are missing. This dataset should also only contain observations of animals for which the sex has been determined:

surveys_complete <- surveys %>%
  filter(!is.na(species_id),       # remove missing species_id
         !is.na(weight),           # remove missing weight
         !is.na(hindfoot_length),  # remove missing hindfoot_length
         !is.na(sex))              # remove missing sex

# This expression is a succinct alternative to the above
surveys_complete_comcas <- surveys %>%
    filter(complete.cases(species_id, weight, hindfoot_length, sex))

# This is even briefer, but omits observations with NA in *any* column.
# There is no way to control which columns to use, but it is common to want
# to exclude all NAs, which in our case corresponds to the columns listed above.
surveys_complete_naomit <- na.omit(surveys)

# Compare the dimensions of the original and the cleaned data frame
dim(surveys)

## [1] 34786    13

dim(surveys_complete)

## [1] 30676    13

dim(surveys_complete_comcas)

## [1] 30676    13

dim(surveys_complete_naomit)

## [1] 30676    13

Now that our dataset is ready, we can save it as a CSV file in our data-processed folder.

write_csv(surveys_complete, path = "data-processed/surveys_complete.csv")

Parts of this lesson material were taken and modified from Data Carpentry under their CC-BY copyright license. See their lesson page for the original source.

This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.

More dplyr and ggplot, tidying and exporting data

Joel Östblom

Lesson preamble

Lecture objectives

Lecture outline

Setting up

Split-apply-combine… plot!

Faceting

Challenge

Reshaping with gather and spread

Long to Wide with spread

Wide to long with gather

Challenge

Exporting data

Long to Wide with `spread`

Wide to long with `gather`