Learning objectives
- Describe what a data frame is.
- Load external data from a .csv file into a data frame in R.
- Summarize the contents of a data frame in R.
- Understand the purpose of the
dplyr
package.- Learn to use data wrangling commands
select
,filter
,%>%,
andmutate
from thedplyr
package.Lesson outline
- Data set background (10 min)
- What are data frames (15 min)
- R packages for data analyses (5 min)
- Data wrangling in dplyr (40 min)
Today, we will be working with real data from a longitudinal study of the species abundance in the Chihuahuan desert ecosystem near Portal, Arizona, USA. This study includes observations of plants, ants, and rodents from 1977 - 2002, and has been used in over 100 publications. More information is available in the abstract of this paper from 2009. There are several datasets available related to this study, and we will be working with datasets that have been preprocessed by the Data Carpentry to facilitate teaching. These are made available online as The Portal Project Teaching Database, both at the Data Carpentry website, and on Figshare. Figshare is a great place to publish data, code, figures, and more openly to make them available for other researchers and to communicate findings that are not part of a longer paper.
We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxa | e.g. rodent, reptile, bird, rabbit |
plot_type | type of plot |
To read the data into R, we are going to use a function called read_csv
. This function is contained in an R-package called readr
. R-packages are a bit like browser extensions; they are not essential, but can provide nifty functionality. We will go through R-packages in general and which ones are good for data analyses in detail later in this lecture. Now, let’s install readr
:
Now we can use the read_csv
function. One useful option that read_csv
includes, is the ability to read a CSV file directly from a URL, without downloading it in a separate step:
However, it is often a good idea to download the data first, so you have a copy stored locally on your computer in case you want to do some offline analyses, or the online version of the file changes or the file is taken down. You can either download the data manually or from within R:
download.file("https://ndownloader.figshare.com/files/2292169",
"portal_data.csv") # Saves this name in the current directory
The data is read in by specifying its local path.
## Parsed with column specification:
## cols(
## record_id = col_double(),
## month = col_double(),
## day = col_double(),
## year = col_double(),
## plot_id = col_double(),
## species_id = col_character(),
## sex = col_character(),
## hindfoot_length = col_double(),
## weight = col_double(),
## genus = col_character(),
## species = col_character(),
## taxa = col_character(),
## plot_type = col_character()
## )
This statement produces some output regarding which data type it found in each column. If we want to check this in more detail, we can print the variable’s value: surveys
.
## # A tibble: 34,786 x 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL <NA> NA NA
## 4 266 10 16 1977 2 NL <NA> NA NA
## 5 349 11 12 1977 2 NL <NA> NA NA
## 6 363 11 12 1977 2 NL <NA> NA NA
## 7 435 12 10 1977 2 NL <NA> NA NA
## 8 506 1 8 1978 2 NL <NA> NA NA
## 9 588 2 18 1978 2 NL M NA 218
## 10 661 3 11 1978 2 NL <NA> NA NA
## # … with 34,776 more rows, and 4 more variables: genus <chr>, species <chr>,
## # taxa <chr>, plot_type <chr>
In the online html-version of this lecture, you only see the first few rows of the data frame. Running the code chunk above in the R Notebook would display a nice tabular view of the data, which also includes pagination when there are many rows and we can click the green arrow to view all the columns. Technically, this object is actually a tibble
rather than a data frame, as indicated in the output. The reason for this is that read_csv
automatically converts the data into to a tibble
when loading it. Since a tibble
is just a data frame with some convenient extra functionality, we will use these words interchangeably from now on.
If we just want to glance at how the data frame looks, it is sufficient to display only the top (the first 6 lines) using the function head()
:
## # A tibble: 6 x 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL <NA> NA NA
## 4 266 10 16 1977 2 NL <NA> NA NA
## 5 349 11 12 1977 2 NL <NA> NA NA
## 6 363 11 12 1977 2 NL <NA> NA NA
## # … with 4 more variables: genus <chr>, species <chr>, taxa <chr>,
## # plot_type <chr>
Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by the function read_csv()
; in other words, when importing spreadsheets from your hard drive (or the web).
A data frame is a representation of data in the format of a table where the columns are vectors that all have the same length. Because the columns are vectors, they all contain the same type of data as we discussed in last class (e.g., characters, integers, factors). We can see this when inspecting the structure of a data frame with the function str()
:
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 34786 obs. of 13 variables:
## $ record_id : num 1 72 224 266 349 363 435 506 588 661 ...
## $ month : num 7 8 9 10 11 11 12 1 2 3 ...
## $ day : num 16 19 13 16 12 12 10 8 18 11 ...
## $ year : num 1977 1977 1977 1977 1977 ...
## $ plot_id : num 2 2 2 2 2 2 2 2 2 2 ...
## $ species_id : chr "NL" "NL" "NL" "NL" ...
## $ sex : chr "M" "M" NA NA ...
## $ hindfoot_length: num 32 31 NA NA NA NA NA NA NA NA ...
## $ weight : num NA NA NA NA NA NA NA NA 218 NA ...
## $ genus : chr "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
## $ species : chr "albigula" "albigula" "albigula" "albigula" ...
## $ taxa : chr "Rodent" "Rodent" "Rodent" "Rodent" ...
## $ plot_type : chr "Control" "Control" "Control" "Control" ...
## - attr(*, "spec")=
## .. cols(
## .. record_id = col_double(),
## .. month = col_double(),
## .. day = col_double(),
## .. year = col_double(),
## .. plot_id = col_double(),
## .. species_id = col_character(),
## .. sex = col_character(),
## .. hindfoot_length = col_double(),
## .. weight = col_double(),
## .. genus = col_character(),
## .. species = col_character(),
## .. taxa = col_character(),
## .. plot_type = col_character()
## .. )
Integer refers to a whole number, such as 1, 2, 3, 4, etc. Numbers with decimals, 1.0, 2.4, 3.333, are referred to as floats. Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting. Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
data.frame
objectsWe already saw how the functions head()
and str()
can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!
dim(surveys)
- returns a vector with the number of rows in the first element and the number of columns as the second element (the dimensions of the object)nrow(surveys)
- returns the number of rowsncol(surveys)
- returns the number of columnshead(surveys)
- shows the first 6 rowstail(surveys)
- shows the last 6 rowsnames(surveys)
- returns the column names (synonym of colnames()
for data.frame
objects)rownames(surveys)
- returns the row namesstr(surveys)
- structure of the object and information about the class, length, and content of each columnsummary(surveys)
- summary statistics for each columnNote: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
Based on the output of str(surveys)
, can you answer the following questions?
surveys
?Our survey data frame has rows and columns (it has 2 dimensions). If we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. When indexing, base R data frames return a different format depending on how we index the data (i.e. either a vector or a data frame), but with enhanced data frames, tibbles
, the returned object is almost always a data frame.
## # A tibble: 1 x 1
## record_id
## <dbl>
## 1 1
## # A tibble: 1 x 1
## species_id
## <chr>
## 1 NL
## # A tibble: 34,786 x 1
## record_id
## <dbl>
## 1 1
## 2 72
## 3 224
## 4 266
## 5 349
## 6 363
## 7 435
## 8 506
## 9 588
## 10 661
## # … with 34,776 more rows
## # A tibble: 34,786 x 1
## record_id
## <dbl>
## 1 1
## 2 72
## 3 224
## 4 266
## 5 349
## 6 363
## 7 435
## 8 506
## 9 588
## 10 661
## # … with 34,776 more rows
## # A tibble: 3 x 1
## sex
## <chr>
## 1 M
## 2 M
## 3 <NA>
## # A tibble: 1 x 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 224 9 13 1977 2 NL <NA> NA NA
## # … with 4 more variables: genus <chr>, species <chr>, taxa <chr>,
## # plot_type <chr>
## # A tibble: 6 x 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL <NA> NA NA
## 4 266 10 16 1977 2 NL <NA> NA NA
## 5 349 11 12 1977 2 NL <NA> NA NA
## 6 363 11 12 1977 2 NL <NA> NA NA
## # … with 4 more variables: genus <chr>, species <chr>, taxa <chr>,
## # plot_type <chr>
:
is a special operator that creates numeric vectors of integers in increasing or decreasing order; test 1:10
and 10:1
for instance. This works similarly to seq
, which we looked at earlier in class:
## [1] 0 1 2 3 4 5 6 7 8 9 10
## [1] 0 1 2 3 4 5 6 7 8 9 10
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1] TRUE
You can also exclude certain parts of a data frame using the “-
” sign:
## # A tibble: 34,786 x 12
## month day year plot_id species_id sex hindfoot_length weight genus
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 7 16 1977 2 NL M 32 NA Neot…
## 2 8 19 1977 2 NL M 31 NA Neot…
## 3 9 13 1977 2 NL <NA> NA NA Neot…
## 4 10 16 1977 2 NL <NA> NA NA Neot…
## 5 11 12 1977 2 NL <NA> NA NA Neot…
## 6 11 12 1977 2 NL <NA> NA NA Neot…
## 7 12 10 1977 2 NL <NA> NA NA Neot…
## 8 1 8 1978 2 NL <NA> NA NA Neot…
## 9 2 18 1978 2 NL M NA 218 Neot…
## 10 3 11 1978 2 NL <NA> NA NA Neot…
## # … with 34,776 more rows, and 3 more variables: species <chr>, taxa <chr>,
## # plot_type <chr>
## # A tibble: 6 x 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL <NA> NA NA
## 4 266 10 16 1977 2 NL <NA> NA NA
## 5 349 11 12 1977 2 NL <NA> NA NA
## 6 363 11 12 1977 2 NL <NA> NA NA
## # … with 4 more variables: genus <chr>, species <chr>, taxa <chr>,
## # plot_type <chr>
As well as using numeric values to subset a data.frame
(or matrix
), columns can be called by name, using one of the four following notations:
## # A tibble: 34,786 x 1
## species_id
## <chr>
## 1 NL
## 2 NL
## 3 NL
## 4 NL
## 5 NL
## 6 NL
## 7 NL
## 8 NL
## 9 NL
## 10 NL
## # … with 34,776 more rows
## # A tibble: 34,786 x 1
## species_id
## <chr>
## 1 NL
## 2 NL
## 3 NL
## 4 NL
## 5 NL
## 6 NL
## 7 NL
## 8 NL
## 9 NL
## 10 NL
## # … with 34,776 more rows
For our purposes, these notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name.
Another syntax that is often used to specify column names is $
. In this case, the returned object is actually a vector. We will not go into detail about this, but since it is such common usage, it is good to be aware of this.
# We use `head()` since the output from vectors are not automatically cut off
# and we don't want to clutter the screen with all the `species_id` values
head(surveys$species_id) # Result is a vector
## [1] "NL" "NL" "NL" "NL" "NL" "NL"
Create a data.frame
(surveys_200
) containing only the observations from row 200 of the surveys
dataset.
Notice how nrow()
gave you the number of rows in a data.frame
?
tail()
to make sure it’s meeting expectations.nrow()
instead of the row number.surveys_last
) from that last row.Use nrow()
to extract the row that is in the middle of the data frame. Store the content of this row in an object named surveys_middle
.
Combine nrow()
with the -
notation above to reproduce the behavior of head(surveys)
keeping just the first through 6th rows of the surveys dataset.
There are certainly many tools built in to base R which can be used to understand data, but we are going to use a package called dplyr
which makes exploratory data analysis (EDA) particularly intuitive and effective.
First, let’s explain the concept of an R-package. What we have used so far is all part of base R (except read_csv
), together with many more functions. Every package included in base R will be installed on any computer where R is installed, since they are considered critical for using R, e.g. c()
, mean()
, +
, -
, etc. However, since R is an open language, it is easy to develop your own R-package that provides new functionality and submit it to the official repository for R-packages called CRAN (Comprehensive R Archive Network). CRAN has thousands of packages, and all these cannot be installed by default, because then base R installation would be huge and most people would only be using a fraction of everything installed on their machine. It would be like if you downloaded the Firefox or Chrome browser and you would get all extensions and addons installed by default, or as if your phone came with every app ever made for it already installed when you bought it: quite impractical.
To install a package in R, we use the function install.packages()
. In this case, the package dplyr
is part of a bigger collections of packages called tidyverse
(just like Microsoft Word is part of Microsoft Office), which also contains the readr
package we installed in the beginning alongside many more packages that makes exploratory data analyses more intuitive and effective.
Now all the dplyr
functions are available to us by prefacing them with dplyr::
:
## Observations: 34,786
## Variables: 13
## $ record_id <dbl> 1, 72, 224, 266, 349, 363, 435, 506, 588, 661, 748, 8…
## $ month <dbl> 7, 8, 9, 10, 11, 11, 12, 1, 2, 3, 4, 5, 6, 8, 9, 10, …
## $ day <dbl> 16, 19, 13, 16, 12, 12, 10, 8, 18, 11, 8, 6, 9, 5, 4,…
## $ year <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1978, 1978,…
## $ plot_id <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ species_id <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL",…
## $ sex <chr> "M", "M", NA, NA, NA, NA, NA, NA, "M", NA, NA, "M", "…
## $ hindfoot_length <dbl> 32, 31, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, NA, 3…
## $ weight <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 218, NA, NA, 204, 200…
## $ genus <chr> "Neotoma", "Neotoma", "Neotoma", "Neotoma", "Neotoma"…
## $ species <chr> "albigula", "albigula", "albigula", "albigula", "albi…
## $ taxa <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "Ro…
## $ plot_type <chr> "Control", "Control", "Control", "Control", "Control"…
We will be using this package a lot, and it would be a little annoying to have to type dplyr::
every time, so we will load it into our current environment. This needs to be done once for every new R session and makes all functions accessible without their package prefix, which is very convenient, as long as you are aware of which function you are using and don’t load a function with the same name from two different packages.
# We could also do `library(dplyr)`, but we need the rest of the
# tidyverse packages later, so we might as well import the entire collection.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.4
## ✔ tidyr 1.0.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Observations: 34,786
## Variables: 13
## $ record_id <dbl> 1, 72, 224, 266, 349, 363, 435, 506, 588, 661, 748, 8…
## $ month <dbl> 7, 8, 9, 10, 11, 11, 12, 1, 2, 3, 4, 5, 6, 8, 9, 10, …
## $ day <dbl> 16, 19, 13, 16, 12, 12, 10, 8, 18, 11, 8, 6, 9, 5, 4,…
## $ year <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1978, 1978,…
## $ plot_id <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ species_id <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL", "NL",…
## $ sex <chr> "M", "M", NA, NA, NA, NA, NA, NA, "M", NA, NA, "M", "…
## $ hindfoot_length <dbl> 32, 31, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32, NA, 3…
## $ weight <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 218, NA, NA, 204, 200…
## $ genus <chr> "Neotoma", "Neotoma", "Neotoma", "Neotoma", "Neotoma"…
## $ species <chr> "albigula", "albigula", "albigula", "albigula", "albi…
## $ taxa <chr> "Rodent", "Rodent", "Rodent", "Rodent", "Rodent", "Ro…
## $ plot_type <chr> "Control", "Control", "Control", "Control", "Control"…
Wrangling here is used in the sense of maneuvering, managing, controlling, and turning your data upside down and inside out to look at it from different angles in order to understand it. The package dplyr
provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++), this means that many operations run much faster than similar tools in R. An additional feature is the ability to work directly with data stored in an external database, such as SQL-databases. The ability to work with databases is great because you are able to work with much bigger datasets (100s of GB) than your computer could normally handle. We will not talk in detail about this in class, but there are great resources online to learn more (e.g. this lecture from Data Carpentry).
We’re going to learn some of the most common dplyr
functions: select()
, filter()
, mutate()
, group_by()
, and summarise()
. To select columns of a data frame, use select()
. The first argument to this function is the data frame (surveys
), and the subsequent arguments are the columns to keep.
## # A tibble: 34,786 x 4
## plot_id species_id weight year
## <dbl> <chr> <dbl> <dbl>
## 1 2 NL NA 1977
## 2 2 NL NA 1977
## 3 2 NL NA 1977
## 4 2 NL NA 1977
## 5 2 NL NA 1977
## 6 2 NL NA 1977
## 7 2 NL NA 1977
## 8 2 NL NA 1978
## 9 2 NL 218 1978
## 10 2 NL NA 1978
## # … with 34,776 more rows
To choose rows based on a specific criteria, use filter()
:
## # A tibble: 1,180 x 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 22314 6 7 1995 2 NL M 34 NA
## 2 22728 9 23 1995 2 NL F 32 165
## 3 22899 10 28 1995 2 NL F 32 171
## 4 23032 12 2 1995 2 NL F 33 NA
## 5 22003 1 11 1995 2 DM M 37 41
## 6 22042 2 4 1995 2 DM F 36 45
## 7 22044 2 4 1995 2 DM M 37 46
## 8 22105 3 4 1995 2 DM F 37 49
## 9 22109 3 4 1995 2 DM M 37 46
## 10 22168 4 1 1995 2 DM M 36 48
## # … with 1,170 more rows, and 4 more variables: genus <chr>, species <chr>,
## # taxa <chr>, plot_type <chr>
Note that to check for equality, R requires two equal signs (==
). This is to prevent confusion with object assignment, since otherwise year = 1995
might be interpreted as ‘set the year
parameter to 1995
’, which is not what filter
does!
Basic conditionals in R are broadly similar to how they’re already expressed mathematically:
## [1] TRUE
## [1] FALSE
However, there are a few idiosyncrasies to be mindful of for other conditionals:
## [1] TRUE
## [1] TRUE
## [1] FALSE
Finally, the %in%
operator is used to check for membership:
## [1] TRUE
All of the above conditionals are compatible with filter
, with the key difference being that filter
expects column names as part of conditional statements instead of individual numbers.
But what if you wanted to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes. With intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects:
## # A tibble: 1,180 x 4
## plot_id species_id weight year
## <dbl> <chr> <dbl> <dbl>
## 1 2 NL NA 1995
## 2 2 NL 165 1995
## 3 2 NL 171 1995
## 4 2 NL NA 1995
## 5 2 DM 41 1995
## 6 2 DM 45 1995
## 7 2 DM 46 1995
## 8 2 DM 49 1995
## 9 2 DM 46 1995
## 10 2 DM 48 1995
## # … with 1,170 more rows
You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as things are evaluated from the inside out.
## # A tibble: 1,180 x 4
## plot_id species_id weight year
## <dbl> <chr> <dbl> <dbl>
## 1 2 NL NA 1995
## 2 2 NL 165 1995
## 3 2 NL 171 1995
## 4 2 NL NA 1995
## 5 2 DM 41 1995
## 6 2 DM 45 1995
## 7 2 DM 46 1995
## 8 2 DM 49 1995
## 9 2 DM 46 1995
## 10 2 DM 48 1995
## # … with 1,170 more rows
The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>%
and are made available via the magrittr
package that also is included in the tidyverse
. If you use RStudio, you can type the pipe with Ctrl/Cmd + Shift + M.
## # A tibble: 1,180 x 4
## plot_id species_id weight year
## <dbl> <chr> <dbl> <dbl>
## 1 2 NL NA 1995
## 2 2 NL 165 1995
## 3 2 NL 171 1995
## 4 2 NL NA 1995
## 5 2 DM 41 1995
## 6 2 DM 45 1995
## 7 2 DM 46 1995
## 8 2 DM 49 1995
## 9 2 DM 46 1995
## 10 2 DM 48 1995
## # … with 1,170 more rows
The .
refers to the object that is passed from the previous line. In this example, the data frame surveys
is passed to the .
in the select()
statement. Then, the modified data frame which is the result of the select()
operation, is passed to the .
in the filter() statement. Put more simply: whatever was the result from the line above the current line, will be used in the current line.
Since it gets a bit tedious to write out all the dots, dplyr
allows for them to be omitted. By default, the pipe will pass its input to the first argument of the right hand side function; in dplyr
, the first argument is always a data frame. The chunk below gives the same output as the one above:
## # A tibble: 1,180 x 4
## plot_id species_id weight year
## <dbl> <chr> <dbl> <dbl>
## 1 2 NL NA 1995
## 2 2 NL 165 1995
## 3 2 NL 171 1995
## 4 2 NL NA 1995
## 5 2 DM 41 1995
## 6 2 DM 45 1995
## 7 2 DM 46 1995
## 8 2 DM 49 1995
## 9 2 DM 46 1995
## 10 2 DM 48 1995
## # … with 1,170 more rows
Another example:
## # A tibble: 17 x 3
## species_id sex weight
## <chr> <chr> <dbl>
## 1 PF F 4
## 2 PF F 4
## 3 PF M 4
## 4 RM F 4
## 5 RM M 4
## 6 PF <NA> 4
## 7 PP M 4
## 8 RM M 4
## 9 RM M 4
## 10 RM M 4
## 11 PF M 4
## 12 PF F 4
## 13 RM M 4
## 14 RM M 4
## 15 RM F 4
## 16 RM M 4
## 17 RM M 4
In the above code, we use the pipe to send the surveys
dataset first through filter()
to keep rows where weight
is less than 5, then through select()
to keep only the species_id
, sex
, and weight
columns. Since %>%
takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include it as an argument to the filter()
and select()
functions anymore.
If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head()
of the data. (Pipes work with non-dplyr
functions, too, as long as either the dplyr
or magrittr
package is loaded).
## # A tibble: 6 x 3
## species_id sex weight
## <chr> <chr> <dbl>
## 1 PF F 4
## 2 PF F 4
## 3 PF M 4
## 4 RM F 4
## 5 RM M 4
## 6 PF <NA> 4
If we wanted to create a new object with this smaller version of the data, we could do so by assigning it a new name:
## # A tibble: 17 x 3
## species_id sex weight
## <chr> <chr> <dbl>
## 1 PF F 4
## 2 PF F 4
## 3 PF M 4
## 4 RM F 4
## 5 RM M 4
## 6 PF <NA> 4
## 7 PP M 4
## 8 RM M 4
## 9 RM M 4
## 10 RM M 4
## 11 PF M 4
## 12 PF F 4
## 13 RM M 4
## 14 RM M 4
## 15 RM F 4
## 16 RM M 4
## 17 RM M 4
Note that the final data frame is the leftmost part of this expression.
A single expression can also be used to filter for several criteria, either matching all criteria (&
) or any criteria (|
):
## # A tibble: 15,690 x 2
## sex taxa
## <chr> <chr>
## 1 F Rodent
## 2 F Rodent
## 3 F Rodent
## 4 F Rodent
## 5 F Rodent
## 6 F Rodent
## 7 F Rodent
## 8 F Rodent
## 9 F Rodent
## 10 F Rodent
## # … with 15,680 more rows
## # A tibble: 3 x 2
## species taxa
## <chr> <chr>
## 1 leucophrys Bird
## 2 clarki Reptile
## 3 leucophrys Bird
Using pipes, subset the survey
data to include individuals collected before 1995 and retain only the columns year
, sex
, and weight
.
Frequently, you’ll want to create new columns based on the values in existing columns. For instance, you might want to do unit conversions, or find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of weight in kg:
## # A tibble: 34,786 x 14
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL <NA> NA NA
## 4 266 10 16 1977 2 NL <NA> NA NA
## 5 349 11 12 1977 2 NL <NA> NA NA
## 6 363 11 12 1977 2 NL <NA> NA NA
## 7 435 12 10 1977 2 NL <NA> NA NA
## 8 506 1 8 1978 2 NL <NA> NA NA
## 9 588 2 18 1978 2 NL M NA 218
## 10 661 3 11 1978 2 NL <NA> NA NA
## # … with 34,776 more rows, and 5 more variables: genus <chr>, species <chr>,
## # taxa <chr>, plot_type <chr>, weight_kg <dbl>
You can also create a second new column based on the first new column within the same call of mutate()
:
## # A tibble: 34,786 x 15
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL <NA> NA NA
## 4 266 10 16 1977 2 NL <NA> NA NA
## 5 349 11 12 1977 2 NL <NA> NA NA
## 6 363 11 12 1977 2 NL <NA> NA NA
## 7 435 12 10 1977 2 NL <NA> NA NA
## 8 506 1 8 1978 2 NL <NA> NA NA
## 9 588 2 18 1978 2 NL M NA 218
## 10 661 3 11 1978 2 NL <NA> NA NA
## # … with 34,776 more rows, and 6 more variables: genus <chr>, species <chr>,
## # taxa <chr>, plot_type <chr>, weight_kg <dbl>, weight_kg2 <dbl>
The first few rows of the output are full of NA
s, so if we wanted to remove those we could insert a filter()
in the chain:
## # A tibble: 32,283 x 14
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 588 2 18 1978 2 NL M NA 218
## 2 845 5 6 1978 2 NL M 32 204
## 3 990 6 9 1978 2 NL M NA 200
## 4 1164 8 5 1978 2 NL M 34 199
## 5 1261 9 4 1978 2 NL M 32 197
## 6 1453 11 5 1978 2 NL M NA 218
## 7 1756 4 29 1979 2 NL M 33 166
## 8 1818 5 30 1979 2 NL M 32 184
## 9 1882 7 4 1979 2 NL M 32 206
## 10 2133 10 25 1979 2 NL F 33 274
## # … with 32,273 more rows, and 5 more variables: genus <chr>, species <chr>,
## # taxa <chr>, plot_type <chr>, weight_kg <dbl>
is.na()
is a function that determines whether something is an NA
. The !
symbol negates the result, so we’re asking for everything that is not an NA
.
Create a new data frame from the surveys
data that meets the following criteria: contains only the species_id
column and a new column called hindfoot_half
containing values that are half the hindfoot_length
values. In this hindfoot_half
column, there are no NA
s and all values are less than 30.
Hint: think about how the commands should be ordered to produce this data frame!
This work is licensed under a Creative Commons Attribution 4.0 International License. See the licensing page for more details about copyright information.