This is a brief introduction to R, focussing on data wrangling using
dplyr
and tidyr
packages and generating reproducible documents
using knitr
and rmarkdown
packages.
You don’t need to read all of this for the session. It’s more of a resource and reference.
R is a statistical computing environment to analyze data and write programs. The strength of R comes from:
I should mention a quick caveat. While R is a general-purpose programming language, it works a bit differently from other languages such as Python (it was developed by statisticians after all!). As such, programming in R may not be as intuitive, powerful, or easy as it may be in Python (though it can be done), especially if you come from a computer science background. If your work involves a lot of programming, I would recommend Python as your main tool. However, it never hurts to learn more than one language, especially as R is great for data analysis and plotting.
Ok, firstly, I’ve made this session with some assumptions (see the slides.html) file. Briefly I’m assuming you want to use R for statistical analysis, plotting, and/or writing up reports. I’m using R Markdown to show how to write up documents with R code and since getting the data into an analyzable form is the hardest part of an analysis, I’m using packages specific to that task.
While you can create functions in R, I won’t be going over them. A great resource for R functions is this page from Hadley Wickham’s ‘Advanced R’ book
An .Rmd
or R Markdown file is a
markdown file that contains
R code chunks that can be processed to output the results of the R
code into a generated .md
file. This is an incredible (and recent)
strength of using R, as this then allows you to create html, pdf, or
Word doc files from the .md
file using the rmarkdown
package
(which relies on pandoc).
On the top of each .Rmd
file is the
YAML front matter, which looks
like:
---
title: "Introduction to R"
author: "Luke Johnston"
date: "July 23, 2015"
output:
html_document:
highlight: tango
number_sections: yes
theme: readable
toc: yes
---
Note the starting and ending ---
‘tags’. This starts the YAML
block.
Markdown syntax for formatting is used in .Rmd
. Check out the
R Markdown documentation for a quick
tutorial on the syntax.
You’ll need to import your data into R to analyze it. I’m assuming
the data is already cleaned and ready for analysis. If at any time
you need help with a command, use the ?
command, appended with the
command of interest (eg. ?write.csv
). R comes with many internal
datasets that you can practice on. The one I’m going to use is the
swiss
dataset.
write.csv(swiss, file = 'swiss.csv') # Export
ds <- read.csv('swiss.csv') # Import
R has several very useful and easy tools for quickly viewing your
data. head()
shows the first few rows of a data.frame (a structure
for storing data that can include numbers, integers, factors, strings,
etc). names()
shows the column names. str()
shows the structure,
such as what the object is, and its contents. summary()
shows a
quick description of the summary statistics (means, median, frequency)
for each of your columns. class()
is like str()
but only shows
the top level name of the object, so eg. while a data.frame contains
multiple columns that str()
would show, class()
would only show
that the object is a “data.frame”.
head(ds)
## X Fertility Agriculture Examination Education Catholic
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Delemont 83.1 45.1 6 9 84.84
## 3 Franches-Mnt 92.5 39.7 5 5 93.40
## 4 Moutier 85.8 36.5 12 7 33.77
## 5 Neuveville 76.9 43.5 17 15 5.16
## 6 Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## 1 22.2
## 2 22.2
## 3 20.2
## 4 20.3
## 5 20.6
## 6 26.6
names(ds)
## [1] "X" "Fertility" "Agriculture"
## [4] "Examination" "Education" "Catholic"
## [7] "Infant.Mortality"
str(ds)
## 'data.frame': 47 obs. of 7 variables:
## $ X : Factor w/ 47 levels "Aigle","Aubonne",..: 8 9 12 26 28 34 5 13 15 38 ...
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
summary(ds)
## X Fertility Agriculture Examination
## Aigle : 1 Min. :35.00 Min. : 1.20 Min. : 3.00
## Aubonne : 1 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00
## Avenches: 1 Median :70.40 Median :54.10 Median :16.00
## Boudry : 1 Mean :70.14 Mean :50.66 Mean :16.49
## Broye : 1 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00
## Conthey : 1 Max. :92.50 Max. :89.70 Max. :37.00
## (Other) :41
## Education Catholic Infant.Mortality
## Min. : 1.00 Min. : 2.150 Min. :10.80
## 1st Qu.: 6.00 1st Qu.: 5.195 1st Qu.:18.15
## Median : 8.00 Median : 15.140 Median :20.00
## Mean :10.98 Mean : 41.144 Mean :19.94
## 3rd Qu.:12.00 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :53.00 Max. :100.000 Max. :26.60
##
class(ds)
## [1] "data.frame"
Data wrangling is a bit tedious in base R. So I’m using two packages
designed to make this easier. Load packages by using the library()
function. dplyr
comes with a %>%
pipe function (via the
magrittr
package), which works similar to how the Bash shell |
pipe works. The command on the right-hand side takes the output from
the command on the left-hand side, just like how a plumbing pipe works
for water.
The four lines of code below using tbl_df
are all the same. The .
object represents the output from the pipe, but it doesn’t have to be
used as using %>%
implies also using .
. tbl_df
makes the object
into a tbl
class, making printing of the output nicer.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
tbl_df(ds)
## # A tibble: 47 x 7
## X Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Delemont 83.1 45.1 6 9 84.84
## 3 Franches-Mnt 92.5 39.7 5 5 93.40
## 4 Moutier 85.8 36.5 12 7 33.77
## 5 Neuveville 76.9 43.5 17 15 5.16
## 6 Porrentruy 76.1 35.3 9 7 90.57
## 7 Broye 83.8 70.2 16 7 92.85
## 8 Glane 92.4 67.8 14 8 97.16
## 9 Gruyere 82.4 53.3 12 7 97.67
## 10 Sarine 82.9 45.2 16 13 91.38
## # ... with 37 more rows, and 1 more variables: Infant.Mortality <dbl>
ds %>% tbl_df()
## # A tibble: 47 x 7
## X Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Delemont 83.1 45.1 6 9 84.84
## 3 Franches-Mnt 92.5 39.7 5 5 93.40
## 4 Moutier 85.8 36.5 12 7 33.77
## 5 Neuveville 76.9 43.5 17 15 5.16
## 6 Porrentruy 76.1 35.3 9 7 90.57
## 7 Broye 83.8 70.2 16 7 92.85
## 8 Glane 92.4 67.8 14 8 97.16
## 9 Gruyere 82.4 53.3 12 7 97.67
## 10 Sarine 82.9 45.2 16 13 91.38
## # ... with 37 more rows, and 1 more variables: Infant.Mortality <dbl>
ds %>% tbl_df
## # A tibble: 47 x 7
## X Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Delemont 83.1 45.1 6 9 84.84
## 3 Franches-Mnt 92.5 39.7 5 5 93.40
## 4 Moutier 85.8 36.5 12 7 33.77
## 5 Neuveville 76.9 43.5 17 15 5.16
## 6 Porrentruy 76.1 35.3 9 7 90.57
## 7 Broye 83.8 70.2 16 7 92.85
## 8 Glane 92.4 67.8 14 8 97.16
## 9 Gruyere 82.4 53.3 12 7 97.67
## 10 Sarine 82.9 45.2 16 13 91.38
## # ... with 37 more rows, and 1 more variables: Infant.Mortality <dbl>
ds %>% tbl_df(.)
## # A tibble: 47 x 7
## X Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Delemont 83.1 45.1 6 9 84.84
## 3 Franches-Mnt 92.5 39.7 5 5 93.40
## 4 Moutier 85.8 36.5 12 7 33.77
## 5 Neuveville 76.9 43.5 17 15 5.16
## 6 Porrentruy 76.1 35.3 9 7 90.57
## 7 Broye 83.8 70.2 16 7 92.85
## 8 Glane 92.4 67.8 14 8 97.16
## 9 Gruyere 82.4 53.3 12 7 97.67
## 10 Sarine 82.9 45.2 16 13 91.38
## # ... with 37 more rows, and 1 more variables: Infant.Mortality <dbl>
## Let's put it into a new object
ds2 <- tbl_df(ds)
Again, these next lines are the same. select
does as it says: select
the column from the dataset.
select(ds2, Education, Catholic, Fertility)
## # A tibble: 47 x 3
## Education Catholic Fertility
## <int> <dbl> <dbl>
## 1 12 9.96 80.2
## 2 9 84.84 83.1
## 3 5 93.40 92.5
## 4 7 33.77 85.8
## 5 15 5.16 76.9
## 6 7 90.57 76.1
## 7 7 92.85 83.8
## 8 8 97.16 92.4
## 9 7 97.67 82.4
## 10 13 91.38 82.9
## # ... with 37 more rows
ds2 %>% select(Education, Catholic, Fertility)
## # A tibble: 47 x 3
## Education Catholic Fertility
## <int> <dbl> <dbl>
## 1 12 9.96 80.2
## 2 9 84.84 83.1
## 3 5 93.40 92.5
## 4 7 33.77 85.8
## 5 15 5.16 76.9
## 6 7 90.57 76.1
## 7 7 92.85 83.8
## 8 8 97.16 92.4
## 9 7 97.67 82.4
## 10 13 91.38 82.9
## # ... with 37 more rows
ds2 %>% select(., Education, Catholic, Fertility)
## # A tibble: 47 x 3
## Education Catholic Fertility
## <int> <dbl> <dbl>
## 1 12 9.96 80.2
## 2 9 84.84 83.1
## 3 5 93.40 92.5
## 4 7 33.77 85.8
## 5 15 5.16 76.9
## 6 7 90.57 76.1
## 7 7 92.85 83.8
## 8 8 97.16 92.4
## 9 7 97.67 82.4
## 10 13 91.38 82.9
## # ... with 37 more rows
You can rename columns either using rename
or select
(the new name
is on the left hand side, so newname = oldname
). However, with the
select
command, only that column gets selected, while rename
selects all columns.
ds2 %>% rename(County = X)
## # A tibble: 47 x 7
## County Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Delemont 83.1 45.1 6 9 84.84
## 3 Franches-Mnt 92.5 39.7 5 5 93.40
## 4 Moutier 85.8 36.5 12 7 33.77
## 5 Neuveville 76.9 43.5 17 15 5.16
## 6 Porrentruy 76.1 35.3 9 7 90.57
## 7 Broye 83.8 70.2 16 7 92.85
## 8 Glane 92.4 67.8 14 8 97.16
## 9 Gruyere 82.4 53.3 12 7 97.67
## 10 Sarine 82.9 45.2 16 13 91.38
## # ... with 37 more rows, and 1 more variables: Infant.Mortality <dbl>
ds2 %>% select(County = X)
## # A tibble: 47 x 1
## County
## <fctr>
## 1 Courtelary
## 2 Delemont
## 3 Franches-Mnt
## 4 Moutier
## 5 Neuveville
## 6 Porrentruy
## 7 Broye
## 8 Glane
## 9 Gruyere
## 10 Sarine
## # ... with 37 more rows
You can subset the dataset using filter
. Note the double equal sign
==
for testing if ‘Examination’ is equal to 15. A single =
is
used for something else (assigning things to objects).
filter(ds2, Catholic < 20, Examination == 15)
## # A tibble: 3 x 7
## X Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Yverdon 65.4 49.5 15 8 6.10
## 3 Val de Ruz 77.6 37.6 15 7 4.97
## # ... with 1 more variables: Infant.Mortality <dbl>
ds2 %>% filter(Catholic < 20, Examination == 15)
## # A tibble: 3 x 7
## X Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Yverdon 65.4 49.5 15 8 6.10
## 3 Val de Ruz 77.6 37.6 15 7 4.97
## # ... with 1 more variables: Infant.Mortality <dbl>
ds2 %>% filter(., Catholic < 20, Examination == 15)
## # A tibble: 3 x 7
## X Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Yverdon 65.4 49.5 15 8 6.10
## 3 Val de Ruz 77.6 37.6 15 7 4.97
## # ... with 1 more variables: Infant.Mortality <dbl>
## For string/factor variables
ds2 %>% filter(X == 'Aigle')
## # A tibble: 1 x 7
## X Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Aigle 64.1 62 21 12 8.52
## # ... with 1 more variables: Infant.Mortality <dbl>
We can start chaining these commands together using the %>%
command.
There is no limit to how long a chain can be. arrange
sorts/orders/re-arranges the column Education in ascending
order. mutate
creates a new column.
ds2 %>%
filter(Catholic > 20) %>%
select(Education, Fertility)
## # A tibble: 21 x 2
## Education Fertility
## <int> <dbl>
## 1 9 83.1
## 2 5 92.5
## 3 7 85.8
## 4 7 76.1
## 5 7 83.8
## 6 8 92.4
## 7 7 82.4
## 8 13 82.9
## 9 6 87.1
## 10 2 68.3
## # ... with 11 more rows
ds2 %>%
filter(Catholic > 20) %>%
select(County = X, Education, Fertility, Agriculture) %>%
arrange(Education) %>%
mutate(infertile = ifelse(Fertility < 50, 'yes', 'no'),
testing = 'Yes' ## Create a testing column to show how mutate works.
)
## # A tibble: 21 x 6
## County Education Fertility Agriculture infertile testing
## <fctr> <int> <dbl> <dbl> <chr> <chr>
## 1 Echallens 2 68.3 72.6 no Yes
## 2 Conthey 2 75.5 85.9 no Yes
## 3 Herens 2 77.3 89.7 no Yes
## 4 Monthey 3 79.4 64.9 no Yes
## 5 Sierre 3 92.2 84.6 no Yes
## 6 Franches-Mnt 5 92.5 39.7 no Yes
## 7 Veveyse 6 87.1 64.5 no Yes
## 8 Entremont 6 69.3 84.9 no Yes
## 9 Martigwy 6 70.5 78.2 no Yes
## 10 Moutier 7 85.8 36.5 no Yes
## # ... with 11 more rows
To get the data into a nicer and more analyable format, you can use
the tidyr
package. See what gather
does in the code below. Then
see what spread
does. Note that you can remove a column by having a
minus -
sign in front of a variable when you use select
.
library(tidyr)
## Compare this:
ds2 %>%
select(-Infant.Mortality) %>%
rename(County = X)
## # A tibble: 47 x 6
## County Fertility Agriculture Examination Education Catholic
## <fctr> <dbl> <dbl> <int> <int> <dbl>
## 1 Courtelary 80.2 17.0 15 12 9.96
## 2 Delemont 83.1 45.1 6 9 84.84
## 3 Franches-Mnt 92.5 39.7 5 5 93.40
## 4 Moutier 85.8 36.5 12 7 33.77
## 5 Neuveville 76.9 43.5 17 15 5.16
## 6 Porrentruy 76.1 35.3 9 7 90.57
## 7 Broye 83.8 70.2 16 7 92.85
## 8 Glane 92.4 67.8 14 8 97.16
## 9 Gruyere 82.4 53.3 12 7 97.67
## 10 Sarine 82.9 45.2 16 13 91.38
## # ... with 37 more rows
## With this:
ds2 %>%
select(-Infant.Mortality) %>%
rename(County = X) %>%
gather(Measure, Value, -County)
## # A tibble: 235 x 3
## County Measure Value
## <fctr> <chr> <dbl>
## 1 Courtelary Fertility 80.2
## 2 Delemont Fertility 83.1
## 3 Franches-Mnt Fertility 92.5
## 4 Moutier Fertility 85.8
## 5 Neuveville Fertility 76.9
## 6 Porrentruy Fertility 76.1
## 7 Broye Fertility 83.8
## 8 Glane Fertility 92.4
## 9 Gruyere Fertility 82.4
## 10 Sarine Fertility 82.9
## # ... with 225 more rows
## And back again:
ds2 %>%
select(-Infant.Mortality) %>%
rename(County = X) %>%
gather(Measure, Value, -County) %>%
spread(Measure, Value)
## # A tibble: 47 x 6
## County Agriculture Catholic Education Examination Fertility
## * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aigle 62.0 8.52 12 21 64.1
## 2 Aubonne 67.5 2.27 7 14 66.9
## 3 Avenches 60.7 4.43 12 19 68.9
## 4 Boudry 38.4 5.62 12 26 70.4
## 5 Broye 70.2 92.85 7 16 83.8
## 6 Conthey 85.9 99.71 2 3 75.5
## 7 Cossonay 69.3 2.82 5 22 61.7
## 8 Courtelary 17.0 9.96 12 15 80.2
## 9 Delemont 45.1 84.84 9 6 83.1
## 10 Echallens 72.6 24.20 2 18 68.3
## # ... with 37 more rows
Combined with dplyr
’s group_by
and summarise
you can quickly
summarise data or do further, more complicated analyses. group_by
makes it so further analyses or operations work on the groups.
summarise
transforms the data to only contain the new variable(s)
created, in this case the mean.
ds2 %>%
select(-X) %>%
gather(Measure, Value) %>%
group_by(Measure) %>%
summarise(mean = mean(Value))
## # A tibble: 6 x 2
## Measure mean
## <chr> <dbl>
## 1 Agriculture 50.65957
## 2 Catholic 41.14383
## 3 Education 10.97872
## 4 Examination 16.48936
## 5 Fertility 70.14255
## 6 Infant.Mortality 19.94255
You can extend this to be created as a table in the generated .md
or
.html
file using the kable
command (short for ‘knitr table’).
library(knitr)
ds2 %>%
select(-X) %>%
gather(Measure, Value) %>%
group_by(Measure) %>%
summarise(mean = mean(Value)) %>%
kable()
Measure | mean |
---|---|
Agriculture | 50.65957 |
Catholic | 41.14383 |
Education | 10.97872 |
Examination | 16.48936 |
Fertility | 70.14255 |
Infant.Mortality | 19.94255 |
Check out the documentation on
knitr
or
R Markdown
for R code chunk options. If you look at the raw .Rmd
file for this
document, you can see that the below code chunk uses
eval = FALSE
, which tells knitr to not run this code chunk.
These two commands generate either a html or a md file.
## into html
library(rmarkdown)
render('lesson.Rmd') ## or can use rmarkdown::render('main.Rmd')
## into md
library(knitr)
knit('lesson.Rmd') ## or can use knitr::knit('main.Rmd')
Make a table with the means of Agriculture, Examination, Education,
and Infant.Mortality for each category of Fertility (hint: convert it
into a factor by values >50 vs <50), when Catholic is less than 60
(hint, use dplyr
commands + gather
). Have the Fertility groups as
two columns in the new table (hint, use spread
+ kable
).