**Authors**: Luke Johnston**Research field**: Nutritional Sciences**Lesson content URL**: https://github.com/UofTCoders/studyGroup/tree/gh-pages/lessons/r/resampling

In statistics, we use samples to infer about the population given some set of conditions. We generally do not know or can ever know (without substantial cost and time) the entire characteristics of the population, so we use the sample to make guesses. Because the sample is not the entire population, the sample does not completely represent the population. That’s why there are standard deviations, standard errors, and confidence intervals with all statistics. If we took another sample, it would be slightly different from the original sample and would have slightly different statistics. Since standard errors of the statistics are calculated based on the sample, these estimates can be biased to the sample and have certain mathematical assumptions about the distribution. What resampling does is to take randomly drawn (sub)samples of the sample and calculate the statistic from that (sub)sample. Do this enough times and you can get a distribution of statistic values that can provide an empirical measure of the accuracy/precision of the test statistic, with less rigid assumptions. In this lesson, I’ll cover bootstrapping and permutation testing. The differences come down to essentially:

- Bootstrapping is best used to estimate confidence intervals of test statistics.
- Permutation testing is best used for testing hypotheses.

Bootstrapping is the process of resampling *with replacement* (*all* values in
the sample have an *equal* probability of being selected, including multiple
times, so a value could have a duplicate). Resample, calculate a statistic (e.g.
the mean), repeat this hundreds or thousands of times and you are able to
estimate a precise/accurate uncertainty of the mean (confidence interval) of the
data’s distribution. There are less assumptions about the underlying
distribution using bootstrap compared to calculating the standard error
directly.

```
x <- runif(10)
hist(x)
mean(x)
```

```
## [1] 0.5701154
```

```
resample <- sample(x, replace = TRUE)
mean(resample)
```

```
## [1] 0.6443582
```

```
library(dplyr)
```

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

```
library(ggplot2)
```

```
# Resample 100 times, and find the mean of each
data_frame(num = 1:100) %>%
group_by(num) %>%
mutate(means = mean(sample(x, replace = TRUE))) %>%
ggplot(aes(x = means)) +
geom_freqpoly()
```

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

```
# Resample a 5000 times, and find the mean of each
data_frame(num = 1:100) %>%
group_by(num) %>%
mutate(means = mean(sample(x, replace = TRUE))) %>%
ggplot(aes(x = means)) +
geom_freqpoly()
```

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

There is a R package that does boostrapping, called `boot`

. The `boot`

function
needs a function that calculates the mean based on the resample of the data. It
takes two arguments, the values (`x`

) and the resample vector of the values
(`i`

).

```
library(boot)
# Create a function to take a resample of the values,
# and then calculate the mean
boot_mean <- function(original_vector, resample_vector) {
mean(original_vector[resample_vector])
}
# R is number of replications
mean_results <- boot(x, boot_mean, R = 2000)
# Load broom to get a tidy dataframe as output.
library(broom)
tidy(mean_results)
```

```
## statistic bias std.error
## 1 0.5701154 -0.004093434 0.08803179
```

```
# Calculate the confidence intervals
boot.ci(mean_results)
```

```
## Warning in boot.ci(mean_results): bootstrap variances needed for
## studentized intervals
```

```
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = mean_results)
##
## Intervals :
## Level Normal Basic
## 95% ( 0.4017, 0.7467 ) ( 0.4046, 0.7517 )
##
## Level Percentile BCa
## 95% ( 0.3885, 0.7356 ) ( 0.4044, 0.7445 )
## Calculations and Intervals on Original Scale
```

If you bootstrap for a statistic with two values, the individual rows are resampled. However, the rows are still paired together (no reshuffling of the order occurs).

```
boot_corr <- function(data, resample_vector) {
cor(data$x[resample_vector], data$y[resample_vector])
}
ds <- data.frame(y = runif(10), x = runif(10))
# Original correlation
cor(ds$x, ds$y)
```

```
## [1] -0.2155279
```

```
# Bootstrapped correlation
cor_results <- boot(example_data, boot_corr, R = 100)
```

```
## Error in NROW(data): object 'example_data' not found
```

```
tidy(cor_results)
```

```
## Error in tidy(cor_results): object 'cor_results' not found
```

```
# Plus confidence interval
boot.ci(cor_results)
```

```
## Error in boot.ci(cor_results): object 'cor_results' not found
```

Similar to bootstrapping, except permutation testing resamples *without
replacement* (meaning when a value is selected, it can not be selected again, so
no value can be a duplicate). This simply shuffles the values. In the case of a
univariate statistic (e.g. mean), this will not change anything. However, when
there are two or more variables, reshuffling one variable will change the test
statistic (e.g. correlation or regression). Usually this is done on the
response/outcome/y variable, and usually for tests that use discrete or
categorical (“yes”, “no”) variables.

```
ds <- data.frame(y = runif(10), x = runif(10))
ds
```

```
## y x
## 1 0.26248353 0.60789022
## 2 0.27608331 0.81612978
## 3 0.35928985 0.47067467
## 4 0.34308694 0.86670014
## 5 0.61019344 0.23596764
## 6 0.31907683 0.53597891
## 7 0.51808777 0.77629172
## 8 0.39792031 0.09387856
## 9 0.04529551 0.75595895
## 10 0.56524394 0.44996910
```

```
# Using spearman correlation to be consistent with the next example
cor(ds$y, ds$x, method = "spearman")
```

```
## [1] -0.5393939
```

```
# Resampled the y only (reshuffled the order)
ds$resample_y <- sample(ds$y)
ds
```

```
## y x resample_y
## 1 0.26248353 0.60789022 0.31907683
## 2 0.27608331 0.81612978 0.04529551
## 3 0.35928985 0.47067467 0.56524394
## 4 0.34308694 0.86670014 0.39792031
## 5 0.61019344 0.23596764 0.34308694
## 6 0.31907683 0.53597891 0.35928985
## 7 0.51808777 0.77629172 0.51808777
## 8 0.39792031 0.09387856 0.26248353
## 9 0.04529551 0.75595895 0.27608331
## 10 0.56524394 0.44996910 0.61019344
```

```
# Correlation again, with the resampled y
cor(ds$resample_y, ds$x, method = "spearman")
```

```
## [1] -0.07878788
```

```
# 1000 resamples of the correlation of each.
data_frame(num = 1:1000) %>%
group_by(num) %>%
mutate(corr = cor(sample(y), x, method = "spearman")) %>%
ggplot(aes(x = corr)) +
geom_freqpoly()
```

```
## Error in mutate_impl(.data, dots): Evaluation error: object 'y' not found.
```

There is a `coin`

package that does permutation testing for several tests.

```
library(coin)
```

```
## Loading required package: methods
```

```
## Loading required package: survival
```

```
##
## Attaching package: 'survival'
```

```
## The following object is masked from 'package:boot':
##
## aml
```

```
ds <- data.frame(y, x)
```

```
## Error in data.frame(y, x): object 'y' not found
```

```
spearman_test(y ~ x, data = ds)
```

```
##
## Asymptotic Spearman Correlation Test
##
## data: y by x
## Z = -1.6182, p-value = 0.1056
## alternative hypothesis: true rho is not equal to 0
```

There are many more tests found within the `coin`

package. Check them out in the vignette:

```
vignette("coin", package = "coin")
```