Practical N: Solutions

We use the following packages:

library(boot)
library(mice)
library(dplyr)
library(magrittr)
library(ggplot2)

I fix the random seed to 123

set.seed(123)

Use the following function to calculate the bootstrapped estimate for the correlation between age and weight on the boys data set. Use R=1000 bootstrap samples.

corfun <- function(data, i){
  data.sample <- data[i,]
  stat <- cor(data.sample$age, data.sample$wgt, use = "pairwise.complete.obs")
    return(stat)
}

To apply the corfun() function in the bootstrap function boot(), we run:

bootstr.cor <- 
  boys %>%
  boot(corfun, R = 1000)

Our object bootstr.cor contains all the bootstrap information about the correlation parameter

Explore the contents of the bootstr.cor object. For example, use function attributes() to see the listed dimensions that are within the bootstr.cor object and the class of the object.

attributes(bootstr.cor)

## $names
##  [1] "t0"        "t"         "R"         "data"      "seed"      "statistic"
##  [7] "sim"       "call"      "stype"     "strata"    "weights"  
## 
## $class
## [1] "boot"
## 
## $boot_type
## [1] "boot"

we see that the bootstr.cor object has class boot.

head(bootstr.cor$data)

##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

The most interesting listed dimensions are the following:

The $t0 contains the original estimate for the data

bootstr.cor$t0

## [1] 0.9505762

The $t contains the individual estimate for every bootstrapped sample

head(bootstr.cor$t)

##           [,1]
## [1,] 0.9532544
## [2,] 0.9494370
## [3,] 0.9527242
## [4,] 0.9511808
## [5,] 0.9521278
## [6,] 0.9531920

The average bootstrapped estimate would then be

bootstr.cor$t %>% mean

## [1] 0.9507738

The $data has the original data set that has been served to function boot():

head(bootstr.cor$data)

##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

Plot the histogram of the individual bootstrapped estimates for every bootstrapped sample (i.e. $t).

plotdata <- data.frame(t = bootstr.cor$t)

plotdata %>%
  ggplot(aes(t)) + 
  geom_histogram(bins = 20) + 
  geom_density(col = "orange") + 
  geom_vline(xintercept = bootstr.cor$t0, col = "orange", linetype = "dashed", linewidth = 1.5)

Add a column to boys that indicates whether boys are overweight by taking the boundary bmi > 25.

Remember that body mass index takes the ratio weight-over-height-squared into account and the validity of the interpretation does change for children vs. adults.

We can use function mutate() to assign values to a new column (variable) ovw for boys that have a bmi <= 25 (smaller or equal) or bmi > 25 (larger). The if_else() function can be directly applied to call if_else(condition, true, false).

boys2 <- boys %>%
    mutate(ovw = if_else(bmi <= 25, "Not Overweight", "Overweight"))

There are now 20 Overweight boys:

boys2 %$%
  table(ovw)

## ovw
## Not Overweight     Overweight 
##            707             20

Bootstrap a $X^2$-test that evaluates the distribution of overweight ovw boys over the regions reg. We can copy the previous function and replace the line for our test-statistic:

X2fun <- function(data, i){
  data.sample <- data[i,]
  tab <- data.sample %$%
    table(ovw, reg) 
  X2 <- suppressWarnings(chisq.test(tab))
    return(X2$statistic)
}

I use suppressWarnings() to suppress the warnings the $X^2$ test is going to give us, because some of the expected cell-frequencies are below 5. I only return the chi-squared estimate by asking for X2$statistic. This is because function boot expects an estimate as return, not a whole list of estimates.

bootstr.X2 <- 
  boys2 %>%
  boot(X2fun, R = 1000)

bootstr.X2

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = ., statistic = X2fun, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1* 3.650944 3.775064    4.895331

Plot the histogram of the individual bootstrapped estimates for every bootstrapped sample (i.e. $t).

plotdata <- data.frame(t = bootstr.X2$t)

plotdata %>%
  ggplot(aes(t)) + 
  geom_histogram(bins = 40) + 
  geom_vline(xintercept = bootstr.X2$t0, col = "orange", linetype = "dashed", linewidth = 1.5)

The sampling distribution of the estimate is quite skewed.

Do a bootstrap of the regression estimates of the following model:

lm(wgt ~ age + hgt + I(hgt^2))

The function that would give us the regression estimates for the parameters is e.g. :

lmfun <- function(data, i){
  data.sample <- data[i,]
  coef <- data.sample %$%
    lm(wgt ~ age + hgt + I(hgt^2)) %>%
    coef()
    return(coef)
}

and if we plug that function into boot() for R = 1000 bootstrap samples, we obtain

bootstr.lm <- 
  boys %>%
  boot(lmfun, R = 1000)

bootstr.lm

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = ., statistic = lmfun, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 23.853118370  1.453609e-02 1.9359294811
## t2*  0.710865240 -2.103321e-03 0.2403071014
## t3* -0.498229914  1.175075e-04 0.0380706182
## t4*  0.003722907 -9.715547e-07 0.0002249442

Again, the observed data regression estimate is:

bootstr.lm$t0

##  (Intercept)          age          hgt     I(hgt^2) 
## 23.853118370  0.710865240 -0.498229914  0.003722907

Create a histogram of the individual bootstrapped estimates There are two ways to do this.
Using the plot options of the class boot.
Construct the plots manually by extracting the regression coefficients from the boot class object.

Plot the distribution of the intercept by using the index argument (1 refers to the intercept and subsequent numbers to the remaining regression coefficients / slopes).

plot(bootstr.lm, index = 1)

Plot the distribution of the slope of wgt:

plot(bootstr.lm, index = 2)

Plot the distribution of the slope of hgt:

plot(bootstr.lm, index = 3)

Plot the distribution of the slope of hgt^2:

plot(bootstr.lm, index = 4)

Option 2: manually Look at the code to see if you understand what it does.

est <- bootstr.lm$t
colnames(est) <- (c("Intercept", "wgt", "hgt", "hgt^2"))

reshape2::melt(est) %>%
  ggplot(aes(value)) + facet_wrap(~Var2, scales = 'free_x') +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Calculate the confidence intervals for the intercept and slopes

We can use function boot.ci() to obtain the confidence intervals from our estimates. Remember that there we have 4 statistics: one intercept and three slopes. The bootstrap confidence interval of the intercept can be obtained with the argument index = 1 and we choose the basic interval:

boot.ci(bootstr.lm, type = "basic", index = 1)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstr.lm, type = "basic", index = 1)
## 
## Intervals : 
## Level      Basic         
## 95%   (19.73, 27.57 )  
## Calculations and Intervals on Original Scale

boot.ci(bootstr.lm, type = "basic", index = 2)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstr.lm, type = "basic", index = 2)
## 
## Intervals : 
## Level      Basic         
## 95%   ( 0.2376,  1.2161 )  
## Calculations and Intervals on Original Scale

boot.ci(bootstr.lm, type = "basic", index = 3)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstr.lm, type = "basic", index = 3)
## 
## Intervals : 
## Level      Basic         
## 95%   (-0.5709, -0.4195 )  
## Calculations and Intervals on Original Scale

boot.ci(bootstr.lm, type = "basic", index = 4)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstr.lm, type = "basic", index = 4)
## 
## Intervals : 
## Level      Basic         
## 95%   ( 0.0033,  0.0041 )  
## Calculations and Intervals on Original Scale

We can also use the quantiles() for the $t dimensions. For example, to obtain the cut-offs for .025 and .975 from the empirical distribution for the intercept, use:

quantile(est[, "Intercept"], c(0.025,0.975))

##     2.5%    97.5% 
## 20.15042 27.96690

or, for all estimates when using apply():

apply(est, 2, function(x) quantile(x, c(0.025,0.975)))

##       Intercept       wgt        hgt       hgt^2
## 2.5%   20.15042 0.2116248 -0.5767418 0.003300655
## 97.5%  27.96690 1.1784080 -0.4262788 0.004167523

The quantile() function yields a slightly different result from the percentile bootstrap estimates, because the quantile function uses a quantile algorithm. Read all about these algorithms in the help.

To calculate the percentile bootstrap CI by hand:

sort(est[, "Intercept"])[c(25, 976)]

## [1] 20.13762 27.97504

End of Practical

Practical N: Solutions

Gerko Vink, Laurence Frank

Statistical Programming in R