We use the following packages:
library(boot)
library(mice)
library(dplyr)
library(magrittr)
library(ggplot2)
I fix the random seed to 123
set.seed(123)
age
and
weight
on the boys
data set. Use
R=1000
bootstrap samples.corfun <- function(data, i){
data.sample <- data[i,]
stat <- cor(data.sample$age, data.sample$wgt, use = "pairwise.complete.obs")
return(stat)
}
To apply the corfun()
function in the bootstrap function
boot()
, we run:
bootstr.cor <-
boys %>%
boot(corfun, R = 1000)
Our object bootstr.cor
contains all the bootstrap
information about the correlation parameter
bootstr.cor
object.
For example, use function attributes()
to see the listed
dimensions that are within the bootstr.cor
object and the
class
of the object.attributes(bootstr.cor)
## $names
## [1] "t0" "t" "R" "data" "seed" "statistic"
## [7] "sim" "call" "stype" "strata" "weights"
##
## $class
## [1] "boot"
##
## $boot_type
## [1] "boot"
we see that the bootstr.cor
object has class
boot
.
head(bootstr.cor$data)
## age hgt wgt bmi hc gen phb tv reg
## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
The most interesting listed dimensions are the following:
$t0
contains the original estimate for the
databootstr.cor$t0
## [1] 0.9505762
$t
contains the individual estimate for every
bootstrapped samplehead(bootstr.cor$t)
## [,1]
## [1,] 0.9532544
## [2,] 0.9494370
## [3,] 0.9527242
## [4,] 0.9511808
## [5,] 0.9521278
## [6,] 0.9531920
The average bootstrapped estimate would then be
bootstr.cor$t %>% mean
## [1] 0.9507738
$data
has the original data set that has been
served to function boot()
:head(bootstr.cor$data)
## age hgt wgt bmi hc gen phb tv reg
## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
$t
).plotdata <- data.frame(t = bootstr.cor$t)
plotdata %>%
ggplot(aes(t)) +
geom_histogram(bins = 20) +
geom_density(col = "orange") +
geom_vline(xintercept = bootstr.cor$t0, col = "orange", linetype = "dashed", linewidth = 1.5)
bmi > 25
.Remember that body mass index takes the ratio weight-over-height-squared into account and the validity of the interpretation does change for children vs. adults.
We can use function mutate()
to assign values to a new
column (variable) ovw
for boys that have a
bmi <= 25
(smaller or equal) or bmi > 25
(larger). The if_else()
function can be directly applied to
call if_else(condition, true, false)
.
boys2 <- boys %>%
mutate(ovw = if_else(bmi <= 25, "Not Overweight", "Overweight"))
There are now 20 Overweight
boys:
boys2 %$%
table(ovw)
## ovw
## Not Overweight Overweight
## 707 20
ovw
boys over
the regions reg
. We can copy the previous
function and replace the line for our test-statistic:X2fun <- function(data, i){
data.sample <- data[i,]
tab <- data.sample %$%
table(ovw, reg)
X2 <- suppressWarnings(chisq.test(tab))
return(X2$statistic)
}
I use suppressWarnings()
to suppress the warnings the
\(X^2\) test is going to give us,
because some of the expected cell-frequencies are below 5. I only return
the chi-squared estimate by asking for X2$statistic
. This
is because function boot
expects an estimate as return, not
a whole list of estimates.
bootstr.X2 <-
boys2 %>%
boot(X2fun, R = 1000)
bootstr.X2
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = ., statistic = X2fun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 3.650944 3.775064 4.895331
$t
).plotdata <- data.frame(t = bootstr.X2$t)
plotdata %>%
ggplot(aes(t)) +
geom_histogram(bins = 40) +
geom_vline(xintercept = bootstr.X2$t0, col = "orange", linetype = "dashed", linewidth = 1.5)
The sampling distribution of the estimate is quite skewed.
lm(wgt ~ age + hgt + I(hgt^2))
The function that would give us the regression estimates for the parameters is e.g. :
lmfun <- function(data, i){
data.sample <- data[i,]
coef <- data.sample %$%
lm(wgt ~ age + hgt + I(hgt^2)) %>%
coef()
return(coef)
}
and if we plug that function into boot()
for
R = 1000
bootstrap samples, we obtain
bootstr.lm <-
boys %>%
boot(lmfun, R = 1000)
bootstr.lm
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = ., statistic = lmfun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 23.853118370 1.453609e-02 1.9359294811
## t2* 0.710865240 -2.103321e-03 0.2403071014
## t3* -0.498229914 1.175075e-04 0.0380706182
## t4* 0.003722907 -9.715547e-07 0.0002249442
Again, the observed data regression estimate is:
bootstr.lm$t0
## (Intercept) age hgt I(hgt^2)
## 23.853118370 0.710865240 -0.498229914 0.003722907
boot
.Plot the distribution of the intercept by using the
index
argument (1 refers to the intercept and subsequent
numbers to the remaining regression coefficients / slopes).
plot(bootstr.lm, index = 1)
Plot the distribution of the slope of
wgt
:
plot(bootstr.lm, index = 2)
Plot the distribution of the slope of hgt
:
plot(bootstr.lm, index = 3)
Plot the distribution of the slope of
hgt^2
:
plot(bootstr.lm, index = 4)
Option 2: manually Look at the code to see if you understand what it does.
est <- bootstr.lm$t
colnames(est) <- (c("Intercept", "wgt", "hgt", "hgt^2"))
reshape2::melt(est) %>%
ggplot(aes(value)) + facet_wrap(~Var2, scales = 'free_x') +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can use function boot.ci()
to obtain the confidence
intervals from our estimates. Remember that there we have 4 statistics:
one intercept and three slopes. The bootstrap confidence interval of the
intercept can be obtained with the argument index = 1
and
we choose the basic interval:
boot.ci(bootstr.lm, type = "basic", index = 1)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstr.lm, type = "basic", index = 1)
##
## Intervals :
## Level Basic
## 95% (19.73, 27.57 )
## Calculations and Intervals on Original Scale
boot.ci(bootstr.lm, type = "basic", index = 2)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstr.lm, type = "basic", index = 2)
##
## Intervals :
## Level Basic
## 95% ( 0.2376, 1.2161 )
## Calculations and Intervals on Original Scale
boot.ci(bootstr.lm, type = "basic", index = 3)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstr.lm, type = "basic", index = 3)
##
## Intervals :
## Level Basic
## 95% (-0.5709, -0.4195 )
## Calculations and Intervals on Original Scale
boot.ci(bootstr.lm, type = "basic", index = 4)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstr.lm, type = "basic", index = 4)
##
## Intervals :
## Level Basic
## 95% ( 0.0033, 0.0041 )
## Calculations and Intervals on Original Scale
We can also use the quantiles()
for the $t
dimensions. For example, to obtain the cut-offs for .025
and .975
from the empirical distribution for the intercept,
use:
quantile(est[, "Intercept"], c(0.025,0.975))
## 2.5% 97.5%
## 20.15042 27.96690
or, for all estimates when using apply()
:
apply(est, 2, function(x) quantile(x, c(0.025,0.975)))
## Intercept wgt hgt hgt^2
## 2.5% 20.15042 0.2116248 -0.5767418 0.003300655
## 97.5% 27.96690 1.1784080 -0.4262788 0.004167523
The quantile()
function yields a slightly different
result from the percentile bootstrap estimates, because the quantile
function uses a quantile algorithm. Read all about these algorithms in
the help.
To calculate the percentile bootstrap CI by hand:
sort(est[, "Intercept"])[c(25, 976)]
## [1] 20.13762 27.97504
End of Practical