It is always wise to fix the random seed. I often use a seed value of
123
:
set.seed(123)
at the top of my document. This ensures that all calculations in my
documents are exactly reproducible. This is because the random number
generator in R
will take its origin from the specified seed
value. Every subsequent random process advance the random generation by
one step. When we specify a seed value we can replicate the exact chain
of random processes. This is an extremely useful tool.
When an R
instance is started, there is initially no
seed. In that case, R
will create one from the current time
and process ID. Hence, different sessions will give different results
when random numbers are involved. When you store a workspace and reopen
it, you will continue from the seed specified within the workspace.
123
set.seed(seed = 123)
or simply
set.seed(123)
The random number generator is now fixed to value 123
.
We can now ensure that our data generation process is exactly
reproducible on every machine using R
around the world.
draw1
.draw1 <- rnorm(n = 10)
The call rnorm(n = 10)
results in a draw of 10 values
from a standard normal distribution, a distribution with mean zero and
variance equal to 1. Drawing from a standard normal distribution is the
default in rnorm
because the default function arguments for
rnorm()
specify:
args(rnorm)
## function (n, mean = 0, sd = 1)
## NULL
We can see that by default, values a drawn with mean = 0
and sd = 1
. Unless we specify other values for the mean and
standard deviation, the defaults will be used.
draw1
are indeed conform to the default values for
mean
and sd
of the function
rnorm()
mean(draw1)
## [1] 0.07462564
sd(draw1)
## [1] 0.9537841
The means and variances are close, but not equal to the default values. However, we’ve only drawn 10 values from a theoretical distribution. It is statistically not reasonable to expect that a single draw of 10 values will result in an unbiased estimate of the (infinitely large) population parameters.
draw2
.draw2 <- rnorm(n=1000)
draw1
or
draw2
means <- c(0, mean(draw1), mean(draw2))
sds <- c(1, sd(draw1), sd(draw2))
n <- c(Inf, 10, 1000)
result <- data.frame(n=n, mean=means, sd=sds)
row.names(result) <- c("population", "draw1", "draw2")
result
## n mean sd
## population Inf 0.00000000 1.0000000
## draw1 10 0.07462564 0.9537841
## draw2 1000 0.01459110 0.9957478
We have now created a data.frame
with the results.
Perhaps this data frame becomes easier to read with some rounding:
round(result, 3)
## n mean sd
## population Inf 0.000 1.000
## draw1 10 0.075 0.954
## draw2 1000 0.015 0.996
It is clear that the larger sample size will, in general, yield less
bias. However, the bias we obtain is depends on the chosen random seed.
For example, if we repeat the above process with
set.seed(125)
, we obtain completely different results:
set.seed(125)
draw1b <- rnorm(10)
draw2b <- rnorm(1000)
means <- c(0, mean(draw1b), mean(draw2b))
sds <- c(1, sd(draw1b), sd(draw2b))
result <- data.frame(n=n, mean=means, sd=sds)
row.names(result) <- c("population", "draw1 (seed=125)", "draw2 (seed=125)")
round(result, 3)
## n mean sd
## population Inf 0.000 1.000
## draw1 (seed=125) 10 0.053 1.042
## draw2 (seed=125) 1000 -0.058 1.009
Now draw2
has a larger absolute bias from the mean,
despite its larger sample size. Note that I renamed the objects to
draw1b
and draw2b
because otherwise I would
overwrite draw1
and draw2
. Overwriting these
objects would be inefficient, because we will use these objects later on
in this exercise. If, by accident we would have overwritten these
objects, we can simply obtain them again by re-fixing the random seed
and parsing the exact same function calls. Also note that I do not
recreate object n
as it has not changed. I can simply
re-use the object n
that is already in our
Global Environment
.
The lesson so far is: When you fix your random seed, do not forget that your results are dependent on this random seed. It may be the case that results you obtain are a mere fluke: extremely rare, but obtained because you use a random seed. It would be nice to automate any random number process with different seeds to obtain the sampling distribution of estimates. This we will do on Thursday and Friday when we consider resampling techniques like cross-validation and bootstrapping.
In the previous exercise we have changed the random seed to
125
.
123
set.seed(123)
draw1
and draw2
again, but name them draw3
and draw4
,
respectively.draw3 <- rnorm(n=10)
draw4 <- rnorm(n=1000)
draw5
: 22 values from a normal
distribution with mean = 8
and
variance = 144
.draw5 <- rnorm(22, 8, sqrt(144)) #square root of 144
Function rnorm()
takes the standard deviation \(\sigma\), not the variance \(\sigma^2\). Hence we take the square root
of the variance to obtain the standard deviation cf.
\[\sigma=\sqrt{\sigma^2} = \sqrt{144} = 12\]
draw1
is equal to
draw3
and if object draw2
is equal to
draw4
.all.equal(draw1, draw3)
## [1] TRUE
all.equal(draw2, draw4)
## [1] TRUE
The objects are indeed equal because they originated in an identical
order since the same random seed was specified. In other words,
draw1
and draw3
were generated in random step
1 since set.seed(123)
and draw2
and
draw4
were both generated in step 2 since
set.seed(123)
. If we would have reversed the order in which
these objects were generated, then the objects would not be equal. For
example:
set.seed(123)
draw4 <- rnorm(1000)
draw3 <- rnorm(10)
all.equal(draw1, draw3)
## [1] "Mean relative difference: 1.634686"
all.equal(draw2, draw4)
## [1] "Mean relative difference: 1.409047"
Because the objects are not equal, mean differences are automatically
printed by function all.equal()
.
Fun fact: the first 10 cases in draw1
and draw4
are equal:
cbind(draw1[1:10], draw4[1:10])
## [,1] [,2]
## [1,] -0.56047565 -0.56047565
## [2,] -0.23017749 -0.23017749
## [3,] 1.55870831 1.55870831
## [4,] 0.07050839 0.07050839
## [5,] 0.12928774 0.12928774
## [6,] 1.71506499 1.71506499
## [7,] 0.46091621 0.46091621
## [8,] -1.26506123 -1.26506123
## [9,] -0.68685285 -0.68685285
## [10,] -0.44566197 -0.44566197
and if we recreate draw5
, the results are also
equal.
draw5b <- rnorm(22, 8, sqrt(144))
all.equal(draw5, draw5b)
## [1] TRUE
This is because these values and objects follow the same steps since
set.seed(123)
. The first 10 cases in draw1
and
draw4
both stem from the first step since
set.seed(123)
and generating draw5
and
draw5b
has been the third random step since
set.seed(123)
in both random chains.
summary()
mean = 50
and
sd = 20
,df = 11
degrees of freedom,min = 5
and maximum
max = 6
,size = 1
trials and
prob = .8
probablity if success on each trial,df1 = 1
and df2 = 2
degrees of freedom,lambda = 5
as the
mean.First,
draw.norm <- rnorm(1000, mean = 50, sd = 20)
draw.t <- rt(1000, df = 11)
draw.unif <- runif(1000, min = 5, max = 6)
draw.binom <- rbinom(1000, size = 1, prob = .8)
draw.F <- rf(1000, df1 = 1, df2 = 2)
draw.pois <- rpois(1000, lambda = 5)
It may ease interpretation to create a data.frame
from
these results and call summary on the data frame.
data <- data.frame(normal = draw.norm,
t = draw.t,
uniform = draw.unif,
binomial = draw.binom,
Fdistr = draw.F,
poisson = draw.pois)
summary(data)
## normal t uniform binomial
## Min. :-10.96 Min. :-4.206696 Min. :5.000 Min. :0.000
## 1st Qu.: 36.94 1st Qu.:-0.697476 1st Qu.:5.247 1st Qu.:1.000
## Median : 50.73 Median : 0.006237 Median :5.490 Median :1.000
## Mean : 50.69 Mean :-0.041201 Mean :5.493 Mean :0.797
## 3rd Qu.: 64.83 3rd Qu.: 0.645770 3rd Qu.:5.726 3rd Qu.:1.000
## Max. :117.81 Max. : 3.539230 Max. :6.000 Max. :1.000
## Fdistr poisson
## Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.1274 1st Qu.: 3.000
## Median : 0.6643 Median : 5.000
## Mean : 6.0563 Mean : 4.973
## 3rd Qu.: 2.6684 3rd Qu.: 6.000
## Max. :937.0277 Max. :14.000
breaks = 25
for the number of breakpoints in the histogram
hist(draw.norm, breaks = 25)
hist(draw.t, breaks = 25)
hist(draw.unif, breaks = 25)
hist(draw.binom, breaks = 25)
hist(draw.F, breaks = 25)
hist(draw.pois, breaks = 25)
1000
from a
multivariate normal distribution, both with mean 5
and
variance 1
and make sure that they correlate \(\rho = .8\).#you need function rmvnorm from package mvtnorm
sigma <- matrix(c(1, .8, .8, 1), nrow = 2, ncol = 2) #variance/covariance matrix
data <- mvtnorm::rmvnorm(n = 1000, sigma = sigma, mean = c(5, 5))
colMeans(data)
## [1] 4.995531 5.002532
var(data)
## [,1] [,2]
## [1,] 0.9574800 0.7646658
## [2,] 0.7646658 0.9700879
End of Practical