Generating Data

Statistical Programming with R

Topics

Drawing data from distributions
Sampling, random seed, pseudo-random number generator

Packages and functions that we use

library(mvtnorm)  # Multivariate Normal tools

table(): Build a contingency table out of counts
hist(): Draw a histogram
mvtnorm::rmvnorm(): Drawing values from a multivariate normal distribution
sample(): Probabilistic sampling with or without replacement
rchisq(): Probabilistic sampling from a \(\chi^2\) distribution
rpois(): Probabilistic sampling from a poisson distribution
set.seed(123): Fixing the Random Number Generator seed

Creating our own data

We saw that we could create vectors

c(1, 2, 3, 4, 3, 2, 1)

## [1] 1 2 3 4 3 2 1

c(1:4, 3:1)

## [1] 1 2 3 4 3 2 1

We could create matrices

matrix(1:15, nrow = 3, ncol = 5, byrow = TRUE)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15

Creating our own data

Create a vector from a matrix:

mat <- matrix(data=1:18, nrow=3, ncol=6)
mat

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7   10   13   16
## [2,]    2    5    8   11   14   17
## [3,]    3    6    9   12   15   18

c(mat)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18

But we can also draw a sample

Of the same size,

sample(x=mat)

##  [1] 15  7 16 11  4 13  9  6  8 10  3 12 17  5 18 14  2  1

# Besides x=mat, no other arguments are specified, therefore the function uses 
# the defaults:
# If size is not specified, size = number of elements in x
# replace: default value is FALSE

smaller size,

sample(mat, size = 5)

## [1] 15  5  9 18 16

or larger size

sample(mat, size = 50, replace = TRUE)

##  [1] 10 15 13  3 16 14 14 10 14  8 12  7 12 10 18  1 11  7  3  6 17  9 13  8  6
## [26]  8 14 15 15  2  1 16 12  1 17  8 11  4 11  1  4 11 11  1 10 17  4  6 13  7

We can do random sampling

hist(sample(mat, size = 50000, replace=TRUE), breaks = 0:18)

Or nonrandom sampling

probs <- c(rep(.01, 15), .1, .25, .50)
probs

##  [1] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
## [16] 0.10 0.25 0.50

hist(sample(mat, size = 50000, prob = probs, replace=TRUE), breaks = 0:18)

We can replicate individual samples

set.seed(123)
sample(mat, size = 50, replace = TRUE)

##  [1] 15 14  3 10 18 11  5 14  5  9  3  8  7 10  9  4 14 17 11  7 12 15 10 13  7
## [26]  9  9 10  7  6  2  5  8 12 13 18  1  6 15  9 15 16  6 11  8  7 16 17 18 17

set.seed(123)
sample(mat, size = 50, replace = TRUE)

##  [1] 15 14  3 10 18 11  5 14  5  9  3  8  7 10  9  4 14 17 11  7 12 15 10 13  7
## [26]  9  9 10  7  6  2  5  8 12 13 18  1  6 15  9 15 16  6 11  8  7 16 17 18 17

We can replicate a chain of samples

set.seed(123)
sample(mat, size = 5, replace = TRUE)

## [1] 15 14  3 10 18

sample(mat, size = 7, replace = TRUE)

## [1] 11  5 14  5  9  3  8

set.seed(123)
sample(mat, size = 5, replace = TRUE)

## [1] 15 14  3 10 18

sample(mat, size = 7, replace = TRUE)

## [1] 11  5 14  5  9  3  8

The random seed, pseudo-random numbers

The random seed is a number used to initialize the pseudo-random number generator

If replication is needed, pseudo-random number generators must be used

Pseudo-random number generators generate a sequence of numbers determined by an initial value
The properties of generated number sequences approximates the properties of random number sequences
But the sequence is predictable and therefore not truly random

The random seed, pseudo-random numbers

For truly random numbers, visit https://www.random.org/ (random numbers are generated based on atmosperic noise)

The initial value (the seed) itself does not need to be random.

The resulting process is (pseudo-)random because the seed value is not used to generate randomness
It merely forms the starting point of the algorithm.

Why fix the random seed

When an R instance is started, there is initially no seed. In that case, R will create one from the current time and process ID.

Hence, different sessions will give different results when random numbers are involved.
When you store a workspace and reopen it, you will continue from the seed specified within the workspace.

If we fix the random seed we can exactly replicate the random process

If the method has not changed: the results of the process will be identical when using the same seed.

Replications allow for verification
But beware: the process depends on the seed
- The results obtained could theoretically be extremely rare and would not have occurred with every other potential seed
- Run another seed before publishing your results

Random seeds

Drawing data

We can draw data from a standard normal distribution

hist(rnorm(1000, mean = 0, sd = 1))

Drawing data

We can draw data from a specific normal distribution

hist(rnorm(1000, mean = 50, sd = 15))

Drawing data: many values

hist(rnorm(100000, mean = 50, sd = 15))

Drawing data: other distribution

\(\chi^2\)-distribution with \(df = 1\) degrees of freedom

hist(rchisq(500, df = 1))

Drawing data: other distribution

\(\chi^2\)-distribution with \(df = 100\) degrees of freedom

hist(rchisq(500, df = 100))

Drawing data: other distribution

Poisson distribution with mean \(\lambda = 1\)

hist(rpois(5000, lambda = 1))

Drawing data: other distribution

Poisson distribution with mean \(\lambda = 10\)

hist(rpois(5000, lambda = 100))

Drawing data: other distribution

Binomial distribution: 1000 draws of 1 trial with \(P(\text{success}=.75)\)

run <- rbinom(1000, size = 1, prob = .75) 
table(run)

## run
##   0   1 
## 247 753

Drawing data: other distribution

Binomial distribution: 10.000 draws of 60 trials with \(P(\text{success}=.75)\)

run <- rbinom(10000, size = 60, prob = .75)
table(run)

## run
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
##    1    7   10   21   29   67  137  257  395  615  783  981 1096 1195 1136 1053 
##   48   49   50   51   52   53   54   55   56 
##  836  572  376  200  136   63   25    7    2

Drawing multivariate distributions

sigma <- matrix(c(30, 15, 15, 30), 2, 2)
sigma

##      [,1] [,2]
## [1,]   30   15
## [2,]   15   30

means <- c(97, 63)
data <- rmvnorm(n = 1000, mean = means, sigma = sigma)
colMeans(data)

## [1] 96.86905 62.77418

var(data)

##          [,1]     [,2]
## [1,] 26.98297 14.28588
## [2,] 14.28588 31.07523

Bivariate normal

plot(data)

Drawing multivariate distributions

sigma <- matrix(c(1, 0, 0, 1), 2, 2)
sigma

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

means <- c(0, 0)
data <- rmvnorm(n = 1000, mean = means, sigma = sigma)
colMeans(data)

## [1] -0.004211069  0.007496766

var(data)

##              [,1]         [,2]
## [1,]  0.869919122 -0.002239893
## [2,] -0.002239893  1.074986477

Bivariate normal

plot(data)

Histograms

par(mfrow = c(1, 2))
hist(data[, 1])
hist(data[, 2])

Practical

Practical B

Open the Practical B Solutions file to check the answers.
Open the Practical B Template file to write the code.