This morning we have learned the basics of programming in R
:
- assign elements to objects with
<-
(alt/option -) - work with
RStudio
andR Markdown
- run code
- organize your work with projects in
RStudio
Statistical Programming with R
This morning we have learned the basics of programming in R
:
<-
(alt/option -)RStudio
and R Markdown
RStudio
RStudio
and R
only the base packages are activated: the basic installation with basic functionality.Use sessionInfo()
to see which packages are active. This is how the basic installation looks like:
Packages are like apps on your mobile phone.
The easiest way to install a package, e.g. mice, is to use:
install.packages("mice")
Alternatively, you can also do it in RStudio
through:
Tools -> Install Packages
An overview of the packages you have installed, see the tab “Packages” in the output pane:
There are two ways to load a package in R
:
library(mice)
and
require(mice)
When a package is not found (not installed):
require()
will produce a warning but will continue to run the rest of the code.library()
will produce an error and stop running the rest of the code.Everything that is published on the Comprehensive R
Archive Network (CRAN) and is aimed at R
users, must be accompanied by a help file.
In the search bar of the output pane:
In the console:
help(sample)
or ?sample
(opens a help window).help(package=mice)
for packagessample
in console or editor (Markdown code chunk) a pop-up window appears with help about the structure of the function.Type your search term in the search bar of the output pane.
In the console:
??
followed by your search term.??anova
returns a list of all help pages that contain the word ‘anova’.Some packages have cheat sheets, see in R Studio
, Help menu -> Cheat Sheets
Google the search term(s) and add ‘R’ as keyword.
Helpful websites: http://www.stackoverflow.com and http://www.stackexchange.com
Functions are the building blocks of R
Built-in or user-defined (programme your own functions).
To use a function, type the function name with parentheses: mean()
Typing the name of the function without the parentheses reveals the code of the function.
Every function in R has the following structure:
Image source: Garrett Grolemund, Hands-On Programming with R, 2.6
When you want to use a function in R, you need to know which information you need to provide to the function.
For example the function sample()
Use args(<function name>)
to obtain info about the arguments and the default values:
args(sample)
## function (x, size, replace = FALSE, prob = NULL) ## NULL
Or make use of the pop-up help and use the TAB key to cycle through the arguments:
Clicking F1
opens the help file of the function sample()
:
Now we can use the function to, for example, mimic the sampling of two dice.
dice <- sample(1:6, size=2, replace=TRUE) dice
## [1] 3 6
x
represents the items to sample from (the range of possible items). In this case the numbers 1 to 6 (the eyes of single die).
size
is the number of items to choose, in this case 2
replace=TRUE
means sampling with replacement
Will the function work if we leave out the argument names and give only the values?
dice <- sample(1:6, 2, TRUE) dice
## [1] 3 2
And if we change the order of the values?
dice <- sample(2, 1:6, TRUE)
## Error in sample.int(x, size, replace, prob): invalid 'size' argument
dice
## [1] 3 2
Changing the order is possible only when the argument is mentioned.
dice <- sample(size=2, x=1:6, replace=TRUE) dice
## [1] 2 6
Recommendation: type out the arguments and their values. This prevents errors and increases the readability of your code.
A vector is an indexed set of values (a list of numbers) and has one dimension (row vector or column vector). The simplest vector has 1 element.
c()
creates a list of numbers:
v1 <- c(3) v1
## [1] 3
v2 <- c(1:12) v2
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
Vectors can have the following data atomic modes: integer, numeric/double, character, logical, complex
Numeric (double):
v3 <- c(100:110) v3
## [1] 100 101 102 103 104 105 106 107 108 109 110
Integer:
v4 <- c(1L:12L) v4
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
Character:
v5 <- c(letters[21:26]) v5
## [1] "u" "v" "w" "x" "y" "z"
names <- c("Mike", "Anne", "George") names
## [1] "Mike" "Anne" "George"
Logical:
v6 <- c(TRUE, FALSE) v6
## [1] TRUE FALSE
With c()
vector <- c(25:30) vector
## [1] 25 26 27 28 29 30
Simple replication with rep()
rep(1:2, 3)
## [1] 1 2 1 2 1 2
Or more complex:
rep(c("A", "B"), c(2, 3))
## [1] "A" "A" "B" "B" "B"
rep(c("A", "B"), each=3)
## [1] "A" "A" "A" "B" "B" "B"
Sequence of numbers with seq()
seq(from=2, to=10, by=2)
## [1] 2 4 6 8 10
matrix()
creates arrays with specified dimensions, e.g. vectors:
rvect <- matrix(data=vector, nrow=6, ncol=1) rvect
## [,1] ## [1,] 25 ## [2,] 26 ## [3,] 27 ## [4,] 28 ## [5,] 29 ## [6,] 30
dim(rvect)
## [1] 6 1
A matrix:
(M1 <- matrix(v2, nrow=3, ncol=4))
## [,1] [,2] [,3] [,4] ## [1,] 1 4 7 10 ## [2,] 2 5 8 11 ## [3,] 3 6 9 12
dim(M1)
## [1] 3 4
(M2 <- matrix(v2, nrow=4, ncol=3))
## [,1] [,2] [,3] ## [1,] 1 5 9 ## [2,] 2 6 10 ## [3,] 3 7 11 ## [4,] 4 8 12
Vectors and matrices can only hold one data type. Remember, matrices and vectors are numerical OR character objects. They can never contain both and still be used for numerical calculations.
vector
## [1] 25 26 27 28 29 30
v5
## [1] "u" "v" "w" "x" "y" "z"
(newvect <- c(vector, v5))
## [1] "25" "26" "27" "28" "29" "30" "u" "v" "w" "x" "y" "z"
Vectors and matrices can only hold one data type
vector
## [1] 25 26 27 28 29 30
v5
## [1] "u" "v" "w" "x" "y" "z"
M <- matrix(cbind(vector, v5), nrow=6, ncol=2) M
## [,1] [,2] ## [1,] "25" "u" ## [2,] "26" "v" ## [3,] "27" "w" ## [4,] "28" "x" ## [5,] "29" "y" ## [6,] "30" "z"
Lists are flexible data structures: the elements in a list may be a combination of different data types (numeric, character) and dimensions.
L <- list(names, vector, M) L
## [[1]] ## [1] "Mike" "Anne" "George" ## ## [[2]] ## [1] 25 26 27 28 29 30 ## ## [[3]] ## [,1] [,2] ## [1,] "25" "u" ## [2,] "26" "v" ## [3,] "27" "w" ## [4,] "28" "x" ## [5,] "29" "y" ## [6,] "30" "z"
Assign names to the elements of a list with names()
. Notice the $
.
names(L) <-c("Names", "Numbers", "Matrix") L
## $Names ## [1] "Mike" "Anne" "George" ## ## $Numbers ## [1] 25 26 27 28 29 30 ## ## $Matrix ## [,1] [,2] ## [1,] "25" "u" ## [2,] "26" "v" ## [3,] "27" "w" ## [4,] "28" "x" ## [5,] "29" "y" ## [6,] "30" "z"
A data frame is the R
representation of a rectangular data set where the rows are the observations and the columns the variables.
Data frames can contain both numerical and character column vectors at the same time, although never in the same column.
D <- data.frame("V1" = rnorm(5), "V2" = rnorm(5, mean = 5, sd = 2), "V3" = letters[1:5]) D
## V1 V2 V3 ## 1 0.1292877 4.108676 a ## 2 1.7150650 7.448164 b ## 3 0.4609162 5.719628 c ## 4 -1.2650612 5.801543 d ## 5 -0.6868529 5.221365 e
We ‘filled’ a data frame with two randomly generated sets from the normal distribution - where \(V1\) is standard normal and \(V2 \sim N(5,2)\) - and a character set.
You can name the columns and rows in data frames with row.names
:
row.names(D) <- c("row 1", "row 2", "row 3", "row 4", "row 5") D
## V1 V2 V3 ## row 1 0.1292877 4.108676 a ## row 2 1.7150650 7.448164 b ## row 3 0.4609162 5.719628 c ## row 4 -1.2650612 5.801543 d ## row 5 -0.6868529 5.221365 e
Factors are used to represent categorical data (ordered or unordered).
A factor is a vector with integers where each integer has a label.
Factors facilitate interpretation of results in statistical modeling: a variable with labels “male”, “female” is self-describing compared to a variable with values 1
, 2
.
Factors are very useful in statistical modeling (linear models, GLM) where they facilitate the dummy coding process of categorical variables.
Factor objects can be created with the factor()
function.
x <- factor(c("male", "male", "female", "male", "female")) x
## [1] male male female male female ## Levels: female male
Obtain the summary
of the factor:
summary(x)
## female male ## 2 3
Factors are integer vectors where each integer has a label (levels
):
typeof(x)
## [1] "integer"
attributes(x)
## $levels ## [1] "female" "male" ## ## $class ## [1] "factor"
In the basic installation of R (“base R”) there are three ways to select elements from vectors, matrices, lists and data frames:
[]
[[]]
$
[]
Square brackets []
are used to call single elements or entire rows and columns.
[a, b]
: a refers to the row number(s), b refers to the column number(s).
M <- matrix(rnorm(12), nrow=3, ncol=4) M
## [,1] [,2] [,3] [,4] ## [1,] -0.5558411 -1.9666172 -1.0678237 -0.7288912 ## [2,] 1.7869131 0.7013559 -0.2179749 -0.6250393 ## [3,] 0.4978505 -0.4727914 -1.0260044 -1.6866933
M[2, 3]
## [1] -0.2179749
[]
Also for data frames:
D
## V1 V2 V3 ## row 1 0.1292877 4.108676 a ## row 2 1.7150650 7.448164 b ## row 3 0.4609162 5.719628 c ## row 4 -1.2650612 5.801543 d ## row 5 -0.6868529 5.221365 e
D[2, 3] # Select element "b"
## [1] "b"
[]
D
## V1 V2 V3 ## row 1 0.1292877 4.108676 a ## row 2 1.7150650 7.448164 b ## row 3 0.4609162 5.719628 c ## row 4 -1.2650612 5.801543 d ## row 5 -0.6868529 5.221365 e
D[2, ] # Select second row
## V1 V2 V3 ## row 2 1.715065 7.448164 b
D[, 1] # Select first column
## [1] 0.1292877 1.7150650 0.4609162 -1.2650612 -0.6868529
[]
D
## V1 V2 V3 ## row 1 0.1292877 4.108676 a ## row 2 1.7150650 7.448164 b ## row 3 0.4609162 5.719628 c ## row 4 -1.2650612 5.801543 d ## row 5 -0.6868529 5.221365 e
D[2:3, 2] # Select second and third row in second colum
## [1] 7.448164 5.719628
D[1, c(2,3)] # Select elements in the first row, second and third column
## V2 V3 ## row 1 4.108676 a
[]
D
## V1 V2 V3 ## row 1 0.1292877 4.108676 a ## row 2 1.7150650 7.448164 b ## row 3 0.4609162 5.719628 c ## row 4 -1.2650612 5.801543 d ## row 5 -0.6868529 5.221365 e
D[ , -3] # Select all rows and leave out the third column.
## V1 V2 ## row 1 0.1292877 4.108676 ## row 2 1.7150650 7.448164 ## row 3 0.4609162 5.719628 ## row 4 -1.2650612 5.801543 ## row 5 -0.6868529 5.221365
[]
D
## V1 V2 V3 ## row 1 0.1292877 4.108676 a ## row 2 1.7150650 7.448164 b ## row 3 0.4609162 5.719628 c ## row 4 -1.2650612 5.801543 d ## row 5 -0.6868529 5.221365 e
D[2:3, -c(3)] # Select the second and third row minus the third column
## V1 V2 ## row 2 1.7150650 7.448164 ## row 3 0.4609162 5.719628
[[]]
The [[]]
operator selects only one element
L
## $Names ## [1] "Mike" "Anne" "George" ## ## $Numbers ## [1] 25 26 27 28 29 30 ## ## $Matrix ## [,1] [,2] ## [1,] "25" "u" ## [2,] "26" "v" ## [3,] "27" "w" ## [4,] "28" "x" ## [5,] "29" "y" ## [6,] "30" "z"
L[[1]]
## [1] "Mike" "Anne" "George"
$
Use $
to select elements with name labels in lists or data frames:
## $Names ## [1] "Mike" "Anne" "George" ## ## $Numbers ## [1] 25 26 27 28 29 30 ## ## $Matrix ## [,1] [,2] ## [1,] "25" "u" ## [2,] "26" "v" ## [3,] "27" "w" ## [4,] "28" "x" ## [5,] "29" "y" ## [6,] "30" "z"
L$Names
## [1] "Mike" "Anne" "George"
$
Use $
to select a variable in a data frame:
## V1 V2 V3 ## row 1 0.1292877 4.108676 a ## row 2 1.7150650 7.448164 b ## row 3 0.4609162 5.719628 c ## row 4 -1.2650612 5.801543 d ## row 5 -0.6868529 5.221365 e
D$V3
## [1] "a" "b" "c" "d" "e"
Logical operators are signs that evaluate a statement, such as ==
, <
, >
, <=
, >=
, and |
(OR) as well as &
(AND). Typing !
before a logical operator takes the complement of that action.
For example, if we would like to select elements of vector v
that are larger than 6, we would type:
v <- c(1:12) v
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
v[v > 6]
## [1] 7 8 9 10 11 12
v
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
v > 6
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
The column values for TRUE
may be of different length. A vector as a return is therefore more appropriate. The TRUE
and FALSE
values serve as indicators to select the elements in v
larger than 6.
v[v > 6]
## [1] 7 8 9 10 11 12
Symbol | Meaning |
---|---|
! | logical not |
\(\&\) | logical and |
\(|\) | logical or |
\(<\) | less than |
\(<=\) | less than or equal to |
\(>\) | greater than |
\(>=\) | greater than or equal to |
\(==\) | logical equals |
\(!=\) | not equal |
In R there are two types of numbers: integers and floating point numbers. Since computer memory is limited, you cannot store numbers with infinite precision. Numbers are therefore represented with floating point numbers. Floating points cannot represent decimal fractions exactly in most cases.
(3 - 2.9)
## [1] 0.1
(3 - 2.9) <= 0.1
## [1] FALSE
Why does R tell us that 3 - 2.9 ≠ 0.1?
(3 - 2.9) - 0.1
## [1] 8.326673e-17
Let’s have a look at how the decimal fractions are actually represented as floating points. You can see this by asking a representation with 54 decimals.
sprintf("%.54f",3 - 2.9)
## [1] "0.100000000000000088817841970012523233890533447265625000"
sprintf("%.54f",0.1)
## [1] "0.100000000000000005551115123125782702118158340454101562"
The difference of 8.326673e-17 is smaller than the representable difference between two numbers whose value is close to 0.1.
The smallest positive floating point number in R is: 2.220446e-16
(3 - 2.9) - 0.1
## [1] 8.326673e-17
.Machine$double.eps
## [1] 2.220446e-16
You can verify whether the difference between two floating points is smaller than the smallest positive floating point number (2.220446e-16).
Or use the all.equal()
function which checks that the difference is close to the smallest floating point number.
((3 - 2.9) - 0.1) < .Machine$double.eps
## [1] TRUE
all.equal((3 - 2.9), 0.1)
## [1] TRUE
Go to the course website and download the file “Practical B: template” (a Markdown file).
Save the file in the project folder you created for this course, and, if necessary, open the R Project by clicking on the .Rproj
file.
Make the exercises, if possible without looking at the answers in the file “Practical B: solutions”.
In any case; ask for help when you feel help is needed.