In this practical exercise we are going to play around with the different types of elements in R.
Go to the course website and download the file “Practical B: template” (a Markdown file).
Save the template file in the project folder you created for this
course, and, if necessary, open the R Project by clicking on the
.Rproj file.
Open the file “Practical B: Exercises”. Make the exercises, if possible without looking at the answers in the file “Practical B: solutions”.
In any case; ask for help when you feel help is needed.
vec1 with values 1
through 6 and one named vec2 with letters A through
F.vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c("A", "B", "C", "D", "E", "F")To create a vector we used c(), which stands for
‘concatenation’. It is just a series of numbers or letters.
vec1 and one from
vec2. The dimensions for both matrices are 3 rows by 2
columns. Find the function to create a matrix by typing
?matrix. Notice that when you start typing
?matrix in the code chunk, a pop-up window appears with
information about the function.?matrix
mat1 <- matrix(vec1, nrow = 3, ncol = 2)
mat2 <- matrix(vec2, nrow = 3, ncol = 2)To create a matrix we used matrix(). For a matrix we
need to specify the dimensions (in this case 3 rows and 2 columns) and
the input (in this case vec1 or vec2) needs to
match these dimensions.
vec1## [1] 1 2 3 4 5 6vec2## [1] "A" "B" "C" "D" "E" "F"mat1##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6mat2##      [,1] [,2]
## [1,] "A"  "D" 
## [2,] "B"  "E" 
## [3,] "C"  "F"vec1 and mat1 contain numbers and
vec2 and mat2 contain characters.
vec1 and
vec2 with 6 rows and 2 columns. Inspect this
matrix.mat3 <- matrix(c(vec1, vec2), 6, 2)
mat3##      [,1] [,2]
## [1,] "1"  "A" 
## [2,] "2"  "B" 
## [3,] "3"  "C" 
## [4,] "4"  "D" 
## [5,] "5"  "E" 
## [6,] "6"  "F"or
mat3b <- cbind(vec1, vec2)
is.matrix(mat3b)## [1] TRUEmat3b##      vec1 vec2
## [1,] "1"  "A" 
## [2,] "2"  "B" 
## [3,] "3"  "C" 
## [4,] "4"  "D" 
## [5,] "5"  "E" 
## [6,] "6"  "F"If one or more elements in the matrix represent characters, all other
elements are also converted to characters. A matrix is just for either
numeric or character elements. Notice that the second approach (the
column bind approach from mat3b) returns a matrix where the
column names are already set to the name of the bound objects.
To solve the problem of numbers represented as characters we can
create a dataframe. A dataframe is essentially a matrix that allows for
character elements. The use of a dataframe is often preferred over the
use of a matrix in R, except for purposes where pure
numerical calculations are done, such as in matrix algebra. However,
most datasets do contain character information and a dataframe would
normally be your preferred choice when working with your own collected
datasets in R.
dat3 where
vec1 and vec2 are both columns. Name the
columns V1 and V2, respectively. Use function
data.frame().dat3 <- data.frame(V1 = vec1, V2 = vec2)
dat3##   V1 V2
## 1  1  A
## 2  2  B
## 3  3  C
## 4  4  D
## 5  5  E
## 6  6  Fstr() function. Try this function on
dat3.You can inspect the structure of a dataframe (or other R object) by
using the str() function or by clicking on the object in
the Environment tab in RStudio, which unfolds the
properties of each element. See:
Try both ways of inspecting the structure of dat3.
str(dat3)## 'data.frame':    6 obs. of  2 variables:
##  $ V1: num  1 2 3 4 5 6
##  $ V2: chr  "A" "B" "C" "D" ...Inspecting the structure of your data is vital, as you probably have
imported your data from some other source. If we, at a later stage,
start analyzing our data without the correct measurement level, we may
run into problems. One problem that often occurs is that categorical
variables (factors in R) are not coded as such.
dat3 that you
have created in Question 4.dat3[3, ] #3rd row##   V1 V2
## 3  3  Cdat3[, 2] #2nd column## [1] "A" "B" "C" "D" "E" "F"dat3$V2   #also 2nd column## [1] "A" "B" "C" "D" "E" "F"dat3[3, 2] #intersection## [1] "C"The [3, 2] index is very useful in ‘R’. The first number
(before the comma) represents the row and the second number (after the
comma) represents the column. For a vector there are no two dimensions
and only one dimension can be called. For example, vec1[3]
would yield 3. Try it.
Columns can also be called by the $ sign, but only if a
name has been assigned. With dataframes assigning names happens
automatically.
V1 in our
dataframe dat3 is not coded correctly, but actually
represents grouping information about cities. Convert the variable to a
factor and add the labels Utrecht, New York, London, Singapore, Rome and
Cape Town.dat3$V1 <- factor(dat3$V1, labels = c("Utrecht", "New York", "London", "Singapore", "Rome", "Capetown"))
dat3##          V1 V2
## 1   Utrecht  A
## 2  New York  B
## 3    London  C
## 4 Singapore  D
## 5      Rome  E
## 6  Capetown  FYou can verify the changes with str() or by inspecting
the object dat3 in the RStudio Environment tab.
boys.RData.There are two ways to go about opening workspaces that are available
on the internet. You either need to download the boys.RData
file from the course page and put it in the project folder. Then run the
below code
load("boys.RData")or double-click the boys.RData file on your machine
(right-click and open with RStudio if it does not open by
default in RStudio, but in R).
Alternatively, you can import workspaces directly from the internet by running and loading the connection
con <- url("https://www.gerkovink.com/fundamentals/data/boys.RData")
load(con)In the above code we store the connection in object con
and then load the connection with load(con).
boys dataset (it is from package
mice, by the way) by typing boys in the
console, and subsequently by using the function
View().The output is not displayed here as the data set is simply too large.
Using View() is preferred for inspecting datasets that
are large. View() opens the dataset in a spreadsheet-like
window (conform MS Excel, or SPSS). If you View() your own
datasets, you can not edit the datasets’ contents.
boys data set
and inspect the first and final 6 cases in the data set.To do it numerically, find out what the dimensions of the boys dataset are.
dim(boys)## [1] 748   9There are 748 cases on 9 variables. To select the first and last six cases, use
boys[1:6, ]##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA southboys[743:748, ]##         age   hgt  wgt   bmi   hc  gen  phb tv   reg
## 7410 20.372 188.7 59.8 16.79 55.2 <NA> <NA> NA  west
## 7418 20.429 181.1 67.2 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.0 24.60   NA <NA> <NA> NA  west
## 7447 20.780 193.5 75.4 20.13   NA <NA> <NA> NA  west
## 7451 20.813 189.0 78.0 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.5 23.14   NA <NA> <NA> NA  eastor, more efficiently:
head(boys)##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA southtail(boys)##         age   hgt  wgt   bmi   hc  gen  phb tv   reg
## 7410 20.372 188.7 59.8 16.79 55.2 <NA> <NA> NA  west
## 7418 20.429 181.1 67.2 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.0 24.60   NA <NA> <NA> NA  west
## 7447 20.780 193.5 75.4 20.13   NA <NA> <NA> NA  west
## 7451 20.813 189.0 78.0 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.5 23.14   NA <NA> <NA> NA  eastThe functions head() and tail() are very
useful functions. For example, from looking at both functions we can
observe that the data are very likely sorted based on
age.
wgt) on the x-axis and the height variable
(hgt) on the y-axis. How can you achieve such a plot? Tip:
use the args() function.A. plot(boys$wgt, boys$hgt)
B. plot(boys$hgt, boys$wgt)
C. plot(x=boys$wgt, y=boys$hgt)
D. plot(y=boys$hgt, x=boys$wgt)
There are three correct answers: A, C and D.
# look at the arguments structure in the function
args(plot)## function (x, y, ...) 
## NULL# this reveals that the function plots the first value on the x-axis and the second value on the y-axis.Make the plot using the correct code.
plot(x=boys$wgt, y=boys$hgt)boys data are sorted based on
age. Verify this.To verify if the data are indeed sorted, we can run the following
command to test the complement of that statement. Remember that we can
always search the help for functions: e.g. we could have searched here
for ?sort and we would quickly have ended up at function
is.unsorted() as it tests whether an object is not
sorted.
is.unsorted(boys$age)## [1] FALSEwhich returns FALSE, indicating that boys’ age is indeed
sorted (we asked if it was unsorted!). To directly test if it is sorted,
we could have used
!is.unsorted(boys$age)## [1] TRUEwhich tests if data data are not unsorted. In other words the values
TRUE and FALSE under
is.unsorted() turn into FALSE and
TRUE under !is.unsorted(), respectively.
boys dataset with
str(). Use one or more functions to find distributional
summary information (at least information about the minimum, the
maximum, the mean and the median) for all of the variables. Give the
standard deviation for age and bmi.
Tip: make use of the help (?) and help search (??) functionality in
R.str(boys)## 'data.frame':    748 obs. of  9 variables:
##  $ age: num  0.035 0.038 0.057 0.06 0.062 0.068 0.068 0.071 0.071 0.073 ...
##  $ hgt: num  50.1 53.5 50 54.5 57.5 55.5 52.5 53 55.1 54.5 ...
##  $ wgt: num  3.65 3.37 3.14 4.27 5.03 ...
##  $ bmi: num  14.5 11.8 12.6 14.4 15.2 ...
##  $ hc : num  33.7 35 35.2 36.7 37.3 37 34.9 35.8 36.8 38 ...
##  $ gen: Ord.factor w/ 5 levels "G1"<"G2"<"G3"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ phb: Ord.factor w/ 6 levels "P1"<"P2"<"P3"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ tv : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ reg: Factor w/ 5 levels "north","east",..: 4 4 4 4 4 4 4 3 3 2 ...summary(boys) #summary info##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 84.88   1st Qu.: 11.70   1st Qu.:15.90  
##  Median :10.505   Median :147.30   Median : 34.65   Median :17.45  
##  Mean   : 9.159   Mean   :132.15   Mean   : 37.15   Mean   :18.07  
##  3rd Qu.:15.267   3rd Qu.:175.22   3rd Qu.: 59.58   3rd Qu.:19.53  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##                   NA's   :20       NA's   :4        NA's   :21     
##        hc          gen        phb            tv           reg     
##  Min.   :33.70   G1  : 56   P1  : 63   Min.   : 1.00   north: 81  
##  1st Qu.:48.12   G2  : 50   P2  : 40   1st Qu.: 4.00   east :161  
##  Median :53.00   G3  : 22   P3  : 19   Median :12.00   west :239  
##  Mean   :51.51   G4  : 42   P4  : 32   Mean   :11.89   south:191  
##  3rd Qu.:56.00   G5  : 75   P5  : 50   3rd Qu.:20.00   city : 73  
##  Max.   :65.00   NA's:503   P6  : 41   Max.   :25.00   NA's :  3  
##  NA's   :46                 NA's:503   NA's   :522sd(boys$age) #standard deviation for age## [1] 6.894052sd(boys$bmi, na.rm = TRUE) #standard deviation for bmi## [1] 3.053421Note that bmi contains 21 missing values, e.g. by
looking at the summary information. Therefor we need to use
na.rm = T to calculate the standard deviation on the
observed cases only.
The logical operators (TRUE vs FALSE) are a very powerful tool in
R. For example, we can just select the rows (respondents)
in the data that are older than 20 by putting the logical operator
within the row index of the dataset:
boys2 <- boys[boys$age >= 20, ]
nrow(boys2)## [1] 12or, alternatively using subset(),
boys2.1 <- subset(boys, age >= 20)
nrow(boys2.1)## [1] 12boys3 <- boys[boys$age > 19 & boys$age < 19.5, ]
nrow(boys3)## [1] 18or, alternatively,
boys3.2 <- subset(boys, age > 19 & age < 19.5)
nrow(boys3.2)## [1] 18north?mean(boys$age[boys$age < 15 & boys$reg != "north" ], na.rm = TRUE)## [1] 6.044461or, alternatively,
mean(subset(boys, age < 15 & reg != "north")$age, na.rm=TRUE)## [1] 6.044461The mean age is 6.0444609 years
Today we have learned to use R at its basics. This
offers tremendous flexibility, but may also be inefficient when our aim
is some complex analysis, data operation of data manipulation. Doing
advanced operations in basic R may require lots and lots of
code. Tomorrow we will start using packages that allow us to do
complicated operations with just a few lines of code.
As you start using R in your own research, you will find
yourself in need of packages that are not part of the default
R installation. The beauty of R is that its
functionality is community-driven. People can add packages to
CRAN that other people can use and improve. Chances are
that a function and/or package has been already developed for the
analysis or operation you plan to carry out. If not, you are of course
welcome to fill the gap by submitting your own package.
End of practical