Data manipulation and pipes

Statistical Programming with R

This lecture

Data manipulation
Basic analysis (correlation & t-test)
R model formula and classes
Pipes

New packages we use

library(MASS)     # for the cats data
library(dplyr)    # data manipulation
library(haven)    # in/exporting data
library(magrittr) # pipes

New functions

transform(): changing and adding columns
dplyr::filter(): row-wise selection (of cases)
dplyr::arrange(): order rows by values of a column or columns (low to high)
table(): frequency tables
class(): object class
levels(): levels of a factor
order(): data entries in increasing order
haven::read_sav(): import SPSS data
cor(): bivariate correlation
sample(): drawing a sample
t.test(): t-test

Data manipulation

The cats data

Description (see ?cats): The heart weights (in g) and body weights (in kg) of 144 adult cats (male and female) used for digitalis experiments. Digitalis is a medication used to treat heart failure and certain types of irregular heartbeat.

head(cats)

##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

str(cats)

## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

The cats data

summary(cats)

##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

How to get only Female cats?

# With dplyr::filter
fem.cats <- dplyr::filter(cats, Sex == "F")

dim(fem.cats)

## [1] 47  3

# With base R:
# fem.cats <- cats[cats$Sex == "F", ]

How to get only heavy cats?

heavy.cats <- dplyr::filter(cats, Bwt > 3)
dim(heavy.cats)

## [1] 36  3

head(heavy.cats)

##   Sex Bwt  Hwt
## 1   M 3.1  9.9
## 2   M 3.1 11.5
## 3   M 3.1 12.1
## 4   M 3.1 12.5
## 5   M 3.1 13.0
## 6   M 3.1 14.3

# With base R:
# heavy.cats2 <- cats[cats$Bwt > 3, ]
# or:
# heavy.cats3 <- subset(cats, Bwt > 3)

Select cats with specified range of body weights

dplyr::filter(cats, Bwt > 2.0 & Bwt < 2.2, Sex == "F")

##   Sex Bwt Hwt
## 1   F 2.1 7.2
## 2   F 2.1 7.3
## 3   F 2.1 7.6
## 4   F 2.1 8.1
## 5   F 2.1 8.2
## 6   F 2.1 8.3
## 7   F 2.1 8.5
## 8   F 2.1 8.7
## 9   F 2.1 9.8

Working with factors

class(cats$Sex)

## [1] "factor"

levels(cats$Sex)

## [1] "F" "M"

Working with factors

Changing the level names:

levels(cats$Sex) <- c("Female", "Male")

table(cats$Sex)

## 
## Female   Male 
##     47     97

head(cats)

##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.1 7.2
## 5 Female 2.1 7.3
## 6 Female 2.1 7.6

Sorting with `dplyr::arrange()`

dplyr::arrange()sorts the rows of a data frame according to the order of one or more of the columns.

# sort cats by body weight (Bwt) from low to high values
sorted.cats <- dplyr::arrange(cats, Bwt)
head(sorted.cats)

##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4   Male 2.0 6.5
## 5   Male 2.0 6.5
## 6 Female 2.1 7.2

Sorting with `dplyr::arrange()`

dplyr::arrange()sorts the rows of a data frame according to the order of one or more of the columns.

# Sort the cats by Sex first and within Sex (1=female) by body weight (low to high)
sorted.cats.grouped <- dplyr::arrange(cats, Sex, Bwt)
head(sorted.cats.grouped)

##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.1 7.2
## 5 Female 2.1 7.3
## 6 Female 2.1 7.6

Combining matrices or dataframes

# add column with id numbers
ID <- matrix(c(1:144), nrow=144, ncol=1)

# add variable name for ID variable and change names of other variables
D <- dplyr::bind_cols(ID, cats)
names(D) <- c("ID", "Sex", "Bodyweigt", "Hartweight")
head(D)

##   ID    Sex Bodyweigt Hartweight
## 1  1 Female       2.0        7.0
## 2  2 Female       2.0        7.4
## 3  3 Female       2.0        9.5
## 4  4 Female       2.1        7.2
## 5  5 Female       2.1        7.3
## 6  6 Female       2.1        7.6

Basic analysis

Correlation

cor(cats[, -1])

##           Bwt       Hwt
## Bwt 1.0000000 0.8041274
## Hwt 0.8041274 1.0000000

With [, -1] we exclude the first column (female/male)

Correlation

cor.test(cats$Bwt, cats$Hwt)

## 
##  Pearson's product-moment correlation
## 
## data:  cats$Bwt and cats$Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274

What do we conclude?

Correlation

plot(cats$Bwt, cats$Hwt)

T-test

Test the null hypothesis that the difference in mean heart weight between male and female cats is 0

t.test(formula = Hwt ~ Sex, data = cats)

## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

T-test

plot(formula = Hwt ~ Sex, data = cats)

Linear regression

Remember the boys data from package mice.

If we want to regress BMI (bmi) on weight (wgt), we would use the lm() function and the model formula bmi ~ wgt.

lm(formula = bmi ~ wgt, data = boys)

## 
## Call:
## lm(formula = bmi ~ wgt, data = boys)
## 
## Coefficients:
## (Intercept)          wgt  
##     14.5401       0.0935

Scatterplot

plot(formula = bmi ~ wgt, data = boys)

Add a quadratic term

lm(formula = bmi ~ wgt + I(wgt^2), data = boys)

## 
## Call:
## lm(formula = bmi ~ wgt + I(wgt^2), data = boys)
## 
## Coefficients:
## (Intercept)          wgt     I(wgt^2)  
##   16.271466    -0.041715     0.001604

Why we need `I()`

Inside an R model formula, the +, :, and ^ operators are text objects and therefore behave differently than when used in calculations with numeric vectors. Using I() converts these operators to their numerical meaning.

lm(bmi ~ wgt + wgt^2, data = boys)

## 
## Call:
## lm(formula = bmi ~ wgt + wgt^2, data = boys)
## 
## Coefficients:
## (Intercept)          wgt  
##     14.5401       0.0935

lm(bmi ~ wgt + I(wgt^2), data = boys)

## 
## Call:
## lm(formula = bmi ~ wgt + I(wgt^2), data = boys)
## 
## Coefficients:
## (Intercept)          wgt     I(wgt^2)  
##   16.271466    -0.041715     0.001604

Output of the `lm()` class

It is ‘nicer’ to store the output from the function in an object. The convention for regression models is an object called fit.

fit <- lm(bmi ~ wgt + I(wgt^2), data = boys)

The object fit contains a lot more than just the regression weights. To inspect what is inside you can use

ls(fit)

##  [1] "assign"        "call"          "coefficients"  "df.residual"  
##  [5] "effects"       "fitted.values" "model"         "na.action"    
##  [9] "qr"            "rank"          "residuals"     "terms"        
## [13] "xlevels"

Inspecting what is inside `fit`

Another approach to inspecting the contents of fit is the function attributes()

attributes(fit)

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "lm"

The benefit of using attributes() is that it directly tells you the class of the object.

Classes in R

class(fit)

## [1] "lm"

Classes are used for an object-oriented style of programming. This means that you can write a specific function that

has fixed requirements with respect to the input.
presents output or graphs in a predefined manner.

When a generic function fun is applied to an object with class attribute c("first", "second"), the system searches for a function called fun.first and, if it finds it, applies it to the object.

If no such function is found, a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used (if it exists). If there is no class attribute, the implicit class is tried, then the default method.

Classes example: plotting without class

plot(bmi ~ wgt, data = boys)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 1)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 2)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 4)

Why is plot different for class `"lm"`?

The function plot() is called, but not used. Instead, because the linear model has class "lm", R searches for the function plot.lm().

If function plot.lm() would not exist, R tries to apply function plot() (which would have failed in this case because plot requires x and y as input)

plot.lm() is created by John Maindonald and Martin Maechler. They thought it would be useful to have a standard plotting environment for objects with class "lm".

Since the elements that class "lm" returns are known, creating a generic function class is straightforward.

Pipes

This is a pipe:

boys <- 
  read_sav("boys.sav") %>%
  head()

It effectively replaces head(read_sav("boys.sav")).

Why are pipes useful?

Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like

boys  <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))

With the pipe:

boys <-
  read_sav("boys.sav") %>%
  transform(hgt = hgt/100) %>%
  filter(age > 15) %>%
  subset(select = c(hgt, wgt, bmi))

Benefit: a single object in memory, the steps are easy to follow and understand.

With pipes

Your code becomes more readable:

data operations are structured from left-to-right and not from in-to-out
nested function calls are avoided
local variables and copied objects are avoided
easy to add steps in the sequence

Keyboard shortcut:

Windows/Linux: ctrl + shift + m
Mac: cmd + shift + m

What do pipes do:

Pipes create functions without nesting:

f(x) becomes x %>% f()

set.seed(123)
mean(rnorm(10))

## [1] 0.07462564

set.seed(123)
rnorm(10) %>% 
  mean()

## [1] 0.07462564

Pipes create functions without nesting

f(x, y) becomes x %>% f(y)

cor(boys, use = "pairwise.complete.obs")

##           hgt       wgt       bmi
## hgt 1.0000000 0.6100784 0.1758781
## wgt 0.6100784 1.0000000 0.8841304
## bmi 0.1758781 0.8841304 1.0000000

boys %>% 
  cor(use = "pairwise.complete.obs")

##           hgt       wgt       bmi
## hgt 1.0000000 0.6100784 0.1758781
## wgt 0.6100784 1.0000000 0.8841304
## bmi 0.1758781 0.8841304 1.0000000

Pipes create functions without nesting

h(g(f(x))) becomes x %>% f %>% g %>% h

max(na.omit(subset(boys, select = wgt)))

## [1] 117.4

boys %>% 
  subset(select = wgt) %>% 
  na.omit() %>% 
  max()

## [1] 117.4

Useful: outlier filtering

nrow(cats)

## [1] 144

# select the observations that fall outside the +/- SD around the mean
cats.outl <- 
  cats %>% 
  dplyr::filter(Hwt > mean(Hwt) + 3 * sd(Hwt) | Hwt < mean(Hwt) - 3 * sd(Hwt))

nrow(cats.outl)

## [1] 1

# select the observations that fall within +/- 3 SD around the mean
cats.without.outl <- 
  cats %>% 
  dplyr::filter(Hwt < mean(Hwt) + 3 * sd(Hwt) & Hwt > mean(Hwt) - 3 * sd(Hwt))

nrow(cats.without.outl)

## [1] 143

More pipe stuff

Pipes and the R model formula `~`

Sometimes the data we want to use, are “piped” in the wrong argument, see e.g.:

cats %>% 
  lm(Hwt ~ Bwt)

This leads to an error:

Error in as.data.frame.default(data) :

cannot coerce class ‘“formula”’ to a data.frame

The %>% pipe operator is intended to work with functions where each result is forwarded on to the first argument of the next function. For the function lm() the first argument is the model formula, which is a text object and not the data frame.

Solution 1

We can use the . symbol to act as placeholder for the data:

cats %>% 
  lm(Hwt ~ Bwt, data = .)

## 
## Call:
## lm(formula = Hwt ~ Bwt, data = .)
## 
## Coefficients:
## (Intercept)          Bwt  
##     -0.3567       4.0341

Solution 2: the exposition `%$%` pipe

HTML5 Icon

Solution 2: the exposition `%$%` pipe

The exposition %$% pipe exposes the content(s) of the data frame (the variables) to the next function in the pipeline.

# Now the code works without need of the placeholder

cats %$% 
  lm(Hwt ~ Bwt)

## 
## Call:
## lm(formula = Hwt ~ Bwt)
## 
## Coefficients:
## (Intercept)          Bwt  
##     -0.3567       4.0341

Debugging pipelines

Sample 3 positions from the alphabet and show the position and the letter. If you don’t know what’s going on, run each statement separately!

set.seed(123)
1:26

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26

set.seed(123)
1:26 %>% 
  sample(3)

## [1] 15 19 14

set.seed(123)
1:26 %>%
  sample(3) %>%
  paste(., LETTERS[.])

## [1] "15 O" "19 S" "14 N"

Performing a t-test in a pipe

cats %$%
  t.test(Hwt ~ Sex)

## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

is the same as

t.test(Hwt ~ Sex, data = cats)

Storing a t-test from a pipe

cats.test <- 
  cats %$%
  t.test(Bwt ~ Sex)

cats.test

## 
##  Welch Two Sample t-test
## 
## data:  Bwt by Sex
## t = -8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -0.6631268 -0.4177242
## sample estimates:
## mean in group Female   mean in group Male 
##             2.359574             2.900000

This lecture

New packages we use

New functions

Data manipulation

The cats data

The cats data

How to get only Female cats?

How to get only heavy cats?

Select cats with specified range of body weights

Working with factors

Working with factors

Sorting with dplyr::arrange()

Sorting with dplyr::arrange()

Combining matrices or dataframes

Basic analysis

Correlation

Correlation

Correlation

T-test

T-test

Linear regression

Scatterplot

Add a quadratic term

Why we need I()

Output of the lm() class

Inspecting what is inside fit

Classes in R

Classes example: plotting without class

Classes example: plotting with class

Classes example: plotting with class

Classes example: plotting with class

Why is plot different for class "lm"?

Pipes

This is a pipe:

Why are pipes useful?

With pipes

What do pipes do:

Pipes create functions without nesting

Pipes create functions without nesting

Useful: outlier filtering

More pipe stuff

Pipes and the R model formula ~

Solution 1

Solution 2: the exposition %$% pipe

Solution 2: the exposition %$% pipe

Debugging pipelines

Performing a t-test in a pipe

Storing a t-test from a pipe

Practical

Sorting with `dplyr::arrange()`

Sorting with `dplyr::arrange()`

Why we need `I()`

Output of the `lm()` class

Inspecting what is inside `fit`

Why is plot different for class `"lm"`?

Pipes and the R model formula `~`

Solution 2: the exposition `%$%` pipe

Solution 2: the exposition `%$%` pipe