Data manipulation
Basic analysis (correlation & t-test)
R model formula and classes
Pipes
Statistical Programming with R
Data manipulation
Basic analysis (correlation & t-test)
R model formula and classes
Pipes
library(MASS) # for the cats data library(dplyr) # data manipulation library(haven) # in/exporting data library(magrittr) # pipes
transform()
: changing and adding columnsdplyr::filter()
: row-wise selection (of cases)dplyr::arrange()
: order rows by values of a column or columns (low to high)table()
: frequency tablesclass()
: object classlevels()
: levels of a factororder()
: data entries in increasing orderhaven::read_sav()
: import SPSS datacor()
: bivariate correlationsample()
: drawing a samplet.test()
: t-testDescription (see ?cats
): The heart weights (in g) and body weights (in kg) of 144 adult cats (male and female) used for digitalis experiments. Digitalis is a medication used to treat heart failure and certain types of irregular heartbeat.
head(cats)
## Sex Bwt Hwt ## 1 F 2.0 7.0 ## 2 F 2.0 7.4 ## 3 F 2.0 9.5 ## 4 F 2.1 7.2 ## 5 F 2.1 7.3 ## 6 F 2.1 7.6
str(cats)
## 'data.frame': 144 obs. of 3 variables: ## $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ... ## $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ... ## $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
summary(cats)
## Sex Bwt Hwt ## F:47 Min. :2.000 Min. : 6.30 ## M:97 1st Qu.:2.300 1st Qu.: 8.95 ## Median :2.700 Median :10.10 ## Mean :2.724 Mean :10.63 ## 3rd Qu.:3.025 3rd Qu.:12.12 ## Max. :3.900 Max. :20.50
# With dplyr::filter fem.cats <- dplyr::filter(cats, Sex == "F") dim(fem.cats)
## [1] 47 3
# With base R: # fem.cats <- cats[cats$Sex == "F", ]
heavy.cats <- dplyr::filter(cats, Bwt > 3) dim(heavy.cats)
## [1] 36 3
head(heavy.cats)
## Sex Bwt Hwt ## 1 M 3.1 9.9 ## 2 M 3.1 11.5 ## 3 M 3.1 12.1 ## 4 M 3.1 12.5 ## 5 M 3.1 13.0 ## 6 M 3.1 14.3
# With base R: # heavy.cats2 <- cats[cats$Bwt > 3, ] # or: # heavy.cats3 <- subset(cats, Bwt > 3)
dplyr::filter(cats, Bwt > 2.0 & Bwt < 2.2, Sex == "F")
## Sex Bwt Hwt ## 1 F 2.1 7.2 ## 2 F 2.1 7.3 ## 3 F 2.1 7.6 ## 4 F 2.1 8.1 ## 5 F 2.1 8.2 ## 6 F 2.1 8.3 ## 7 F 2.1 8.5 ## 8 F 2.1 8.7 ## 9 F 2.1 9.8
class(cats$Sex)
## [1] "factor"
levels(cats$Sex)
## [1] "F" "M"
Changing the level names:
levels(cats$Sex) <- c("Female", "Male") table(cats$Sex)
## ## Female Male ## 47 97
head(cats)
## Sex Bwt Hwt ## 1 Female 2.0 7.0 ## 2 Female 2.0 7.4 ## 3 Female 2.0 9.5 ## 4 Female 2.1 7.2 ## 5 Female 2.1 7.3 ## 6 Female 2.1 7.6
dplyr::arrange()
dplyr::arrange()
sorts the rows of a data frame according to the order of one or more of the columns.
# sort cats by body weight (Bwt) from low to high values sorted.cats <- dplyr::arrange(cats, Bwt) head(sorted.cats)
## Sex Bwt Hwt ## 1 Female 2.0 7.0 ## 2 Female 2.0 7.4 ## 3 Female 2.0 9.5 ## 4 Male 2.0 6.5 ## 5 Male 2.0 6.5 ## 6 Female 2.1 7.2
dplyr::arrange()
dplyr::arrange()
sorts the rows of a data frame according to the order of one or more of the columns.
# Sort the cats by Sex first and within Sex (1=female) by body weight (low to high) sorted.cats.grouped <- dplyr::arrange(cats, Sex, Bwt) head(sorted.cats.grouped)
## Sex Bwt Hwt ## 1 Female 2.0 7.0 ## 2 Female 2.0 7.4 ## 3 Female 2.0 9.5 ## 4 Female 2.1 7.2 ## 5 Female 2.1 7.3 ## 6 Female 2.1 7.6
# add column with id numbers ID <- matrix(c(1:144), nrow=144, ncol=1)
# add variable name for ID variable and change names of other variables D <- dplyr::bind_cols(ID, cats) names(D) <- c("ID", "Sex", "Bodyweigt", "Hartweight") head(D)
## ID Sex Bodyweigt Hartweight ## 1 1 Female 2.0 7.0 ## 2 2 Female 2.0 7.4 ## 3 3 Female 2.0 9.5 ## 4 4 Female 2.1 7.2 ## 5 5 Female 2.1 7.3 ## 6 6 Female 2.1 7.6
cor(cats[, -1])
## Bwt Hwt ## Bwt 1.0000000 0.8041274 ## Hwt 0.8041274 1.0000000
With [, -1]
we exclude the first column (female/male)
cor.test(cats$Bwt, cats$Hwt)
## ## Pearson's product-moment correlation ## ## data: cats$Bwt and cats$Hwt ## t = 16.119, df = 142, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.7375682 0.8552122 ## sample estimates: ## cor ## 0.8041274
What do we conclude?
plot(cats$Bwt, cats$Hwt)
Test the null hypothesis that the difference in mean heart weight between male and female cats is 0
t.test(formula = Hwt ~ Sex, data = cats)
## ## Welch Two Sample t-test ## ## data: Hwt by Sex ## t = -6.5179, df = 140.61, p-value = 1.186e-09 ## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0 ## 95 percent confidence interval: ## -2.763753 -1.477352 ## sample estimates: ## mean in group Female mean in group Male ## 9.202128 11.322680
plot(formula = Hwt ~ Sex, data = cats)
Remember the boys
data from package mice
.
If we want to regress BMI (bmi
) on weight (wgt
), we would use the lm()
function and the model formula bmi ~ wgt
.
lm(formula = bmi ~ wgt, data = boys)
## ## Call: ## lm(formula = bmi ~ wgt, data = boys) ## ## Coefficients: ## (Intercept) wgt ## 14.5401 0.0935
plot(formula = bmi ~ wgt, data = boys)
lm(formula = bmi ~ wgt + I(wgt^2), data = boys)
## ## Call: ## lm(formula = bmi ~ wgt + I(wgt^2), data = boys) ## ## Coefficients: ## (Intercept) wgt I(wgt^2) ## 16.271466 -0.041715 0.001604
I()
Inside an R model formula, the +
, :
, and ^
operators are text objects and therefore behave differently than when used in calculations with numeric vectors. Using I()
converts these operators to their numerical meaning.
lm(bmi ~ wgt + wgt^2, data = boys)
## ## Call: ## lm(formula = bmi ~ wgt + wgt^2, data = boys) ## ## Coefficients: ## (Intercept) wgt ## 14.5401 0.0935
lm(bmi ~ wgt + I(wgt^2), data = boys)
## ## Call: ## lm(formula = bmi ~ wgt + I(wgt^2), data = boys) ## ## Coefficients: ## (Intercept) wgt I(wgt^2) ## 16.271466 -0.041715 0.001604
lm()
classIt is ‘nicer’ to store the output from the function in an object. The convention for regression models is an object called fit
.
fit <- lm(bmi ~ wgt + I(wgt^2), data = boys)
The object fit
contains a lot more than just the regression weights. To inspect what is inside you can use
ls(fit)
## [1] "assign" "call" "coefficients" "df.residual" ## [5] "effects" "fitted.values" "model" "na.action" ## [9] "qr" "rank" "residuals" "terms" ## [13] "xlevels"
fit
Another approach to inspecting the contents of fit
is the function attributes()
attributes(fit)
## $names ## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "na.action" "xlevels" "call" "terms" ## [13] "model" ## ## $class ## [1] "lm"
The benefit of using attributes()
is that it directly tells you the class of the object.
class(fit)
## [1] "lm"
Classes are used for an object-oriented style of programming. This means that you can write a specific function that
When a generic function fun
is applied to an object with class attribute c("first", "second")
, the system searches for a function called fun.first
and, if it finds it, applies it to the object.
If no such function is found, a function called fun.second
is tried. If no class name produces a suitable function, the function fun.default
is used (if it exists). If there is no class attribute, the implicit class is tried, then the default method.
plot(bmi ~ wgt, data = boys)
plot(lm(bmi ~ wgt, data = boys), which = 1)
plot(lm(bmi ~ wgt, data = boys), which = 2)
plot(lm(bmi ~ wgt, data = boys), which = 4)
"lm"
?The function plot()
is called, but not used. Instead, because the linear model has class "lm"
, R
searches for the function plot.lm()
.
If function plot.lm()
would not exist, R
tries to apply function plot()
(which would have failed in this case because plot requires x
and y
as input)
plot.lm()
is created by John Maindonald and Martin Maechler. They thought it would be useful to have a standard plotting environment for objects with class "lm"
.
Since the elements that class "lm"
returns are known, creating a generic function class is straightforward.
boys <- read_sav("boys.sav") %>% head()
It effectively replaces head(read_sav("boys.sav"))
.
Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like
boys <- read_sav("boys.sav") boys2 <- transform(boys, hgt = hgt / 100) boys3 <- filter(boys2, age > 15) boys4 <- subset(boys3, select = c(hgt, wgt, bmi))
With the pipe:
boys <- read_sav("boys.sav") %>% transform(hgt = hgt/100) %>% filter(age > 15) %>% subset(select = c(hgt, wgt, bmi))
Benefit: a single object in memory, the steps are easy to follow and understand.
Your code becomes more readable:
Keyboard shortcut:
ctrl + shift + m
cmd + shift + m
Pipes create functions without nesting:
f(x)
becomes x %>% f()
set.seed(123) mean(rnorm(10))
## [1] 0.07462564
set.seed(123) rnorm(10) %>% mean()
## [1] 0.07462564
f(x, y)
becomes x %>% f(y)
cor(boys, use = "pairwise.complete.obs")
## hgt wgt bmi ## hgt 1.0000000 0.6100784 0.1758781 ## wgt 0.6100784 1.0000000 0.8841304 ## bmi 0.1758781 0.8841304 1.0000000
boys %>% cor(use = "pairwise.complete.obs")
## hgt wgt bmi ## hgt 1.0000000 0.6100784 0.1758781 ## wgt 0.6100784 1.0000000 0.8841304 ## bmi 0.1758781 0.8841304 1.0000000
h(g(f(x)))
becomes x %>% f %>% g %>% h
max(na.omit(subset(boys, select = wgt)))
## [1] 117.4
boys %>% subset(select = wgt) %>% na.omit() %>% max()
## [1] 117.4
nrow(cats)
## [1] 144
# select the observations that fall outside the +/- SD around the mean cats.outl <- cats %>% dplyr::filter(Hwt > mean(Hwt) + 3 * sd(Hwt) | Hwt < mean(Hwt) - 3 * sd(Hwt)) nrow(cats.outl)
## [1] 1
# select the observations that fall within +/- 3 SD around the mean cats.without.outl <- cats %>% dplyr::filter(Hwt < mean(Hwt) + 3 * sd(Hwt) & Hwt > mean(Hwt) - 3 * sd(Hwt)) nrow(cats.without.outl)
## [1] 143
~
Sometimes the data we want to use, are “piped” in the wrong argument, see e.g.:
cats %>% lm(Hwt ~ Bwt)
This leads to an error:
Error in as.data.frame.default(data) :
cannot coerce class ‘“formula”’ to a data.frame
The %>%
pipe operator is intended to work with functions where each result is forwarded on to the first argument of the next function. For the function lm()
the first argument is the model formula, which is a text object and not the data frame.
We can use the .
symbol to act as placeholder for the data:
cats %>% lm(Hwt ~ Bwt, data = .)
## ## Call: ## lm(formula = Hwt ~ Bwt, data = .) ## ## Coefficients: ## (Intercept) Bwt ## -0.3567 4.0341
%$%
pipe%$%
pipeThe exposition %$%
pipe exposes the content(s) of the data frame (the variables) to the next function in the pipeline.
# Now the code works without need of the placeholder cats %$% lm(Hwt ~ Bwt)
## ## Call: ## lm(formula = Hwt ~ Bwt) ## ## Coefficients: ## (Intercept) Bwt ## -0.3567 4.0341
Sample 3 positions from the alphabet and show the position and the letter. If you don’t know what’s going on, run each statement separately!
set.seed(123) 1:26
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ## [26] 26
set.seed(123) 1:26 %>% sample(3)
## [1] 15 19 14
set.seed(123) 1:26 %>% sample(3) %>% paste(., LETTERS[.])
## [1] "15 O" "19 S" "14 N"
cats %$% t.test(Hwt ~ Sex)
## ## Welch Two Sample t-test ## ## data: Hwt by Sex ## t = -6.5179, df = 140.61, p-value = 1.186e-09 ## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0 ## 95 percent confidence interval: ## -2.763753 -1.477352 ## sample estimates: ## mean in group Female mean in group Male ## 9.202128 11.322680
is the same as
t.test(Hwt ~ Sex, data = cats)
cats.test <- cats %$% t.test(Bwt ~ Sex) cats.test
## ## Welch Two Sample t-test ## ## data: Bwt by Sex ## t = -8.7095, df = 136.84, p-value = 8.831e-15 ## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0 ## 95 percent confidence interval: ## -0.6631268 -0.4177242 ## sample estimates: ## mean in group Female mean in group Male ## 2.359574 2.900000