R
, exporting dataR
: the model formuladplyr
(tidyverse)dplyr
is a very useful package for data transformation
and manipulation. See cheat sheet in R Studio, Help -> cheat
sheets
Subsetting in base R with []
, [[]]
and
$
In dplyr
, two important functions for subsetting:
select()
: subset columns
filter()
: subset rows
select()
Remove second column with select()
:
# Remove second column:
require(plyr)
## Loading required package: plyr
dplyr::select(D, -2)
## V1 V3
## 1 -0.56047565 a
## 2 -0.23017749 b
## 3 1.55870831 c
## 4 0.07050839 d
## 5 0.12928774 e
# with base R: D[, -2]
select()
Other way to remove the second column
D
## V1 V2 V3
## 1 -0.56047565 8.430130 a
## 2 -0.23017749 5.921832 b
## 3 1.55870831 2.469878 c
## 4 0.07050839 3.626294 d
## 5 0.12928774 4.108676 e
dplyr::select(D, V1, V3)
## V1 V3
## 1 -0.56047565 a
## 2 -0.23017749 b
## 3 1.55870831 c
## 4 0.07050839 d
## 5 0.12928774 e
# base R: D[, c(1,3)]
filter()
D
## V1 V2 V3
## 1 -0.56047565 8.430130 a
## 2 -0.23017749 5.921832 b
## 3 1.55870831 2.469878 c
## 4 0.07050839 3.626294 d
## 5 0.12928774 4.108676 e
dplyr::filter(D, V1 < 0 & V2 > 1)
## V1 V2 V3
## 1 -0.5604756 8.430130 a
## 2 -0.2301775 5.921832 b
# with base R: D[D$V1 < 0 & D$V2 > 1, ]
R
.RData
In Practical B you worked with the boys
data which were
stored in the R
file format: .RData
, an R
workspace file.
Open the sleepdata.Rdata
file with:
load("sleepdata.RData")
# Note: This code works if you have placed the sleepdata.RData file
# in the same project folder as your Rmd file. You do not have to
# specify a file path.
R has many in-built data sets. The command data()
will
give a list of all in-built data sets (also the data included in the
non-base packages that are activated).
Open an in-built data set as follows:
require(MASS) # load the package MASS that contains the mammals data.
## Loading required package: MASS
data(mammals) # load the mammals data
Text files (.txt) can be imported in to R with:
read.table("mammalsleep.txt")
CSV (comma seperated values) files can be imported with:
read.csv("filename.csv", header=TRUE, sep=",") # for comma (,) separated files
read.csv2("filename.csv", header=TRUE, sep=";") # for semicolons (;) separated files
There are many packages that facilitate importing/exporting other data formats from statistical software:
read_spss
from package
haven
(but also other data formats from Stata and SAS)MplusAutomation
read.dta()
in foreign
sasxport.get()
from package
Hmisc
read.xlsx()
from package
openxlsx
read_excel()
from package
readxl
haven
by
Hadley Wickham provides wonderful
functions to import and export many data types from software such as
Stata, SAS and SPSS.
For a short guideline to import multiple formats into R
,
see e.g. http://www.statmethods.net/input/importingdata.html.
R
Calculations based on missing values (NA’s) are not possible in
R
:
mean(c(1, 2, NA, 4, 5))
## [1] NA
There are two easy ways to perform “listwise deletion”:
mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
## [1] 3
mean(na.omit(c(1, 2, NA, 4, 5)))
## [1] 3
R
To model objects based on other objects, we use the ~
(tilde) operator to construct an R model formula, a
type of language object.
For example, to model body mass index (BMI) on weight with a linear
regression model, use the model formula bmi ~ wgt
with the
linear model function lm()
:
require(mice)
lm(formula = bmi ~ wgt, data = boys)
##
## Call:
## lm(formula = bmi ~ wgt, data = boys)
##
## Coefficients:
## (Intercept) wgt
## 14.5401 0.0935
In this case, `bmi
~ wgt
means “regress
bmi
on weight
.
The model formula can also be used to plot create a scatterplot with
bmi
on the y-axis and wgt
on the x-axis.
plot(formula = bmi ~ wgt, data = boys)
R
by default saves (part of) the code history and
RStudio
expands this functionality greatly.
Most often it may be useful to look back at the code history for various reasons.
There are multiple ways to access the code history.
If you simply get a message, without the words “Error” or “Warning”, it is a message to inform you. The code runs as expected, but you are simply made aware of possible unwanted effects:
::
operatorNamespaces is an advanced topic and becomes important when you start to develop your own packages. Namespaces provide a context for looking up the value of an object associated with a name.
In daily practice you will often encounter the consequences when two different packages use the same name for a function.
For example, the plyr
package and the Hmisc
package both have a function with the same name
summarize()
but not with the same functionality.
If you load plyr
then Hmisc
,
summarize()
will refer to the Hmisc
version.
If you load Hmisc
then plyr
,
summarize()
will refer to the plyr
version.
To avoid the confusion of not knowing which function is active, you
can disambiguate the functions by using the ::
operator:
Hmisc::summarize()
and plyr::summarize
will
refer to the function in the specific package. Hence ::
serves as a namespace by providing the context where to look for the
function summarize
.
Then the order in which you loaded the packages does not matter anymore.
When the message is preceded by “Warning”: your code will still work, but with some caveats and will not produce the results you expect.
z <- 1:5
y <- 1:6
z
## [1] 1 2 3 4 5
y
## [1] 1 2 3 4 5 6
z + y
## Warning in z + y: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 8 10 7
Generally when there is an error, the code will not run.
For example, we want to load the package Hmisc
but it is
not installed. We will get the following Error and the code will not
run:
If you use require(Hmisc)
instead, there will be a
warning message and the rest of the code (if there is any) will be
executed:
#
) to clarify what you
are doing
R
-files
RStudio
projectsRStudio Projects
Every time you start a new data analysis project, create a new
RStudio Project
.
Because you want your project to work:
RStudio Projects
create a convention that guarantees
that the project can be moved around on your computer or onto other
computers and will still “just work”:
Today you had perhaps your first experience with R and with coding.
You may have noticed that computers are not that smart: you have to give very precise instructions without mistakes and ambiguous meaning. Remember also the way computers store numerical values.
R
-coding and tidy verse style guideFile names should end in .R
and, of course, be
meaningful.
GOOD:
predict_ad_revenue.R
BAD:
foo.R
Place spaces around all binary operators (=, +, -, <-, etc.).
Exception: Spaces around =’s are optional when passing parameters in a function call.
lm(age ~ bmi, data=boys)
or
lm(age ~ bmi, data = boys)
Do not place a space before a comma, but always place one after a comma.
GOOD:
tab.prior <- table(df[df$days.from.opt < 0, "campaign.id"])
total <- sum(x[, 1])
total <- sum(x[1, ])
Extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-).
plot(x = x.coord,
y = data.mat[, MakeColName(metric, ptiles[1], "roiOpt")],
ylim = ylim,
xlab = "dates",
ylab = metric,
main = (paste(metric, " for 3 samples ", sep = "")))
Do not place spaces around code in parentheses or square brackets.
Exception: Always place a space after a comma.
GOOD:
if (debug)
x[1, ]
BAD:
if ( debug ) # No spaces around debug
x[1,] # Needs a space after the comma
BAD:
# Needs spaces around '<'
tab.prior <- table(df[df$days.from.opt<0, "campaign.id"])
# Needs a space after the comma
tab.prior <- table(df[df$days.from.opt < 0,"campaign.id"])
# Needs a space before <- #Using alt/option - creates the correct spacing
tab.prior<- table(df[df$days.from.opt < 0, "campaign.id"])
# Needs spaces around <- #Using alt/option - creates the correct spacing
tab.prior<-table(df[df$days.from.opt < 0, "campaign.id"])
# Needs a space after the comma
total <- sum(x[,1])
# Needs a space after the comma, not before
total <- sum(x[ ,1])
Don’t use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions.
variable.name is preferred, variableName is accepted
GOOD: avg.clicks
OK: avgClicks
BAD: avg_Clicks
FunctionName
GOOD: CalculateAvgClicks
BAD: calculate_avg_clicks
,
calculateAvgClicks
kConstantName
The maximum line length is 80 characters.
# This is to demonstrate that at about eighty characters you would move off of the page
# Also, if you have a very wide function
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + wgt * hgt + wgt * hgt * bmi, data = boys)
# it would be nice to pose it as
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt
+ bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys)
#or
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg
+ bmi * hgt
+ bmi * wgt
+ wgt * hgt
+ wgt * hgt * bmi,
data = boys)
When indenting your code, use two spaces. RStudio
does
this for you!
Never use tabs or a mix of tabs and spaces.
Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis.
Use common sense and BE CONSISTENT.
The point of having style guidelines is to have a common vocabulary of coding
If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers out of their rhythm when they go to read it. Try to avoid this.