- Data manipulation (base R, tidyverse)
- subsetting
- Importing data into
R
, exporting data - Standard solves for missing data
- Modeling in
R
: the model formula - Errors, warnings, messages
- Coding tips
- Google style guide
Statistical Programming with R
R
, exporting dataR
: the model formuladplyr
(tidyverse)dplyr
is a very useful package for data transformation and manipulation. See cheat sheet in R Studio, Help -> cheat sheets
Subsetting in base R with []
, [[]]
and $
In dplyr
, two important functions for subsetting:
select()
: subset columns
filter()
: subset rows
select()
Remove second column with select()
:
# Remove second column: require(plyr)
## Loading required package: plyr
dplyr::select(D, -2)
## V1 V3 ## 1 -0.56047565 a ## 2 -0.23017749 b ## 3 1.55870831 c ## 4 0.07050839 d ## 5 0.12928774 e
# with base R: D[, -2]
select()
Other way to remove the second column
D
## V1 V2 V3 ## 1 -0.56047565 8.430130 a ## 2 -0.23017749 5.921832 b ## 3 1.55870831 2.469878 c ## 4 0.07050839 3.626294 d ## 5 0.12928774 4.108676 e
dplyr::select(D, V1, V3)
## V1 V3 ## 1 -0.56047565 a ## 2 -0.23017749 b ## 3 1.55870831 c ## 4 0.07050839 d ## 5 0.12928774 e
# base R: D[, c(1,3)]
filter()
D
## V1 V2 V3 ## 1 -0.56047565 8.430130 a ## 2 -0.23017749 5.921832 b ## 3 1.55870831 2.469878 c ## 4 0.07050839 3.626294 d ## 5 0.12928774 4.108676 e
dplyr::filter(D, V1 < 0 & V2 > 1)
## V1 V2 V3 ## 1 -0.5604756 8.430130 a ## 2 -0.2301775 5.921832 b
# with base R: D[D$V1 < 0 & D$V2 > 1, ]
R
.RData
In Practical B you worked with the boys
data which were stored in the R
file format: .RData
, an R workspace file.
Open the sleepdata.Rdata
file with:
load("sleepdata.RData") # Note: This code works if you have placed the sleepdata.RData file # in the same project folder as your Rmd file. You do not have to # specify a file path.
R has many in-built data sets. The command data()
will give a list of all in-built data sets (also the data included in the non-base packages that are activated).
Open an in-built data set as follows:
require(MASS) # load the package MASS that contains the mammals data.
## Loading required package: MASS
data(mammals) # load the mammals data
Text files (.txt) can be imported in to R with:
read.table("mammalsleep.txt")
CSV (comma seperated values) files can be imported with:
read.csv("filename.csv", header=TRUE, sep=",") # for comma (,) separated files read.csv2("filename.csv", header=TRUE, sep=";") # for semicolons (;) separated files
There are many packages that facilitate importing/exporting other data formats from statistical software:
read_spss
from package haven
(but also other data formats from Stata and SAS)MplusAutomation
read.dta()
in foreign
sasxport.get()
from package Hmisc
read.xlsx()
from package openxlsx
read_excel()
from package readxl
haven
by Hadley Wickham provides wonderful functions to import and export many data types from software such as Stata, SAS and SPSS.
For a short guideline to import multiple formats into R
, see e.g. http://www.statmethods.net/input/importingdata.html.
R
Calculations based on missing values (NA’s) are not possible in R
:
mean(c(1, 2, NA, 4, 5))
## [1] NA
There are two easy ways to perform “listwise deletion”:
mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
## [1] 3
mean(na.omit(c(1, 2, NA, 4, 5)))
## [1] 3
R
To model objects based on other objects, we use the ~
(tilde) operator to construct an R model formula, a type of language object.
For example, to model body mass index (BMI) on weight with a linear regression model, use the model formula bmi ~ wgt
with the linear model function lm()
:
require(mice) lm(formula = bmi ~ wgt, data = boys)
## ## Call: ## lm(formula = bmi ~ wgt, data = boys) ## ## Coefficients: ## (Intercept) wgt ## 14.5401 0.0935
In this case, `bmi
~ wgt
means “regress bmi
on weight
.
The model formula can also be used to plot create a scatterplot with bmi
on the y-axis and wgt
on the x-axis.
plot(formula = bmi ~ wgt, data = boys)
R
by default saves (part of) the code history and RStudio
expands this functionality greatly.
Most often it may be useful to look back at the code history for various reasons.
There are multiple ways to access the code history.
If you simply get a message, without the words “Error” or “Warning”, it is a message to inform you. The code runs as expected, but you are simply made aware of possible unwanted effects:
::
operatorNamespaces is an advanced topic and becomes important when you start to develop your own packages. Namespaces provide a context for looking up the value of an object associated with a name.
In daily practice you will often encounter the consequences when two different packages use the same name for a function.
For example, the plyr
package and the Hmisc
package both have a function with the same name summarize()
but not with the same functionality.
If you load plyr
then Hmisc
, summarize()
will refer to the Hmisc
version. If you load Hmisc
then plyr
, summarize()
will refer to the plyr
version.
To avoid the confusion of not knowing which function is active, you can disambiguate the functions by using the ::
operator:
Hmisc::summarize()
and plyr::summarize
will refer to the function in the specific package. Hence ::
serves as a namespace by providing the context where to look for the function summarize
.
Then the order in which you loaded the packages does not matter anymore.
When the message is preceded by “Warning”: your code will still work, but with some caveats and will not produce the results you expect.
z <- 1:5 y <- 1:6 z
## [1] 1 2 3 4 5
y
## [1] 1 2 3 4 5 6
z + y
## Warning in z + y: longer object length is not a multiple of shorter object ## length
## [1] 2 4 6 8 10 7
Generally when there is an error, the code will not run.
For example, we want to load the package Hmisc
but it is not installed. We will get the following Error and the code will not run:
If you use require(Hmisc)
instead, there will be a warning message and the rest of the code (if there is any) will be executed:
#
) to clarify what you are doing
R
-files
RStudio
projectsRStudio Projects
Every time you start a new data analysis project, create a new RStudio Project
.
Because you want your project to work:
RStudio Projects
create a convention that guarantees that the project can be moved around on your computer or onto other computers and will still “just work”:
Today you had perhaps your first experience with R and with coding.
You may have noticed that computers are not that smart: you have to give very precise instructions without mistakes and ambiguous meaning. Remember also the way computers store numerical values.
R
-coding and tidy verse style guideFile names should end in .R
and, of course, be meaningful.
GOOD:
predict_ad_revenue.R
BAD:
foo.R
Place spaces around all binary operators (=, +, -, <-, etc.).
Exception: Spaces around =’s are optional when passing parameters in a function call.
lm(age ~ bmi, data=boys)
or
lm(age ~ bmi, data = boys)
Do not place a space before a comma, but always place one after a comma.
GOOD:
tab.prior <- table(df[df$days.from.opt < 0, "campaign.id"]) total <- sum(x[, 1]) total <- sum(x[1, ])
Extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-).
plot(x = x.coord, y = data.mat[, MakeColName(metric, ptiles[1], "roiOpt")], ylim = ylim, xlab = "dates", ylab = metric, main = (paste(metric, " for 3 samples ", sep = "")))
Do not place spaces around code in parentheses or square brackets.
Exception: Always place a space after a comma.
GOOD:
if (debug) x[1, ]
BAD:
if ( debug ) # No spaces around debug x[1,] # Needs a space after the comma
BAD:
# Needs spaces around '<' tab.prior <- table(df[df$days.from.opt<0, "campaign.id"]) # Needs a space after the comma tab.prior <- table(df[df$days.from.opt < 0,"campaign.id"]) # Needs a space before <- #Using alt/option - creates the correct spacing tab.prior<- table(df[df$days.from.opt < 0, "campaign.id"]) # Needs spaces around <- #Using alt/option - creates the correct spacing tab.prior<-table(df[df$days.from.opt < 0, "campaign.id"]) # Needs a space after the comma total <- sum(x[,1]) # Needs a space after the comma, not before total <- sum(x[ ,1])
Don’t use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions.
variable.name is preferred, variableName is accepted
GOOD: avg.clicks
OK: avgClicks
BAD: avg_Clicks
FunctionName
GOOD: CalculateAvgClicks
BAD: calculate_avg_clicks
, calculateAvgClicks
kConstantName
The maximum line length is 80 characters.
# This is to demonstrate that at about eighty characters you would move off of the page # Also, if you have a very wide function fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + wgt * hgt + wgt * hgt * bmi, data = boys) # it would be nice to pose it as fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys) #or fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys)
When indenting your code, use two spaces. RStudio
does this for you!
Never use tabs or a mix of tabs and spaces.
Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis.
Use common sense and BE CONSISTENT.
The point of having style guidelines is to have a common vocabulary of coding
If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers out of their rhythm when they go to read it. Try to avoid this.