Topics

Data manipulation (base R, tidyverse)
- subsetting
Importing data into R, exporting data
Standard solves for missing data
Modeling in R: the model formula
Errors, warnings, messages
Coding tips
Google style guide

Tidyverse

Data transformation with `dplyr` (tidyverse)

HTML5 Icon

dplyr is a very useful package for data transformation and manipulation. See cheat sheet in R Studio, Help -> cheat sheets

Subsetting in base R with [], [[]] and $

In dplyr, two important functions for subsetting:

select() : subset columns

filter() : subset rows

Subset columns with `select()`

Remove second column with select():

# Remove second column:
require(plyr)

## Loading required package: plyr

dplyr::select(D, -2)

##            V1 V3
## 1 -0.56047565  a
## 2 -0.23017749  b
## 3  1.55870831  c
## 4  0.07050839  d
## 5  0.12928774  e

# with base R: D[, -2]

Subset columns with `select()`

Other way to remove the second column

##            V1       V2 V3
## 1 -0.56047565 8.430130  a
## 2 -0.23017749 5.921832  b
## 3  1.55870831 2.469878  c
## 4  0.07050839 3.626294  d
## 5  0.12928774 4.108676  e

dplyr::select(D, V1, V3)

##            V1 V3
## 1 -0.56047565  a
## 2 -0.23017749  b
## 3  1.55870831  c
## 4  0.07050839  d
## 5  0.12928774  e

# base R: D[, c(1,3)]

Subset rows with `filter()`

##            V1       V2 V3
## 1 -0.56047565 8.430130  a
## 2 -0.23017749 5.921832  b
## 3  1.55870831 2.469878  c
## 4  0.07050839 3.626294  d
## 5  0.12928774 4.108676  e

dplyr::filter(D, V1 < 0 & V2 > 1)

##           V1       V2 V3
## 1 -0.5604756 8.430130  a
## 2 -0.2301775 5.921832  b

# with base R: D[D$V1 < 0 & D$V2 > 1, ]

Importing data into `R`

R data format and workspace: `.RData`

In Practical B you worked with the boys data which were stored in the R file format: .RData, an R workspace file.

A workspace contains all changes you made to your data and functions during a session.
Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large data sets from raw text.

Open the sleepdata.Rdata file with:

load("sleepdata.RData")
# Note: This code works if you have placed the sleepdata.RData file 
# in the same project folder as your Rmd file. You do not have to 
# specify a file path.

Data sets in R (packages)

R has many in-built data sets. The command data() will give a list of all in-built data sets (also the data included in the non-base packages that are activated).

Open an in-built data set as follows:

require(MASS) # load the package MASS that contains the mammals data.

## Loading required package: MASS

data(mammals) # load the mammals data

Importing delimited data files

Text files (.txt) can be imported in to R with:

read.table("mammalsleep.txt")

CSV (comma seperated values) files can be imported with:

read.csv("filename.csv", header=TRUE, sep=",") # for comma (,) separated files 

read.csv2("filename.csv", header=TRUE, sep=";") # for semicolons (;) separated files

Read and write statistical data formats

There are many packages that facilitate importing/exporting other data formats from statistical software:

SPSS: the function read_spss from package haven (but also other data formats from Stata and SAS)
Mplus: package MplusAutomation
Stata: read.dta() in foreign
SAS: sasxport.get() from package Hmisc
MS Excel:
- function read.xlsx() from package openxlsx
- function read_excel() from package readxl

haven by Hadley Wickham provides wonderful functions to import and export many data types from software such as Stata, SAS and SPSS.

For a short guideline to import multiple formats into R, see e.g. http://www.statmethods.net/input/importingdata.html.

Standard solves for missing values

Dealing with missing values in `R`

Calculations based on missing values (NA’s) are not possible in R:

mean(c(1, 2, NA, 4, 5))

## [1] NA

There are two easy ways to perform “listwise deletion”:

mean(c(1, 2, NA, 4, 5), na.rm = TRUE)

## [1] 3

mean(na.omit(c(1, 2, NA, 4, 5)))

## [1] 3

Modeling in `R`

The model formula

To model objects based on other objects, we use the ~ (tilde) operator to construct an R model formula, a type of language object.

For example, to model body mass index (BMI) on weight with a linear regression model, use the model formula bmi ~ wgt with the linear model function lm():

require(mice)
lm(formula = bmi ~ wgt, data = boys)

## 
## Call:
## lm(formula = bmi ~ wgt, data = boys)
## 
## Coefficients:
## (Intercept)          wgt  
##     14.5401       0.0935

In this case, `bmi ~ wgt means “regress bmi on weight.

The model formula

The model formula can also be used to plot create a scatterplot with bmi on the y-axis and wgt on the x-axis.

plot(formula = bmi ~ wgt, data = boys)

More R functionality

History and why it is useful

R by default saves (part of) the code history and RStudio expands this functionality greatly.

Most often it may be useful to look back at the code history for various reasons.

There are multiple ways to access the code history.
1. Use arrow up in the console. This allows you to go back in time, one codeline by one. Extremely useful to go back to previous lines for minor alterations to the code.
2. Use the history tab in the environment pane. The complete project history can be found here and the history can be searched. This is particularly convenient when you know what code you are looking for.

Errors, warnings, messages

Messages

If you simply get a message, without the words “Error” or “Warning”, it is a message to inform you. The code runs as expected, but you are simply made aware of possible unwanted effects:

HTML5 Icon

Namespaces and the `::` operator

Namespaces is an advanced topic and becomes important when you start to develop your own packages. Namespaces provide a context for looking up the value of an object associated with a name.

In daily practice you will often encounter the consequences when two different packages use the same name for a function.

For example, the plyr package and the Hmisc package both have a function with the same name summarize()but not with the same functionality.

If you load plyr then Hmisc, summarize() will refer to the Hmisc version. If you load Hmisc then plyr, summarize() will refer to the plyr version.

To avoid the confusion of not knowing which function is active, you can disambiguate the functions by using the :: operator:

Hmisc::summarize() and plyr::summarize will refer to the function in the specific package. Hence :: serves as a namespace by providing the context where to look for the function summarize.

Then the order in which you loaded the packages does not matter anymore.

Warning

When the message is preceded by “Warning”: your code will still work, but with some caveats and will not produce the results you expect.

z <- 1:5 
y <- 1:6  
z

## [1] 1 2 3 4 5

## [1] 1 2 3 4 5 6

z + y

## Warning in z + y: longer object length is not a multiple of shorter object
## length

## [1]  2  4  6  8 10  7

Errors

Generally when there is an error, the code will not run.

For example, we want to load the package Hmisc but it is not installed. We will get the following Error and the code will not run:

HTML5 Icon

If you use require(Hmisc) instead, there will be a warning message and the rest of the code (if there is any) will be executed:

HTML5 Icon

Programming tips and organising your work

Some tips to learn to code:

keep your code tidy
use comments (text preceded by #) to clarify what you are doing
- If you look at your code again, one month from now: you will not know what you did –> unless you use comments
when working with functions, use the TAB key to quickly access the help for the function’s components
work with logically named R-files
- indicate the sequential nature of your work
work with RStudio projects

Use `RStudio Projects`

Every time you start a new data analysis project, create a new RStudio Project.

Because you want your project to work:

not only now, but also in a few years;
when the folder and file paths have changed;
when collaborators want to run your code on their computer.

RStudio Projects create a convention that guarantees that the project can be moved around on your computer or onto other computers and will still “just work”:

all code and outputs are stored in one set location;
relative file paths are created;
a clean R environment is created every time you open it;
every project can have its own version control system and history;
RStudio projects can relate to Git (or other online) repositories.

Some tips to learn to code:

Today you had perhaps your first experience with R and with coding.

You may have noticed that computers are not that smart: you have to give very precise instructions without mistakes and ambiguous meaning. Remember also the way computers store numerical values.

Practice, practice, practice …
Use the “copy, paste, tweak” approach: use code made by others (plenty available on the web) and tweak it to make it useful for your project.
Learning to code goes smoother when you are working on a particular data project that is important to you, like analyzing your own data. Practice what you learn in this course on your own data.
You become more organized in coding, as you focus on creating readable code. In the long run, this will result in you becoming a more efficient programmer. Remember: efficient code runs faster.

`R`-coding and tidy verse style guide

https://style.tidyverse.org/index.html

Naming conventions

File Names

File names should end in .R and, of course, be meaningful.

GOOD:

predict_ad_revenue.R

BAD:

foo.R

Spacing

Place spaces around all binary operators (=, +, -, <-, etc.).

Exception: Spaces around =’s are optional when passing parameters in a function call.

lm(age ~ bmi, data=boys)

lm(age ~ bmi, data = boys)

Spacing (continued)

Do not place a space before a comma, but always place one after a comma.

GOOD:

tab.prior <- table(df[df$days.from.opt < 0, "campaign.id"])
total <- sum(x[, 1])
total <- sum(x[1, ])

Extra spacing

Extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-).

plot(x    = x.coord,
     y    = data.mat[, MakeColName(metric, ptiles[1], "roiOpt")],
     ylim = ylim,
     xlab = "dates",
     ylab = metric,
     main = (paste(metric, " for 3 samples ", sep = "")))

Do not place spaces around code in parentheses or square brackets.

Exception: Always place a space after a comma.

Extra spacing

GOOD:

if (debug)
x[1, ]

BAD:

if ( debug )  # No spaces around debug
x[1,]  # Needs a space after the comma

Spacing (continued)

BAD:

# Needs spaces around '<'
tab.prior <- table(df[df$days.from.opt<0, "campaign.id"])  
# Needs a space after the comma
tab.prior <- table(df[df$days.from.opt < 0,"campaign.id"])  
# Needs a space before <-  #Using alt/option - creates the correct spacing
tab.prior<- table(df[df$days.from.opt < 0, "campaign.id"]) 
# Needs spaces around <-  #Using alt/option - creates the correct spacing
tab.prior<-table(df[df$days.from.opt < 0, "campaign.id"])  
# Needs a space after the comma
total <- sum(x[,1])  
# Needs a space after the comma, not before 
total <- sum(x[ ,1])

Identifiers

Don’t use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions.

The preferred form for variable names is all lower case letters and words separated with dots (variable.name), but variableName is also accepted;
function names have initial capital letters and no dots (FunctionName);
constants are named like functions but with an initial k.

Identifiers (continued)

variable.name is preferred, variableName is accepted
GOOD: avg.clicks
OK: avgClicks
BAD: avg_Clicks
FunctionName
GOOD: CalculateAvgClicks
BAD: calculate_avg_clicks , calculateAvgClicks
kConstantName

Syntax

Line Length

The maximum line length is 80 characters.

# This is to demonstrate that at about eighty characters you would move off of the page

# Also, if you have a very wide function
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + wgt * hgt + wgt * hgt * bmi, data = boys)

# it would be nice to pose it as
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt 
          + bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys)
#or
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg 
          + bmi * hgt 
          + bmi * wgt
          + wgt * hgt 
          + wgt * hgt * bmi, 
          data = boys)

Indentation

When indenting your code, use two spaces. RStudio does this for you!

Never use tabs or a mix of tabs and spaces.

Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis.

In general…

Use common sense and BE CONSISTENT.
The point of having style guidelines is to have a common vocabulary of coding
- so people can concentrate on what you are saying, rather than on how you are saying it.
If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers out of their rhythm when they go to read it. Try to avoid this.

R-functionality

Laurence Frank, Gerko Vink

Statistical Programming with R

Topics

Tidyverse

Data transformation with dplyr (tidyverse)

Subset columns with select()

Subset columns with select()

Subset rows with filter()

Importing data into R

R data format and workspace: .RData

Data sets in R (packages)

Importing delimited data files

Read and write statistical data formats

Standard solves for missing values

Dealing with missing values in R

Modeling in R

The model formula

The model formula

More R functionality

History and why it is useful

Errors, warnings, messages

Messages

Namespaces and the :: operator

Warning

Errors

Programming tips and organising your work

Some tips to learn to code:

Use RStudio Projects

Some tips to learn to code:

R-coding and tidy verse style guide

Naming conventions

File Names

Spacing

Spacing (continued)

Extra spacing

Extra spacing

Spacing (continued)

Identifiers

Identifiers (continued)

Syntax

Line Length

Indentation

In general…

Data transformation with `dplyr` (tidyverse)

Subset columns with `select()`

Subset columns with `select()`

Subset rows with `filter()`

Importing data into `R`

R data format and workspace: `.RData`

Dealing with missing values in `R`

Modeling in `R`

Namespaces and the `::` operator

Use `RStudio Projects`

`R`-coding and tidy verse style guide