library(MASS)
library(ggplot2)
library(dplyr)
library(gridExtra)

The Diamonds data

In this exercise, predict the price of diamonds from a continuous and categorical variable. The data set is diamonds of the package ggplot2, The data set is very large, so we will work with a random sample of 500 cases.

First have a look at the structure of the data set with: str(). What do you conclude? Which data types do you see?

str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
  1. Set the seed to 100 and—with the help of pipe operators—use the slice_sample() function to take a random sample form the original data set. Call the new data set: my_diamonds, Change the class of the variables cut, color and clarity from ordered factor to factor. Look again at the results with str() and summary().
set.seed(100)
my_diamonds <- dplyr::slice_sample(diamonds, n = 500, replace = TRUE) %>% 
  dplyr::mutate(cut = factor(cut, ordered = FALSE),
         color   = factor(color, ordered = FALSE),
         clarity = factor(clarity, ordered = FALSE))

str(my_diamonds)
## tibble [500 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:500] 1.26 0.7 0.36 2.1 1.21 2 0.4 1.02 0.61 0.55 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 5 5 5 4 4 2 5 3 3 5 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 4 1 3 7 1 2 3 2 4 4 ...
##  $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 3 4 3 2 2 2 7 8 3 5 ...
##  $ depth  : num [1:500] 59.6 62.7 62 59.1 59.7 64.7 61.7 58.1 62 61.6 ...
##  $ table  : num [1:500] 57 57 56 58 58 57 57 59 58 55 ...
##  $ price  : int [1:500] 6738 3448 770 12494 4946 15393 1446 11420 1659 1786 ...
##  $ x      : num [1:500] 7.08 5.65 4.59 8.46 7.06 7.75 4.74 6.71 5.43 5.29 ...
##  $ y      : num [1:500] 7.04 5.67 4.54 8.4 6.96 7.86 4.7 6.75 5.48 5.32 ...
##  $ z      : num [1:500] 4.21 3.55 2.83 4.98 4.19 5.05 2.91 3.91 3.38 3.27 ...
summary(my_diamonds)
##      carat               cut      color      clarity        depth      
##  Min.   :0.2300   Fair     : 12   D: 53   SI1    :117   Min.   :56.20  
##  1st Qu.:0.4000   Good     : 41   E:102   VS2    :107   1st Qu.:61.10  
##  Median :0.7000   Very Good:120   F: 90   SI2    : 94   Median :61.90  
##  Mean   :0.8069   Premium  :123   G: 97   VS1    : 78   Mean   :61.76  
##  3rd Qu.:1.1000   Ideal    :204   H: 85   VVS2   : 51   3rd Qu.:62.50  
##  Max.   :2.5400                   I: 46   VVS1   : 38   Max.   :68.40  
##                                   J: 27   (Other): 15                  
##      table           price               x               y        
##  Min.   :52.80   Min.   :  354.0   Min.   :3.940   Min.   :3.900  
##  1st Qu.:56.00   1st Qu.:  972.8   1st Qu.:4.718   1st Qu.:4.718  
##  Median :57.00   Median : 2304.5   Median :5.655   Median :5.675  
##  Mean   :57.23   Mean   : 4075.1   Mean   :5.749   Mean   :5.751  
##  3rd Qu.:59.00   3rd Qu.: 5677.5   3rd Qu.:6.643   3rd Qu.:6.622  
##  Max.   :65.00   Max.   :18678.0   Max.   :8.820   Max.   :8.760  
##                                                                   
##        z        
##  Min.   :2.410  
##  1st Qu.:2.917  
##  Median :3.510  
##  Mean   :3.550  
##  3rd Qu.:4.090  
##  Max.   :5.550  
## 

Presentation ready summary tables

The package gtsummary provides a flexible way to create presentation-ready summary tables. Have a look at the webpage of this package to learn more about the possibilities. Install the package gtsummary and load it.

Make a gtsummary table of the my_diamonds data with the following code: my_diamonds %>% tbl_summary()

# install.packages("gtsummary")
require(gtsummary)

my_diamonds %>% tbl_summary()
Characteristic N = 5001
carat 0.70 (0.40, 1.10)
cut
Fair 12 (2.4%)
Good 41 (8.2%)
Very Good 120 (24%)
Premium 123 (25%)
Ideal 204 (41%)
color
D 53 (11%)
E 102 (20%)
F 90 (18%)
G 97 (19%)
H 85 (17%)
I 46 (9.2%)
J 27 (5.4%)
clarity
I1 2 (0.4%)
SI2 94 (19%)
SI1 117 (23%)
VS2 107 (21%)
VS1 78 (16%)
VVS2 51 (10%)
VVS1 38 (7.6%)
IF 13 (2.6%)
depth 61.90 (61.10, 62.50)
table 57.00 (56.00, 59.00)
price 2,304 (973, 5,678)
x 5.66 (4.72, 6.64)
y 5.68 (4.72, 6.62)
z 3.51 (2.92, 4.09)
1 Median (IQR); n (%)

Make a scatterplot with ggplot with the following aesthetics: x-axis = carat; y-axis = price; color = depth and add a meaningful main title to the plot, as well as axis lables. What do you conclude?

p <- ggplot(my_diamonds, aes(x=carat, y=price, color=depth)) +
      geom_point() +
      labs(x = "carat", y = "price", title = "Price of diamonds related to their carat and depth values") 
p


Numeric predictors

We start with a simple regression predicting the price of diamonds from the numeric predictor carat.

  1. Display a scatter plot of price against carat, and add a linear regression line (set the size of the points to 0.5).
ggplot(my_diamonds, aes(x = carat, y = price)) +
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE) 
## `geom_smooth()` using formula = 'y ~ x'

  1. Describe the effect of carat on price.
  • carat is positively associated with price.
  1. Display and interpret the summary of the model.
fit <- lm(price ~ carat, my_diamonds)
fit %>% summary()
## 
## Call:
## lm(formula = price ~ carat, data = my_diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4097.8  -858.6   -73.8   534.8  7458.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2307.6      136.6  -16.90   <2e-16 ***
## carat         7910.2      145.0   54.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1577 on 498 degrees of freedom
## Multiple R-squared:  0.8567, Adjusted R-squared:  0.8564 
## F-statistic:  2978 on 1 and 498 DF,  p-value: < 2.2e-16
  1. Test the significance of the model against the intercept-only model.
anova(lm(price ~ 1, my_diamonds), fit)
## Analysis of Variance Table
## 
## Model 1: price ~ 1
## Model 2: price ~ carat
##   Res.Df        RSS Df Sum of Sq    F    Pr(>F)    
## 1    499 8643336276                                
## 2    498 1238309642  1 7.405e+09 2978 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Display and interpret a the residual plots of the model.
par(mfrow = c(2, 2))
plot(fit)

Categorical predictor

We now predict the price of diamonds from the categorical predictors cut. The variable cut in this data set has class ordered (the class for ordinal variables), but, for this practical, we want it to have class factor (the class for categorical variables).

  1. Display boxplots of price for the different levels of cut with ggplot. Interpret the results.
my_diamonds %>% 
  ggplot(aes(price, cut)) +
  geom_boxplot()

  1. Display and interpret the summary of the model with cut predicting price.
fit <- lm(price ~ cut, my_diamonds)
fit %>% summary()
## 
## Call:
## lm(formula = price ~ cut, data = my_diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5132  -2599  -1478   1668  15326 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6266.2     1176.9   5.324 1.54e-07 ***
## cutGood       -1738.5     1338.1  -1.299   0.1945    
## cutVery Good  -2367.0     1234.3  -1.918   0.0557 .  
## cutPremium     -941.4     1233.0  -0.764   0.4455    
## cutIdeal      -3060.9     1211.0  -2.528   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4077 on 495 degrees of freedom
## Multiple R-squared:  0.04815,    Adjusted R-squared:  0.04045 
## F-statistic: 6.259 on 4 and 495 DF,  p-value: 6.416e-05
  1. What is the reference category of cut?
  • Fair
  1. Which category has the highest predicted price?
  • Fair
  1. Display and interpret the residual plots of the model.
par(mfrow = c(2, 2))
plot(fit)

Multiple regression

  1. Fit the main effects model with carat and cut as predictors of price, and display and interpret the summary of the regression results.
fit_main <- lm(price ~ carat + cut, my_diamonds) 
summary(fit_main)
## 
## Call:
## lm(formula = price ~ carat + cut, data = my_diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4189.6  -837.0   -49.8   538.7  7343.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4121.1      489.0  -8.427 3.91e-16 ***
## carat          8036.6      149.2  53.857  < 2e-16 ***
## cutGood        1531.5      514.6   2.976 0.003060 ** 
## cutVery Good   1650.6      477.2   3.459 0.000589 ***
## cutPremium     1713.1      473.4   3.619 0.000326 ***
## cutIdeal       1883.2      471.5   3.994 7.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1557 on 494 degrees of freedom
## Multiple R-squared:  0.8615, Adjusted R-squared:  0.8601 
## F-statistic: 614.5 on 5 and 494 DF,  p-value: < 2.2e-16
  1. Make a table of the output of the regression model with tbl_regression() (part of the gtsummary package) using the following code:
t1 <-
  lm(price ~ carat + cut, my_diamonds) %>% 
  tbl_regression()
t1
Characteristic Beta 95% CI1 p-value
carat 8,037 7,743, 8,330 <0.001
cut
Fair — —
Good 1,531 521, 2,542 0.003
Very Good 1,651 713, 2,588 <0.001
Premium 1,713 783, 2,643 <0.001
Ideal 1,883 957, 2,809 <0.001
1 CI = Confidence Interval
  1. Qualitatively compare the size and significance of the parameter estimates from this model to those from the simple regression models.
  • The effects are somewhat smaller in the multiple regression model and the significance levels are somewhat lower. The depth effect was significant in the simple regression model and not in the multiple regression model.
  1. Fit a the same model with the interaction between carat and cut included, and display and interpret the model summary.
fit_interact <- lm(price ~ carat * cut, my_diamonds) 
summary(fit_interact)
## 
## Call:
## lm(formula = price ~ carat * cut, data = my_diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4401.7  -807.9   -55.8   549.5  7063.3 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3633.21    1131.22  -3.212  0.00141 ** 
## carat               7659.10     803.09   9.537  < 2e-16 ***
## cutGood             1037.11    1233.45   0.841  0.40086    
## cutVery Good        1408.20    1166.56   1.207  0.22796    
## cutPremium           859.38    1168.87   0.735  0.46256    
## cutIdeal            1464.59    1149.97   1.274  0.20341    
## carat:cutGood        384.82     936.86   0.411  0.68144    
## carat:cutVery Good    67.75     861.41   0.079  0.93735    
## carat:cutPremium     757.69     846.87   0.895  0.37139    
## carat:cutIdeal       275.17     843.95   0.326  0.74453    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1558 on 490 degrees of freedom
## Multiple R-squared:  0.8624, Adjusted R-squared:  0.8599 
## F-statistic: 341.3 on 9 and 490 DF,  p-value: < 2.2e-16
t2 <-
  lm(price ~ carat * cut, my_diamonds) %>% 
  tbl_regression()
t2
Characteristic Beta 95% CI1 p-value
carat 7,659 6,081, 9,237 <0.001
cut
Fair — —
Good 1,037 -1,386, 3,461 0.4
Very Good 1,408 -884, 3,700 0.2
Premium 859 -1,437, 3,156 0.5
Ideal 1,465 -795, 3,724 0.2
carat * cut
carat * Good 385 -1,456, 2,226 0.7
carat * Very Good 68 -1,625, 1,760 >0.9
carat * Premium 758 -906, 2,422 0.4
carat * Ideal 275 -1,383, 1,933 0.7
1 CI = Confidence Interval
  1. What do you notice about the significance of the effects?
  • Only the effect of carat is significant.
  1. Compare the fit of the model with main effects and the model with interaction effects with the anova function.
anova(fit_main, fit_interact)
## Analysis of Variance Table
## 
## Model 1: price ~ carat + cut
## Model 2: price ~ carat * cut
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1    494 1197262713                           
## 2    490 1189093338  4   8169375 0.8416 0.4992

Make a presentation-ready regression table of both models: the main effect model and the interaction effect model with the following code:

# presentation ready regression table

tbl_merge(
  list(t1, t2),
  tab_spanner = c("**Main effects model**", "**Interaction model**")
)
Characteristic Main effects model Interaction model
Beta 95% CI1 p-value Beta 95% CI1 p-value
carat 8,037 7,743, 8,330 <0.001 7,659 6,081, 9,237 <0.001
cut
Fair — — — —
Good 1,531 521, 2,542 0.003 1,037 -1,386, 3,461 0.4
Very Good 1,651 713, 2,588 <0.001 1,408 -884, 3,700 0.2
Premium 1,713 783, 2,643 <0.001 859 -1,437, 3,156 0.5
Ideal 1,883 957, 2,809 <0.001 1,465 -795, 3,724 0.2
carat * cut
carat * Good 385 -1,456, 2,226 0.7
carat * Very Good 68 -1,625, 1,760 >0.9
carat * Premium 758 -906, 2,422 0.4
carat * Ideal 275 -1,383, 1,933 0.7
1 CI = Confidence Interval
  1. Display a scatter plot with price on the x-axis, carat on the y-axis, with different colors for the points and linear regression lines for the levels of cut. Suppress the confidence bands. What does this plot tell you about the main effects and the interaction effect of carat and cut?
ggplot(my_diamonds, aes(carat, price, col = cut))+
  geom_point() +
  geom_smooth(method = "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'

  • The regression lines are separated, so there are some main effects.
  • They are almost parallel, so no interactions.

End of practical