library(MASS)
library(ggplot2)
library(dplyr)
library(gridExtra)
In this exercise, predict the price of diamonds from a continuous and
categorical variable. The data set is diamonds
of the
package ggplot2
, The data set is very large, so we will
work with a random sample of 500 cases.
First have a look at the structure of the data set with:
str()
. What do you conclude? Which data types do you
see?
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
slice_sample()
function to take a random sample form the
original data set. Call the new data set: my_diamonds
,
Change the class of the variables cut
, color
and clarity
from ordered factor
to
factor
. Look again at the results with str()
and summary()
.set.seed(100)
my_diamonds <- dplyr::slice_sample(diamonds, n = 500, replace = TRUE) %>%
dplyr::mutate(cut = factor(cut, ordered = FALSE),
color = factor(color, ordered = FALSE),
clarity = factor(clarity, ordered = FALSE))
str(my_diamonds)
## tibble [500 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:500] 1.26 0.7 0.36 2.1 1.21 2 0.4 1.02 0.61 0.55 ...
## $ cut : Factor w/ 5 levels "Fair","Good",..: 5 5 5 4 4 2 5 3 3 5 ...
## $ color : Factor w/ 7 levels "D","E","F","G",..: 4 1 3 7 1 2 3 2 4 4 ...
## $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 3 4 3 2 2 2 7 8 3 5 ...
## $ depth : num [1:500] 59.6 62.7 62 59.1 59.7 64.7 61.7 58.1 62 61.6 ...
## $ table : num [1:500] 57 57 56 58 58 57 57 59 58 55 ...
## $ price : int [1:500] 6738 3448 770 12494 4946 15393 1446 11420 1659 1786 ...
## $ x : num [1:500] 7.08 5.65 4.59 8.46 7.06 7.75 4.74 6.71 5.43 5.29 ...
## $ y : num [1:500] 7.04 5.67 4.54 8.4 6.96 7.86 4.7 6.75 5.48 5.32 ...
## $ z : num [1:500] 4.21 3.55 2.83 4.98 4.19 5.05 2.91 3.91 3.38 3.27 ...
summary(my_diamonds)
## carat cut color clarity depth
## Min. :0.2300 Fair : 12 D: 53 SI1 :117 Min. :56.20
## 1st Qu.:0.4000 Good : 41 E:102 VS2 :107 1st Qu.:61.10
## Median :0.7000 Very Good:120 F: 90 SI2 : 94 Median :61.90
## Mean :0.8069 Premium :123 G: 97 VS1 : 78 Mean :61.76
## 3rd Qu.:1.1000 Ideal :204 H: 85 VVS2 : 51 3rd Qu.:62.50
## Max. :2.5400 I: 46 VVS1 : 38 Max. :68.40
## J: 27 (Other): 15
## table price x y
## Min. :52.80 Min. : 354.0 Min. :3.940 Min. :3.900
## 1st Qu.:56.00 1st Qu.: 972.8 1st Qu.:4.718 1st Qu.:4.718
## Median :57.00 Median : 2304.5 Median :5.655 Median :5.675
## Mean :57.23 Mean : 4075.1 Mean :5.749 Mean :5.751
## 3rd Qu.:59.00 3rd Qu.: 5677.5 3rd Qu.:6.643 3rd Qu.:6.622
## Max. :65.00 Max. :18678.0 Max. :8.820 Max. :8.760
##
## z
## Min. :2.410
## 1st Qu.:2.917
## Median :3.510
## Mean :3.550
## 3rd Qu.:4.090
## Max. :5.550
##
The package gtsummary
provides a flexible way to create presentation-ready summary tables.
Have a look at the webpage of this package to learn more about the
possibilities. Install the package gtsummary
and load
it.
Make a gtsummary
table of the my_diamonds
data with the following code:
my_diamonds %>% tbl_summary()
# install.packages("gtsummary")
require(gtsummary)
my_diamonds %>% tbl_summary()
Characteristic | N = 5001 |
---|---|
carat | 0.70 (0.40, 1.10) |
cut | |
Fair | 12 (2.4%) |
Good | 41 (8.2%) |
Very Good | 120 (24%) |
Premium | 123 (25%) |
Ideal | 204 (41%) |
color | |
D | 53 (11%) |
E | 102 (20%) |
F | 90 (18%) |
G | 97 (19%) |
H | 85 (17%) |
I | 46 (9.2%) |
J | 27 (5.4%) |
clarity | |
I1 | 2 (0.4%) |
SI2 | 94 (19%) |
SI1 | 117 (23%) |
VS2 | 107 (21%) |
VS1 | 78 (16%) |
VVS2 | 51 (10%) |
VVS1 | 38 (7.6%) |
IF | 13 (2.6%) |
depth | 61.90 (61.10, 62.50) |
table | 57.00 (56.00, 59.00) |
price | 2,304 (973, 5,678) |
x | 5.66 (4.72, 6.64) |
y | 5.68 (4.72, 6.62) |
z | 3.51 (2.92, 4.09) |
1 Median (IQR); n (%) |
Make a scatterplot with ggplot
with the following
aesthetics: x-axis = carat
; y-axis = price
;
color = depth
and add a meaningful main title to the plot,
as well as axis lables. What do you conclude?
p <- ggplot(my_diamonds, aes(x=carat, y=price, color=depth)) +
geom_point() +
labs(x = "carat", y = "price", title = "Price of diamonds related to their carat and depth values")
p
We start with a simple regression predicting the price
of diamonds from the numeric predictor carat
.
price
against
carat
, and add a linear regression line (set the size of
the points to 0.5).ggplot(my_diamonds, aes(x = carat, y = price)) +
geom_point(size = 0.5) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
carat
on
price
.carat
is positively associated with
price
.fit <- lm(price ~ carat, my_diamonds)
fit %>% summary()
##
## Call:
## lm(formula = price ~ carat, data = my_diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4097.8 -858.6 -73.8 534.8 7458.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2307.6 136.6 -16.90 <2e-16 ***
## carat 7910.2 145.0 54.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1577 on 498 degrees of freedom
## Multiple R-squared: 0.8567, Adjusted R-squared: 0.8564
## F-statistic: 2978 on 1 and 498 DF, p-value: < 2.2e-16
anova(lm(price ~ 1, my_diamonds), fit)
## Analysis of Variance Table
##
## Model 1: price ~ 1
## Model 2: price ~ carat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 499 8643336276
## 2 498 1238309642 1 7.405e+09 2978 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(fit)
We now predict the price
of diamonds from the
categorical predictors cut
. The variable cut
in this data set has class ordered
(the class for ordinal
variables), but, for this practical, we want it to have class
factor
(the class for categorical variables).
price
for the different levels of
cut
with ggplot
. Interpret the results.my_diamonds %>%
ggplot(aes(price, cut)) +
geom_boxplot()
cut
predicting price
.fit <- lm(price ~ cut, my_diamonds)
fit %>% summary()
##
## Call:
## lm(formula = price ~ cut, data = my_diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5132 -2599 -1478 1668 15326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6266.2 1176.9 5.324 1.54e-07 ***
## cutGood -1738.5 1338.1 -1.299 0.1945
## cutVery Good -2367.0 1234.3 -1.918 0.0557 .
## cutPremium -941.4 1233.0 -0.764 0.4455
## cutIdeal -3060.9 1211.0 -2.528 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4077 on 495 degrees of freedom
## Multiple R-squared: 0.04815, Adjusted R-squared: 0.04045
## F-statistic: 6.259 on 4 and 495 DF, p-value: 6.416e-05
cut
?par(mfrow = c(2, 2))
plot(fit)
carat
and
cut
as predictors of price
, and display and
interpret the summary of the regression results.fit_main <- lm(price ~ carat + cut, my_diamonds)
summary(fit_main)
##
## Call:
## lm(formula = price ~ carat + cut, data = my_diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4189.6 -837.0 -49.8 538.7 7343.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4121.1 489.0 -8.427 3.91e-16 ***
## carat 8036.6 149.2 53.857 < 2e-16 ***
## cutGood 1531.5 514.6 2.976 0.003060 **
## cutVery Good 1650.6 477.2 3.459 0.000589 ***
## cutPremium 1713.1 473.4 3.619 0.000326 ***
## cutIdeal 1883.2 471.5 3.994 7.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1557 on 494 degrees of freedom
## Multiple R-squared: 0.8615, Adjusted R-squared: 0.8601
## F-statistic: 614.5 on 5 and 494 DF, p-value: < 2.2e-16
tbl_regression()
(part of the gtsummary
package) using the following code:t1 <-
lm(price ~ carat + cut, my_diamonds) %>%
tbl_regression()
t1
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
carat | 8,037 | 7,743, 8,330 | <0.001 |
cut | |||
Fair | — | — | |
Good | 1,531 | 521, 2,542 | 0.003 |
Very Good | 1,651 | 713, 2,588 | <0.001 |
Premium | 1,713 | 783, 2,643 | <0.001 |
Ideal | 1,883 | 957, 2,809 | <0.001 |
1 CI = Confidence Interval |
depth
effect was significant in the simple regression model and not in the
multiple regression model.carat
and cut
included, and display and interpret the model
summary.fit_interact <- lm(price ~ carat * cut, my_diamonds)
summary(fit_interact)
##
## Call:
## lm(formula = price ~ carat * cut, data = my_diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4401.7 -807.9 -55.8 549.5 7063.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3633.21 1131.22 -3.212 0.00141 **
## carat 7659.10 803.09 9.537 < 2e-16 ***
## cutGood 1037.11 1233.45 0.841 0.40086
## cutVery Good 1408.20 1166.56 1.207 0.22796
## cutPremium 859.38 1168.87 0.735 0.46256
## cutIdeal 1464.59 1149.97 1.274 0.20341
## carat:cutGood 384.82 936.86 0.411 0.68144
## carat:cutVery Good 67.75 861.41 0.079 0.93735
## carat:cutPremium 757.69 846.87 0.895 0.37139
## carat:cutIdeal 275.17 843.95 0.326 0.74453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1558 on 490 degrees of freedom
## Multiple R-squared: 0.8624, Adjusted R-squared: 0.8599
## F-statistic: 341.3 on 9 and 490 DF, p-value: < 2.2e-16
t2 <-
lm(price ~ carat * cut, my_diamonds) %>%
tbl_regression()
t2
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
carat | 7,659 | 6,081, 9,237 | <0.001 |
cut | |||
Fair | — | — | |
Good | 1,037 | -1,386, 3,461 | 0.4 |
Very Good | 1,408 | -884, 3,700 | 0.2 |
Premium | 859 | -1,437, 3,156 | 0.5 |
Ideal | 1,465 | -795, 3,724 | 0.2 |
carat * cut | |||
carat * Good | 385 | -1,456, 2,226 | 0.7 |
carat * Very Good | 68 | -1,625, 1,760 | >0.9 |
carat * Premium | 758 | -906, 2,422 | 0.4 |
carat * Ideal | 275 | -1,383, 1,933 | 0.7 |
1 CI = Confidence Interval |
carat
is significant.anova
function.anova(fit_main, fit_interact)
## Analysis of Variance Table
##
## Model 1: price ~ carat + cut
## Model 2: price ~ carat * cut
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 494 1197262713
## 2 490 1189093338 4 8169375 0.8416 0.4992
Make a presentation-ready regression table of both models: the main effect model and the interaction effect model with the following code:
# presentation ready regression table
tbl_merge(
list(t1, t2),
tab_spanner = c("**Main effects model**", "**Interaction model**")
)
Characteristic | Main effects model | Interaction model | ||||
---|---|---|---|---|---|---|
Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
carat | 8,037 | 7,743, 8,330 | <0.001 | 7,659 | 6,081, 9,237 | <0.001 |
cut | ||||||
Fair | — | — | — | — | ||
Good | 1,531 | 521, 2,542 | 0.003 | 1,037 | -1,386, 3,461 | 0.4 |
Very Good | 1,651 | 713, 2,588 | <0.001 | 1,408 | -884, 3,700 | 0.2 |
Premium | 1,713 | 783, 2,643 | <0.001 | 859 | -1,437, 3,156 | 0.5 |
Ideal | 1,883 | 957, 2,809 | <0.001 | 1,465 | -795, 3,724 | 0.2 |
carat * cut | ||||||
carat * Good | 385 | -1,456, 2,226 | 0.7 | |||
carat * Very Good | 68 | -1,625, 1,760 | >0.9 | |||
carat * Premium | 758 | -906, 2,422 | 0.4 | |||
carat * Ideal | 275 | -1,383, 1,933 | 0.7 | |||
1 CI = Confidence Interval |
price
on the x-axis,
carat
on the y-axis, with different colors for the points
and linear regression lines for the levels of cut
. Suppress
the confidence bands. What does this plot tell you about the main
effects and the interaction effect of carat
and
cut
?ggplot(my_diamonds, aes(carat, price, col = cut))+
geom_point() +
geom_smooth(method = "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'
End of practical