Answers for this lab must be submitted on Moodle before April 8th at 5pm.
Tip: In RMarkdown you can add the argument
cache = TRUE
to a block of code ({r, cache = TRUE}
) to save the result of the block. In this case, as long as the code remains the same, the calculation is not repeated every time the RMarkdown document is compiled. This function is especially useful for time-consuming operations, such as fitting a Bayesian model withbrm
.
We will use the gapminder dataset presented during the exercises on robust regression (lab 4). This data frame includes life expectancy (lifeExp), population (pop) and GDP per capita (gdpPercap) for 142 countries and 12 years (every 5 years between 1952 and 2007).
library(gapminder)
data(gapminder)
str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
As in lab 4, we first transform the predictors:
gdp_norm is the logarithm of gdpPercap, scaled to have a mean of 0 and a standard deviation of 1.
dyear is the number of years since 1952.
library(dplyr)
gapminder <- mutate(gapminder, gdp_norm = scale(log(gdpPercap)),
dyear = year - 1952)
In lab 4, we first performed a linear regression of
lifeExp
as a function of gdp_norm
,
dyear
and their interaction. For this first part, we will
estimate these same effects in a Bayesian context, by adding random
effects of the country on the intercept and the coefficients of
gdp_norm
and dyear
.
Notes:
The model formula in brm
follows the same syntax as
lmer
for the specification of fixed and random
effects.
Although it would be possible to add a random effect of country
on the gdp_norm:dyear
interaction, we omit it here in order
to reduce model computation time.
class = "sd"
), while the last line refers to the
standard deviation of the individual observations
(class = "sigma"
).gap_prior <- c(set_prior("", class = "Intercept"),
set_prior("", class = "b", coef = "gdp_norm"),
set_prior("", class = "b", coef = "dyear"),
set_prior("", class = "b", coef = "gdp_norm:dyear"),
set_prior("", class = "sd", coef = "Intercept", group = "country"),
set_prior("", class = "sd", coef = "gdp_norm", group = "country"),
set_prior("", class = "sd", coef = "dyear", group = "country"),
set_prior("", class = "sigma"))
It is recommended to choose normal distributions in all cases. For
“sigma” and “sd”, these distributions will be interpreted as half-normal
because it is implied that these parameters are \(\geq 0\). To choose the \(\mu\) and \(\sigma\) values for each normal
distribution, consider the interpretation of each parameter and in
particular the scales of the predictors gdp_norm
and
dyear
.
For the effect of the interaction, we can assume that it is not
stronger than the main effects of the two predictors, so
gdp_norm:year
can take the same prior distribution as the
smallest assumed effect between gdp_norm
and
year
.
As for the standard deviations of the random effects (“sd”), their prior distribution can have the same width as that of the corresponding coefficient “b”.
brm
. I suggest specifying
chains = 1, iter = 1500, warmup = 1000
to produce a single
Markov chain with 1000 warmup iterations and 500 sampling iterations.
Then visualize the predicted distribution of lifeExp
for
each iteration of the prior parameters.Due to the large number of estimated effects and the fact that we
impose only weak constraints on each prior distribution, extreme or even
impossible values (large positive and negative values) are to be
expected; the important thing is that the density is greater within a
realistic range of values. It may be useful to “zoom” into a part of the
ggplot
graph by adding
coord_cartesian(xlim = c(..., ...), ylim = c(..., ...))
with limits in \(x\) and \(y\).
Now fit the model with brm
. You can reduce the
number of Markov chains to 2 to save time, but keep the default values
for the number of iterations. (You can ignore the warning that the
effective sample size or ESS is small.) How can you assess the
convergence of the model?
Compare the magnitude of the coefficients of
gdp_norm
and dyear
to the magnitude of the
random effects of country on those coefficients. What do you learn from
that comparison?
In lab 4, we saw that a robust regression was preferable for this dataset. In order to allow more extreme residuals in a Bayesian context, we will replace the normal distribution for the residuals by a Student \(t\) distribution.
family = student
in brm
. This argument
indicates that the residuals normalized by sigma
, \((y - \hat{y})/\sigma\), follow a \(t\) distribution with \(\nu\) degrees of freedom.Keep the same prior distributions for all parameters of the model.
Let brm
choose a prior distribution for \(\nu\) (nu
). By calling the
prior_summary
function from the fitted model, can you
determine what this prior distribution is by default?
Describe the main differences between the parameter estimates from this model, compared with those of the model in part 1.
Apply predict
to the two models to obtain the mean,
standard deviation and 95% credibility interval for the posterior
prediction for each point in the data frame and attach these predictions
to the original data set with cbind
. Select a few countries
from the dataset and illustrate the observations, predictions of the two
models and their credibility intervals.