This assignment must be submitted before February 18th at 5pm on Moodle.
This exercise is based on the gapminder dataset from the package of the same name.
Jennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder
This dataset includes the life expectancy (lifeExp), population (pop) and GDP per capita (gdpPercap) for 142 countries and 12 years (every 5 years between 1952 and 2007).
library(gapminder)
str(gapminder)
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
... + facet_wrap(~year)
.What general trends do you observe? Are there extreme values that could strongly influence a regression model? If so, try to identify these data in the table based on the position of the points in the graph.
lm
) to determine the effect of GDP per capita, year and their interaction on life expectancy. To help interpret the coefficients, perform the following transformations on the predictors:Take the logarithm of gdpPercap and standardize it with the function scale
. Reminder: scale(x)
subtracts each value of x
from its mean and divides by its standard deviation, so the resulting variable has a mean of 0 and a standard deviation of 1; it represents the number of standard deviations above or below the mean.
Replace year with the number of years since 1952.
Interpret the meaning of each of the coefficients in the model and then refer to the diagnostic graphs. Are the assumptions of the linear model met?
lmrob
from the robustbase package) and median regression (function rq
from the quantreg package, choosing only the median quantile). Explain how the estimates and standard errors of the coefficients differ between the three methods.Note: Use the showAlgo = FALSE
option when applying the summary
function to the output of lmrob
, to simplify the summary.
ggplot
you can use the geom_smooth
function with method = "lm"
for linear regression and method = "lmrob"
for robust regression. For median regression you can use geom_quantile
as seen in the notes.Based on your observation of the data in 1(a), would it be useful to model different quantiles of life expectancy based on the predictors? Justify your answer.
Perform a quantile regression with the same predictors as in 1(b), with the following quantiles: (0.1, 0.25, 0.5, 0.75, 0.9)
. Use the plot
function on the quantile regression summary and describe how the effect of the predictors varies between quantiles.
Superimpose the quantile regression lines on the graph of the data. Do the trends for each quantile appear to be affected by extreme values?
While this dataset is useful for illustrating the concepts of robust regression and quantile regression, it should be noted that this type of statistical analysis comparing variables measured at the national level has several limitations:
It cannot be assumed that the associations detected apply at a smaller scale (e.g., the relationship between life expectancy and income when comparing national averages is not necessarily the same as the relationship between life expectancy and income at the level of individuals living in each country).
Averages calculated in different countries are not independent observations, because environmental, social and economic conditions are correlated between nearby countries.
There are many factors that differentiate countries, so it is difficult to interpret an association as a causal link.
Many articles, particularly in the social sciences, have been published on the methods to be used to make this type of cross-country comparisons.