Model selection

1. Root biomass as a function of the environment

The environment.csv dataset (from Beckerman and Petchey’s textbook, Getting started with R: An introduction for biologists) includes measures of root biomass (in g/m\(^2\)) for 10 sites as a function of altitude (in m), temperature (in degrees C) and rainfall (in m).

enviro <- read.csv("environment.csv")
str(enviro)

## 'data.frame':    10 obs. of  5 variables:
##  $ site       : int  1 2 3 4 5 6 7 8 9 10
##  $ altitude   : int  13 160 100 205 45 84 349 509 399 30
##  $ temperature: int  24 18 17 15 20 21 14 11 13 19
##  $ rainfall   : num  0.01 0.5 0.6 1.1 0.09 0.2 1.2 0.6 0.8 0.5
##  $ biomass    : int  20 120 110 200 45 70 150 275 220 38

Estimate the parameters of the model including the three predictors: biomass ~ altitude + temperature + rainfall. Does the inclusion of the three predictors in the same model cause problems? Justify your answer.
Propose several alternative models for this dataset, including the null model (0 predictor) and models with 1 or 2 predictors (without interactions). Avoid using highly correlated predictors in the same model. Create a table comparing these models according to their AICc.
What is the best model for predicting root biomass at a new site similar to those sampled? Would it be useful to make average predictions from several models here? Justify your answer.

2. Predictions of the migration of bird species

The file migration.csv contains data from Rubolini et al. (2005) on 28 bird species that migrate between Europe and Africa.

migr <- read.csv("migration.csv")
str(migr)

## 'data.frame':    28 obs. of  14 variables:
##  $ speciesID : int  1 3 4 5 7 8 9 11 12 13 ...
##  $ species1  : chr  "Acrocephalus" "Acrocephalus" "Anthus" "Anthus" ...
##  $ species2  : chr  "arundinaceus" "scirpaceus" "campestris" "trivialis" ...
##  $ migDate   : num  33 38 32 27 35 30 31 30.8 30 28 ...
##  $ latBreed  : num  46 48 43.5 55.3 47.5 50.3 51 51.5 48.8 59 ...
##  $ latWntr   : num  -10.3 0 6 -10 -7.5 18.5 -15 7.5 -10 7.5 ...
##  $ sexDchrmt : num  0 0 0 0 4.3 2 2.3 7 17.3 16 ...
##  $ nestSite  : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ moult     : int  1 1 0 0 1 0 1 0 0 0 ...
##  $ mWngLn    : num  96.8 66.8 91.6 88.7 192.1 ...
##  $ fWngLn    : num  92.3 66 86.9 84.7 194.3 ...
##  $ numSpecies: int  641 546 140 3531 269 104 166 101 737 12837 ...
##  $ X         : num  -10.3 0 6 -10 -7.5 18.5 -15 7.5 -10 7.5 ...
##  $ Y         : num  33 38 32 27 35 30 31 30.8 30 28 ...

We are looking to predict the date of arrival in Europe (migDate, measured in days from April 1st) based on the following predictors:

Latitude of the breeding site in Europe (latBreed)
Latitude of the wintering site in Africa (latWntr). Note: Latitude is positive if north of the equator, negative if south.
Whether the species nests in existing cavities (nestSite, 0 = no, 1 = yes)
Whether the species moults at the wintering site (moult, 0 = no, 1 = yes)

In theory, birds are expected to arrive later if their breeding site is further north (due to climate and distance) and if they moult at the wintering site. Birds are expected to arrive earlier if their wintering grounds are at a higher latitude in Africa (less distance to travel) and if they nest in existing cavities.

Check the fit of the complete linear model including the 4 predictors. Interpret the values obtained for each of the coefficients of these predictors (but not the intercept). Are these results consistent with those expected in theory?
Using AICc, compare models including each of the following combinations of the 4 predictors:

latBreed
latWntr
latBreed + latWntr
latBreed + nestSite
latWntr + nestSite
latBreed + latWntr + nestSite
latBreed + nestSite + moult
latWntr + nestSite + moult
latBreed + latWntr + nestSite + moult (complete model)

How many models have a \(\Delta AIC \le 2\)? According to the Akaike weights, what is the probability that the best model is among those?

Load the dataset migr_test.csv which contains the data of 10 other species from the Rubolini et al.

migr_test <- read.csv("migr_test.csv")
str(migr_test)

## 'data.frame':    10 obs. of  14 variables:
##  $ speciesID : int  2 6 10 14 18 22 26 30 34 38
##  $ species1  : chr  "Acrocephalus" "Calandrella" "Delichon" "Hippolais" ...
##  $ species2  : chr  "schoenobaenus" "brachydactyla" "urbica" "icterina" ...
##  $ migDate   : num  35 27.5 29 39 31.2 28 35 27 22 22
##  $ latBreed  : num  57.5 39.5 48.5 56 54.5 49 45.5 56.5 48 44
##  $ latWntr   : num  -7.5 15.5 -15 -19 13 -7.5 -12 -9 11 16
##  $ sexDchrmt : num  0 0 0 0 0 9 19.3 0 5.7 2.3
##  $ nestSite  : int  0 0 0 0 0 0 0 0 0 1
##  $ moult     : int  1 0 1 1 1 0 1 1 0 1
##  $ mWngLn    : num  67.2 93.4 111.1 78.9 64.6 ...
##  $ fWngLn    : num  64.7 89.8 110 78 63.6 ...
##  $ numSpecies: int  2524 138 1624 10297 63 1163 1525 24767 2658 410
##  $ X         : num  -7.5 15.5 -15 -19 13 -7.5 -12 -9 11 16
##  $ Y         : num  35 27.5 29 39 31.2 28 35 27 22 22

Calculate the mean of the square prediction error (observation - prediction)\(^2\) for these 10 new observations according to (i) the best model identified in (b) and (ii) the weighted average prediction of all models.

Tip: To obtain a vector of the average predictions, choose the mod.avg.pred component of the object produced by the modavgPred function.

Model selection

October 27, 2021

1. Root biomass as a function of the environment

2. Predictions of the migration of bird species