This workshop provides an overview of tools available in R for the analysis of geolocated data. This type of data is becoming increasingly common in various fields (e.g. aerial photos, satellite imagery, census data, geolocated posts on social networks, etc.). There are two broad categories of geospatial data:

  • raster data represent variables defined at every point of a grid covering the full extent of the data (as in satellite images);
  • vector data associate variables (sometimes also called attributes) to discrete geometrical objects located in space (such as the position of cities and highways on a road map).

At first, using programming commands to process geographical data can seem less intuitive, compared with the graphical user interface of GIS software. Here are a few advantages of scripted analyses:

  • It is easy to repeat the analysis for new data by re-running the script.
  • It is easy for other researchers to reproduce the methods if they have access to the same programming language.
  • When using R specifically, the spatial data can be extracted and merged with other datasets for statistical analyses in a single programming environment.

Objectives

  • Become familiar with the main packages for processing and simple visualization of vector and raster data in R (the sf and stars packages, respectively).
  • Perform common data transformation operations using functions from those packages.
  • Create more complex static maps (with ggplot2) and interactive maps (with mapview).

Note on packages

The set of packages available for spatial analysis in R has evolved rapidly. A few years ago, the sp and raster packages were the main tools for vector and raster data processing, respectively. sf and stars are part of a recent initiative to overhaul R spatial tools (https://www.r-spatial.org/). The sf package represents spatial data frames with a standard format based on open-source geodatabases, and integrates well with popular R packages for data manipulation and visualization (such as dplyr and ggplot2). The stars package is newer but already shows a few improvements over raster. A previous version of this workshop used raster instead of stars.

Explore a vector dataset

All datasets used in this workshop can be found in the data folder. The mrc dataset contains information on Québec regional county municipalities (MRCs) in a ESRI shapefile format. Note that a single shapefile dataset is spread across multiple files, which share a name but differ in their file extension (mrc.dbf, mrc.prj, mrc.shp and mrc.shx).

To load this dataset in R, we call the st_read function from sf. (All sf package functions start with the prefix st_, standing for spatiotemporal.) The first argument to st_read is a path to the data; here it is sufficient to specify the .shp file name. The argument stringsAsFactors = FALSE prevents R from converting character variables to factors in the resulting data frame.

library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

mrc <- st_read("data/mrc.shp", stringsAsFactors = FALSE)
## Reading layer `mrc' from data source `Z:\cours\rgeo-csee2019\data\mrc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 104 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -79.7625 ymin: 44.99136 xmax: -56.93495 ymax: 62.58217
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs

The output text indicates the main properties of the loaded dataset, including the geometric type (MULTIPOLYGON), the spatial extent of the data (bbox) and the coordinate reference system (CRS) in use. That CRS is described in two formats: a EPSG code and a proj4string character string. We will discuss coordinate systems in the next section. For now, we note that “+proj=longlat” in the proj4string means that the data is in longitude and latitude coordinates (with units of decimal degrees).

The bounding box and CRS can be accessed separately using the st_bbox and st_crs functions:

st_bbox(mrc)
##      xmin      ymin      xmax      ymax 
## -79.76250  44.99136 -56.93495  62.58217
st_crs(mrc)
## Coordinate Reference System:
##   EPSG: 4269 
##   proj4string: "+proj=longlat +datum=NAD83 +no_defs"

Let us look at the first few rows of the data:

class(mrc)
## [1] "sf"         "data.frame"
head(mrc)
## Simple feature collection with 6 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -79.51776 ymin: 45.44911 xmax: -63.31941 ymax: 62.58217
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##                           mrc_name reg_id              reg_name pop2016
## 1                          Abitibi     08 Abitibi-Temiscamingue   24707
## 2                    Abitibi-Ouest     08 Abitibi-Temiscamingue   20582
## 3                            Acton     16            Monteregie   15623
## 4 Administration regionale Kativik     10        Nord-du-Quebec   13343
## 5                  Antoine-Labelle     15           Laurentides   35410
## 6                       Argenteuil     15           Laurentides   32477
##                         geometry
## 1 MULTIPOLYGON (((-78.64359 4...
## 2 MULTIPOLYGON (((-79.51776 4...
## 3 MULTIPOLYGON (((-72.70415 4...
## 4 MULTIPOLYGON (((-77.82293 5...
## 5 MULTIPOLYGON (((-75.52133 4...
## 6 MULTIPOLYGON (((-74.71004 4...

An sf object is a specialized data.frame where each line contains data associated with a geometry element (or simple feature, hence the package name), which is described in the geometry column. The most common feature types are:

  • POINT: Coordinates (x, y) of a point.
  • LINESTRING: Sequence of points connected by length segments.
  • POLYGON: Sequence of points creating a closed simple polygon.
  • MULTIPOINT, MULTILINESTRING ou MULTIPOLYGON: Dataset where each feature can be composed of multiple points, linestrings or polygons.

The plot function applied to an sf object creates a map for each field in the dataset.

plot(mrc)

To plot a single variable, you need to select the corresponding column. To show a map without data variables, you can select the geometry column. The axes = TRUE parameter tells R to show coordinate axes.

plot(mrc["geometry"], axes = TRUE)

You can select a subset of rows or columns of an sf object just like a regular data frame.

# Select the 5th row
mrc[5, ]
## Simple feature collection with 1 feature and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -76.14932 ymin: 45.95376 xmax: -74.34359 ymax: 47.76333
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##          mrc_name reg_id    reg_name pop2016
## 5 Antoine-Labelle     15 Laurentides   35410
##                         geometry
## 5 MULTIPOLYGON (((-75.52133 4...
# Select the MRC name and population columns for MRCs with a population over 200,000
mrc[mrc$pop2016 > 200000, c("mrc_name", "pop2016")] 
## Simple feature collection with 5 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -75.90834 ymin: 45.37075 xmax: -71.13162 ymax: 46.98186
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##     mrc_name pop2016                       geometry
## 23  Gatineau  278198 MULTIPOLYGON (((-75.89532 4...
## 47     Laval  425225 MULTIPOLYGON (((-73.74488 4...
## 69 Longueuil  417497 MULTIPOLYGON (((-73.44474 4...
## 82  Montreal 1959836 MULTIPOLYGON (((-73.49801 4...
## 89    Quebec  572755 MULTIPOLYGON (((-71.55293 4...

Note that the column containing the spatial information (geometry) is always retained, even if not explicitly selected. To discard that column and convert the sf object to a regular (non-spatial) data frame, you can use the function st_drop_geometry.

Exercise 1

Select the MRCs in the Bas-St-Laurent (reg_id: 01) and Gaspesie (reg_id: 11) regions, then display their 2016 population on a map.
Hint: The operator %in% can check if a variable has one value within a set, for example x %in% c(1, 3) returns TRUE if x is equal to 1 or 3.

Solution

Integration with the dplyr package

The data manipulation functions from the dplyr package work on sf objects as well. For example, we could rewrite the example above (select the name and population of MRCs with populations over 200,000) using filter and select.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

filter(mrc, pop2016 > 200000) %>%
    select(mrc_name, pop2016)
## Simple feature collection with 5 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -75.90834 ymin: 45.37075 xmax: -71.13162 ymax: 46.98186
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##    mrc_name pop2016                       geometry
## 1  Gatineau  278198 MULTIPOLYGON (((-75.89532 4...
## 2     Laval  425225 MULTIPOLYGON (((-73.74488 4...
## 3 Longueuil  417497 MULTIPOLYGON (((-73.44474 4...
## 4  Montreal 1959836 MULTIPOLYGON (((-73.49801 4...
## 5    Quebec  572755 MULTIPOLYGON (((-71.55293 4...

When performing a grouped summary, the individual features are also aggregated in a single feature by group. For example, let us aggregate the MRCs and their population by region.

regions <- group_by(mrc, reg_name) %>%
    summarize(pop2016 = sum(pop2016))
head(regions)
## Simple feature collection with 6 features and 2 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: -79.57933 ymin: 45.58943 xmax: -56.93495 ymax: 55.00006
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 3
##   reg_name        pop2016                                          geometry
##   <chr>             <dbl>                                   <GEOMETRY [°]>
## 1 Abitibi-Temisc~  147282 POLYGON ((-77.93147 47.27012, -77.93155 47.26624~
## 2 Bas-Saint-Laur~  197806 POLYGON ((-68.38319 47.88014, -68.38314 47.83307~
## 3 Capitale-Natio~  733898 POLYGON ((-70.69151 47.03311, -70.69453 47.02324~
## 4 Centre-du-Queb~  243354 POLYGON ((-72.09283 45.79699, -72.11938 45.77338~
## 5 Chaudiere-Appa~  421993 POLYGON ((-70.2784 46.05675, -70.2784 46.05608, ~
## 6 Cote-Nord         92712 MULTIPOLYGON (((-66.25978 55.00001, -66.25001 55~
plot(regions["pop2016"])