Techniques we’ll cover in this part:

  • how geospatial datasets ‘look’ in R and how they work
  • methods for plotting static maps (of polygon geometries)
  • how to resize/simplify polygon geometries
  • how to calculate the area of polygon geometries
  • how to calculate overlap between polygon geometries
  • a brief introduction to coordinate reference systems

In this section we’ll cover some of the fundamental concepts and operations when working with geospatial data in R. For now, we’ll be using non-linguistic data as examples, looking mainly at things like population statistics and other random things.

1 Working with spatial datasets

The sf package, which stands for ‘simple features’, is used for lots of spatial operations in R and is now the recommended way of performing GIS in R. The spData package contains lots of pre-built datasets for spatial analysis. We’ll also load in the tidyverse set of packages, which we use for general data processing/plotting in R. Remember to install all of these packages first if you haven’t already:

install.packages('sf')
install.packages('spData')
install.packages('tidyverse')

Now load them in:

library(sf)
library(spData)  
library(tidyverse)

Let’s look at the us_states dataset that comes with the spData package - use head() to print out the first 6 rows:

head(us_states)
## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -114.8136 ymin: 24.55868 xmax: -71.78699 ymax: 42.04964
## Geodetic CRS:  NAD83
##   GEOID        NAME   REGION             AREA total_pop_10 total_pop_15
## 1    01     Alabama    South 133709.27 [km^2]      4712651      4830620
## 2    04     Arizona     West 295281.25 [km^2]      6246816      6641928
## 3    08    Colorado     West 269573.06 [km^2]      4887061      5278906
## 4    09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222
## 5    12     Florida    South 151052.01 [km^2]     18511620     19645772
## 6    13     Georgia    South 152725.21 [km^2]      9468815     10006693
##                         geometry
## 1 MULTIPOLYGON (((-88.20006 3...
## 2 MULTIPOLYGON (((-114.7196 3...
## 3 MULTIPOLYGON (((-109.0501 4...
## 4 MULTIPOLYGON (((-73.48731 4...
## 5 MULTIPOLYGON (((-81.81169 2...
## 6 MULTIPOLYGON (((-85.60516 3...

Each row is a different US state, and it contains lots of different columns:

  • GEOID and NAME: unique identifiers/names for each state
  • REGION: groups the states into 4 categories (Norteast [sic], Midwest, West, South)
  • AREA: a measure of state area in km2
  • total_pop_10 and total_pop_15: population size in 2010 and 2015, respectively
  • geometry: this is where all the spatial ‘stuff’ is kept - more on this later

What kind of object type is this in R?

class(us_states)
## [1] "sf"         "data.frame"

As you can see, its class is a combination of ‘sf’ and ‘data.frame’, so while it has some special status containing spatial information, you can also treat it like a normal dataframe and apply the same kind of data processing/wrangling operations. For example, you can run the table() function on a column to tabulate how many observations are in each level of that variable

table(us_states$REGION)
## 
## Norteast  Midwest    South     West 
##        9       12       17       11

So what kind of special status does it have as a sf class?

It has things like a CRS (coordinate reference system - more on that later)

It also has a $geometry column, which contains all of the coordinates for the shapes. This is where all the important geographic information lies - it’s what gives points their location, and shapes their… shape.

2 Plotting spatial datasets

We can input the whole dataset into the plot() function, which will plot all the polygons and colour-code them by each column of metadata.

plot(us_states)

If you want to just plot a single set of the geometries with no colour-coding, you can directly reference the $geometry column:

plot(us_states$geometry)

Or you can plot one state/polygon at a time by subsetting the dataframe as you normally would in R. For example, here’s the 21st row/state in the dataset:

plot(us_states$geometry[[21]])

Exercise

Keep running the code below to plot a random US state each time and guess which it is (let’s test our geography knowledge!)

random_state <- us_states %>%
  sample_n(1)

plot(random_state$geometry)

random_state$NAME
## [1] "Florida"

You can also plot these using the regular ggplot syntax. You don’t even need to specify columns for the X/Y values - R is clever and knows that it’s a spatial dataframe with a geometry column for the shape. I like to use the theme_void() option when plotting maps since we don’t really need all the axes and gridlines etc.

ggplot(random_state) +
  geom_sf(fill = 'blue') +
  theme_void()

You can apply all the usual ggplot customisation options. For example, we can try to make a really garish-looking map in hot pink with a thick, lime green, dashed border.

ggplot(random_state) +
  geom_sf(fill = 'hotpink', colour = 'green', lwd = 5, lty = 'dashed') +
  theme_void()

Nice.

You can even do cool things like st_buffer(), which is basically like inflating the polygon like a balloon. Here’s what Florida looks like with st_buffer() set to 30000:

random_state %>%
  st_buffer(30000) %>%
  ggplot() + 
  geom_sf() +
  theme_void()

And here’s it set to 90000:

random_state %>%
  st_buffer(90000) %>%
  ggplot() + 
  geom_sf() +
  theme_void()

This is quite useful when plotting things like rivers, which can be quite thin and hard to see on maps. For example, spData comes with a built-in shapefile for the River Seine in France:

seine %>%
  plot()

Plotting with a buffer:

seine %>%
  st_buffer(5000) %>%
  plot()

Like with any type of dataset in R, you can also colour-code your ggplot by the values in some column of data. Let’s colour-code the states by one of the metadata columns, the total population in 2010:

ggplot(us_states) +
  geom_sf(aes(fill = total_pop_10)) +
  theme_void()

Exercise

It’s time for some interactive exercises now:

  1. Create a new column, which calculates population density in each year using the AREA and total_pop_10 and total_pop_15 columns
  2. Plot a new map colour-coded by population density
  3. Create a ‘population change’ column, which calculates the change in population between 2010 and 2015

Answers

Calculating population density and population change:

us_states <- us_states %>%
  mutate(pop_dens_10 = total_pop_10 / as.numeric(AREA),
         pop_dens_15 = total_pop_15 / as.numeric(AREA),
         pop_dens_change = pop_dens_15 - pop_dens_10)

Plotting by population density:

#plot by population density in 2010
ggplot(us_states) +
  geom_sf(aes(fill = pop_dens_10)) +
  theme_void()

#this isn't very informative because District of Colombia is a massive outlier
us_states %>%
  select(NAME, pop_dens_10) %>%
  arrange(desc(pop_dens_10))
## Simple feature collection with 49 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
## Geodetic CRS:  NAD83
## First 10 features:
##                    NAME pop_dens_10                       geometry
## 1  District of Columbia   3279.2769 MULTIPOLYGON (((-77.11976 3...
## 2            New Jersey    430.1923 MULTIPOLYGON (((-75.55945 3...
## 3          Rhode Island    385.1138 MULTIPOLYGON (((-71.19564 4...
## 4         Massachusetts    309.7496 MULTIPOLYGON (((-70.27553 4...
## 5           Connecticut    273.2488 MULTIPOLYGON (((-73.48731 4...
## 6              Maryland    212.1674 MULTIPOLYGON (((-79.47666 3...
## 7              Delaware    170.0704 MULTIPOLYGON (((-75.7886 39...
## 8              New York    151.1749 MULTIPOLYGON (((-74.0702 40...
## 9               Florida    122.5513 MULTIPOLYGON (((-81.81169 2...
## 10         Pennsylvania    107.5780 MULTIPOLYGON (((-80.51942 4...
#so either filter it out:
us_states %>%
  filter(NAME != 'District of Columbia') %>%
  ggplot() +
  geom_sf(aes(fill =  pop_dens_10))

#or log-transform the values:
ggplot(us_states) +
  geom_sf(aes(fill = log(pop_dens_10))) +
  theme_void()

Plotting by population change:

ggplot(us_states) +
  geom_sf(aes(fill = pop_dens_change)) +
  theme_void()

#again, not very useful because District of Colombia is a massive outlier
us_states %>%
  select(NAME, pop_dens_change) %>%
    arrange(desc(pop_dens_change))
## Simple feature collection with 49 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
## Geodetic CRS:  NAD83
## First 10 features:
##                    NAME pop_dens_change                       geometry
## 1  District of Columbia      353.986830 MULTIPOLYGON (((-77.11976 3...
## 2         Massachusetts       10.926915 MULTIPOLYGON (((-70.27553 4...
## 3            New Jersey        9.018396 MULTIPOLYGON (((-75.55945 3...
## 4              Maryland        8.719782 MULTIPOLYGON (((-79.47666 3...
## 5              Delaware        8.718132 MULTIPOLYGON (((-75.7886 39...
## 6               Florida        7.508354 MULTIPOLYGON (((-81.81169 2...
## 7        North Carolina        4.442798 MULTIPOLYGON (((-83.10023 3...
## 8            California        4.354330 MULTIPOLYGON (((-118.6034 3...
## 9              Virginia        3.936031 MULTIPOLYGON (((-75.66971 3...
## 10          Connecticut        3.651576 MULTIPOLYGON (((-73.48731 4...
#so either filter it out
us_states %>%
  filter(NAME != 'District of Columbia') %>%
  ggplot() +
  geom_sf(aes(fill = pop_dens_change)) +
  theme_void()

#or log-transform the values
us_states %>%
  ggplot() +
  geom_sf(aes(fill = log(pop_dens_change))) +
  theme_void()

Let’s fly over the Atlantic and look at the UK. As well as us_states, the spData package also contains a dataset called world - let’s take a look:

head(world)
## Simple feature collection with 6 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -18.28799 xmax: 180 ymax: 83.23324
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 11
##   iso_a2 name_long  continent region_un subregion type  area_km2     pop lifeExp
##   <chr>  <chr>      <chr>     <chr>     <chr>     <chr>    <dbl>   <dbl>   <dbl>
## 1 FJ     Fiji       Oceania   Oceania   Melanesia Sove…   1.93e4  8.86e5    70.0
## 2 TZ     Tanzania   Africa    Africa    Eastern … Sove…   9.33e5  5.22e7    64.2
## 3 EH     Western S… Africa    Africa    Northern… Inde…   9.63e4 NA         NA  
## 4 CA     Canada     North Am… Americas  Northern… Sove…   1.00e7  3.55e7    82.0
## 5 US     United St… North Am… Americas  Northern… Coun…   9.51e6  3.19e8    78.8
## 6 KZ     Kazakhstan Asia      Asia      Central … Sove…   2.73e6  1.73e7    71.6
## # ℹ 2 more variables: gdpPercap <dbl>, geom <MULTIPOLYGON [°]>

Let’s filter it down to just the UK and plot it - remember we can treat this as a normal dataframe still so we can use the filter() function as we would for any other dataframe.

world %>%
  filter(name_long == 'United Kingdom') %>% 
  ggplot() +
  geom_sf()

It looks like a child has drawn it! Clearly, the set of world polygons is fairly low resolution.

Luckily, I’ve come prepared with my own UK geospatial dataset. We call these shapefiles. We can load in external shapefiles by using the read_sf() function, similar to how you might read in a csv file using read_csv(). I’ve saved this shapefile inside a subfolder called shapefiles within the directory containing this script, and I recommend you do the same.

NOTE on files: to open up a shapefile, you just need its .shp and .shx files, but ideally you want the extra files too: the .dbf file contains info on labels, and the .prj file contains the CRS (again - more on that later). These should all be inside the same directory, with matching filenames, but when loading in the shapefile you only need to refer to the .shp file like so:

uk <- read_sf('shapefiles/uk-outline.shp')

Let’s plot it to see if the quality is better

ggplot(uk) +
  geom_sf()

This isn’t very interesting on its own, because it’s all just one polygon. Let’s load in a different shapefile, of England only, which is split by region instead.

uk.regions <- read_sf('shapefiles/uk-regions.shp')

As we can see, this splits England up by its 9 NUTS1 regions (NUTS stands for the Nomenclature of Territorial Units for Statistics), i.e. things like North East, Yorkshire and The Humber, West Midlands, East of England etc.

uk.regions
## Simple feature collection with 9 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 82668.52 ymin: 5336.966 xmax: 655653.8 ymax: 657536
## Projected CRS: OSGB36 / British National Grid
## # A tibble: 9 × 8
##   RGN22CD  RGN22NM  BNG_E  BNG_N   LONG   LAT GlobalID                  geometry
##   <chr>    <chr>    <dbl>  <dbl>  <dbl> <dbl> <chr>           <MULTIPOLYGON [m]>
## 1 E120000… North … 417314 600356 -1.73   55.3 5a23ed0… (((450254 525947.3, 4502…
## 2 E120000… North … 350014 506279 -2.77   54.4 86f3c3e… (((327866.9 373722.3, 32…
## 3 E120000… Yorksh… 446902 448736 -1.29   53.9 e98894d… (((495603.1 422511.5, 49…
## 4 E120000… East M… 477659 322635 -0.850  52.8 94937b2… (((552751.4 326659.3, 55…
## 5 E120000… West M… 386294 295477 -2.20   52.6 1b27107… (((402840.5 368845.6, 40…
## 6 E120000… East o… 571078 263235  0.504  52.2 ab4f9e4… (((570605.1 181382.2, 57…
## 7 E120000… London  517515 178392 -0.309  51.5 acffd85… (((516122.7 172421.3, 51…
## 8 E120000… South … 470062 172924 -0.993  51.5 0ac18c7… (((429038.4 84843.5, 429…
## 9 E120000… South … 285013 102567 -3.63   50.8 72d5f65… (((83962.84 5401.15, 839…

3 Simplifying spatial objects

We can plot this using plot() or ggplot() as we saw earlier, but a word of warning first: this takes a long time (at least on my ancient 2014 MacBook!) because the shapefile is so detailed and complex.

uk.regions %>%
  ggplot() +
  geom_sf()

We can quantify this using the object_size() function from the pryr package:

library(pryr)

object_size(uk.regions)
## 23.51 MB

23MB is pretty big! (for comparison, the size of the world shapefile we were plotting earlier is just 0.3MB, despite containing over 170 polygons/countries)

We can simplify the complexity of the UK region shapefile to reduce its file size (and therefore the time taken to plot it). For this, we need to load in the rmapshaper package and use the ms_simplify() function. We provide a proportion value (from 0 to 1) to the keep argument, which corresponds to the proportion of points we want to keep, i.e. the lower the value, the simpler the resulting shapefile. Polygons are essentially just sets of points joined up like in the dot-to-dot books we all had as children! Naturally, the fewer of these points you keep, the simpler and more ‘jaggedy’ the resulting shape will be.

Let’s try 0.02, keeping just 2% of the original points.

library(rmapshaper)

uk.regions <- uk.regions %>% 
  ms_simplify(keep = 0.02)

This is now just over 600KB:

object_size(uk.regions)
## 612.14 kB

This means it’s much quicker to plot, and quality is still absolutely fine - in fact it doesn’t even look any different! (note: RGN22NM is the name of the column containing the label for each region - talk about unintuitive column names…)

uk.regions %>%
  ggplot() +
  geom_sf(aes(fill = RGN22NM)) +
  theme_void()

Exercise

  1. How far can you push it? Reload the uk.regions shapefile in and try re-simplifying with different values of keep inside ms_simplify() to see just how small/simple you can make it without visibly losing significant quality. Or approach this from a more fun angle: how ridiculous can you make the UK shapefile look with more extreme simplification? (hint: you might want to go really small with the value of keep)

  2. There’s another method you can use to simplify spatial polygons, using st_simplify() instead and specifying a dTolerance argument, e.g. uk.small <- uk.regions %>% st_simplify(dTolerance = 500). Try this with different values of dTolerance (higher values = more simplified) - how does the result compare with the output of ms_simplify()?

Answers

#reload in to undo our previous simplification
uk.regions <- read_sf('shapefiles/uk-regions.shp')

#using ms_simplify, keeping just 0.01% of points (shrinks to 17kb)
uk.small.1 <- uk.regions %>% ms_simplify(keep = 0.0001)

#still looks decent!
ggplot(uk.small.1) +
  geom_sf()

#using ms_simplify, keeping just 0.0001% of points (shrinks to 14kb)
uk.small.2 <- uk.regions %>% ms_simplify(keep = 0.000001)

#not so decent...
ggplot(uk.small.2) +
  geom_sf()

#using st_simplify instead, setting tolerance to 500m (shrinks to 140kb)
uk.small.3 <- uk.regions %>% st_simplify(dTolerance = 500)

#dosn't look that different to the output of ms_simplify()
ggplot(uk.small.3) +
  geom_sf()

#using st_simplify instead, setting tolerance to 10,000m (shrinks to 18kb)
uk.small.4 <- uk.regions %>% st_simplify(dTolerance = 10000)

#notice how the borders between polygons aren't preserved here - gaps emerge! 
#(this wasn't an issue with ms_simplify)
ggplot(uk.small.4) +
  geom_sf()

Bonus: we can also use the smoothr package to smooth polygons and produce what I like to call the ‘melted ice sculpture’ effect!

#starting with uk.small.1, which is the one we shrunk using ms_simplify(keep = 0.0001)
uk.small.6 <- uk.small.1 %>% smoothr::smooth(method = "ksmooth", smoothness = 3)

ggplot(uk.small.6) +
  geom_sf()

Ok that’s enough messing aroud - before we carry on, let’s revert back to our original simplification of the uk.regions shapefile:

uk.regions <- read_sf('shapefiles/uk-regions.shp') %>%
  ms_simplify(keep = 0.02)

4 Calculating overlap between areas/polygons

One thing we might want to do is plot two separate sets of polygon geometries and calculate the degree of overlap between them.

Let’s use this new simplified version of the England regions shapefile alongside a shapefile denoting the Areas of Outstanding Natural Beauty (AONB) in England, downloaded from the government’s open data site. We’ll be asking which region of England is the most ‘beautiful’, and we’ll be answering this scientifically based on the relative proportion of each region’s area that’s classified as an ‘area of outstanding natural beauty’.

aonb <- read_sf('shapefiles/AONB.shp')

This is another large shapefile with unncessary detail:

object_size(aonb)
## 32.72 MB

So let’s simpify it to the same degree as we did for the England regions:

aonb <- aonb %>%
  ms_simplify(keep = 0.02)

And here’s what it looks like:

ggplot(aonb) +
  geom_sf()

Before we plot these two shapefiles over the top of each other, we need to check their coordinate reference systems. This topic is way beyond the scope of this workshop, but essentially the CRS is used to precisely locate points on the earth’s surface. It combines a specific projection, origin point, and unit of measurement (e.g. are the points specified in latitude/longitude, or easting/northing?). Every shapefile has a CRS, or if it doesn’t then R will only be able to plot the shapefile in an empty space.

There is a whole science behind this topic (it isn’t helped by the wonky shape of the Earth and how it’s not a perfect sphere) but the specifics aren’t important: all we need to know is that if we want to do any computation involving two different shapefiles (e.g. measuring overlap, plotting one over the other etc), then they must be projected in the same way with identical CRSs.

We can check this with the st_crs() function:

st_crs(uk.regions)
## Coordinate Reference System:
##   User input: OSGB36 / British National Grid 
##   wkt:
## PROJCRS["OSGB36 / British National Grid",
##     BASEGEOGCRS["OSGB36",
##         DATUM["Ordnance Survey of Great Britain 1936",
##             ELLIPSOID["Airy 1830",6377563.396,299.3249646,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4277]],
##     CONVERSION["British National Grid",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",49,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-2,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996012717,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",400000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",-100000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["United Kingdom (UK) - offshore to boundary of UKCS within 49°45'N to 61°N and 9°W to 2°E; onshore Great Britain (England, Wales and Scotland). Isle of Man onshore."],
##         BBOX[49.75,-9,61.01,2.01]],
##     ID["EPSG",27700]]

That’s quite a lot of reader-unfriendly jargon!

Let’s run a comparison between the CRS of our two shapefiles:

st_crs(uk.regions) == st_crs(aonb)
## [1] TRUE

Great, they’re identical! We don’t have to mess around with anything then (spoiler alert: we’ll see how to change a CRS and reproject spatial data in the next part of this workshop).

Now that we know our two shapefiles have the same projection, we can plot them together by simply adding another geom_sf() layer and specifying the second shapefile:

ggplot(uk.regions) +
  geom_sf(aes(fill = RGN22NM)) +
  geom_sf(data = aonb, fill = 'red', alpha = 0.5) +
  theme_void()

Notice how some of the AONBs actually straddle two different regions (e.g. the North Pennines overlap with both the North West and the North East). This makes our life a bit more difficult, because it’s not just a simple case of working out which areas ‘belong’ to which regions and then calculating their total area. Instead, we’ll have to calculate how much of each region of England is covered by an AONB.

To do this, we need to first find the total area of each UK region. This is nice and easy: we can do this by applying the st_area() function to our shapefile, which returns a set of area values for each unique geometry in the shapefile (i.e. each region of the UK).

st_area(uk.regions)
## Units: [m^2]
## [1]  8599596685 14170875783 15410922881 15645545178 13003754980 19133442615
## [7]  1573905346 19090850368 23856297826

Let’s add this back to our shapefile in a new column called total_area:

uk.regions$total_area <- st_area(uk.regions)

Then find the area of overlap. At the moment, there are 34 rows (i.e. 34 unique geometries) in our AONB shapefile, one for each area of outstanding natural beauty in England. We want to collapse this down to a single multipolygon, i.e. a single row. We can do this by using the st_union() function - let’s do this, and save it a new shapefile called aonb.single:

aonb.single <- aonb %>% st_union()

Now we can use the st_intersection() function to only return the parts of uk.regions that are overlapped with our AONB shapefile

overlapped <- st_intersection(uk.regions, aonb.single)

Let’s plot this intersection and colour-code it by region:

ggplot(overlapped) + 
  geom_sf(aes(fill = RGN22NM)) +
  theme_void()

On the surface it might just look like the original AONB shapefile we loaded in, so you might be thinking what was the point in everything we just did? Importantly, this is actually the UK region shapefile but only those parts that are overlapped with the AONB shapefile, i.e. it’s still split by region with 9 rows/geometries:

overlapped
## Simple feature collection with 9 features and 8 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: 85816.02 ymin: 7101.81 xmax: 653767.1 ymax: 650535.1
## Projected CRS: OSGB36 / British National Grid
## # A tibble: 9 × 9
##   RGN22CD   RGN22NM                BNG_E  BNG_N   LONG   LAT GlobalID total_area
## * <chr>     <chr>                  <dbl>  <dbl>  <dbl> <dbl> <chr>         [m^2]
## 1 E12000001 North East            417314 600356 -1.73   55.3 5a23ed0…    8.60e 9
## 2 E12000002 North West            350014 506279 -2.77   54.4 86f3c3e…    1.42e10
## 3 E12000003 Yorkshire and The Hu… 446902 448736 -1.29   53.9 e98894d…    1.54e10
## 4 E12000004 East Midlands         477659 322635 -0.850  52.8 94937b2…    1.56e10
## 5 E12000005 West Midlands         386294 295477 -2.20   52.6 1b27107…    1.30e10
## 6 E12000006 East of England       571078 263235  0.504  52.2 ab4f9e4…    1.91e10
## 7 E12000007 London                517515 178392 -0.309  51.5 acffd85…    1.57e 9
## 8 E12000008 South East            470062 172924 -0.993  51.5 0ac18c7…    1.91e10
## 9 E12000009 South West            285013 102567 -3.63   50.8 72d5f65…    2.39e10
## # ℹ 1 more variable: geometry <GEOMETRY [m]>

This means we can easily calculate the area of this overlapped shapefile using the same method as before, i.e. using the st_area() function. Let’s do this and save it back to the shapefile in a column simply called beaut_area:

overlapped$beaut_area <- st_area(overlapped)

We now have two area measures for each English region: its total area in the total_area column (in m2) and how much of it is an area of outstanding natural beauty in the beaut_area column (also measured in m2).

Let’s do a few things here to finish off: take our overlapped shapefile, add a column called beaut_prop which is the proportion of each region that is an AONB (simply divide beaut_area by total_area), then drop the geometry column to turn it into a simple dataframe, select only the columns containing the region name (RGN22NM) and this proportion (beaut_prop), then sort by these values:

overlapped %>%
  mutate(beaut_prop = beaut_area / total_area) %>%
  st_drop_geometry() %>%
  select(RGN22NM, beaut_prop) %>%
  arrange(beaut_prop)
## # A tibble: 9 × 2
##   RGN22NM                  beaut_prop
##   <chr>                           [1]
## 1 London                      0.00184
## 2 East Midlands               0.0332 
## 3 East of England             0.0549 
## 4 Yorkshire and The Humber    0.0599 
## 5 West Midlands               0.0983 
## 6 North West                  0.110  
## 7 North East                  0.165  
## 8 South East                  0.259  
## 9 South West                  0.294

There you have it: the South West is officially the most beautiful region, with almost 30% of its area classified as an area of outstanding natural beauty! Yorkshire and The Humber is lagging behind at almost 6%, but spare a thought for poor Londoners where the value is just 0.18%!

They do say beauty is in the eye of the beholder though…

Exercise

Now it’s your turn: search the web for some shapefiles - you’d be surprised at how many there are! Try downloading them, messing around with them, plotting them etc (remember to resize them first if they’re large and you don’t want to risk waiting an age for R to do its thing)

Here are some random examples:

Example

Here’s one I prepared earlier, plotting canals and inland rivers in the North West (in blue) alongside the locations of Japanese knotweed (in red), all overlaid on a map of UK postcode areas:

uk.areas <- read_sf("shapefiles/uk-areas.shp") %>%
  ms_simplify(keep = 0.1)

knotweed <- read_sf("shapefiles/NBN_Atlas_occurrences_-_Japanese_Knotweed_(View_Layer).shp")

uk.water <- read_sf("~/Downloads/oprvrs_essh_gb/data/WatercourseLink.shp")

uk.areas %>%
  ggplot() +
  geom_sf() +
  geom_sf(data = filter(uk.water, form %in% c('canal', 'inlandRiver')), 
          colour = 'blue', alpha = 0.4) +
  geom_sf(data = knotweed, colour = 'red', alpha = 0.3, size = 0.5) +
  coord_sf(xlim = c(-3, -2), ylim = c(53.2, 53.7)) +
  theme_void()