Techniques we’ll cover in this part:

  • how to create an sf ‘spatial’ object from a ‘normal’ dataset
  • a lesson in setting CRSs!
  • how to plot points overlaid on polygon geometries
  • how to plot interactive maps
  • how to perform point-in-polygon aggregation
  • how to generate ‘buffer zones’ [coming soon]

In this section we’ll be working with geotagged Twitter data from a corpus I compiled back in 2015–16. These tweets were scraped in real-time at random points during this time period, and all the tweets were tagged with the latitude/longitude coordinates of where the user was when the tweet was sent (limited to just the British Isles).

To make things more interesting, I’ve searched through the 16,000,000-tweet corpus and filtered down to just those tweets containing the hashtags #LFC, #EFC, #MUFC and #MCFC, i.e. Liverpool and Everton football clubs (based in Merseyside), and Manchester United and Manchester City football clubs (based in Manchester), respectively. These 4 football clubs are all based in the North West but we’ll be looking at how they pattern geographically in the UK to see if that’s reflected in where people tweet about these clubs, as an exercise in working with a different type of spatial geometry: points, rather than polygons (and we’ll see how points can interact with polygons).

We’ll be working with the sf and tidyverse packages again - you should already have these loaded in from the previous section but you’ll have to load them in using library() again if you’re working in a new R session:

library(sf)
library(tidyverse)

1 Creating shapefiles and working with point geometries

Load in the CSV of tweets using read_csv() (if you haven’t downloaded it yet, remember all the datasets are available here)

tweets <- read_csv("data/football_tweets.csv")

Taking a look at first handful of rows shows that we have the following columns: * id: a unique identifier of each tweet in the corpus * date: the date of the tweet (in YYYY-MM-DD format) * tweet: the content of the tweet (usernames have been removed to preserve anonymity) * lat: the latitude of where the tweet was sent * long: the longitude of where the tweet was sent * region: the region of where the tweet was sent (geocoded based on the lat/long)

head(tweets)
## # A tibble: 6 × 6
##   id         date       tweet                                  lat   long region
##   <chr>      <chr>      <chr>                                <dbl>  <dbl> <chr> 
## 1 UK00000439 20/01/2015 "\"@user struggling to break them d…  56.0 -3.17  sco   
## 2 UK00001568 20/01/2015 "This 'Can' business is annoying @u…  51.0 -0.931 se    
## 3 UK00003907 20/01/2015 "You can see why his name is Hazard…  51.5 -1.62  sw    
## 4 UK00004381 20/01/2015 "A goal is coming here for #LFC #LF…  53.2 -2.89  nw    
## 5 UK00004994 20/01/2015 "Chelsea players trying to ref the …  53.7 -1.36  yo    
## 6 UK00005452 20/01/2015 "Breaking news ! Chelsea drop big t…  52.0 -0.220 eoe

Notice that this is just a regular dataframe: we read it in using read_csv(), not read_sf() like we use for shapefiles. It does contain geographic information in the lat and long columns, but this is currently stored as simple numerical values.

We can plot the tweets in their current form using geom_point() and specifying the lat and long columns as x/y coordinates, i.e. as if it was just a scatterplot:

ggplot(tweets, aes(x = long, y = lat)) +
  geom_point()

It certainly looks like the UK (albeit somewhat stretched)! But if we want to do any kind of geospatial analysis, we’ll need to turn this into a proper sf object.

We can do this using the st_as_sf() function, which converts different object types into an sf object. Inside this function, we need to specify which columns contain the location information, which in our case are the long and lat columns.

tweets <- tweets %>%
  st_as_sf(coords = c("long", "lat"))

How does the dataset look now?

tweets
## Simple feature collection with 9328 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -7.668945 ymin: 50.08477 xmax: 1.746847 ymax: 58.58217
## CRS:           NA
## # A tibble: 9,328 × 5
##    id         date       tweet                  region             geometry
##    <chr>      <chr>      <chr>                  <chr>               <POINT>
##  1 UK00000439 20/01/2015 "\"@user struggling t… sco    (-3.167583 55.96051)
##  2 UK00001568 20/01/2015 "This 'Can' business … se     (-0.930736 51.00709)
##  3 UK00003907 20/01/2015 "You can see why his … sw      (-1.61818 51.47993)
##  4 UK00004381 20/01/2015 "A goal is coming her… nw     (-2.894173 53.19671)
##  5 UK00004994 20/01/2015 "Chelsea players tryi… yo     (-1.360851 53.67186)
##  6 UK00005452 20/01/2015 "Breaking news ! Chel… eoe    (-0.220439 51.97679)
##  7 UK00006165 20/01/2015 "\"Sterling is wasted… sw     (-1.810184 51.59554)
##  8 UK00006573 20/01/2015 "\"Moreno told to pla… eoe    (-0.451013 52.13996)
##  9 UK00006777 20/01/2015 "Hate to say it but #… lon    (-0.074378 51.38822)
## 10 UK00007044 20/01/2015 "Fucking get in!!!!! … sco    (-2.117146 57.18896)
## # ℹ 9,318 more rows

Good - we have a geometry column, which is correctly set to the ‘POINT’ type.

Let’s try to plot these tweets over the top of a map of the UK. This time, we’ll be using a map of the UK split by postcode area (the first letter(s) of the postcode, e.g. the YO part of YO10 5DD) rather than region - there are 124 postcode areas in the UK.

Let’s read it in from an external shapefile named uk-areas.shp, and save it to an R object called uk.areas:

uk.areas <- read_sf("shapefiles/uk-areas.shp")

I’m going to resize this using the ms_simplify() function as we saw earlier, just to speed things up on my laptop, but you don’t necessarily need to do this (it’s only 6MB so not too bad)

library(rmapshaper)

uk.areas <- uk.areas %>%
  ms_simplify(keep = 0.1)

This shapefile doesn’t contain any extra columns of metadata, just the name of the postcode area and its shapefile geometry:

uk.areas
## Simple feature collection with 124 features and 1 field
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -8.650302 ymin: 49.16292 xmax: 1.767277 ymax: 60.86019
## Geodetic CRS:  WGS 84
## # A tibble: 124 × 2
##    pc_area                                                              geometry
##  * <chr>                                                          <GEOMETRY [°]>
##  1 AB      POLYGON ((-2.210914 56.8858, -2.212617 56.89376, -2.202608 56.9, -2.…
##  2 AL      POLYGON ((-0.4033521 51.73366, -0.3957628 51.73226, -0.3872464 51.72…
##  3 B       POLYGON ((-2.140863 52.34402, -2.141172 52.33586, -2.144486 52.33301…
##  4 BA      POLYGON ((-1.991551 51.22716, -2.026649 51.22941, -2.046019 51.22318…
##  5 BB      POLYGON ((-2.2401 53.66015, -2.222977 53.66227, -2.218685 53.66572, …
##  6 BD      POLYGON ((-2.070899 53.81301, -2.057832 53.80932, -2.052169 53.80983…
##  7 BH      MULTIPOLYGON (((-2.020586 50.92601, -2.031324 50.92534, -2.042842 50…
##  8 BL      POLYGON ((-2.524395 53.52653, -2.518426 53.52962, -2.513953 53.52907…
##  9 BN      POLYGON ((-0.5913331 50.79032, -0.571183 50.79532, -0.5595023 50.797…
## 10 BR      POLYGON ((0.1775631 51.36047, 0.1705663 51.36796, 0.1722546 51.37331…
## # ℹ 114 more rows

Let’s plot this map of the UK with our new tweets sf object overlaid:

ggplot(uk.areas) +
  geom_sf() +
  geom_sf(data = tweets)

Oh! If you’ve tried the above, you should have received an error message - this is because although we’ve created a geometry column for our tweets dataset, we haven’t actually set a coordinate reference system (CRS). Remember this is the thing that allows R to locate things on the globe so it’s pretty important!

We can set a CRS by using the st_crs() function and specifying an EPSG code. According to our old friend Wikipedia, EPSG codes come from a “public registry of geodetic datums, spatial reference systems, Earth ellipsoids, coordinate transformations and related units of measurement, originated by a member of the European Petroleum Survey Group (EPSG) in 1985”. Good to know!

Let’s try the CRS based on EPSG 27700 - this is the CRS of the uk.regions shapefile we were working with in Part 1 of this workshop.

st_crs(tweets) <- 27700

Let’s try that plot again now that both our shapefiles have a CRS:

ggplot(uk.areas) +
  geom_sf() +
  geom_sf(data = tweets, colour = 'red', alpha = 0.6)

That doesn’t look right, unless all ~10,000 tweets were genuinely sent from a small boat about 60 miles off the coast of the Isles of Scilly in the middle of the Celtic Sea (unlikely).

What’s gone wrong? Well, we don’t just need to set any CRS for our tweets dataset: we need it to be the same CRS as our map of UK postcode areas, otherwise the two won’t align at all (it should also be appropriate the unit of measurement we’re working with, which in this case is latitude and longitude).

What’s the CRS of our map of UK postcode areas?

st_crs(uk.areas)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]

It’s currently set to a WGS 84 projection, which is one of the more widely used coordinate reference systems and based on geodetic latitude and longitude measurements. This CRS has the EPSG code 4326, so we can set it manually (using st_crs(tweets) <- 4326) or we can just set our tweets shapefile to take the exact CRS that the uk.areas shapefile already has:

st_crs(tweets) <- st_crs(uk.areas)
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that

Notice that we get a warning message here because we’ve tried to use st_crs() on tweet when it already had a CRS (the 27700 EPSG code we incorrectly set earlier). When there was no CRS, using st_crs() was fine, but here we actually want to reproject the geometries from one CRS to another and we have to use st_transform() for that:

tweets <- tweets %>%
  st_transform(4326)

#or:

tweets <- tweets %>%
  st_transform(st_crs(uk.areas))

Now let’s try that plot again:

ggplot(uk.areas) +
  geom_sf() +
  geom_sf(data = tweets, colour = 'red', alpha = 0.6)

Much better! Now let’s colour-code the points based on which of the four football clubs are mentioned - just like with any dataframe, we can create a new column (we’ll call it club) to code this using a combination of mutate() (to create a new column), case_when() (to conditionally code this column based on something else in the dataset) and str_detect() (to search for the hashtags of interest in the content of each tweet, from the tweet column).

#since str_detect() is case sensitive, we temporarily convert the tweet to 
#lowercase using tolower() and just search for lowercase hashtags
tweets <- tweets %>%
  mutate(club = case_when(
    str_detect(tolower(tweet), '#lfc') ~ 'liverpool',
    str_detect(tolower(tweet), '#efc') ~ 'everton',
    str_detect(tolower(tweet), '#mufc') ~ 'man_united',
    str_detect(tolower(tweet), '#mcfc') ~ 'man_city'
  ))

Let’s look at the total number of mentions - presented without comment (🔴🏆)

table(tweets$club)
## 
##    everton  liverpool   man_city man_united 
##        850       4385       1151       2942

What about their geographical distribution? Let’s colour-code the map from before based on this new column - we’ll do this using a custom palette that we save to an object called club_colours, which reflects the colours of each club’s kit:

club_colours = c('everton' = '#2C39E2',
                 'liverpool' = '#FF775A', 
                 'man_city' = '#B4DAEC',
                 'man_united' = '#AF3015')

ggplot(uk.areas) +
  geom_sf() +
  geom_sf(data = tweets, aes(colour = club)) +
  scale_colour_manual(values = club_colours) +
  theme_void()

It’s pretty over crowded, which makes it difficult to see any trends.

What about just plotting the North West? We can ‘zoom in’ by using coord_sf() to specify xlim and ylim ranges for our X and Y axes respectively:

ggplot(uk.areas) +
  geom_sf() +
  geom_sf(data = tweets, aes(colour = club)) +
  scale_colour_manual(values = club_colours) +
  coord_sf(xlim = c(-3.5, -1.5), ylim = c(53, 54)) +
  theme_void()

Exercise

Try to improve the look of the map by messing with different aesthetics/options - not only can you customise the geom_sf() plotting function in the same way as you would any other geom_ layer, you can of course mess around with the wider ggplot theme in different ways too.

Here’s my best piece of art, which plots the land in green, the sea in blue, and removes the panel gridlines to make it look more like an actual map and less like a normal plot. I’ve also made the points smaller, and transparent:

ggplot(uk.areas) +
  geom_sf(fill = '#9EAF96', colour = 'white') +
  geom_sf(data = tweets, aes(colour = club),
          size = 0.5, alpha = 0.6) +
  scale_colour_manual(values = club_colours) +
  coord_sf(xlim = c(-3.5, -1.5), ylim = c(53, 54)) +
  theme_void() +
  theme(panel.background = element_rect(fill = '#D3E6EE', colour = NA),
        panel.grid.major = element_line(colour = "transparent"),
        panel.grid.minor = element_line(colour = "transparent"))

2 Point-in-polygon aggregation

Another thing we can do is make choropleth maps instead, plotting polygon geometries (e.g. postcode areas) colour-coded based on some value - i.e. the type we made in Part 1 of this workshop with the US population statistics.

To do this, we need to perform point-in-polygon aggregation, i.e. for each point, work out which postcode area geometry it lies within. We can do this using the st_join() function.

If you get an error message along the lines of ‘Loop 0 is not valid’ and metioning duplicated vertices, don’t worry! This can crop up with certain shapefiles if they contain ‘topologically invalid’ geometries, which is an occasional consequence of the simplification process (and seems to be the case here if you used ms_simplify(keep = 0.1) on the uk.areas shapefile as I did). You can either run the analysis on the original uk.areas shapefile with no simplification, or run the mysterious st_make_valid() function on it, which applies some magic under the hood to make the geometries ‘valid’.

uk.areas <- st_make_valid(uk.areas)

tweets <- tweets %>%
  st_join(uk.areas, join = st_within)

As we can see below, this has worked and we now have a pc_area column denoting the postcode area that each tweets falls inside.

tweets
## Simple feature collection with 9328 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -7.668945 ymin: 50.08477 xmax: 1.746847 ymax: 58.58217
## Geodetic CRS:  WGS 84
## # A tibble: 9,328 × 7
##    id         date       tweet    region             geometry club      pc_area
##  * <chr>      <chr>      <chr>    <chr>           <POINT [°]> <chr>     <chr>  
##  1 UK00000439 20/01/2015 "\"@use… sco    (-3.167583 55.96051) liverpool EH     
##  2 UK00001568 20/01/2015 "This '… se     (-0.930736 51.00709) liverpool GU     
##  3 UK00003907 20/01/2015 "You ca… sw      (-1.61818 51.47993) liverpool SN     
##  4 UK00004381 20/01/2015 "A goal… nw     (-2.894173 53.19671) liverpool CH     
##  5 UK00004994 20/01/2015 "Chelse… yo     (-1.360851 53.67186) liverpool WF     
##  6 UK00005452 20/01/2015 "Breaki… eoe    (-0.220439 51.97679) liverpool SG     
##  7 UK00006165 20/01/2015 "\"Ster… sw     (-1.810184 51.59554) liverpool SN     
##  8 UK00006573 20/01/2015 "\"More… eoe    (-0.451013 52.13996) liverpool MK     
##  9 UK00006777 20/01/2015 "Hate t… lon    (-0.074378 51.38822) liverpool CR     
## 10 UK00007044 20/01/2015 "Fuckin… sco    (-2.117146 57.18896) liverpool AB     
## # ℹ 9,318 more rows

Now we can filter our tweets (and shapefile) down to just certain postcode areas, in this case just those in the NW of England:

nw.postcodes <- c('L', 'M', 'FY', 'PR', 'BB', 'WN', 'BL', 
                  'OL', 'HX', 'HD', 'WA', 'CH', 'CW', 'SK')

uk.areas %>%
  filter(pc_area %in% nw.postcodes) %>%
  ggplot() +
    geom_sf() +
    geom_sf(data = filter(tweets, pc_area %in% nw.postcodes), aes(colour = club)) +
    scale_colour_manual(values = club_colours) +
    theme_void()

And more importantly, we can also perform our usual descriptive statistics, e.g. to count the number of observations that fall inside each postcode area:

#use st_drop_geometry() because we only want to make a 
#basic dataframe of counts here - we don't need any geometries
counts <- tweets %>%
  st_drop_geometry() %>%
  filter(pc_area %in% c('L', 'M')) %>%
  group_by(pc_area) %>%
  count(club)

What do the raw numbers look like? How many times is each football club tweeted about in the L and M postcode areas?

counts
## # A tibble: 8 × 3
## # Groups:   pc_area [2]
##   pc_area club           n
##   <chr>   <chr>      <int>
## 1 L       everton      258
## 2 L       liverpool    696
## 3 L       man_city      29
## 4 L       man_united    36
## 5 M       everton        8
## 6 M       liverpool     53
## 7 M       man_city     179
## 8 M       man_united   374

We can visualise this in the form of a stacked bar chart:

ggplot(counts) +
  geom_bar(aes(x = pc_area, y = n, fill = club), 
           position="fill", stat="identity") +
  scale_fill_manual(values = club_colours)

Looks like there’s a strong regional pattern here, as we might expect: Liverpool is the most tweeted-about club in the L postcode area, followed by Everton, and in the M postcode area Manchester United are the most tweeted-about club followed by Manchester City. Science!

3 Making interactive maps

Another cool thing you can do is neat way of plotting the maps an in interactive way using the mapview package (which is really just a wrapper for the leaflet package - here’s a good tutorial on how to make interactive ‘leaflet’ maps). As you can see, this brings with it lots of features/benefits:

  • click and drag with the mouse to move around the map
  • zoom in and out
  • click on individual points to see columns of metadata from the shapefile
  • customise the ‘basemap’

Remember to install the package first before loading it in with library()

install.packages('mapview')
library(mapview)

All we have to do is input our tweets shapefile into the mapview() plotting function - it has a different set of arguments compared to ggplot when it comes to specifying map aesthetics though. Use zcol to specify a column to colour-code the geometries by (whether points or polygons), use col.regions to specify a custom colour palette, and cex for point size:

mapview(tweets, zcol = 'club', col.regions = club_colours, cex = 3)

You can save an interactive mapview object using the mapshot() function. Let’s run the same code as before to generate our interactive map, but actually save it to an R object called map_to_save that we can input into the mapshot() function:

map_to_save <- mapview(tweets, zcol = 'club', col.regions = club_colours, cex = 3)

mapshot(map_to_save, url = "tweet_map.html")

Exercise

Try playing around with the customisation options for mapview() - don’t forget you can click on each point to actually read the content of the tweet (I apologies for any colourful language - football fans can get very angry on social media sometimes)

If you’re feeling particularly ambitious, you could try to perform some basic sentiment analysis following this tutorial I gave a few years ago (shameless plug), and then e.g. plotting an interactive map colour-coded by the sentiment to see if any regional patterns emerge.