Overview

One of the major advantages of using Twitter as a source of corpus data is that some users have ‘geotagging’ enabled, meaning that each tweet they send is tagged with latitude and longitude coordinates. This makes Twitter a really useful tool for geospatial analysis.

Unfortunately, it’s estimated that only 1-2% of Twitter uses have geotagging enabled on their account, which means it can take quite a while to create a corpus of geocoded tweets.

In this section, we’ll explore different ways of visualising geolocation data.

1 Searching for geocoded tweets

Let’s explore the extent to which we can use geotagged Twitter data to explore regional differences in language use. A well-known example of dialectal variation in British English concerns the word used for the final or ‘main’ meal of the day - broadly speaking, northerners say tea while southerners say dinner (check out this map for empirical evidence!).

The following code will collect the latest 10,000 tweets containing either the word dinner or tea:

meal.tweets <- search_tweets("dinner OR tea", n = 10000, lang = "en", include_rts = FALSE)

If you can’t access the Twitter API, download a copy of the data here: meal-tweets.Rdata

In this case, we’ll want to colour-code each tweet depending on whether it contains dinner or tea. We can do this using a combination of the following commands:

mutate() adds a new column to (or edits an existing column in) our data set
case_when() fills the new type column with values, coding each row of data depending on the output of str_detect()
str_detect() searches within each tweet for the word of interest

Try and make sense of the code below. It takes the meal.tweets object we just created, converts the text column to lowercase, adds a column called type, and then uses case_when() to determine which values go in this new column: if text (i.e. the tweet itself) contains dinner, we put dinner in the new column, if it contains tea, we put tea in the new column.

meal.tweets <- meal.tweets %>%
  mutate(text = tolower(text)) %>%
  mutate(type = case_when(
    str_detect(text, 'tea') ~ 'tea',
    str_detect(text, 'dinner') ~ 'dinner'
  ))

We also need to run the following line of code to extract latitude and longitude coordinates into a plottable format - this will create two new columns lat and lng that we can use for our Y and X axes, respctively:

meal.tweets <- lat_lng(meal.tweets)

If a tweet doesn’t contain workable latitude/longitude coordinates, it will just have NA in the new lat and lng columns - let’s find out how many tweets in our dataset are like this:

meal.tweets %>%
  filter(is.na(lat) & is.na(lng)) %>%
  nrow()

## [1] 9572

Since we didn’t restrict our search to a particular region, over 9500 of our 10000 tweets have no geolocation data! The problem is, as this is a dialectal variable in British English, we should further filter our data down just to tweets sent from the UK. This is a little trickier, as we only have continuous latitude/longitude coordinates. Luckily, I’ve done the hard work for you and calculated the lat/long limits of the UK 😉

meal.tweets.uk <- meal.tweets %>%
  filter(lat > 49.82 & lat < 59.47 & lng > -10.85 & lng < 2.02)

nrow(meal.tweets.uk)

## [1] 114

This leaves us with 114 tweets. Better than nothing I suppose…

2 Plotting a static map

Now that we’ve got our data, it’s time to plot a map. The ggplot2 package has a useful function called map_data() which allows us to plot world maps, but you’ll need to install the maps package first (you don’t need to load it using library(), it’ll be loaded automatically):

install.packages("maps")

Now let’s the save the world map data to an object called world:

world <- map_data("world")

It’s easy to get map data for individual regions too. The world object contains a column called region, which we can use for filtering (to see what the region names are, you can run unique(world$region))

uk <- world %>%
  filter(region == "UK")

Tangent: map_data and geoms

At this point it’s a good idea to look at the actual content of this map data:

head(uk)

##        long      lat group order region     subregion
## 1 -1.065576 50.69024   570 40057     UK Isle of Wight
## 2 -1.149365 50.65571   570 40058     UK Isle of Wight
## 3 -1.175830 50.61523   570 40059     UK Isle of Wight
## 4 -1.196094 50.59922   570 40060     UK Isle of Wight
## 5 -1.251465 50.58882   570 40061     UK Isle of Wight
## 6 -1.306299 50.58853   570 40062     UK Isle of Wight

As you can see, it’s literally made up of almost 1000 (or, in the case of the world map, almost 100000) individual latitude/longitude coordinates. When we use it to plot a world map, we’re actually plotting all of these individual points and then joining them together.

This map data also provides a good example of how different types of ggplot geometric layers work.

geom_point(): plots individual points

ggplot(uk) +
  geom_point(aes(x = long, y = lat)) +
  theme_void()

geom_path(): plots the path between each point

ggplot(uk) +
  geom_path(aes(x = long, y = lat, group = group)) +
  theme_void()

geom_polygon(): plots the path as above, but filled in

ggplot(uk) +
  geom_polygon(aes(x = long, y = lat, group = group)) +
  theme_void()

Note that the code for geom_path() and geom_polygon() includes an extra argument inside aes() - a reference to the group variable. This is necessary in the case of these two geometric layers, at least when plotting something like the UK with lots of individual islands; it tells R to treat each ‘group’ of points (i.e. each island) separately, and not to connect them together.

If we leave out the group argument, we get maps that look like this:

You might notice they’re not quite right…

Anyway - back to mapping the dinner~tea variation. We’ll be using geom_polygon() to plot our maps in this workshop.

Importantly, we need to set the colour of the points within aes(), based on whatever value is in the type column. The following chunk of code should plot the world map using geom_polygon() and then plot our tweets on top of this as individual points using geom_point(). Note how we’ve also specified for the points to be colour-coded based on the value in the type column we made earlier (i.e. whether it’s a tea tweet or a dinner tweet)

uk %>%
  ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group)) +
  geom_point(data = meal.tweets.uk, aes(x = lng, y = lat, colour = type)) +
  theme_void()

Looks about right! We don’t see a perfect north~south divide (can you think of why this might be?) but you certainly see a cluster of tea tweets around Manchester and a cluster of dinner tweets in London.

3 Plotting interactive maps

An alternative method to plot geospatial data is to use the leaflet package, which allows us to produce interactive maps. This has a number of benefits:

you can move around the map and zoom in
you can customise background map tiles, with plenty to choose from depending on what style you want (see this link for a list of options)
you can include pop-up boxes which display extra information when you click on each point (useful for metadata)

Start off by installing and loading the package:

install.packages("leaflet")

library(leaflet)

The code below is enough to plot a basic map with no colour-coding using leaflet. It involves three commands:

leaflet() to initialise the map
addProviderTiles() to add the background (you can change ‘CartoDB.Positron’ to any of the options listed here)
addCircleMarkers() to add the individual points. We can specify things like the size of the points (using radius), and also add content into the pop-up window which appears upon clicking an individual point. In this case, we’ll make it display the content of each tweet, which is held in the text column.

leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(data = meal.tweets.uk, 
                   radius = 2, 
                   popup = ~text)

Of course, we’ll also want to colour-code our points based on our linguistic variable. Colour-coding using leaflet is a little more complicated than with ggplot. You’ll first need to create a colour palette where you list all the possible values and the colour you’d like to assign to them. In the example below, we’ll set tea to red and dinner to blue, and assign this to an object called palette. Then you simply have to specify color = ~palette(type) when generating the map, and also include an addLegend() function to make the map more readable:

palette <- colorFactor(c("blue", "red"), domain = c("dinner", "tea"))

leaflet() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(data = meal.tweets.uk, 
                   radius = 2, 
                   popup = ~text,
                   color = ~palette(type)) %>%
  addLegend(pal = palette, 
            values = c("tea", "dinner"))

Exercise

Check the content of the tweets by clicking on a few - how many of these tweets are false positives resulting from homonymy (i.e. people talking about a cup of tea, or using dinner to refer to the midday meal)?
Try changing addProviderTiles() to a different set of map tiles listed here - if you’re looking for something a bit more interesting, try Thunderforest.SpinalMap…
Try conducting a Twitter search of your own using search_tweets() and then plotting the tweets with geolocation data using either ggplot or leaflet

Working with Twitter data in R

Part 3: Working with geocoded tweets

George Bailey
University of York

Part 0:
Home Part 1:
Intro to R Part 2:
Collecting Twitter data Part 3:
Geocoded Tweets Part 4:
Sentiment analysis

1 Searching for geocoded tweets

2 Plotting a static map

3 Plotting interactive maps

Working with Twitter data in R

Part 3: Working with geocoded tweets

George Bailey University of York Part 0: Home Part 1:Intro to R Part 2:Collecting Twitter data Part 3:Geocoded Tweets Part 4:Sentiment analysis

1 Searching for geocoded tweets

2 Plotting a static map

3 Plotting interactive maps

George Bailey
University of York

Part 0:
Home Part 1:
Intro to R Part 2:
Collecting Twitter data Part 3:
Geocoded Tweets Part 4:
Sentiment analysis