Overview
One of the major advantages of using Twitter as a source of corpus data is that some users have ‘geotagging’ enabled, meaning that each tweet they send is tagged with latitude and longitude coordinates. This makes Twitter a really useful tool for geospatial analysis.
Unfortunately, it’s estimated that only 1-2% of Twitter uses have geotagging enabled on their account, which means it can take quite a while to create a corpus of geocoded tweets.
In this section, we’ll explore different ways of visualising geolocation data.
1 Searching for geocoded tweets
Let’s explore the extent to which we can use geotagged Twitter data to explore regional differences in language use. A well-known example of dialectal variation in British English concerns the word used for the final or ‘main’ meal of the day - broadly speaking, northerners say tea while southerners say dinner (check out this map for empirical evidence!).
The following code will collect the latest 10,000 tweets containing either the word dinner or tea:
If you can’t access the Twitter API, download a copy of the data here: meal-tweets.Rdata
In this case, we’ll want to colour-code each tweet depending on whether it contains dinner or tea. We can do this using a combination of the following commands:
mutate()
adds a new column to (or edits an existing column in) our data setcase_when()
fills the newtype
column with values, coding each row of data depending on the output ofstr_detect()
str_detect()
searches within each tweet for the word of interest
Try and make sense of the code below. It takes the meal.tweets
object we just created, converts the text
column to lowercase, adds a column called type
, and then uses case_when()
to determine which values go in this new column: if text
(i.e. the tweet itself) contains dinner, we put dinner in the new column, if it contains tea, we put tea in the new column.
meal.tweets <- meal.tweets %>%
mutate(text = tolower(text)) %>%
mutate(type = case_when(
str_detect(text, 'tea') ~ 'tea',
str_detect(text, 'dinner') ~ 'dinner'
))
We also need to run the following line of code to extract latitude and longitude coordinates into a plottable format - this will create two new columns lat
and lng
that we can use for our Y and X axes, respctively:
If a tweet doesn’t contain workable latitude/longitude coordinates, it will just have NA in the new lat
and lng
columns - let’s find out how many tweets in our dataset are like this:
## [1] 9572
Since we didn’t restrict our search to a particular region, over 9500 of our 10000 tweets have no geolocation data! The problem is, as this is a dialectal variable in British English, we should further filter our data down just to tweets sent from the UK. This is a little trickier, as we only have continuous latitude/longitude coordinates. Luckily, I’ve done the hard work for you and calculated the lat/long limits of the UK 😉
meal.tweets.uk <- meal.tweets %>%
filter(lat > 49.82 & lat < 59.47 & lng > -10.85 & lng < 2.02)
nrow(meal.tweets.uk)
## [1] 114
This leaves us with 114 tweets. Better than nothing I suppose…
2 Plotting a static map
Now that we’ve got our data, it’s time to plot a map. The ggplot2
package has a useful function called map_data()
which allows us to plot world maps, but you’ll need to install the maps
package first (you don’t need to load it using library()
, it’ll be loaded automatically):
Now let’s the save the world map data to an object called world
:
It’s easy to get map data for individual regions too. The world
object contains a column called region
, which we can use for filtering (to see what the region names are, you can run unique(world$region)
)
Tangent: map_data and geoms
At this point it’s a good idea to look at the actual content of this map data:
## long lat group order region subregion
## 1 -1.065576 50.69024 570 40057 UK Isle of Wight
## 2 -1.149365 50.65571 570 40058 UK Isle of Wight
## 3 -1.175830 50.61523 570 40059 UK Isle of Wight
## 4 -1.196094 50.59922 570 40060 UK Isle of Wight
## 5 -1.251465 50.58882 570 40061 UK Isle of Wight
## 6 -1.306299 50.58853 570 40062 UK Isle of Wight
As you can see, it’s literally made up of almost 1000 (or, in the case of the world map, almost 100000) individual latitude/longitude coordinates. When we use it to plot a world map, we’re actually plotting all of these individual points and then joining them together.
This map data also provides a good example of how different types of ggplot
geometric layers work.
geom_point(): plots individual points
geom_path(): plots the path between each point
geom_polygon(): plots the path as above, but filled in
Note that the code for geom_path()
and geom_polygon()
includes an extra argument inside aes()
- a reference to the group
variable. This is necessary in the case of these two geometric layers, at least when plotting something like the UK with lots of individual islands; it tells R to treat each ‘group’ of points (i.e. each island) separately, and not to connect them together.
If we leave out the group
argument, we get maps that look like this:
You might notice they’re not quite right…
Anyway - back to mapping the dinner~tea variation. We’ll be using geom_polygon()
to plot our maps in this workshop.
Importantly, we need to set the colour
of the points within aes()
, based on whatever value is in the type
column. The following chunk of code should plot the world map using geom_polygon()
and then plot our tweets on top of this as individual points using geom_point()
. Note how we’ve also specified for the points to be colour-coded based on the value in the type
column we made earlier (i.e. whether it’s a tea tweet or a dinner tweet)
uk %>%
ggplot() +
geom_polygon(aes(x = long, y = lat, group = group)) +
geom_point(data = meal.tweets.uk, aes(x = lng, y = lat, colour = type)) +
theme_void()
Looks about right! We don’t see a perfect north~south divide (can you think of why this might be?) but you certainly see a cluster of tea tweets around Manchester and a cluster of dinner tweets in London.
3 Plotting interactive maps
An alternative method to plot geospatial data is to use the leaflet
package, which allows us to produce interactive maps. This has a number of benefits:
- you can move around the map and zoom in
- you can customise background map tiles, with plenty to choose from depending on what style you want (see this link for a list of options)
- you can include pop-up boxes which display extra information when you click on each point (useful for metadata)
Start off by installing and loading the package:
The code below is enough to plot a basic map with no colour-coding using leaflet
. It involves three commands:
leaflet()
to initialise the mapaddProviderTiles()
to add the background (you can change ‘CartoDB.Positron’ to any of the options listed here)addCircleMarkers()
to add the individual points. We can specify things like the size of the points (usingradius
), and also add content into the pop-up window which appears upon clicking an individual point. In this case, we’ll make it display the content of each tweet, which is held in thetext
column.
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(data = meal.tweets.uk,
radius = 2,
popup = ~text)
Of course, we’ll also want to colour-code our points based on our linguistic variable. Colour-coding using leaflet
is a little more complicated than with ggplot
. You’ll first need to create a colour palette where you list all the possible values and the colour you’d like to assign to them. In the example below, we’ll set tea to red and dinner to blue, and assign this to an object called palette
. Then you simply have to specify color = ~palette(type)
when generating the map, and also include an addLegend()
function to make the map more readable:
palette <- colorFactor(c("blue", "red"), domain = c("dinner", "tea"))
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(data = meal.tweets.uk,
radius = 2,
popup = ~text,
color = ~palette(type)) %>%
addLegend(pal = palette,
values = c("tea", "dinner"))
Exercise
Check the content of the tweets by clicking on a few - how many of these tweets are false positives resulting from homonymy (i.e. people talking about a cup of tea, or using dinner to refer to the midday meal)?
Try changing
addProviderTiles()
to a different set of map tiles listed here - if you’re looking for something a bit more interesting, tryThunderforest.SpinalMap
…Try conducting a Twitter search of your own using
search_tweets()
and then plotting the tweets with geolocation data using eitherggplot
orleaflet