Overview
Before we do any analysis, we’ll need to collect some data. There are a number of ways of ‘scraping’ Twitter for tweets, and some are better than others depending on your workflow and what kind of information you’d like to collect. R packages such as rtweet
are great because you can collect the data and then analyse it all within the same program, but there are other options such as the stand-alone FireAnt program that provides a nice user-friendly way of collecting data. In this section, we will cover both of these methods.
1 Mining tweets in R
The first thing we need to do is install and then load the rtweet
R package. You only need to install the package once: after this, you can simply load it into the workspace using the library()
function.
You’ll also need to install the httpuv
package in order to authenticate your R session and gain permission to mine Twitter data.
1.1 Searching by tweet content
There are a few options when it comes to searching for tweets, but perhaps the most basic (and most useful!) is to search for a particular term. You can do this by using the search_tweets()
function to collect tweets from the past week that include a particular word or phrase (the search is restricted to tweets from the last week, unfortunately).
As with any R function, you can type in ?search_tweets
to bring up its help vignette, which will explain how it works and what each argument means. If you’re ever unsure about what an R function does, or what arguments are necessary and what they mean, always check the help vignette first.
Let’s do a Game of Thrones search (potential spoilers ahead!). In the example below, the code will return a maximum of 100 tweets (specified by the n
argument) containing the word ‘Lannister’, and write the output of this search to a dataframe called lannister.tweets
. The include_rts
argument allows us to exclude retweets from this search, and lang
allows us to filter by language (e.g. ‘en’ for English, ‘de’ for German etc.)
If you can’t access the Twitter API, download a copy of the data here: lannister-tweets.Rdata
It’s important to note that Twitter limits your queries to around 18,000 tweets every 15 minutes. If you get a warning message about exceeding the limit, you’ll have to wait around 15 minutes until it resets and then you can search again.
We can also specify more than one search term at a time, if we separate the search terms with ‘OR’. This time, let’s search for tweets containing the word Tyrion or the word Lannister. Let’s also increase n
to 300:
If you can’t access the Twitter API, download a copy of the data here: tyrion-tweets.Rdata
Now that we’ve collected some data, let’s see how many rows and columns our dataframe has:
## [1] 300 90
It includes 300 rows (as we might expect), and 90 columns. 90 columns is a lot! If we run the colnames()
function on our dataframe, we can see what each column is named, revealing all the metadata that we collect alongside the content of each tweet itself.
## [1] "user_id" "status_id"
## [3] "created_at" "screen_name"
## [5] "text" "source"
## [7] "display_text_width" "reply_to_status_id"
## [9] "reply_to_user_id" "reply_to_screen_name"
## [11] "is_quote" "is_retweet"
## [13] "favorite_count" "retweet_count"
## [15] "quote_count" "reply_count"
## [17] "hashtags" "symbols"
## [19] "urls_url" "urls_t.co"
## [21] "urls_expanded_url" "media_url"
## [23] "media_t.co" "media_expanded_url"
## [25] "media_type" "ext_media_url"
## [27] "ext_media_t.co" "ext_media_expanded_url"
## [29] "ext_media_type" "mentions_user_id"
## [31] "mentions_screen_name" "lang"
## [33] "quoted_status_id" "quoted_text"
## [35] "quoted_created_at" "quoted_source"
## [37] "quoted_favorite_count" "quoted_retweet_count"
## [39] "quoted_user_id" "quoted_screen_name"
## [41] "quoted_name" "quoted_followers_count"
## [43] "quoted_friends_count" "quoted_statuses_count"
## [45] "quoted_location" "quoted_description"
## [47] "quoted_verified" "retweet_status_id"
## [49] "retweet_text" "retweet_created_at"
## [51] "retweet_source" "retweet_favorite_count"
## [53] "retweet_retweet_count" "retweet_user_id"
## [55] "retweet_screen_name" "retweet_name"
## [57] "retweet_followers_count" "retweet_friends_count"
## [59] "retweet_statuses_count" "retweet_location"
## [61] "retweet_description" "retweet_verified"
## [63] "place_url" "place_name"
## [65] "place_full_name" "place_type"
## [67] "country" "country_code"
## [69] "geo_coords" "coords_coords"
## [71] "bbox_coords" "status_url"
## [73] "name" "location"
## [75] "description" "url"
## [77] "protected" "followers_count"
## [79] "friends_count" "listed_count"
## [81] "statuses_count" "favourites_count"
## [83] "account_created_at" "verified"
## [85] "profile_url" "profile_expanded_url"
## [87] "account_lang" "profile_banner_url"
## [89] "profile_background_url" "profile_image_url"
Not only do we get the content of the tweet (in text
), we also get information on:
- who sent it (
screen_name
) - the time/date it was sent (
created_at
) - the exact latitude/longitude coordinates from where the tweet was sent, if the account has geotagging enabled (
geo_coords
) - many, many other things!
Exercise
Take a look at the data set. For each column, try and work out what kind of information it contains, and whether or not it could be useful for any analysis
Try running the code again but for a different search term. Be careful not to run it too many times with a high
n
argument otherwise you might go over the limit (remember: 18,000 tweets every 15 mins)Read the help vignette for the
search_tweets()
function (you can do this by typing?search_tweets
). Focus in particular on the description of theq
argument: what’s the difference between searching for Tyrion Lannister, Tyrion OR Lannister, Tyrion AND Lannister, and “Tyrion Lannister”?
1.2 Searching by user
You can also collect tweets from an individual account, such as @realDonaldTrump (if you can stomach it). Note that this is restricted to the most recent 3,200 tweets from a single account (even if you change the n
argument to something even higher, like 10,000).
If you can’t access the Twitter API, download a copy of the data here: trump-tweets.Rdata
1.2.1 Comparing popularity of users
One fun (and perhaps useful!) thing to do is to compare the popularity of tweets from multiple users. We’ve already got Donald Trump’s 150 most recent tweets saved in the trump.tweets
dataframe, but now let’s do the same for @BarackObama:
If you can’t access the Twitter API, download a copy of the data here: obama-tweets.Rdata
Now we can combine these two dataframes together using rbind()
- note that this requires both dataframes to have the same column names in the same order. In this case, both dataframes are just the direct output of search_tweets()
, so they are identical in structure.
Now that we’ve got our combined.tweets
dataframe, let’s plot the correlation between the number of times each tweet was ‘favourited’ (in the favorite_count
column) and ‘retweeted’ (in the retweet_count
column) - and we should colour-code each tweet based on its author. We can use the ggplot2
package for our plotting, which is part of the tidyverse
- a really neat way of structuring our code. Install the tidyverse
set of packages, if you haven’t already, and then load it in using library()
:
The following chunk of code says:
take the
combined.tweets
dataframe, and input this (using the%>%
symbol) to theggplot()
function. Set the x-axis values tofavourite_count
, the y-axis values toretweet_count
, and colour-code our data based on thescreen_name
column. Then plot this data as a scatterplot usinggeom_point()
combined.tweets %>%
ggplot(aes(x = favorite_count, y = retweet_count, colour = screen_name)) +
geom_point()
Looks like all of the most popular tweets belong to Obama 🥳 However, you might have noticed that the data is quite skewed, with some particularly high values resulting in most of the data being compressed in the bottom-left corner. We can fix this by applying a logarithmic transformation to the x and y values, which expands the lower values and compresses the higher values, and makes our figure more readable:
combined.tweets %>%
ggplot(aes(x = log(favorite_count), y = log(retweet_count), colour = screen_name)) +
geom_point()
Using the tidyverse
packages, we can easily summarise the data using some basic descriptive statistics. Let’s say we want to calculate the mean/median number of times each account’s tweets were ‘favourited’, as well as the most number of favourites a single tweet received. We can do this using group_by()
, which temporarily splits our dataset based on each unique value in a specified variable (in this case screen_name
), and summarise()
, which allows us to perform some basic summary statistics (such as mean()
, median()
, and max()
)
combined.tweets %>%
group_by(screen_name) %>%
summarise(mean(favorite_count), median(favorite_count), max(favorite_count))
## # A tibble: 2 x 4
## screen_name `mean(favorite_count… `median(favorite_cou… `max(favorite_coun…
## <chr> <dbl> <dbl> <int>
## 1 BarackObama 240585. 150051 1397785
## 2 realDonaldTru… 59192. 57042. 167229
Exercise
Try using the
get_timelines()
function for two different accounts (maybe a celebrity’s, or even your own!) and conduct a similar comparison.Now look for a relationship between
favorite_count
orretweet_count
and some other variable contained within the dataset, e.g.is_quote
(does the tweet quote an existing tweet?), ordisplay_text_width
(the number of characters in the tweet). Note that since the former is a categorical variable - not a continuous one - a scatterplot wouldn’t be appropriate. Consider usinggeom_boxplot()
instead ofgeom_point()
for this kind of figure.
1.3 Collect tweets in real-time
Another option for data collection is to stream tweets as they’re sent, in real-time.
It doesn’t normally make much sense to run a completely unrestricted search like this, because it’s collecting tweets from all over the world, written in any possible language, written about any possible topic.
It might make more sense to restrict this search geographically. For example, the following code will collect live tweets sent from Manchester over the next minute. If you leave this running for a long time, you can build a corpus of tweets sent from a particular location:
However, a recent update to the Twitter/Google Maps API means that you now have to register for a valid Google Maps API key in order to perform these geographically-restricted searches. We won’t be covering this process in this workshop, but you can read up on how to do it here.
1.4 Saving data
If you want to keep the data you’ve scraped for future analysis, make sure you export it from R. To save your dataframe (assuming you’ve already made a folder called ‘data’ inside your current working directory):
Then you can always load it in again at a later date, using:
2 Mining tweets using FireAnt
If you plan on collecting a lot of data, it might be worthwhile using the FireAnt software instead. FireAnt is a useful piece of software developed by Laurence Anthony, free to download from his website. It provides a graphical user interface (GUI) to access Twitter’s Streaming API, which allows for a user-friendly way to collect tweets sent in real-time. Because you’re collecting tweets as they’re sent, instead of searching back for existing tweets, the limit I mentioned earlier (of 18,000 tweets every 15 minutes) doesn’t apply. You can simply leave the software running for as long as you want (hours, days, weeks) and by the end of it you’ll have a lot of data.
In my experience, if you want to collect a lot of geocoded data from a particular region, using FireAnt is the best option
The data will be saved in .JSON format, which can be read into R as follows:
3 Textual analysis
Now that we’ve run through various ways of collecting tweets, let’s run over some basic analysis you can conduct. We’ll be using the tidyverse
package again, which should already be installed and loaded, but we also need the tidytext
package to conduct some textual analysis:
3.1 Word frequency
We can look at the content of these tweets in more detail by calculating the most frequent words.
First off, let’s convert our database of Donald Trump tweets so that each word of each tweet is on its own line. We can use select()
to just take two columns (the ID, and the content of each tweet), and then unnest_tokens()
to transform our dataframe into a one-word-per-line format.
Thanks to the tidyverse
packages, we can connect these commands together using the %>%
symbol - this is referred to as a ‘pipe chain’, and it basically means:
take the thing that comes before ‘the pipe’, and input this into the thing that comes after ‘the pipe’
In other words, the following code:
- takes
trump.tweets
and inputs it into theselect()
command, which selects just thestatus_id
andtext
columns, discarding the rest… - … these are then input into
unnest_tokens()
, which takes the content oftext
, splits it into a one-word-per-line format, and puts this in a column calledword
… - … and saves all of this into a dataframe called
trump.words
If we were to execute these commands in a ‘non-tidy’ way, it would look something like this instead:
The first, ‘tidy’ method of structuring our code is much more intuitive, without the need to use nested brackets that can sometimes be difficult to interpret. Throughout this workshop we’ll be ‘piping’ commands together in this ‘tidy’ way.
Let’s use head()
to look at the first 6 lines, just to make sure our code has worked:
## # A tibble: 6 x 2
## status_id word
## <chr> <chr>
## 1 1170782374843564034 looking
## 2 1170782374843564034 forward
## 3 1170782374843564034 to
## 4 1170782374843564034 being
## 5 1170782374843564034 in
## 6 1170782374843564034 the
Looks good! Each line of trump.words
contains a single word from each tweet, with a column corresponding to the tweet ID containing that word. Now we’ve got each word of each tweet on its own line, we can simply count the occurrence of each word and plot the most frequent ones.
All of the following commands are piped together using %>%
and their output is saved into a new object called trump.count
:
count()
will count how many times each word appearshead()
will give you the first x number of rows of this data (in the example below the first 30 lines)mutate()
allows us to change theword
column - which we will do using thereorder()
command to make sure we plot the most frequent words at the top
trump.count <- trump.words %>%
count(word, sort = TRUE) %>%
head(30) %>%
mutate(word = reorder(word, n))
Now we can plot the word frequency using ggplot()
. To do this, we take the dataframe trump.count
, pipe it into the ggplot()
function, inside of which we specify the columns for our x and y axes. Then we need to specify what kind of graph we’d like, in this case geom_col()
will plot a bar chart. The final two lines aren’t mandatory, but coord_flip()
will flip the axes round and theme_minimal()
changes the ggplot theme to make the figure look a bit cleaner.
Cool! But there’s a problem here: obviously the most frequent words are things like the, to, and, of etc. which aren’t particularly interesting. To remove this, we can use what’s called a stop list, which is a list of highly frequent words you want to exclude from the analysis. Luckily, the tidytext
package we installed and loaded earlier already provides one of these, called stop_words
. The following code adds a few Twitter-specific items to this list, such as hyperlinks (‘https’) and acronyms (‘rt’, i.e. ‘retweet’) that we obviously aren’t interested in.
new_items <- c("https", "t.co", "amp", "rt")
stop_words_new <- stop_words %>%
pull(word) %>%
append(new_items)
Now we can remake the trump.count
object, but with the addition of a new line that excludes certain words from our dataset:
filter()
can be used to filter out certain rows of data depending on specific critiera that you set%in%
is a logical operator - it checks to see if the object that comes before it appears in the object that comes after it, and returns either TRUE or FALSE!
is a negator, which means it reverses whatever comes after it
Taken together, the second line of code below essentially says: only include rows of data where the word
is not in the updated list of stop words
trump.count <- trump.words %>%
filter(!word %in% stop_words_new) %>%
count(word, sort = TRUE) %>%
head(30) %>%
mutate(word = reorder(word, n))
Now when we re-make the same plot, it shouldn’t include any of the uninteresting function words:
Unsurprisingly, Donald Trump tweets a lot about ‘fake news’, and even ‘Trump’ himself…
Exercise
Now try the same method of analysing word frequency for a different account. You should already have a set of tweets from a different account that you collected earlier using
get_timelines()
- if not, do it now!Let’s also make the word frequency plot more colourful. If you want to make all the bars red, for example, you can specify
fill = 'red'
inside thegeom_col()
command. Try it out.We can take it one step further and colour-code the bars based on a variable/column in our dataset. To do this, you just specify the column name for the
fill
command (without quotation marks), but when you do this you also have to wrapaes()
around it. So to colour everything red, it’sgeom_col(fill = 'red')
, but to colour-code based on the word frequency, it’sgeom_col(aes(fill = n))
, which makes reference to then
column in the dataframe.
3.2 n-grams
So far we’ve just looked at the frequency of individual words, but of course in language context is very important. For this reason, it’s quite common to investigate frequent collocations of words instead - or n-grams. Let’s test it out by looking at bigrams from Trump’s tweets - i.e. which two words appear together most often?
Since ngram analysis requires more data than individual word frequency analysis, you might want to first re-run the get_timeline()
function from before to collect more tweets from the @realDonaldTrump account:
If you can’t access the Twitter API, download a copy of the data here: trump-tweets-big.Rdata
We can use the unnest_tokens()
like we did before to get a one-word-per-line format, but this time we include the token = "ngrams"
and n = 2
arguments, which tells R to tokenise into bigrams instead of individual words (if we wanted trigrams, we would change n
to 3)
trump.ngrams <- trump.tweets %>%
select(status_id, text) %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2, collapse = FALSE)
Take a look at the first 10 rows to make sure it’s worked:
## # A tibble: 10 x 2
## status_id bigram
## <chr> <chr>
## 1 1171120177196544000 we have
## 2 1171120177196544000 have been
## 3 1171120177196544000 been serving
## 4 1171120177196544000 serving as
## 5 1171120177196544000 as policemen
## 6 1171120177196544000 policemen in
## 7 1171120177196544000 in afghanistan
## 8 1171120177196544000 afghanistan and
## 9 1171120177196544000 and that
## 10 1171120177196544000 that was
Next up, we need to count the number of occurrences of each bigram. We can do this using count()
, as we did for individual lexical frequency earlier, but before we do that we should separate()
each bigram into its constituent words and get rid of any that are in our stop list (stop_words_new
) - we can do this using filter()
.
trump.ngrams.count <- trump.ngrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words_new & !word2 %in% stop_words_new) %>%
count(word1, word2, sort = TRUE)
Now let’s plot them! In the following code, we:
- take the
trump.ngrams.count
object we’ve just made… - input this to
mutate()
, where we usepaste()
to rejoin the two words into a single string and save it in a column calledbigram
… - use
mutate()
again to reorder thebigram
column (arranged in descending order by frequency)… - use
head(10)
to take the first 10 rows (i.e. the 10 most frequent bigrams)… - input this to
ggplot()
, where we plot eachbigram
along the x-axis and the frequency itself - i.e.n
- along the y-axis… - use
geom_col()
to plot this as a bar chart… - and finally use
coord_flip()
to flip the x and y axes around (it makes for a nicer-looking plot in this case)
trump.ngrams.count %>%
mutate(bigram = paste(word1, word2)) %>%
mutate(bigram = reorder(bigram, n)) %>%
head(10) %>%
ggplot(aes(x = bigram, y = n)) +
geom_col() +
coord_flip()
However, it’s more common (and more exciting!) to plot this kind of data as an ngram network instead, where words are clustered by their collocation frequency. To do this, we need to first install and load two new packages:
We can now plot a bigram network using the code below. It looks a little scary at first, and in all honesty you don’t strictly need to know what each bit does
filter()
is straightforward: we’re only plotting bigrams that appear more than 3 times in the dataset- we’ve specified
layout = "fr"
inside theggraph()
function - this tells R to use the force-directed algorithm developed by Fruchterman and Reingold when positioning the nodes (check the?layout_tbl_graph_igraph
help vignette for a list of other options you could use instead) geom_edge_link()
refers to the links between nodes - we’ve setedge_alpha = n
so that more frequent bigrams are plotted with darker connecting lines, and we’ve also specifed that we want to plot arrowheads usingarrow
so that we know which word comes first in each bigramgeom_node_point()
refers to the nodes/words themselves - here you can specify their colour/size (you can find a list of colour names here)geom_node_text()
refers to the labels for each node/word - by settinglabel = name
we’re telling R to plot the word label for each node (it wouldn’t be a very informative graph otherwise!) and thevjust
andhjust
arguments allow us to nudge the position of the labels a tiny bit so that they’re not completely overlapping with the nodes themselvestheme_void()
changes the ggplot theme to one with a white background (instead of the default grey)
trump.ngrams.count %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = arrow(type = "closed", length = unit(.1, "inches")),
end_cap = circle(.07, "inches")) +
geom_node_point(color = "skyblue", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1, size = 3) +
theme_void()
Exercise
Try using the same methods as above to plot a bigram network of either:
- a single user’s tweets (from
get_timelines()
) - a certain topic (from
search_tweets()
)
3.3 Time-series
We can also investigate how frequently certain words occur over time using the ts_plot()
function to plot a time-series of our data. Let’s try this out with our tweets from @realDonaldTrump, searching for the phrase ‘fake news’. First off, we’ll have to make a new dataframe which includes only those tweets containing this phrase. We can do this a combination of filter()
and str_detect()
. We also use mutate()
and tolower()
to convert the content of each tweet to lower case before performing our search using str_detect()
:
fake.news <- trump.tweets %>%
mutate(text = tolower(text)) %>%
filter(str_detect(text, "fake news"))
Now that we’ve got this dataset, we can feed it to ts_plot()
, along with an argument for how fine-grained we’d like the time dimension to be, e.g. minutes, hours, days, weeks etc. Let’s try plotting the frequency per day (strictly speaking you only need the first line of code, but labs()
is good for adding labels to your plot)
ts_plot(fake.news, "days") +
labs(title = "'fake news' tweets per day (from @realDonaldTrump)", x = "Time/day", y = "Number of tweets") +
theme_minimal()
Exercise
Can you think of a popular topic that might show temporal patterns (i.e. an increase or decrease over time)? Try it out! Collect some data using either search_tweets()
or get_timelines()
, then use ts_plot()
to plot the frequency over time.