Overview
Sentiment analysis is one small part of the wider field of natural language processing, where we try to automatically identify and determine the emotion of a text. It has a number of real-world applications, ranging from politics to consumer research. For our purposes, we can ask the question: can we take a tweet and automatically quantify how positive or negative it is?
There are different ways of conducting sentiment analysis, and some methods are much more sophisticated than others. In this workshop, we’ll take a simple approach where the overall sentiment of a string of text directly reflects the sentiment of the individual words, without considering the syntactic relationship between those words. This can be problematic in some examples, which we’ll discuss in more detail later on.
1 Sentiment dictionaries
We can do some basic sentiment analysis without the need to install any new packages, the tidytext
package you were introduced to earlier in Part 2 of this workshop already contains everything we need, including a sentiment dictionary!
A sentiment dictionary is just a list of words with a corresponding sentiment classification - we’ll be looking at two possible dictionaries today, which operationalise sentiment in different ways.
1.1 The ‘Bing’ lexicon
Bing Liu has a widely-used sentiment lexicon that is accessible through tidytext
. Let’s read it into R and save it to an object called sent
:
In this dictionary, sentiment is measured as a binary variable - words are either classified as positive or negative. Let’s look at a random sample of the positive words:
## # A tibble: 5 x 2
## word sentiment
## <chr> <chr>
## 1 delicacy positive
## 2 eagerly positive
## 3 beckoned positive
## 4 enchantingly positive
## 5 best positive
And now the same for some negative words:
## # A tibble: 5 x 2
## word sentiment
## <chr> <chr>
## 1 deficiencies negative
## 2 smelled negative
## 3 breakup negative
## 4 confusing negative
## 5 taxing negative
1.2 The ‘AFINN’ lexicon
An alternative to the Bing lexicon is the AFINN lexicon, developed by Finn Årup Nielsen and released in 2011. In this dictionary, words are not classified in a binary fashion but are instead assigned a numerical value reflecting how strongly positive or negative they are, ranging from -5 (the most negative) to +5 (the most positive).
Once again, let’s read it into R and name it sent
(you’ll have to install the textdata
package first to access the AFINN lexicon):
Now let’s take a random sample to get an idea of how words are evaluated - does it fit your intuitions?
## # A tibble: 5 x 2
## word value
## <chr> <dbl>
## 1 violates -2
## 2 lunatics -3
## 3 mistaking -2
## 4 vicious -2
## 5 relaxed 2
2 Calculating sentiment
To conduct sentiment analysis on a given string of text, we need to look up each word in the sentiment lexicon and assign it the appropriate value. This involves two things:
- using
unnest_tokens()
to transform our data into a one-word-per-line format - then using
left_join()
, which will join this data with the sentiment lexicon, i.e. for any words that are in the lexicon, it will attach the corresponding sentiment value in a separate column
Tangent: left_join()
left_join()
is a very useful function, so it’s important to understand how it works. Take the following example dataset of Lord of the Rings characters, which contains the names of individual characters (in character
) and information about their type/race (in type
).
## character type
## 1 Gandalf wizard
## 2 Frodo hobbit
## 3 Legolas elf
## 4 Bilbo hobbit
## 5 Arwen elf
## 6 Saruman wizard
## 7 Pippin hobbit
## 8 Gimli dwarf
Now imagine we have another, separate dataframe, containing descriptive information about the different character types (e.g. wizards are magical, Hobbits have hairy feet etc.).
## type feature
## 1 wizard magical
## 2 hobbit hairy feet
## 3 elf pointy ears
If we want to join these two dataframes together, we can use left_join()
, specifying the two dataframes as arguments. It will notice that both dataframes have a column called type
, and look for matches between the values. If it finds a match in the second table, it copies over the extra columns back to the first table. Note that for any missing values (e.g. Dwarf), it will just use NA:
## character type feature
## 1 Gandalf wizard magical
## 2 Frodo hobbit hairy feet
## 3 Legolas elf pointy ears
## 4 Bilbo hobbit hairy feet
## 5 Arwen elf pointy ears
## 6 Saruman wizard magical
## 7 Pippin hobbit hairy feet
## 8 Gimli dwarf <NA>
If for some reason the joining column has a different name in the two dataframes, you will have to specify them as an extra argument. For example, let’s say the type
column in one of the dataframes is actually called race
instead; the join command would be left_join(dataframe1, dataframe2, by = c("type" = "race"))
.
left_join()
is just one of a number of join commands in R - e.g. right_join()
, full_join()
, inner_join()
etc. We’ll only be using left_join()
today, but remember you can always check the help vignette by typing ?left_join()
into the console if you want to know how they’re all different.
2.1 Applying sentiment analysis to Twitter
Now let’s apply these methods of sentiment analysis to some Twitter data. Maybe something topical. It can only mean one thing…
If you can’t access the Twitter API, download a copy of the data here: brexit-tweets.Rdata
Since Brexit is something people (rightly) have very strong feelings about, it’s the perfect case study for testing out our methods of sentiment analysis. Let’s take a quick look at what some people are saying (remember you can use sample_n()
to take a random sample of rows from a dataframe)
status_id | text |
---|---|
tweet_1001 | This brexit bollocks is still going ? Will beer prices increase? |
tweet_3195 | Longing for the day Politics isn’t about Brexit, just for a day at least |
tweet_3957 | the Tories started this nightmare & I will NEVER vote for them again #brexit |
tweet_5832 | This brexit bullshit is boring now. 🙃 |
tweet_5889 | Brexit is a scam and you’ve been had. Revoke the fuck out of it. Now. |
Ok now that we’ve got our dataframe of tweets, the next step is to use unnest_tokens()
to take each tweet and convert it to one word per line (as we did in Part 2), then assign the sentiment values to each word using left_join()
with the sentiment dictionary stored in sent
:
brexit.words <- brexit.tweets %>%
select(status_id, text) %>%
unnest_tokens(word, text) %>%
left_join(sent)
We should now have one word per line, with the associated sentiment value in the value
column. Let’s take a look at an example tweet to make sure it’s worked:
## # A tibble: 9 x 3
## status_id word value
## <chr> <chr> <dbl>
## 1 tweet_3296 can NA
## 2 tweet_3296 a NA
## 3 tweet_3296 competent 2
## 4 tweet_3296 soul NA
## 5 tweet_3296 stop -1
## 6 tweet_3296 this NA
## 7 tweet_3296 brexit NA
## 8 tweet_3296 madness -3
## 9 tweet_3296 please 1
You might disagree with some of the assigned values, but the important thing is that it’s worked!
Our next step is to produce a single value for each tweet, corresponding to how positive/negative it is. At this point, we have a couple of options:
- for each tweet, we can average over each word’s sentiment value (which would give us a value between -5 to +5 for each tweet)
- alternatively, we can add up all of the individual sentiment values for each tweet to give us a total overall sentiment
Let’s try both. We’ll use group_by()
to group each tweet’s words back together, filter out all the words without a sentiment value using filter()
and !is.na()
, and finally summarise()
, mean()
and sum()
to produce an average value and total value for each lyric:
brexit.values <- brexit.words %>%
group_by(status_id) %>%
filter(!is.na(value)) %>%
summarise(average.value = mean(value), total.value = sum(value))
Let’s take a look at our new brexit.values
dataframe:
## # A tibble: 5 x 3
## status_id average.value total.value
## <chr> <dbl> <dbl>
## 1 tweet_5441 0.333 1
## 2 tweet_6722 -2 -4
## 3 tweet_6422 -2.33 -7
## 4 tweet_4205 -2 -2
## 5 tweet_6095 -1.25 -5
So far so good - but it would be useful to actually see the content of each tweet alongside our sentiment values. This is really easily - we’ve got a status_id
column both in this dataframe and the original brexit.tweets
dataframe, so we can perform a simple left_join()
between them.
Just to make sure we’re getting sensible results, let’s take a look at what’s been classified as the most negative tweet in the dataset:
text |
---|
Brexit is a puking death fuck-up-pisser collapsing, pissing the broken pisser as dishonourable as the bigotry made of extremist-fucking death that implodes, and fucking a thousand unbelievably despicable sarcophagus-juice-fuckers that shits |
Yep, seems pretty accurate! And what about the most positive tweet?
text |
---|
Thank you, Brexit! You lovely, lovely people, @user !!!! Thank you, @user !!!!I hope you are all very, very proud…..@user #RevokeA50Now |
I guess automated sentiment analysis isn’t sensitive to sarcasm…
2.2 Visualisation
Now that we’ve got a single dataframe with two measures of sentiment score for each tweet, it’s time to plot the data!
Let’s try plotting a histogram to visualise the distribution of sentiment values. Notice how:
- inside
geom_histogram()
, we’ve changed the number ofbins
(essentially how fine-grained you’d like the x-axis to be), and set thefill
andcolour
arguments to specify the colour of the bars (and their outline) - we’ve also added a type of ‘geom’ that hasn’t been covered yet:
geom_vline()
draws a vertical line at whatever position on the x-axis you specify inxintercept
- you can uselwd
to change its width, andlty
to change the type of line
brexit.final %>%
ggplot(aes(x = average.value)) +
geom_histogram(bins = 20, fill = 'grey', colour = 'black') +
geom_vline(xintercept = 0, lwd = 1, lty = 'dashed') +
theme_minimal()
There is an unsurprising skew towards the negative side of the scale - interesting!
For comparison, I also collected 10000 tweets containing the words puppy or puppies and generated an equivalent dataset of sentiment scores. I won’t reproduce all of the code here, since I just used the same method of sentiment analysis we’ve just worked through, but you can download the dataset below (or even better: perform the search yourself using search_tweets()
to see if you can replicate the results with a more recent set of tweets!).
In the code below, we add an identifying column type
to distinguish the two sets of tweets, before combining them together into a dataframe called combined.sentiment
:
brexit.final$type = 'brexit'
puppy.final$type = 'puppy'
combined.sentiment <- rbind(brexit.final, puppy.final)
Download a copy of the data here: puppy-tweets.Rdata
And now let’s compare their sentiment scores:
combined.sentiment %>%
ggplot(aes(x = type, y = average.value, colour = type)) +
geom_boxplot() +
geom_hline(yintercept = 0, lty = 'dashed') +
theme_minimal()
Who would’ve thought - puppies are more popular than Brexit…
Another thing we can do is plot a word cloud of the positive and negative words present in the datasets. It’s easier to use a specific package for this rather than using ggplot
as we’ve done so far. Let’s go ahead and install the wordcloud
package, then load it into the workspace:
Since we’re not using ggplot
, the syntax is a little different. To plot a word cloud of the most frequent positive words, we use filter()
to only return those words with a positive sentiment value, then count()
to calculate the frequency of each unique word, and then we make a call to the wordcloud()
function itself.
brexit.words %>%
filter(value > 0) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100, colors = "green4"))
We can do the same for the negative words, i.e. those where value
< 0 (we’re also excluding the word no here as it’s just so much more frequent than the rest and it isn’t really interesting)
brexit.words %>%
filter(value < 0 & word != 'no') %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100, colors = "red"))
Exercise
Pick a topic of your choice and do another Twitter search, then follow the same instructions as above to perform sentiment analysis and compare the results to another topic to see which prompts more positive/negative discussion on Twitter.
Discussion
What are the limitations of conducting sentiment analysis in this way? Take the following tweet as an example, which would be classified as being quite positive with an overall sentiment score of 2:
## # A tibble: 8 x 3
## status_id word value
## <chr> <chr> <dbl>
## 1 tweet_6087 brexit NA
## 2 tweet_6087 is NA
## 3 tweet_6087 never NA
## 4 tweet_6087 going NA
## 5 tweet_6087 to NA
## 6 tweet_6087 be NA
## 7 tweet_6087 a NA
## 8 tweet_6087 success 2
How could you improve this method of sentiment analysis? Think about how things like negation, intensifiers, or mitigators can affect the overall emotion of a sentence, for example in the following:
- He’s a comedian but he isn’t isn’t funny
- The film was really good (cf. The film was good)
- Her performance was kind of impressive