Lexicon-Primarily based Sentiment Evaluation Utilizing R | by Okan Bulut

Machine Learning

Lexicon-Primarily based Sentiment Evaluation Utilizing R | by Okan Bulut | Feb, 2024

hhhhm

2024年2月14日

Lexicon-Primarily based Sentiment Evaluation Utilizing R | by Okan Bulut | Feb, 2024

[ad_1]

For the sake of simplicity, we’ll give attention to the primary wave of the pandemic (March 2020 — June 2020). The transcripts of all media briefings have been publicly out there on the federal government of Alberta’s COVID-19 pandemic web site (https://www.alberta.ca/covid). This dataset comes with an open information license that enables the general public to entry and use the knowledge, together with for industrial functions. After importing these transcripts into R, I turned all of the textual content into lowercase after which utilized phrase tokenization utilizing the tidytext and tokenizers packages. Phrase tokenization cut up the sentences within the media briefings into particular person phrases for every entry (i.e., day of media briefings). Subsequent, I utilized lemmatization to the tokens to resolve every phrase into its canonical kind utilizing the textstem bundle. Lastly, I eliminated frequent stopwords, corresponding to “my,” “for,” “that,” “with,” and “for, utilizing the stopwords bundle. The ultimate dataset is offered right here. Now, let’s import the info into R after which assessment its content material.

load("wave1_alberta.RData")head(wave1_alberta, 10)

A preview of the dataset (Picture by writer)

The dataset has three columns:

month (the month of the media briefing)
date (the precise date of the media briefing), and
phrase (phrases or tokens utilized in media briefing)

Descriptive Evaluation

Now, we are able to calculate some descriptive statistics to raised perceive the content material of our dataset. We’ll start by discovering the highest 5 phrases (based mostly on their frequency) for every month.

library("dplyr")wave1_alberta %>%
group_by(month) %>%
depend(phrase, type = TRUE) %>%
slice_head(n = 5) %>%
as.information.body()

Prime 5 phrases by months (Picture by writer)

The output reveals that phrases corresponding to well being, proceed, and check have been generally used within the media briefings throughout this 4-month interval. We are able to additionally broaden our listing to the most typical 10 phrases and think about the outcomes visually:

library("tidytext")
library("ggplot2")wave1_alberta %>%
# Group by month
group_by(month) %>%
depend(phrase, type = TRUE) %>%
# Discover the highest 10 phrases
slice_head(n = 10) %>%
ungroup() %>%
# Order the phrases by their frequency inside every month
mutate(phrase = reorder_within(phrase, n, month)) %>%
# Create a bar graph
ggplot(aes(x = n, y = phrase, fill = month)) +
geom_col() +
scale_y_reordered() +
facet_wrap(~ month, scales = "free_y") +
labs(x = "Frequency", y = NULL) +
theme(legend.place = "none",
axis.textual content.x = element_text(measurement = 11),
axis.textual content.y = element_text(measurement = 11),
strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 13))

Most typical phrases based mostly on frequency (Picture by writer)

Since some phrases are frequent throughout all 4 months, the plot above might not essentially present us the vital phrases which can be distinctive to every month. To seek out such vital phrases, we are able to use Time period Frequency — Inverse Doc Frequency (TF-IDF)–a broadly used method in NLP for measuring how vital a time period is inside a doc relative to a group of paperwork (for extra detailed details about TF-IDF, try my earlier weblog submit). In our instance, we’ll deal with media briefings for every month as a doc and calculate TF-IDF for the tokens (i.e., phrases) inside every doc. The primary a part of the R codes under creates a brand new dataset, wave1_tf_idf, by calculating TF-IDF for all tokens and deciding on the tokens with the very best TF-IDF values inside every month. Subsequent, we use this dataset to create a bar plot with the TF-IDF values to view the frequent phrases distinctive to every month.

# Calculate TF-IDF for the phrases for every month
wave1_tf_idf <- wave1_alberta %>%
depend(month, phrase, type = TRUE) %>%
bind_tf_idf(phrase, month, n) %>%
prepare(month, -tf_idf) %>%
group_by(month) %>%
top_n(10) %>%
ungroup# Visualize the outcomes
wave1_tf_idf %>%
mutate(phrase = reorder_within(phrase, tf_idf, month)) %>%
ggplot(aes(phrase, tf_idf, fill = month)) +
geom_col(present.legend = FALSE) + 
facet_wrap(~ month, scales = "free", ncol = 2) +
scale_x_reordered() +
coord_flip() +
theme(strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 13),
axis.textual content.x = element_text(measurement = 11),
axis.textual content.y = element_text(measurement = 11)) +
labs(x = NULL, y = "TF-IDF")

Most typical phrases based mostly on TIF-IDF (Picture by writer)

These outcomes are extra informative as a result of the tokens proven within the determine mirror distinctive subjects mentioned every month. For instance, in March 2020, the media briefings have been largely about limiting journey, getting back from crowded conferences, and COVID-19 instances on cruise ships. In June 2020, the main target of the media briefings shifted in the direction of masks necessities, folks protesting pandemic-related restrictions, and so forth.

Earlier than we swap again to the sentiment evaluation, let’s check out one other descriptive variable: the size of every media briefing. This may present us whether or not the media briefings grew to become longer or shorter over time.

wave1_alberta %>%
# Save "day" as a separate variable
mutate(day = substr(date, 9, 10)) %>%
group_by(month, day) %>%
# Rely the variety of phrases
summarize(n = n()) %>%
ggplot(aes(day, n, shade = month, form = month, group = month)) +
geom_point(measurement = 2) + 
geom_line() + 
labs(x = "Days", y = "Variety of Phrases") +
theme(legend.place = "none", 
axis.textual content.x = element_text(angle = 90, measurement = 11),
strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.y = element_text(measurement = 11)) +
ylim(0, 800) +
facet_wrap(~ month, scales = "free_x")

Variety of phrases within the media briefings by day (Picture by writer)

The determine above reveals that the size of media briefings different fairly considerably over time. Particularly in March and Could, there are bigger fluctuations (i.e., very lengthy or quick briefings), whereas, in June, the day by day media briefings are fairly related when it comes to size.

Sentiment Evaluation with tidytext

After analyzing the dataset descriptively, we’re prepared to start with the sentiment evaluation. Within the first half, we’ll use the tidytext bundle for performing sentiment evaluation and computing sentiment scores. We’ll first import the lexicons into R after which merge them with our dataset. Utilizing the Bing lexicon, we have to discover the distinction between the variety of constructive and unfavourable phrases to provide a sentiment rating (i.e., sentiment = the variety of constructive phrases — the variety of unfavourable phrases).

# From the three lexicons, Bing is already out there within the tidytext web page
# for AFINN and NRC, set up the textdata bundle by uncommenting the following line
# set up.packages("textdata")
get_sentiments("bing")
get_sentiments("afinn") 
get_sentiments("nrc")# We'll want the unfold operate from tidyr
library("tidyr")
# Sentiment scores with bing (based mostly on frequency)
wave1_alberta %>%
mutate(day = substr(date, 9, 10)) %>%
group_by(month, day) %>%
inner_join(get_sentiments("bing")) %>%
depend(month, day, sentiment) %>%
unfold(sentiment, n) %>%
mutate(sentiment = constructive - unfavourable) %>%
ggplot(aes(day, sentiment, fill = month)) +
geom_col(present.legend = FALSE) +
labs(x = "Days", y = "Sentiment Rating") +
ylim(-50, 50) + 
theme(legend.place = "none", axis.textual content.x = element_text(angle = 90)) +
facet_wrap(~ month, ncol = 2, scales = "free_x") +
theme(strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.x = element_text(measurement = 11),
axis.textual content.y = element_text(measurement = 11))

Sentiment scores based mostly on the Bing lexicon (Picture by writer)

The determine above reveals that the emotions delivered within the media briefings have been typically unfavourable, which isn’t essentially stunning because the media briefings have been all about how many individuals handed away, hospitalization charges, potential outbreaks, and many others. On sure days (e.g., March 24, 2020 and Could 4, 2020), the media briefings have been significantly extra unfavourable when it comes to sentiments.

Subsequent, we’ll use the AFINN lexicon. Not like Bing that labels phrases as constructive or unfavourable, AFINN assigns a numerical weight to every phrase. The signal of the load signifies the polarity of sentiments (i.e., constructive or unfavourable), whereas the worth signifies the depth of sentiments. Now, let’s see if these weighted values produce completely different sentiment scores.

wave1_alberta %>%
mutate(day = substr(date, 9, 10)) %>%
group_by(month, day) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(month, day) %>%
summarize(sentiment = sum(worth),
sort = ifelse(sentiment >= 0, "constructive", "unfavourable")) %>%
ggplot(aes(day, sentiment, fill = sort)) +
geom_col(present.legend = FALSE) +
labs(x = "Days", y = "Sentiment Rating") +
ylim(-100, 100) + 
facet_wrap(~ month, ncol = 2, scales = "free_x") +
theme(legend.place = "none", 
strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.x = element_text(measurement = 11, angle = 90),
axis.textual content.y = element_text(measurement = 11))

Sentiment scores based mostly on the AFINN lexicon (Picture by writer)

The outcomes based mostly on the AFINN lexicon appear to be fairly completely different! As soon as we take the “weight” of the tokens under consideration, most media briefings transform constructive (see the inexperienced bars), though there are nonetheless some days with unfavourable sentiments (see the purple bars). The 2 analyses we’ve got accomplished up to now have yielded very completely different for 2 causes. First, as I discussed above, the Bing lexicon focuses on the polarity of the phrases however ignores the depth of the phrases (dislike and hate are thought of unfavourable phrases with equal depth). Not like the Bing lexicon, the AFINN lexicon takes the depth under consideration, which impacts the calculation of the sentiment scores. Second, the Bing lexicon (6786 phrases) is pretty bigger than the AFINN lexicon (2477 phrases). Due to this fact, it’s seemingly that some tokens within the media briefings are included within the Bing lexicon, however not within the AFINN lexicon. Disregarding these tokens might need impacted the outcomes.

The ultimate lexicon we’re going to attempt utilizing the tidytext bundle is NRC. As I discussed earlier, this lexicon makes use of Plutchik’s psych-evolutionary concept to label the tokens based mostly on fundamental feelings corresponding to anger, concern, and anticipation. We’re going to depend the variety of phrases or tokens related to every emotion after which visualize the outcomes.

wave1_alberta %>%
mutate(day = substr(date, 9, 10)) %>%
group_by(month, day) %>%
inner_join(get_sentiments("nrc")) %>%
depend(month, day, sentiment) %>%
group_by(month, sentiment) %>%
summarize(n_total = sum(n)) %>%
ggplot(aes(n_total, sentiment, fill = sentiment)) +
geom_col(present.legend = FALSE) +
labs(x = "Frequency", y = "") +
xlim(0, 2000) + 
facet_wrap(~ month, ncol = 2, scales = "free_x") +
theme(strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.x = element_text(measurement = 11),
axis.textual content.y = element_text(measurement = 11))

Sentiment scores based mostly on the NRC lexicon (Picture by writer)

The determine reveals that the media briefings are largely constructive every month. Dr. Hinshaw used phrases related to “belief”, “anticipation”, and “concern”. Total, the sample of those feelings appears to stay very related over time, indicating the consistency of the media briefings when it comes to the kind and depth of the feelings delivered.

One other bundle for lexicon-based sentiment evaluation is sentimentr (Rinker, 2021). Not like the tidytext bundle, this bundle takes valence shifters (e.g., negation) under consideration, which might simply flip the polarity of a sentence with one phrase. For instance, the sentence “I’m not sad” is definitely constructive, but when we analyze it phrase by phrase, the sentence might appear to have a unfavourable sentiment as a result of phrases “not” and “sad”. Equally, “I hardly like this e book” is a unfavourable sentence however the evaluation of particular person phrases, “hardly” and “like”, might yield a constructive sentiment rating. The sentimentr bundle addresses the constraints round sentiment detection with valence shifters (see the bundle writer Tyler Rinker’s Github web page for additional particulars on sentimentr: https://github.com/trinker/sentimentr).

To profit from the sentimentr bundle, we’d like the precise sentences within the media briefings slightly than the person tokens. Due to this fact, I needed to create an untokenized model of the dataset, which is offered right here. We’ll first import this dataset into R, get particular person sentences for every media briefing utilizing the get_sentences() operate, after which calculate sentiment scores by day and month through sentiment_by().

library("sentimentr")
library("magrittr")load("wave1_alberta_sentence.RData")
# Calculate sentiment scores by day and month
wave1_sentimentr <- wave1_alberta_sentence %>%
mutate(day = substr(date, 9, 10)) %>%
get_sentences() %$%
sentiment_by(textual content, listing(month, day))
# View the dataset
head(wave1_sentimentr, 10)

Within the dataset we created, “ave_sentiment” is the common sentiment rating for every day in March, April, Could, and June (i.e., days the place a media briefing was made). Utilizing this dataset, we are able to visualize the sentiment scores.

wave1_sentimentr %>%
group_by(month, day) %>%
ggplot(aes(day, ave_sentiment, fill = ave_sentiment)) +
scale_fill_gradient(low="purple", excessive="blue") + 
geom_col(present.legend = FALSE) +
labs(x = "Days", y = "Sentiment Rating") +
ylim(-0.1, 0.3) +
facet_wrap(~ month, ncol = 2, scales = "free_x") +
theme(legend.place = "none", 
strip.background = element_blank(),
strip.textual content = element_text(color = "black", face = "daring", measurement = 11),
axis.textual content.x = element_text(measurement = 11, angle = 90),
axis.textual content.y = element_text(measurement = 11))

Sentiment scores based mostly on sentiment (Picture by writer)

Within the determine above, the blue bars symbolize extremely constructive sentiment scores, whereas the purple bars depict comparatively decrease sentiment scores. The patterns noticed within the sentiment scores generated by sentimentr carefully resemble these derived from the AFINN lexicon. Notably, this evaluation is predicated on the unique media briefings slightly than solely tokens, with consideration given to valence shifters within the computation of sentiment scores. The convergence between the sentiment patterns recognized by sentimentr and people from AFINN is just not fully sudden. Each approaches incorporate related weighting techniques and mechanisms that account for phrase depth. This alignment reinforces our confidence within the preliminary findings obtained by means of AFINN, validating the consistency and reliability of our analyses with sentiment.

[ad_2]