Wordclouds

TLDR

Wordclouds can be used to produce a neat summary of text and can readily be produced in R. This is a simple example based on a recent conferene paper.


Summarising the content of a conference paper

There is an R package dedicated to creating wordclouds, so I’ve started by loading this, along with the tidyverse (for standard data manipulation) and tidytext (for some help processing the contents of the paper).

library(wordcloud); library(tidyverse); library(tidytext)

The wordcloud package creates a graphic of words that appear in some specified text. The size of the word is proprtional to its frequency in the text R can read text from a local file, as shown below, or from a website.

# We can read a text file using 'readLines' and we can select a file interactively using 'file.choose'
# Both of these are Base R functions
paper <- readLines(file.choose())

The ‘paper’ variable is currently as list of individual lines, as we can see when viewing one of its elements:

print(paper[2])
## [1] "Application of MCMC Sampling to Account for Variability and Dependency"

tidytext helps get this into a friendlier format allowing us to count the occirence of each word.

paper_tbl <- as_tibble(paper) %>% 
  tidytext::unnest_tokens(word, value) %>% 
  dplyr::filter(is.na(as.numeric(word))) %>% 
  count(word)

Since I expected words like ‘the’ and ‘of’ are likely to feature a lot in the text, I wanted to be able to remove them. I initally used dplyr to set up a variable that would allow me to filter out shorter words, based on some threshold…

minLength <- 4

paper_tbl <- paper_tbl %>% 
  mutate(check = case_when(nchar(word) < minLength ~ 0,
                           nchar(word) >= minLength ~ 1))

But then I learnt about stopwords and made use of the database that tidytext conveniently provides, before removing them from the data.

paper_tbl <- paper_tbl %>%  
 anti_join(tidytext::get_stopwords(language = 'en', source = 'stopwords-iso'))

Before sending this directly into the wordcloud function, we can review the current state of the data, either as a table…

head(x = paper_tbl %>% 
       arrange(desc(x = n)), n = 10)
## # A tibble: 10 x 2
##    word           n
##    <chr>      <int>
##  1 model         58
##  2 models        27
##  3 fatigue       25
##  4 data          24
##  5 parameters    24
##  6 bayesian      22
##  7 posterior     18
##  8 crack         17
##  9 growth        14
## 10 priors        12

…or as a simple plot (in either case I’m only interested in the most frequent words for now) …

ggplot(paper_tbl %>% 
         arrange(desc(n)) %>% 
         dplyr::filter(n >= 12))+
  geom_col(mapping = aes(x = word, y = n))+
  theme_minimal()+ theme(axis.text.x = element_text(angle = 90), 
                         axis.title.x = element_blank())+
  labs(y = 'count')

One thing that is apparent from the above summaries is that we have not dealt with plurals from the data, i.e. ‘model’ and ‘models’ will be treated as two different words, with their own count. I’ve not found a neat way to combine these, but a manual solution with regular expressions (such as grepl, !grepl, etc.) would be simple, though not very elegant. I decided to leave plurals as they are.

Finally, time to ask the wordcloud function to read and plot our data. There are some useful arguments to experiment with here:

  • min.freq and max.words set boundaries for how populated the wordcloud will be
  • random.order will put the largest word in the middle if set to FALSE
  • rot.per is the fraction of words that will be rotated in the graphic

Finally, the words are arranged stochastically somehow, and so for a repeatable graphic we need to specify a seed value.

set.seed(1008) 

wordcloud(words = paper_tbl$word, freq = paper_tbl$n, 
          min.freq = 4, max.words = 100, random.order = FALSE, rot.per = 0.25,
          colors = brewer.pal(n = 8, name = 'Paired'))

If you’re not familiar with the colour palettes, the below line will ask R to display them for you:

RColorBrewer::display.brewer.all()

Finally, some links to more information regarding the packages introduced here, both of which are available on CRAN:

CEng, PhD Candidate

My research interests include multi-level Bayesian modelling (for partial pooling of information) and it’s decision theoretic applications, such as quantification of the expected value of information. I also work in football (soccer) analytics.

Related