N-grams are phrases

Tokenization can be adjusted to respect lines, sentences, and paragraphs as well as words. But what about phrases? For example, “Frankenstein's monster” and “philosopher’s stone” are both phrases characteristic of Mary Shelley’s writing. Neither of them would be broken out by the tokenization strategies we’ve discussed so far. Instead, they require a strategy called n-grams.

Most frequent phrases

For our work with Mary Shelley, it’ll be helpful to know a list of the most frequent phrases. The following code produces this list:

Press + to interact

library(tm, quietly = TRUE)
library(readtext, quietly = TRUE)
# This removes Project Gutenberg header and tail -----------
shelleyText <- readtext("data/mws_*.txt") 
shelleyText <- iconv(shelleyText$text, "UTF-8", sub = '')
# *** START OF THE PROJECT GUTENBERG EBOOK ??? ***
# useful text is between these two lines
# *** END OF THE PROJECT GUTENBERG EBOOK ??? ***
fromHere <- regexpr(pattern = ' \\*{3}\n', text = shelleyText)
toHere <- regexpr(pattern = '\\*{3} END', text = shelleyText)
for (index in 1:length(shelleyText)) {
  shelleyText[index] <- substr(shelleyText[index], 
         start = fromHere[index] + attr(fromHere, which = "match.length")[1],
         stop = toHere[index])
}       
# This displays leading n-grams ------------------------
shelleyText |>
  removePunctuation() |>
  removeWords(stopwords('english')) |>
  removeWords(c("I")) |>
  removeNumbers() |>
  stripWhitespace() |>
  Boost_tokenizer() |>
  ngrams(n = 3) |>
  vapply(paste, "", collapse=" ") |>
  table() |>
  sort(decreasing = TRUE) |>
  head(n = 10)

Before We Begin

Important Concepts in Natural Language Processing

Text Mining Package

Understanding Corpora and Sources

Converting Text to Structured Data

Document Insights and Advanced Search Techniques

Working with Metadata in the tm Package

Implementing NLP with the quanteda Package

Implementing NLP with the tidytext Package

Assess What You Have Learned About NLP

Concluding Remarks

Appendix

Preserve Phrases with N-grams

N-grams are phrases

Most frequent phrases