...

/

Preserve Phrases with N-grams

Preserve Phrases with N-grams

Learn to use the tm package to create n-grams from a document.

N-grams are phrases

Tokenization can be adjusted to respect lines, sentences, and paragraphs as well as words. But what about phrases? For example, “Frankenstein's monster” and “philosopher’s stone” are both phrases characteristic of Mary Shelley’s writing. Neither of them would be broken out by the tokenization strategies we’ve discussed so far. Instead, they require a strategy called n-grams.

Most frequent phrases

For our work with Mary Shelley, it’ll be helpful to know a list of the most frequent phrases. The following code produces this list:

Press + to interact
library(tm, quietly = TRUE)
library(readtext, quietly = TRUE)
# This removes Project Gutenberg header and tail -----------
shelleyText <- readtext("data/mws_*.txt")
shelleyText <- iconv(shelleyText$text, "UTF-8", sub = '')
# *** START OF THE PROJECT GUTENBERG EBOOK ??? ***
# useful text is between these two lines
# *** END OF THE PROJECT GUTENBERG EBOOK ??? ***
fromHere <- regexpr(pattern = ' \\*{3}\n', text = shelleyText)
toHere <- regexpr(pattern = '\\*{3} END', text = shelleyText)
for (index in 1:length(shelleyText)) {
shelleyText[index] <- substr(shelleyText[index],
start = fromHere[index] + attr(fromHere, which = "match.length")[1],
stop = toHere[index])
}
# This displays leading n-grams ------------------------
shelleyText |>
removePunctuation() |>
removeWords(stopwords('english')) |>
removeWords(c("I")) |>
removeNumbers() |>
stripWhitespace() |>
Boost_tokenizer() |>
ngrams(n = 3) |>
vapply(paste, "", collapse=" ") |>
table() |>
sort(decreasing = TRUE) |>
head(n = 10)

This results in a list of the most-used tri-grams. These might be useful in our search for forums Mary Shelley would be wise to use for promotion.

There is a lot to unpack in this code—but there is also a lot to learn. It builds on ...