Preserve Phrases with N-grams

Learn to use the tm package to create n-grams from a document.

N-grams are phrases

Tokenization can be adjusted to respect lines, sentences, and paragraphs as well as words. But what about phrases? For example, “Frankenstein's monster” and “philosopher’s stone” are both phrases characteristic of Mary Shelley’s writing. Neither of them would be broken out by the tokenization strategies we’ve discussed so far. Instead, they require a strategy called n-grams.

Most frequent phrases

For our work with Mary Shelley, it’ll be helpful to know a list of the most frequent phrases. The following code produces this list:

Get hands-on with 1400+ tech skills courses.