...
/Analyzing Textual Comparisons with Document-Term Matrices
Analyzing Textual Comparisons with Document-Term Matrices
Learn the significance of document-term matrices in text mining.
We'll cover the following...
Why use document-term matrices?
The following code lists the tokens and their frequencies:
Press + to interact
# This displays leading n-grams ------------------------shelleyText |>removePunctuation() |>removeWords(stopwords('english')) |>removeWords(c("I")) |>removeNumbers() |>stripWhitespace() |>Boost_tokenizer() |>vapply(paste, "", collapse=" ") |>table() |>sort(decreasing = TRUE) |>head(n = 10)
Line 3: We use the pipe (
|>
) operator to pass theshelleyText
data through a series of text processing functions.Line 9: This step involves tokenization, breaking the text into individual words or tokens.
Line 10: Here, the function pastes (concatenates) the words within each n-gram into a single string, separated by a space.
It’s important to note this code doesn’t make use of DTM. Here’s a similar code that creates a DTM:
Press + to interact
library(tm, quietly = TRUE)docDir <- DirSource(directory = "data",pattern = "mws_.+txt")newCorpus <- Corpus(docDir)DTmatrix <- DocumentTermMatrix(newCorpus,control = list(tolower = TRUE,stopwords = TRUE,stripWhiteSpace = TRUE,removePunctuation = TRUE,removeNumbers = TRUE,tokenize = "Boost"))inspect(DTmatrix)