...

/

Analyzing Textual Comparisons with Document-Term Matrices

Analyzing Textual Comparisons with Document-Term Matrices

Learn the significance of document-term matrices in text mining.

Why use document-term matrices?

The following code lists the tokens and their frequencies:

Press + to interact
# This displays leading n-grams ------------------------
shelleyText |>
removePunctuation() |>
removeWords(stopwords('english')) |>
removeWords(c("I")) |>
removeNumbers() |>
stripWhitespace() |>
Boost_tokenizer() |>
vapply(paste, "", collapse=" ") |>
table() |>
sort(decreasing = TRUE) |>
head(n = 10)
  • Line 3: We use the pipe (|>) operator to pass the shelleyText data through a series of text processing functions.

  • Line 9: This step involves tokenization, breaking the text into individual words or tokens.

  • Line 10: Here, the function pastes (concatenates) the words within each n-gram into a single string, separated by a space.

It’s important to note this code doesn’t make use of DTM. Here’s a similar code that creates a DTM:

Press + to interact
library(tm, quietly = TRUE)
docDir <- DirSource(directory = "data",pattern = "mws_.+txt")
newCorpus <- Corpus(docDir)
DTmatrix <- DocumentTermMatrix(newCorpus,
control = list(tolower = TRUE,
stopwords = TRUE,
stripWhiteSpace = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
tokenize = "Boost"
)
)
inspect(DTmatrix)

    ...