...

/

Using a Suitable Corpus Class

Using a Suitable Corpus Class

Learn about the different types of corpora in the tm package and plug-in packages for efficient text mining and NLP analysis in R.

Let’s do a deeper exploration of the corpora included as part of the tm package via plug-in packages.

Corpus

Corpus is a convenient alias to create either a SimpleCorpus or a VCorpus, depending on the arguments provided. For example, SimpleCorpus can’t contain XML, so if we were to use Corpus with XML, Corpus would create a VCorpus. Here is an example of Corpus:

Press + to interact
library(tm, quietly = TRUE)
docDir <- DirSource(directory = "data",
pattern = "mws_.+txt")
newCorpus <- Corpus(docDir)
# show structure of the new corpus
str(newCorpus)

This a simple example. At the top of the structure list, we’ll see a line listing the classes where it is listed as a SimpleCorpus. If the source had been anything other than DataframeSource, DirSource, or VectorSource, this would have been a VCorpus.

Here is the Corpus command with all arguments defined:

Press + to interact
newVCorpus <- Corpus(
x = DirSource(directory = "data",
pattern = "mws_.+txt"),
readerControl = list(reader = readDataframe,
language = "en"),
)
  • x is a source object. ...

Access this course and 1400+ top-rated courses and projects.