Using a Suitable Corpus Class
Learn about the different types of corpora in the tm package and plug-in packages for efficient text mining and NLP analysis in R.
We'll cover the following...
Let’s do a deeper exploration of the corpora included as part of the tm
package via plug-in packages.
Corpus
Corpus
is a convenient alias to create either a SimpleCorpus
or a VCorpus
, depending on the arguments provided. For example, SimpleCorpus
can’t contain XML, so if we were to use Corpus
with XML, Corpus
would create a VCorpus
. Here is an example of Corpus
:
library(tm, quietly = TRUE)docDir <- DirSource(directory = "data",pattern = "mws_.+txt")newCorpus <- Corpus(docDir)# show structure of the new corpusstr(newCorpus)
This a simple example. At the top of the structure list, we’ll see a line listing the classes where it is listed as a SimpleCorpus
. If the source had been anything other than DataframeSource,
DirSource
, or VectorSource
, this would have been a VCorpus
.
Here is the Corpus
command with all arguments defined:
newVCorpus <- Corpus(x = DirSource(directory = "data",pattern = "mws_.+txt"),readerControl = list(reader = readDataframe,language = "en"),)
x
is asource
object. ...