Using the File Folder as Corpus
Learn about using files and folders as SimpleCorpus.
We'll cover the following
The documentation for tm
is nearly 60 pages long and immediately dives into the mechanics of NLP. Rather than trying to understand the entire depth of this package in one go, let’s break it down into understandable and related components. The tm
package can be broken down into these main topics:
Corpora and sources
Metadata
Preprocessing: Cleaning, stopwords, and stemming
Tokenizing: Words, n-grams, weighting
Statistics: Term frequency
Visualization
In this lesson, we’ll use Frankenstein as a base for our project. Our first task is to import text into a corpus.
VCorpus
and SimpleCorpus
Natural language processing and text mining are done on a collection of documents, and this collection is called a corpus. The creation of a corpus is the first step to natural language processing with tm
. Documents are imported into a corpus with the corpus family of commands. The different corpus commands produce different types of corpus.
There are two main versions:
VCorpus
(volatile corpus)SimpleCorpus
(similar toVCorpus
)
Here’s how to create a VCorpus
:
Get hands-on with 1400+ tech skills courses.