Using a Suitable Source Type
Learn about different types of sources. 1
We'll cover the following...
Introduction to sources
The tm
package can import several types of documents with special functions called sources. The tm
package comes with a set of sources for general-purpose work, but a developer can add additional sources through plug-ins. In this lesson, we’ll look at the sources included with tm
.
The tm
package provides getsources()
to produce a list of available sources. Run the following code to list the available sources in this copy of tm
:
library(tm, quietly = TRUE)getSources()
Line 3:
getSources( )
provides a list of sources.
Let’s explore each of these sources in depth.
DataframeSource
A DataframeSource
is a data.frame
where each row represents a document. The first column must be named “doc_id” and contain a unique string to identify the document, possibly a file name. The second column must be named “text” and contain the document’s contents. The following code creates a DataFrameSource
and then creates a corpus from that source:
library(tm, quietly = TRUE)library(readtext)DataDirectory <- "data/"fileList <- dir(path = DataDirectory, pattern = "mws_.+txt")# readtext returns a data.frameaDataframe <- readtext(paste0(DataDirectory, fileList))# This code confirms the doc_id is unique --------if (nrow(aDataframe) == length(unique(aDataframe$doc_id))) {message("doc_id is unique")} else {stop("doc_id is not unique")}aCorpus <- Corpus(DataframeSource(aDataframe))summary(aCorpus)
Line 4: This line sets the
DataDirectory
variable to the string“data/”
. It specifies the directory where the text files are located.Line 5: This line creates a character vector
fileList
containing the names of files inDataDirectory
that match the specified pattern. In this case, it looks for files that start withmws_
and end with.txt
(such asmws_1.txt
ormws_2.txt
).Line 8: This line uses the
readtext()
function from thereadtext
package to read the text content of the files specified infileList
. Thereadtext()
function returns adata.frame
with two columns:text
(the content of the text file) anddoc_id
(the identifier of the document).The
paste0()
function concatenatesDataDirectory
with the file names to form the complete paths to the files.
Line 11: This line checks whether the number of rows in the
aDataframe
data frame is equal to the number of uniquedoc_id
values. Thenrow()
function returns the number of rows, whilelength(unique(aDataframe$doc_id))
returns the number of uniquedoc_id
values. ...