...

/

Using a Suitable Source Type

Using a Suitable Source Type

Learn about different types of sources. 1

Introduction to sources

The tm package can import several types of documents with special functions called sources. The tm package comes with a set of sources for general-purpose work, but a developer can add additional sources through plug-ins. In this lesson, we’ll look at the sources included with tm .

The tm package provides getsources() to produce a list of available sources. Run the following code to list the available sources in this copy of tm:

Press + to interact
library(tm, quietly = TRUE)
getSources()
  • Line 3: getSources( ) provides a list of sources.

Let’s explore each of these sources in depth.

DataframeSource

A DataframeSource is a data.frame where each row represents a document. The first column must be named “doc_id” and contain a unique string to identify the document, possibly a file name. The second column must be named “text” and contain the document’s contents. The following code creates a DataFrameSource and then creates a corpus from that source:

Press + to interact
library(tm, quietly = TRUE)
library(readtext)
DataDirectory <- "data/"
fileList <- dir(path = DataDirectory, pattern = "mws_.+txt")
# readtext returns a data.frame
aDataframe <- readtext(paste0(DataDirectory, fileList))
# This code confirms the doc_id is unique --------
if (nrow(aDataframe) == length(unique(aDataframe$doc_id))) {
message("doc_id is unique")
} else {
stop("doc_id is not unique")
}
aCorpus <- Corpus(DataframeSource(aDataframe))
summary(aCorpus)
  • Line 4: This line sets the DataDirectory variable to the string “data/”. It specifies the directory where the text files are located.

  • Line 5: This line creates a character vector fileList containing the names of files in DataDirectory that match the specified pattern. In this case, it looks for files that start with mws_ and end with .txt (such as mws_1.txt or mws_2.txt).

  • Line 8: This line uses the readtext() function from the readtext package to read the text content of the files specified in fileList. The readtext() function returns a data.frame with two columns:

    • text (the content of the text file) and doc_id (the identifier of the document).

    • The paste0() function concatenates DataDirectory with the file names to form the complete paths to the files.

  • Line 11: This line checks whether the number of rows in the aDataframe data frame is equal to the number of unique doc_id values. The nrow() function returns the number of rows, while length(unique(aDataframe$doc_id)) returns the number of unique doc_id values. ...