Using a Suitable Source Type
Learn about different types of sources. 1
We'll cover the following...
Introduction to sources
The tm package can import several types of documents with special functions called sources. The tm package comes with a set of sources for general-purpose work, but a developer can add additional sources through plug-ins. In this lesson, we’ll look at the sources included with tm .
The tm package provides getsources() to produce a list of available sources. Run the following code to list the available sources in this copy of tm:
library(tm, quietly = TRUE)getSources()
Line 3:
getSources( )provides a list of sources.
Let’s explore each of these sources in depth.
DataframeSource
A DataframeSource is a data.frame where each row represents a document. The first column must be named “doc_id” and contain a unique string to identify the document, possibly a file name. The second column must be named “text” and contain the document’s contents. The following code creates a DataFrameSource and then creates a corpus from that source:
library(tm, quietly = TRUE)library(readtext)DataDirectory <- "data/"fileList <- dir(path = DataDirectory, pattern = "mws_.+txt")# readtext returns a data.frameaDataframe <- readtext(paste0(DataDirectory, fileList))# This code confirms the doc_id is unique --------if (nrow(aDataframe) == length(unique(aDataframe$doc_id))) {message("doc_id is unique")} else {stop("doc_id is not unique")}aCorpus <- Corpus(DataframeSource(aDataframe))summary(aCorpus)
Line 4: This line sets the
DataDirectoryvariable to the string“data/”. It specifies the directory where the text files are located.Line 5: This line creates a character vector
fileListcontaining the names of files inDataDirectorythat match the specified pattern. In this case, it looks for files that start withmws_and end with.txt(such asmws_1.txtormws_2.txt).Line 8: This line uses the
readtext()function from thereadtextpackage to read the text content of the files specified infileList. Thereadtext()function returns adata.framewith two columns:text(the content of the text file) anddoc_id(the identifier of the document).The
paste0()function concatenatesDataDirectorywith the file names to form the complete paths to the files.
Line 11: This line checks whether the number of rows in the
aDataframedata frame is equal to the number of uniquedoc_idvalues. Thenrow()function returns the number of rows, whilelength(unique(aDataframe$doc_id))returns the number of uniquedoc_idvalues. ...