Text Statistics with quanteda

Explore text analysis with quanteda.textstats in R for enhanced understanding.

Statistical analysis of the text

The quanteda.textstats package is an R package designed for text analysis that provides various functions to compute statistical measures and summarize information from text data. It enables users to generate descriptive statistics and insights about their text corpus, aiding in linguistic and quantitative analysis. The quanteda.textstats package contains several functions for analyzing text. These include:

  • textstat_collocation: Finds words or phrases that appear together.

  • textstat_entropy: Describes the diversity and unpredictability of word usage. A higher entropy might indicate a more varied and diverse vocabulary, while a lower entropy might indicate more focused and repetitive language usage.

  • textstat_frequency: Describes frequencies of words or phrases, including minimum, average, max, and others.

  • textstat_keyness: Comparison of the words or phrases in one document against another.

  • textstat_lexdiv: Calculates the lexical diversity of documents.

  • textstat_readability: Calculates the readability of a document.

  • textstat_simil: Finds the similarities (or differences) between words or phrases in documents.

  • textstat_summary: Generates a summary of text statistics for a document.

Readability of the document

For an example of the use of quanteda_textstats, let’s look at readability. Readability scores are measures used to assess the ease with which a person can understand a written text. Two common readability scores are the Flesch Reading Ease and the Coleman-Liau Index.

  • Flesch Reading Ease: The Flesch Reading Ease score is a numerical value indicating how easy or difficult a text is to read. It’s based on the average sentence length and the average number of syllables per word in the text. Higher Flesch Reading Ease scores (between 60 and 100) suggest that the text is relatively easy to read, with shorter sentences and words that are simple to understand.

  • Coleman-Liau Index: The Coleman-Liau Index is another readability score that estimates the grade level required to understand a text. It uses the average number of letters and sentences per 100 words to calculate the index. For example, a Coleman-Liau Index of 8.0 suggests that an eighth-grader should be able to understand the text.

The quanteda_textstats package includes almost 50 different ways to calculate readability. Here’s an example code to calculate these scores:

Get hands-on with 1400+ tech skills courses.