Analyzers in Elasticsearch

Learn about analyzers in Elasticsearch.

Analyzer

The analyzer is the process or sequence of processes that perform a series of operations on text, such as breaking it into individual words, lowercasing it, or removing common words. These operations are used to prepare text data for indexing and searching.

An analyzer in Elasticsearch comprises three main components:

  • Character filtering
  • Tokenization
  • Token filter

When an analyzer receives text data, it preprocesses it by applying zero or more character filtering. Then, it passes it to precisely one tokenizer, which converts the text into individual tokens (words). After tokenization, the analyzer runs zero or more token filters, which helps in modifying tokens (e.g., lowercasing), deleting tokens (e.g., removing stopwords), or adding tokens.

Elasticsearch provides built-in analyzers—a predefined combination of character filters, tokenization, and token filters. These analyzers can be used out of the box without creating or configuring our analyzer. On the other hand, Elasticsearch provides the ability to create our custom analyzer, which uses the appropriate combination of:

  • Zero or more character filters
  • A tokenizer
  • Zero or more token filters
Press + to interact
Analyzer workflow
Analyzer workflow

Character filters

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. It is used to modify or clean up the input text, such as by removing special characters, converting cases, or replacing specific characters or sequences of characters.

One example of using the character filter can be to convert Arabic numerals (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Latin equivalents (0123456789) or to strip HTML elements such as the <b> tag from the stream.

...