...

/

Custom Analyzers: Token Filters

Custom Analyzers: Token Filters

Explore the most commonly used built-in token filters.

Overview

Token filters are used as analysis components to process tokens generated by a tokenizer. They are responsible for adding, removing, and altering tokens in the token stream. Elasticsearch offers a wide range of token filters, which can be broadly categorized into three types:

  • Normalization filters

  • Stemming filters

  • Miscellaneous filters

Normalization filters

Normalization is the process of transforming words into standard forms, such as removing diacritics from characters or converting all text to lowercase, ensuring consistency and accuracy in search results, regardless of differences in word form.

A normalization token filter is a type of token filter in Elasticsearch that is used to standardize and transform text data to improve the quality of search and analysis. The normalization token filter performs various text normalization tasks, such as converting all characters to lowercase, removing diacritics, or replacing non-ASCII characters with their ASCII equivalents.

Commonly used normalization filters

Here is a list of common normalization filters used in Elasticsearch:

  • Lowercase and uppercase token filters: They change the token text to lowercase or uppercase. For example, the lowercase filter can change THE Lazy DoG to the lazy dog.

  • ASCII folding token filter: This token filter converts non-Basic Latin Unicode block characters, including alphabetic, numeric, and symbolic characters, to their ASCII equivalents whenever they exist. For example, the filter will replace "à" with "a".

  • Keep types token filter: It keeps or removes tokens of a specific type. For example, we can use this filter to change 3 quick foxes to quick foxes by keeping only <alphanum> (alphanumeric) tokens.
  • Keyword marker token filter: It is a token filter that marks terms as keywords so that the next filters do not modify them in the analysis chain. For example, suppose we have a text input that reads THE LAZY DoG, and we want to convert all the characters to lowercase, except for the token LAZY. To achieve this, we can use a keyword marker token filter before the lowercase token filter. The keyword marker is configured to match the LAZY token and mark it as a keyword.
  • Trim token filter: It removes the leading and trailing whitespace from the tokens. For example, it converts the token " Elasticsearch " to "Elasticsearch".

Example

The following request tests a custom analyzer, which uses a whitespace ...