Custom Analyzer: Tokenizers

Explore the most commonly used built-in tokenizers.

Overview

A tokenizer is an essential component of the analyzer that receives a stream of characters as input, breaks it down into individual tokens (usually individual words), and outputs a stream of tokens.

Elasticsearch provides several built-in tokenizers that can be used for different types of text analysis. Here are some of the most commonly used ones:

  • Standard tokenizer

  • Whitespace tokenizer

  • Keyword tokenizer

  • Character group tokenizer

  • The nn-gram tokenizer

  • Edge nn-gram tokenizer

  • Path hierarchy

Standard tokenizer

This is one of the most commonly used tokenizers in Elasticsearch. The standard tokenizer takes the input text and breaks it down into individual tokens whenever it encounters a non-letter, such as whitespace, punctuation, etc. The standard tokenizer uses the Unicode Text Segmentation algorithm to break the text into words.

For example, a standard tokenizer will break down the text "QUICK Brown-Foxes" into the following tokens:

["QUICK", "Brown", "Foxes"]

Whitespace tokenizer

The whitespace tokenizer breaks down text into terms whenever it encounters a whitespace character.

For example, the whitespace tokenizer will break down the text "QUICK Brown-Foxes" into the following tokens:

["QUICK", "Brown-foxes"]

Keyword tokenizer

The keyword tokenizer accepts the input text and outputs the exact text as a single term. The keyword tokenizer is useful for fields that require exact matches, such as IDs, zip codes, or product codes.

For example, the keyword tokenizer will receive the text "QUICK Brown-Foxes" and produce it as a single token as follows:

["QUICK Brown-Foxes"]

Character group tokenizer

The character group tokenizer splits the text into tokens whenever it encounters a character that belongs to a specified set of characters.

The character group tokenizer can be created by defining a custom tokenizer and setting the following parameters:

  • tokenize_on_chars: This is the defined character list based
...