Elasticsearch Fundamentals: Indexing and Querying Data/

...

Custom Analyzer: Tokenizers

Explore the most commonly used built-in tokenizers.

We'll cover the following...

This is one of the most commonly used tokenizers in Elasticsearch. The standard tokenizer takes the input text and breaks it down into individual tokens whenever it encounters a non-letter, such as whitespace, punctuation, etc. The standard tokenizer uses the Unicode Text Segmentation algorithm to break the text into words.

For example, a standard tokenizer will break down the text "QUICK Brown-Foxes" into the following tokens:

["QUICK", "Brown", "Foxes"]

The whitespace tokenizer breaks down text into terms whenever it encounters a whitespace character.

For example, the whitespace tokenizer will break down the text "QUICK Brown-Foxes" into the following tokens:

["QUICK", "Brown-foxes"]

Keyword tokenizer

The keyword tokenizer accepts the input text and outputs the exact text as a single term. The keyword tokenizer is useful for fields that require exact matches, such as IDs, zip codes, or product codes.

For example, the keyword tokenizer will receive the text "QUICK Brown-Foxes" and produce it as a single token as follows:

["QUICK Brown-Foxes"]

Custom Analyzer: Tokenizers

Overview

Standard tokenizer

Whitespace tokenizer

Keyword tokenizer

Character group tokenizer