Tokenization

Deep dive: The Meilisearch tokenizer

Tokenization is the act of taking a sentence or phrase and splitting it into smaller units of language, called tokens. It is the first step of document indexing in the Meilisearch engine, and is a critical factor in the quality of search results. Breaking sentences into smaller chunks requires understanding where one word ends and another begins, making tokenization a highly complex and language-dependent task. Meilisearch’s solution to this problem is a modular tokenizer that follows different processes, called pipelines, based on the language it detects. This allows Meilisearch to function in several different languages with zero setup.

Deep dive: The Meilisearch tokenizer

When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called the tokenizer. The tokenizer is responsible for splitting each field by writing system (for example, Latin alphabet, Chinese hanzi). It then applies the corresponding pipeline to each part of each document field. We can break down the tokenization process like so:

Crawl the document(s), splitting each field by script
Go back over the documents part-by-part, running the corresponding tokenization pipeline, if it exists

Pipelines include many language-specific operations. Currently, we have a number of pipelines, including a default pipeline for languages that use whitespace to separate words, and dedicated pipelines for Chinese, Japanese, Hebrew, Thai, and Khmer. For more details, check out the tokenizer contribution guide.

Impact of RAM and multi-threading on indexing performance Handling multilingual datasets

⌘I

Getting started

AI-powered search

Conversational search

Personalization

Self-hosted

Analytics

Teams

Tasks and asynchronous operations

Configuration

Filtering and sorting

Security and permissions

Multi-search

Update and migration

Data backup

Indexing

Engine

Relevancy

Resources

Deep dive: The Meilisearch tokenizer

Getting started

AI-powered search

Conversational search

Personalization

Self-hosted

Analytics

Teams

Tasks and asynchronous operations

Configuration

Filtering and sorting

Security and permissions

Multi-search

Update and migration

Data backup

Indexing

Engine

Relevancy

Resources

​Deep dive: The Meilisearch tokenizer

Deep dive: The Meilisearch tokenizer