> ## Documentation Index
> Fetch the complete documentation index at: https://www.meilisearch.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Language

> Meilisearch is compatible with datasets in any language. It features optimized tokenization for many language families and supports multilingual semantic search through embedding models.

Meilisearch is multilingual and works with datasets in any language. Its tokenizer, [Charabia](https://github.com/meilisearch/charabia), provides optimized segmentation and normalization for a wide range of languages and scripts.

## Supported languages

The following table lists all languages and scripts with dedicated tokenization support in Charabia:

| Language / Script                                               | Segmentation                            | Normalization                                             |
| --------------------------------------------------------------- | --------------------------------------- | --------------------------------------------------------- |
| **Latin** (English, French, Spanish, Italian, Portuguese, etc.) | CamelCase segmentation                  | Decomposition, lowercase, nonspacing-marks removal        |
| **German**                                                      | CamelCase + compound word decomposition | Same as Latin                                             |
| **Swedish**                                                     | Specialized normalization               | Decomposition, lowercase                                  |
| **Greek**                                                       | Default                                 | Decomposition, lowercase, final sigma handling            |
| **Cyrillic / Georgian** (Russian, Ukrainian, Bulgarian, etc.)   | Default                                 | Decomposition, lowercase                                  |
| **Armenian**                                                    | Default                                 | Decomposition, lowercase                                  |
| **Arabic**                                                      | Article (ال) segmentation               | Decomposition, digit conversion, nonspacing-marks removal |
| **Persian**                                                     | Specialized segmentation                | Decomposition, normalization                              |
| **Hebrew**                                                      | Default                                 | Decomposition, nonspacing-marks removal                   |
| **Turkish**                                                     | Default                                 | Specialized case folding (dotted/dotless i)               |
| **Chinese (CMN)**                                               | jieba-based dictionary segmentation     | Decomposition, kvariant conversion                        |
| **Japanese**                                                    | lindera IPA dictionary segmentation     | Decomposition                                             |
| **Korean**                                                      | lindera KO dictionary segmentation      | Decomposition                                             |
| **Thai**                                                        | Dictionary-based segmentation           | Decomposition, nonspacing-marks removal                   |
| **Khmer**                                                       | Dictionary-based segmentation           | Decomposition                                             |

Languages not listed above still work with Meilisearch. Any language that uses whitespace to separate words benefits from the default Latin pipeline. Results may be less relevant for unlisted languages that do not use spaces between words.

We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please [open an issue in the Meilisearch repository](https://github.com/meilisearch/meilisearch/issues/new/choose).

[Read more about our tokenizer](/capabilities/indexing/advanced/tokenization)

## Multilingual hybrid search

Meilisearch's keyword-based search relies on Charabia for tokenization, but [hybrid search](/capabilities/hybrid_search/getting_started) and [semantic search](/capabilities/hybrid_search/overview) use embedding models that can handle languages independently of the tokenizer.

Many embedding providers offer multilingual models that work across 100+ languages out of the box:

| Provider                                                               | Multilingual model                                            | Dimensions                     |
| ---------------------------------------------------------------------- | ------------------------------------------------------------- | ------------------------------ |
| [Cohere](/capabilities/hybrid_search/how_to/configure_cohere_embedder) | `embed-v4.0`                                                  | 256, 512, 1,024, or 1,536      |
| [Cohere](/capabilities/hybrid_search/how_to/configure_cohere_embedder) | `embed-multilingual-v3.0`                                     | 1,024                          |
| [Voyage AI](/capabilities/hybrid_search/providers/voyage)              | `voyage-4`                                                    | 256, 512, 1,024, or 2,048      |
| [Jina](/capabilities/hybrid_search/providers/jina)                     | `jina-embeddings-v4`                                          | 128, 256, 512, 1,024, or 2,048 |
| [AWS Bedrock](/capabilities/hybrid_search/providers/bedrock)           | `cohere.embed-v4:0`                                           | 256, 512, 1,024, or 1,536      |
| [Hugging Face](/capabilities/hybrid_search/providers/huggingface)      | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | 384                            |

Using a multilingual embedding model allows you to:

* **Search across languages**: a query in English can match documents written in French, German, or Japanese.
* **Simplify multilingual indexing**: instead of creating one index per language, a single index with a multilingual embedder can serve multiple languages.
* **Complement keyword search**: combine Charabia's keyword tokenization with semantic embeddings in hybrid search for the best of both approaches.

For multilingual datasets, consider using [hybrid search](/capabilities/hybrid_search/getting_started) with a multilingual embedder alongside [localized attributes](/reference/api/settings/get-localizedattributes) for keyword matching. This gives you accurate tokenization per language for keyword search and cross-language understanding for semantic search.

<Note>
  For guidance on structuring multilingual datasets, see [Handling multilingual datasets](/capabilities/indexing/how_to/handle_multilingual_data).
</Note>

## Improving our language support

While we have employees from all over the world at Meilisearch, we don't speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages.

If you'd like to request optimized support for a language, please upvote the related [discussion in our product repository](https://github.com/meilisearch/product/discussions?discussions_q=label%3Ascope%3Atokenizer+) or [open a new one](https://github.com/meilisearch/product/discussions/new?category=feedback-feature-proposal) if it doesn't exist.

If you'd like to help by developing a tokenizer pipeline yourself: first of all, thank you! We recommend that you take a look at the [tokenizer contribution guide](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md) before making a PR.

## FAQ

### What do you mean when you say Meilisearch offers *optimized* support for a language?

Optimized support for a language means Meilisearch has implemented internal processes specifically tailored to parsing that language, leading to more relevant results. This includes specialized segmentation (how text is split into words) and normalization (how characters are standardized for matching).

### My language does not use whitespace to separate words. Can I still use Meilisearch?

Yes. For keyword search, results may be less relevant than for fully optimized languages. However, you can use [hybrid search](/capabilities/hybrid_search/getting_started) with a multilingual embedding model to get strong semantic results regardless of tokenization support.

### My language does not use the Roman alphabet. Can I still use Meilisearch?

Yes. Charabia supports many non-Latin scripts including Cyrillic, Greek, Arabic, Hebrew, Armenian, Thai, Chinese, Japanese, and Korean. Multilingual embedding models also work across all writing systems.

### Does Meilisearch plan to support additional languages in the future?

Yes, we definitely do. The more feedback we get from native speakers, the easier it is for us to understand how to improve performance for those languages. Similarly, the more requests we get to improve support for a specific language, the more likely we are to devote resources to that project.
