Skip to main content
Meilisearch is multilingual and works with datasets in any language. Its tokenizer, Charabia, provides optimized segmentation and normalization for a wide range of languages and scripts.

Supported languages

The following table lists all languages and scripts with dedicated tokenization support in Charabia:
Language / ScriptSegmentationNormalization
Latin (English, French, Spanish, Italian, Portuguese, etc.)CamelCase segmentationDecomposition, lowercase, nonspacing-marks removal
GermanCamelCase + compound word decompositionSame as Latin
SwedishSpecialized normalizationDecomposition, lowercase
GreekDefaultDecomposition, lowercase, final sigma handling
Cyrillic / Georgian (Russian, Ukrainian, Bulgarian, etc.)DefaultDecomposition, lowercase
ArmenianDefaultDecomposition, lowercase
ArabicArticle (ال) segmentationDecomposition, digit conversion, nonspacing-marks removal
PersianSpecialized segmentationDecomposition, normalization
HebrewDefaultDecomposition, nonspacing-marks removal
TurkishDefaultSpecialized case folding (dotted/dotless i)
Chinese (CMN)jieba-based dictionary segmentationDecomposition, kvariant conversion
Japaneselindera IPA dictionary segmentationDecomposition
Koreanlindera KO dictionary segmentationDecomposition
ThaiDictionary-based segmentationDecomposition, nonspacing-marks removal
KhmerDictionary-based segmentationDecomposition
Languages not listed above still work with Meilisearch. Any language that uses whitespace to separate words benefits from the default Latin pipeline. Results may be less relevant for unlisted languages that do not use spaces between words. We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please open an issue in the Meilisearch repository. Read more about our tokenizer Meilisearch’s keyword-based search relies on Charabia for tokenization, but hybrid search and semantic search use embedding models that can handle languages independently of the tokenizer. Many embedding providers offer multilingual models that work across 100+ languages out of the box:
ProviderMultilingual modelDimensions
Cohereembed-multilingual-v3.01024
Cohereembed-multilingual-light-v3.0384
Voyage AIvoyage-multilingual-21024
AWS Bedrockcohere.embed-multilingual-v31024
Hugging Facesentence-transformers/paraphrase-multilingual-MiniLM-L12-v2384
Using a multilingual embedding model allows you to:
  • Search across languages: a query in English can match documents written in French, German, or Japanese.
  • Simplify multilingual indexing: instead of creating one index per language, a single index with a multilingual embedder can serve multiple languages.
  • Complement keyword search: combine Charabia’s keyword tokenization with semantic embeddings in hybrid search for the best of both approaches.
For multilingual datasets, consider using hybrid search with a multilingual embedder alongside localized attributes for keyword matching. This gives you accurate tokenization per language for keyword search and cross-language understanding for semantic search.
For guidance on structuring multilingual datasets, see Handling multilingual datasets.

Improving our language support

While we have employees from all over the world at Meilisearch, we don’t speak every language. We rely almost entirely on feedback from external contributors to understand how our engine is performing across different languages. If you’d like to request optimized support for a language, please upvote the related discussion in our product repository or open a new one if it doesn’t exist. If you’d like to help by developing a tokenizer pipeline yourself: first of all, thank you! We recommend that you take a look at the tokenizer contribution guide before making a PR.

FAQ

What do you mean when you say Meilisearch offers optimized support for a language?

Optimized support for a language means Meilisearch has implemented internal processes specifically tailored to parsing that language, leading to more relevant results. This includes specialized segmentation (how text is split into words) and normalization (how characters are standardized for matching).

My language does not use whitespace to separate words. Can I still use Meilisearch?

Yes. For keyword search, results may be less relevant than for fully optimized languages. However, you can use hybrid search with a multilingual embedding model to get strong semantic results regardless of tokenization support.

My language does not use the Roman alphabet. Can I still use Meilisearch?

Yes. Charabia supports many non-Latin scripts including Cyrillic, Greek, Arabic, Hebrew, Armenian, Thai, Chinese, Japanese, and Korean. Multilingual embedding models also work across all writing systems.

Does Meilisearch plan to support additional languages in the future?

Yes, we definitely do. The more feedback we get from native speakers, the easier it is for us to understand how to improve performance for those languages. Similarly, the more requests we get to improve support for a specific language, the more likely we are to devote resources to that project.