String
String is the primary type for indexing data in Meilisearch. It enables to create the content in which to search. Strings are processed as detailed below. String tokenization is the process of splitting a string into a list of individual terms that are called tokens. A string is passed to a tokenizer and is then broken into separate string tokens. A token is a word.Tokenization
Tokenization relies on two main processes to identifying words and separating them into tokens: separators and dictionaries.Separators
Separators are characters that indicate where one word ends and another word begins. In languages using the Latin alphabet, for example, words are usually delimited by white space. In Japanese, word boundaries are more commonly indicated in other ways, such as appending particles likeに
and で
to the end of a word.
There are two kinds of separators in Meilisearch: soft and hard. Hard separators signal a significant context switch such as a new sentence or paragraph. Soft separators only delimit one word from another but do not imply a major change of subject.
The list below presents some of the most common separators in languages using the Latin alphabet:
- Soft spaces (distance: 1): whitespaces, quotes,
'-' | '_' | '\'' | ':' | '/' | '\\' | '@' | '"' | '+' | '~' | '=' | '^' | '*' | '#'
- Hard spaces (distance: 8):
'.' | ';' | ',' | '!' | '?' | '(' | ')' | '[' | ']' | '{' | '}'| '|'
Dictionaries
For the tokenization process, dictionaries are lists of groups of characters which should be considered as single term. Dictionaries are particularly useful when identifying words in languages like Japanese, where words are not always marked by separator tokens. Meilisearch comes with a number of general-use dictionaries for its officially supported languages. When working with documents containing many domain-specific terms, such as a legal documents or academic papers, providing a custom dictionary may improve search result relevancy.Distance
Distance plays an essential role in determining whether documents are relevant since one of the ranking rules is the proximity rule. The proximity rule sorts the results by increasing distance between matched query terms. Then, two words separated by a soft space are closer and thus considered more relevant than two words separated by a hard space. After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.Examples
To demonstrate how a string is split by space, let’s say you have the following string as an input:Bruce
and Willis
is equal to 1. The distance between Vin
and Diesel
is also 1. However, the distance between Willis
and Vin
is equal to 8. The same calculations apply to Bruce
and Diesel
(10), Bruce
and Vin
(9), and Willis
and Diesel
(9).
Let’s see another example. Given two documents:
Bruce Willis
, 002
will be the first document returned, and 001
will be the second one. This will happen because the proximity distance between Bruce
and Willis
is equal to 2 in the document 002
, whereas the distance between Bruce
and Willis
is equal to 8 in the document 001
since the full-stop character .
is a hard space.
Numeric
A numeric type (integer
, float
) is converted to a human-readable decimal number string representation. Numeric types can be searched as they are converted to strings.
You can add custom ranking rules to create an ascending or descending sorting rule on a given attribute that has a numeric value in the documents.
You can also create filters. The >
, >=
, <
, <=
, and TO
relational operators apply only to numerical values.
Boolean
A Boolean value, which is eithertrue
or false
, is received and converted to a lowercase human-readable text (true
and false
). Booleans can be searched as they are converted to strings.
null
The null
type can be pushed into Meilisearch but it won’t be taken into account for indexing.
Array
An array is an ordered list of values. These values can be of any type: number, string, boolean, object, or even other arrays. Meilisearch flattens arrays and concatenates them into strings. Non-string values are converted as described in this article’s previous sections.Example
The following input:Objects
When a document field contains an object, Meilisearch flattens it and brings the object’s keys and values to the root level of the document itself. Keep in mind that the flattened objects represented here are an intermediary snapshot of internal processes. When searching, the returned document will keep its original structure. In the example below, thepatient_name
key contains an object:
address
, containing home and work addresses, each of which are objects themselves. After flattening, the document would look like this:
Nested document querying and subdocuments
Meilisearch has no concept of subdocuments and cannot perform nested document querying. In the previous example, the relationship between an appointment’s date and doctor is lost when flattening theappointments
array:
0
and 1
:
Jester Lavorre
in 2022-01-01
. Instead, it returns patients who had an appointment with Jester Lavorre
, and patients who had an appointment in 2022-01-01
.
The best way to work around this limitation is reformatting your data. The above example could be fixed by merging appointment data in a new appointmentsMerged
field so the relationship between appointment and doctor remains intact:
Possible tokenization issues
Even if it behaves exactly as expected, the tokenization process may lead to counterintuitive results in some cases, such as:.
will be considered as hard spaces.
10,3
will be broken into two strings—10
and 3
—instead of being processed as a numeric type.