Data types
This article explains how Meilisearch handles the different types of data in your dataset.
The behavior described here concerns only Meilisearch's internal processes and can be helpful in understanding how the tokenizer works. Document fields remain unchanged for most practical purposes not related to Meilisearch's inner workings.
String
String is the primary type for indexing data in Meilisearch. It enables to create the content in which to search. Strings are processed as detailed below.
String tokenization is the process of splitting a string into a list of individual terms that are called tokens.
A string is passed to a tokenizer and is then broken into separate string tokens. A token is a word.
Tokenization
Tokenization relies on two main processes to identifying words and separating them into tokens: separators and dictionaries.
Separators
Separators are characters that indicate where one word ends and another word begins. In languages using the Latin alphabet, for example, words are usually delimited by white space. In Japanese, word boundaries are more commonly indicated in other ways, such as appending particles like に
and で
to the end of a word.
There are two kinds of separators in Meilisearch: soft and hard. Hard separators signal a significant context switch such as a new sentence or paragraph. Soft separators only delimit one word from another but do not imply a major change of subject.
The list below presents some of the most common separators in languages using the Latin alphabet:
- Soft spaces (distance: 1): whitespaces, quotes,
'-' | '_' | '\'' | ':' | '/' | '\\' | '@' | '"' | '+' | '~' | '=' | '^' | '*' | '#'
- Hard spaces (distance: 8):
'.' | ';' | ',' | '!' | '?' | '(' | ')' | '[' | ']' | '{' | '}'| '|'
For more separators, including those used in other writing systems like Cyrillic and Thai, consult this exhaustive list.
Dictionaries
For the tokenization process, dictionaries are lists of groups of characters which should be considered as single term. Dictionaries are particularly useful when identifying words in languages like Japanese, where words are not always marked by separator tokens.
Meilisearch comes with a number of general-use dictionaries for its officially supported languages. When working with documents containing many domain-specific terms, such as a legal documents or academic papers, providing a custom dictionary may improve search result relevancy.
Distance
Distance plays an essential role in determining whether documents are relevant since one of the ranking rules is the proximity rule. The proximity rule sorts the results by increasing distance between matched query terms. Then, two words separated by a soft space are closer and thus considered more relevant than two words separated by a hard space.
After the tokenizing process, each word is indexed and stored in the global dictionary of the corresponding index.
Examples
To demonstrate how a string is split by space, let's say you have the following string as an input:
"Bruce Willis,Vin Diesel"
In the example above, the distance between Bruce
and Willis
is equal to 1. The distance between Vin
and Diesel
is also 1. However, the distance between Willis
and Vin
is equal to 8. The same calculations apply to Bruce
and Diesel
(10), Bruce
and Vin
(9), and Willis
and Diesel
(9).
Let's see another example. Given two documents:
[
{
"movie_id": "001",
"description": "Bruce.Willis"
},
{
"movie_id": "002",
"description": "Bruce super Willis"
}
]
When making a query on Bruce Willis
, 002
will be the first document returned, and 001
will be the second one. This will happen because the proximity distance between Bruce
and Willis
is equal to 2 in the document 002
, whereas the distance between Bruce
and Willis
is equal to 8 in the document 001
since the full-stop character .
is a hard space.
Numeric
A numeric type (integer
, float
) is converted to a human-readable decimal number string representation. Numeric types can be searched as they are converted to strings.
You can add custom ranking rules to create an ascending or descending sorting rule on a given attribute that has a numeric value in the documents.
You can also create filters. The >
, >=
, <
, <=
, and TO
relational operators apply only to numerical values.
Boolean
A Boolean value, which is either true
or false
, is received and converted to a lowercase human-readable text (true
and false
). Booleans can be searched as they are converted to strings.
null
The null
type can be pushed into Meilisearch but it won't be taken into account for indexing.
Array
An array is an ordered list of values. These values can be of any type: number, string, boolean, object, or even other arrays.
Meilisearch flattens arrays and concatenates them into strings. Non-string values are converted as described in this article's previous sections.
Example
The following input:
[
[
"Bruce Willis",
"Vin Diesel"
],
"Kung Fu Panda"
]
Will be processed as if all elements were arranged at the same level:
"Bruce Willis. Vin Diesel. Kung Fu Panda."
Once the above array has been flattened, it will be parsed exactly as explained in the string example.
Objects
When a document field contains an object, Meilisearch flattens it and brings the object's keys and values to the root level of the document itself.
Keep in mind that the flattened objects represented here are an intermediary snapshot of internal processes. When searching, the returned document will keep its original structure.
In the example below, the patient_name
key contains an object:
{
"id": 0,
"patient_name": {
"forename": "Imogen",
"surname": "Temult"
}
}
During indexing, Meilisearch uses dot notation to eliminate nested fields:
{
"id": 0,
"patient_name.forename": "Imogen",
"patient_name.surname": "Temult"
}
Using dot notation, no information is lost when flattening nested objects, regardless of nesting depth.
Imagine that the example document above includes an additional object, address
, containing home and work addresses, each of which are objects themselves. After flattening, the document would look like this:
{
"id": 0,
"patient_name.forename": "Imogen",
"patient_name.surname": "Temult",
"address.home.street": "Largo Isarco, 2",
"address.home.postcode": "20139",
"address.home.city": "Milano",
"address.work.street": "Ca' Corner Della Regina, 2215",
"address.work.postcode": "30135",
"address.work.city": "Venezia"
}
Meilisearch's internal flattening process also eliminates nesting in arrays of objects. In this case, values are grouped by key. Consider the following document:
{
"id": 0,
"patient_name": "Imogen Temult",
"appointments": [
{
"date": "2022-01-01",
"doctor": "Jester Lavorre",
"ward": "psychiatry"
},
{
"date": "2019-01-01",
"doctor": "Dorian Storm"
}
]
}
After flattening, it would look like this:
{
"id": 0,
"patient_name": "Imogen Temult",
"appointments.date": [
"2022-01-01",
"2019-01-01"
],
"appointments.doctor": [
"Jester Lavorre",
"Dorian Storm"
],
"appointments.ward": [
"psychiatry"
]
}
Once all objects inside a document have been flattened, Meilisearch will continue processing it as described in the previous sections. For example, arrays will be flattened, and numeric and boolean values will be turned into strings.
Nested document querying and subdocuments
Meilisearch has no concept of subdocuments and cannot perform nested document querying. In the previous example, the relationship between an appointment's date and doctor is lost when flattening the appointments
array:
…
"appointments.date": [
"2022-01-01",
"2019-01-01"
],
"appointments.doctor": [
"Jester Lavorre",
"Dorian Storm"
],
…
This may lead to unexpected behavior during search. The following dataset shows two patients and their respective appointments:
[
{
"id": 0,
"patient_name": "Imogen Temult",
"appointments": [
{
"date": "2022-01-01",
"doctor": "Jester Lavorre"
}
]
},
{
"id": 1,
"patient_name": "Caleb Widowgast",
"appointments": [
{
"date": "2022-01-01",
"doctor": "Dorian Storm"
},
{
"date": "2023-01-01",
"doctor": "Jester Lavorre"
}
]
}
]
The following query returns patients 0
and 1
:
curl \
-X POST 'http://localhost:7700/indexes/movie_ratings/search' \
-H 'Content-Type: application/json' \
--data-binary '{
"q": "",
"filter": "(appointments.date = 2022-01-01 AND appointments.doctor = 'Jester Lavorre')"
}'
Meilisearch is unable to only return patients who had an appointment with Jester Lavorre
in 2022-01-01
. Instead, it returns patients who had an appointment with Jester Lavorre
, and patients who had an appointment in 2022-01-01
.
The best way to work around this limitation is reformatting your data. The above example could be fixed by merging appointment data in a new appointmentsMerged
field so the relationship between appointment and doctor remains intact:
[
{
"id": 0,
"patient_name": "Imogen Temult",
"appointmentsMerged": [
"2022-01-01 Jester Lavorre"
]
},
{
"id": 1,
"patient_name": "Caleb Widowgast",
"appointmentsMerged": [
"2023-01-01 Jester Lavorre"
"2022-01-01 Dorian Storm"
]
}
]
Possible tokenization issues
Even if it behaves exactly as expected, the tokenization process may lead to counterintuitive results in some cases, such as:
"S.O.S"
"George R. R. Martin"
10,3
For the two strings above, the full stops .
will be considered as hard spaces.
10,3
will be broken into two strings—10
and 3
—instead of being processed as a numeric type.