How to Make a Search Engine in Python: Step-by-Step Tutorial
Learn how to easily make a search engine in Python in this detailed step-by-step tutorial.

You can make a search engine in Python using a combination of data structures, algorithms, and libraries to index, rank, and retrieve information based on your search query input.
A Python search engine works based on these key building steps:
- Collect data & preprocessing
- Document creation and indexation
- Add a search system
- Rank results.
A search engine built with Python can be used by small-scale and large-scale enterprises that opt for open-source solutions that offer scalability and flexibility.
Due to its easy customization, these search engines can be applied in many applications, such as e-commerce, research, marketplaces, enterprise search, and more.
Let’s examine in more detail the different steps to building a simple search engine from scratch with Python.
1. Collect data & preprocessing
Data collection can be done in several ways. You may need to web scrape content from HTML webpages with packages such as beautifulsoup
, or directly connect your script to Google Sheets using gspread
.
Perhaps you already have a database and only need to use a PostgreSQL Python connector to access it and query the data.
The Python code for data collection can vary a lot depending on your needs, but here is an example of how to use Beautiful Soup for web scraping. First, you need to install the package:
pip install beautifulsoup4
Here’s the code example:
import requests from bs4 import BeautifulSoup url = "https://example.com" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Extract data for item in soup.select('.item', limit=5): title = item.find('h2').text link = item.find('a')['href'] print(f"{title}: {link}")
Once the data is collected, it needs to be preprocessed. This step can occur before and after document indexation. Several preprocessing trials may be needed to ensure the dataset is correctly indexed and ready for optimal retrieval.
For instance, text sources that contain emojis, emails, and source links can be cleaned beforehand to avoid adding unnecessary information to the system.
In paragraphs, punctuation and stop words can be removed, and the sentences can be converted to lowercase.
With Python, several packages can be utilised for data parsing, cleaning, and preprocessing. Let’s take a look at the NLTK library and use it to remove emojis, emails, and punctuation:
First, install the package:
pip install nltk
Now you can try the following script:
import re import string from nltk.tokenize import word_tokenize # Sample text with punctuation, emojis, and emails text = """ Hello! 😊 This is a test [email protected]. Can you remove this? 👍 Also, check [email protected]! """ def clean_text(text): # Step 1: Remove emails text = re.sub(r'S+@S+', '', text) # Step 2: Remove emojis and symbols emoji_pattern = re.compile( "[" "U0001F600-U0001F64F" # Emoticons "U0001F300-U0001F5FF" # Symbols & pictographs "U0001F680-U0001F6FF" # Transport & map symbols "U0001F700-U0001F77F" # Alchemical symbols "U0001F780-U0001F7FF" # Geometric Shapes Extended "U0001F800-U0001F8FF" # Supplemental Arrows-C "U0001F900-U0001F9FF" # Supplemental Symbols and Pictographs "U0001FA00-U0001FA6F" # Chess Symbols "U0001FA70-U0001FAFF" # Symbols and Pictographs Extended-A "U00002702-U000027B0" # Dingbats "U000024C2-U0001F251" "]+", flags=re.UNICODE ) text = emoji_pattern.sub('', text) # Step 3: Remove punctuation text = text.translate(str.maketrans('', '', string.punctuation)) # Step 4: Tokenize and rejoin (optional, removes extra whitespace) tokens = word_tokenize(text) cleaned_text = ' '.join(tokens) return cleaned_text # Clean the text and make it lowercase cleaned_text = clean_text(text).lower() print("Original Text: ", text) print(" Cleaned Text: ", cleaned_text)
In some cases, Natural Language Processing (NLP) could be required. Take, for instance, a list of companies like “Impossible Foods” and Impossible Foods Co.” Both are the same companies, so you can use NLP to vectorize the words and unify the results in one single term based on their cosine similarity.
Sometimes, excessive preprocessing can be a problem, resulting in information loss. Therefore, the best approach is to start with simple steps.
Once the documents are indexed, they can be updated with new preprocessed information.
2. Document creation and indexation
Documents are units of information (e.g., text, JSON, images, or structured/unstructured data) processed and stored on an index.
This step involves gathering the data sources you want to index in your vector database and converting them to documents.
For instance, if you use a JSON format as input, you can use LangChain Python’s framework to convert it to a list of documents directly.
First, you need to install the LangChain package in your system:
pip install langchain
Secondly, you need to import the JSONLoader class and apply the following script:
from langchain_community.document_loaders import JSONLoader loader = JSONLoader( file_path='./my_data.json', jq_schema='.messages[].content', text_content=False) data = loader.load()
The output should be a list of documents that look like this:
[Document(page_content='Bye!', 'seq_num': 1}), Document(page_content='Hello', 'seq_num': 2}), Document(page_content='See you later', 'seq_num': 3})]
Once the documents are created, they can be added to a vector database such as Chroma db.
Meilisearch’s Python SDK simplifies the process by eliminating the need to convert your source data into documents or search for a database solution. You can directly add the JSON or CSV files to an index on Meilisearch’s vector database.
Like LangChain, you first need to install the Meilisearch package on your machine:
pip install meilisearch
Then, you need to create an index with this simple command:
client.create_index('books', {'primaryKey': 'id'})
To add the documents to the index, you can use the JSON format like this:
client.index('movies').add_documents([{ 'id': 287947, 'title': 'Super Gut', 'author': 'Dr. William Davis', }])
With the same package, you can also update the documents, apply filters, and delete them by simply changing the function:
# apply filter client.index('books').update_filterable_attributes(['author']) # update documents client.index('books').update_documents(<list_of_documents>) # delete documents client.index('books').delete_all_documents()
With Meilisearch's Python SDK, multiple indexes can be added, and all functions are easy to follow and implement.
Some examples are available on our GitHub repository; alternatively, you can also refer to the API documentation for more information.
3. Add a search system
If you’re using the LangChain approach with a custom vector database, you must embed your documents with Deep Learning (DL) algorithms. This creates a vector representation of your data, allowing for vector search, hybrid search, semantic search, and more.
Several embedding models are available on Hugging Face and through the OpenAI API.
Let’s, for instance, use the OpenAI embedding model with LangChain and Chroma as a vector database. You first need to install these packages:
pip install langchain-chroma
pip install langchain-openai
Export your OpenAI key and add the following:
from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings db = Chroma.from_documents(documents, OpenAIEmbeddings())
The command above embeds the documents with the OpenAIEmbeddings class and creates an index in the Chroma database. Now you can query the db
instance:
query = "Find a book about Nutrition" docs = db.similarity_search(query) print(docs[0].page_content)
All the above steps can be converted into one with Meilisearch’s Python SDK.
There’s no need to add embeddings or to find a package for your vector database. All you need is to search directly on the index previously created with the following function:
client.index('books').search('Find a book about Nutrition')
But it does not stop here, you can add filters like this:
client.index('books').search('nutrition', { 'filter': ['author = "Dr. William Davis"'] })
Or create a faceted search:
client.index('movie_ratings').search('nutrition', { 'facets': ['authors', 'rating']})
You can experiment with the API using other search options, such as specifying the number of documents to retrieve, querying by locale, or implementing hybrid search.
4. Rank results
Ranking results can take complex Machine Learning (ML) algorithms, which are usually integrated in the Python libraries, so the good news is that you don’t need to create them from scratch.
For instance, Chroma uses an Approximate Nearest Neighbor (ANN) algorithm called Hierarchical Navigable Small World (HNSH) to find similar documents.
If you want to get their score and order them, you can run the following:
results = docs.similarity_search_with_score(query="Find a book about Nutrition") for doc, score in results: print(f"{score}: {doc.page_content}"}
However, this documentation has quite limited ranking results. A better approach is to use Meilisearch ranking rules, which are more straightforward.
By default, these are the ranking rules that can be tweaked:
- "words”: Sorts results by decreasing number of matched query terms
- "typo": Sorts results by increasing number of typos
- "proximity": Sorts results by increasing distance between matched query terms
- "attribute": Sorts results based on the attribute ranking order
- "sort": Sorts results based on parameters decided at query time
- "exactness": Sorts results based on the similarity of the matched words with the query words.
We can already see that the ranking mechanism can go beyond simply similarity. To rank the results, all you need to do is change the order of these terms in the query according to your needs:
client.index('movies').update_ranking_rules([ 'typo', 'words', 'sort', 'proximity', 'attribute', 'exactness', 'release_date:asc', 'rank:desc' ])
You can now search with a certain number of results (limit), and their relevance will be based on the updated ranking order.
This function is much simpler to implement and takes into account many other rules. Meilisearch streamlines the ranking process without the need to explore multiple libraries or create ranking algorithms from scratch.
Can I make a search engine in Python for free?
You can use Python frameworks, such as LangChain, paired with open-source vector databases like Chroma. However, this strategy has limitations, such as limited ranking mechanisms, and requires additional steps and preprocessing.
For ease of implementation, more customization, and fast document retrieval, the best approach is to use self-hosted Meilisearch. You can use your machine or a VPS, which comes with a price.
You can also have free access to Meilisearch Cloud with a 14-day trial.
What are the best open-source search engines for Python?
The best open-source search engines have comprehensive documentation and a large community of developers who share their issues and achievements.
Open-source search engine tools should also be easy to set up and provide examples for the community. These are the three main open-source platforms that support Python:
Meilisearch
Meilisearch is an open-source, lightning-fast search engine designed for developers and enterprises seeking to embed intuitive, scalable search experiences into applications through its RESTful API.
Focusing on simplicity and performance, it provides advanced features such as typo tolerance and faceted search.
The documentation is clear, easy to follow, and has examples. There’s a Discord group for developers to share their work or find solutions, and a well-structured GitHub repository.
Qdrant
Qdrant is an open-source vector database and vector search engine built in Rust. It efficiently handles similarity searches on high-dimensional vectors, making it ideal for tasks like recommendation systems, semantic search, and anomaly detection.
The Qdrant’s RESTful API supports multiple languages, including Python. The documentation is vast and can get overwhelming when you need to find the right steps to build a Python search engine. However, it also provides code examples, a GitHub repository, and a Discord community.
Elasticsearch
Elasticsearch is an open-source, distributed search and analytics engine with a scalable data store and vector database for various use cases.
The Python client for Elasticsearch is well documented and provides the right tutorials to start building a search engine seamlessly.
They have a GitHub repository where you can find examples, more information about the Python SDK, and a ticket system on GitHub to address issues.
Elasticsearch also provides a Python DSL module that aims to help with writing and running queries against Elasticsearch in a more convenient and idiomatic way.
What programming languages besides Python are used to build search engines?
Python is not the only programming language that allows you to build AI-powered search engines. Some common programming languages you can use are:
-
JavaScript: Learn how to build a search engine in Javascript.
-
PHP: Learn how to build a search engine in PHP.
-
Golang: Learn how to build a search engine with Golang.
High-performance Python search engines with Meilisearch
While setting up the Python search engine, we realised the importance of having a unified package with its vector database, seamlessly embeds the documents, and provides tools for easy filtering and ranking results.
Using multiple libraries and Python frameworks can be overwhelming, leading to more problems than solutions. The frameworks can be limited in what you are allowed to do, which may force you to switch to another one or build from scratch, both of which can be time-consuming and resource-intensive.
Clarity, good documentation, and ease of use are the keys to building high-performing Python search engines. You don’t need to master several frameworks, libraries, and look for information in countless forums and YouTube videos.