29 May 2025

How to Make a Search Engine in Python: Step-by-Step Tutorial

Learn how to easily make a search engine in Python in this detailed step-by-step tutorial.

Ilia MarkovSenior Growth Marketing Managernochainmarkov

How to Make a Search Engine in Python: Step-by-Step Tutorial

You can make a search engine in Python using a combination of data structures, algorithms, and libraries to index, rank, and retrieve information based on your search query input.

A Python search engine works based on these key building steps:

Collect data & preprocessing
Document creation and indexation
Add a search system
Rank results.

A search engine built with Python can be used by small-scale and large-scale enterprises that opt for open-source solutions that offer scalability and flexibility.

Due to its easy customization, these search engines can be applied in many applications, such as e-commerce, research, marketplaces, enterprise search, and more.

Let’s examine in more detail the different steps to building a simple search engine from scratch with Python.

1. Collect data & preprocessing

Data collection can be done in several ways. You may need to web scrape content from HTML webpages with packages such as beautifulsoup, or directly connect your script to Google Sheets using gspread.

Perhaps you already have a database and only need to use a PostgreSQL Python connector to access it and query the data.

The Python code for data collection can vary a lot depending on your needs, but here is an example of how to use Beautiful Soup for web scraping. First, you need to install the package:

pip install beautifulsoup4

Here’s the code example:

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract data
for item in soup.select('.item', limit=5): 
    title = item.find('h2').text
    link = item.find('a')['href']
    print(f"{title}: {link}")

Once the data is collected, it needs to be preprocessed. This step can occur before and after document indexation. Several preprocessing trials may be needed to ensure the dataset is correctly indexed and ready for optimal retrieval.

For instance, text sources that contain emojis, emails, and source links can be cleaned beforehand to avoid adding unnecessary information to the system.

In paragraphs, punctuation and stop words can be removed, and the sentences can be converted to lowercase.

With Python, several packages can be utilised for data parsing, cleaning, and preprocessing. Let’s take a look at the NLTK library and use it to remove emojis, emails, and punctuation:

First, install the package:

pip install nltk

Now you can try the following script:

import re
import string
from nltk.tokenize import word_tokenize

# Sample text with punctuation, emojis, and emails
text = """
Hello! 😊 This is a test [email protected]. 
Can you remove this? 👍 Also, check [email protected]! 
"""

def clean_text(text):
    # Step 1: Remove emails
    text = re.sub(r'S+@S+', '', text)
    
    # Step 2: Remove emojis and symbols
    emoji_pattern = re.compile(
        "["
        "U0001F600-U0001F64F"  # Emoticons
        "U0001F300-U0001F5FF"  # Symbols & pictographs
        "U0001F680-U0001F6FF"  # Transport & map symbols
        "U0001F700-U0001F77F"  # Alchemical symbols
        "U0001F780-U0001F7FF"  # Geometric Shapes Extended
        "U0001F800-U0001F8FF"  # Supplemental Arrows-C
        "U0001F900-U0001F9FF"  # Supplemental Symbols and Pictographs
        "U0001FA00-U0001FA6F"  # Chess Symbols
        "U0001FA70-U0001FAFF"  # Symbols and Pictographs Extended-A
        "U00002702-U000027B0"  # Dingbats
        "U000024C2-U0001F251" 
        "]+", 
        flags=re.UNICODE
    )
    text = emoji_pattern.sub('', text)
    
    # Step 3: Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Step 4: Tokenize and rejoin (optional, removes extra whitespace)
    tokens = word_tokenize(text)
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text

# Clean the text and make it lowercase
cleaned_text = clean_text(text).lower()
print("Original Text:
", text)
print("
Cleaned Text:
", cleaned_text)

In some cases, Natural Language Processing (NLP) could be required. Take, for instance, a list of companies like “Impossible Foods” and Impossible Foods Co.” Both are the same companies, so you can use NLP to vectorize the words and unify the results in one single term based on their cosine similarity.

Sometimes, excessive preprocessing can be a problem, resulting in information loss. Therefore, the best approach is to start with simple steps.

Once the documents are indexed, they can be updated with new preprocessed information.

2. Document creation and indexation

Documents are units of information (e.g., text, JSON, images, or structured/unstructured data) processed and stored on an index.

This step involves gathering the data sources you want to index in your vector database and converting them to documents.

For instance, if you use a JSON format as input, you can use LangChain Python’s framework to convert it to a list of documents directly.

First, you need to install the LangChain package in your system:

pip install langchain

Secondly, you need to import the JSONLoader class and apply the following script:

from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./my_data.json',
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()

The output should be a list of documents that look like this:

[Document(page_content='Bye!', 'seq_num': 1}), Document(page_content='Hello', 'seq_num': 2}),
Document(page_content='See you later', 'seq_num': 3})]

Once the documents are created, they can be added to a vector database such as Chroma db.

Meilisearch’s Python SDK simplifies the process by eliminating the need to convert your source data into documents or search for a database solution. You can directly add the JSON or CSV files to an index on Meilisearch’s vector database.

Like LangChain, you first need to install the Meilisearch package on your machine:

pip install meilisearch

Then, you need to create an index with this simple command:

client.create_index('books', {'primaryKey': 'id'})

To add the documents to the index, you can use the JSON format like this:

client.index('movies').add_documents([{
  'id': 287947,
  'title': 'Super Gut',
  'author': 'Dr. William Davis',
}])

With the same package, you can also update the documents, apply filters, and delete them by simply changing the function:

# apply filter
client.index('books').update_filterable_attributes(['author'])

# update documents
client.index('books').update_documents(<list_of_documents>)

# delete documents
client.index('books').delete_all_documents()

With Meilisearch's Python SDK, multiple indexes can be added, and all functions are easy to follow and implement.

Some examples are available on our GitHub repository; alternatively, you can also refer to the API documentation for more information.

3. Add a search system

If you’re using the LangChain approach with a custom vector database, you must embed your documents with Deep Learning (DL) algorithms. This creates a vector representation of your data, allowing for vector search, hybrid search, semantic search, and more.

Several embedding models are available on Hugging Face and through the OpenAI API.

Let’s, for instance, use the OpenAI embedding model with LangChain and Chroma as a vector database. You first need to install these packages:

pip install langchain-chroma
pip install langchain-openai

Export your OpenAI key and add the following:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

db = Chroma.from_documents(documents, OpenAIEmbeddings())

The command above embeds the documents with the OpenAIEmbeddings class and creates an index in the Chroma database. Now you can query the db instance:

query = "Find a book about Nutrition"
docs = db.similarity_search(query)
print(docs[0].page_content)

All the above steps can be converted into one with Meilisearch’s Python SDK.

There’s no need to add embeddings or to find a package for your vector database. All you need is to search directly on the index previously created with the following function:

client.index('books').search('Find a book about Nutrition')

But it does not stop here, you can add filters like this:

client.index('books').search('nutrition', {
  'filter': ['author = "Dr. William Davis"']
})

Or create a faceted search:

client.index('movie_ratings').search('nutrition', {
  'facets': ['authors', 'rating']})

You can experiment with the API using other search options, such as specifying the number of documents to retrieve, querying by locale, or implementing hybrid search.

4. Rank results

Ranking results can take complex Machine Learning (ML) algorithms, which are usually integrated in the Python libraries, so the good news is that you don’t need to create them from scratch.

For instance, Chroma uses an Approximate Nearest Neighbor (ANN) algorithm called Hierarchical Navigable Small World (HNSH) to find similar documents.

If you want to get their score and order them, you can run the following:

results = docs.similarity_search_with_score(query="Find a book about Nutrition")
for doc, score in results:
    print(f"{score}: {doc.page_content}"}

However, this documentation has quite limited ranking results. A better approach is to use Meilisearch ranking rules, which are more straightforward.

By default, these are the ranking rules that can be tweaked:

"words”: Sorts results by decreasing number of matched query terms
"typo": Sorts results by increasing number of typos
"proximity": Sorts results by increasing distance between matched query terms
"attribute": Sorts results based on the attribute ranking order
"sort": Sorts results based on parameters decided at query time
"exactness": Sorts results based on the similarity of the matched words with the query words.

We can already see that the ranking mechanism can go beyond simply similarity. To rank the results, all you need to do is change the order of these terms in the query according to your needs:

client.index('movies').update_ranking_rules([
    'typo',
    'words',
    'sort',
    'proximity',
    'attribute',
    'exactness',
    'release_date:asc',
    'rank:desc'
])

You can now search with a certain number of results (limit), and their relevance will be based on the updated ranking order.

This function is much simpler to implement and takes into account many other rules. Meilisearch streamlines the ranking process without the need to explore multiple libraries or create ranking algorithms from scratch.

Can I make a search engine in Python for free?

You can use Python frameworks, such as LangChain, paired with open-source vector databases like Chroma. However, this strategy has limitations, such as limited ranking mechanisms, and requires additional steps and preprocessing.

For ease of implementation, more customization, and fast document retrieval, the best approach is to use self-hosted Meilisearch. You can use your machine or a VPS, which comes with a price.

You can also have free access to Meilisearch Cloud with a 14-day trial.

What are the best open-source search engines for Python?

The best open-source search engines have comprehensive documentation and a large community of developers who share their issues and achievements.

Open-source search engine tools should also be easy to set up and provide examples for the community. These are the three main open-source platforms that support Python:

Meilisearch

Meilisearch is an open-source, lightning-fast search engine designed for developers and enterprises seeking to embed intuitive, scalable search experiences into applications through its RESTful API.

Focusing on simplicity and performance, it provides advanced features such as typo tolerance and faceted search.

The documentation is clear, easy to follow, and has examples. There’s a Discord group for developers to share their work or find solutions, and a well-structured GitHub repository.

Qdrant

Qdrant is an open-source vector database and vector search engine built in Rust. It efficiently handles similarity searches on high-dimensional vectors, making it ideal for tasks like recommendation systems, semantic search, and anomaly detection.

The Qdrant’s RESTful API supports multiple languages, including Python. The documentation is vast and can get overwhelming when you need to find the right steps to build a Python search engine. However, it also provides code examples, a GitHub repository, and a Discord community.

Elasticsearch

Elasticsearch is an open-source, distributed search and analytics engine with a scalable data store and vector database for various use cases.

The Python client for Elasticsearch is well documented and provides the right tutorials to start building a search engine seamlessly.

They have a GitHub repository where you can find examples, more information about the Python SDK, and a ticket system on GitHub to address issues.

Elasticsearch also provides a Python DSL module that aims to help with writing and running queries against Elasticsearch in a more convenient and idiomatic way.

What programming languages besides Python are used to build search engines?

Python is not the only programming language that allows you to build AI-powered search engines. Some common programming languages you can use are:

JavaScript: Learn how to build a search engine in Javascript.
PHP: Learn how to build a search engine in PHP.
Golang: Learn how to build a search engine with Golang.

High-performance Python search engines with Meilisearch

While setting up the Python search engine, we realised the importance of having a unified package with its vector database, seamlessly embeds the documents, and provides tools for easy filtering and ranking results.

Using multiple libraries and Python frameworks can be overwhelming, leading to more problems than solutions. The frameworks can be limited in what you are allowed to do, which may force you to switch to another one or build from scratch, both of which can be time-consuming and resource-intensive.

Clarity, good documentation, and ease of use are the keys to building high-performing Python search engines. You don’t need to master several frameworks, libraries, and look for information in countless forums and YouTube videos.

Build your search engine in Python with Meilisearch

Meilisearch is the fastest, most reliable, and flexible open-source solution to build your Python search engine. Focusing on performance and ease of implementation, it delivers accurate results every time.

Try it for free today!

State of search

How to Build a Search Engine in React: Step-by-Step Tutorial

Learn how to easily build a search engine in React in this actionable step-by-step tutorial.

Ilia Markov27 May 2025

State of search

Marketplace search engine: How to make one, top tools, & more

Learn what a marketplace search engine is, how it works, & how to create one. See the top tools to use, mistakes to avoid, and more.

Ilia Markov22 May 2025

State of search

Searchable knowledge base: How to make one, top tools, & more

Learn what a searchable knowledge base is, how it works, & how to create one. See the top tools and boost user satisfaction with better information access.

Ilia Markov21 May 2025