May 23, 2026

Building a Search Engine (Part 2): TF-IDF

Last time, we implemented the TF algorithm for search engines. A very simple algorithm and very, very intuitive. It's so easy to understand. The higher the frequency of individual words that make up a search query in a particular document, the higher the chances the answer to that query is contained in that document.

However, in this simple definition lies one of the biggest problems with the TF algorithm. In English, and naturally every language, there are words that have almost no meaning to the context of the topic of discussion but yet come up frequently. Words like "the", "and", "or", "in" add no contextual meaning to the subject, but yet we have to use them a lot. So naturally every document contains these words and we end up counting them too. If a user's query contains "the", we might end up returning a document that has almost zero relevance to what they were actually looking for.

We needed a way to penalize and reduce the influence of those words. And that's exactly what the TF-IDF algorithm does: penalize words that are too common.

What is TF-IDF?

TF-IDF has two parts:

TF is still Term Frequency. It's exactly the same as we knew before.
IDF is the part that does the penalizing for us. IDF stands for Inverse Document Frequency and is used to measure the importance of words across a corpus. (A corpus just means a collection of documents.)

Let's explain in simple terms before we dive into the code.

The Box Analogy

Let's assume there are 3 boxes. Those boxes contain different sets of items. Your job is to find a way to uniquely identify each box, think of it as branding each box based on what it contains. Naturally, you want the brand to be distinct.

So if you see a pencil across every box, you don't brand any box as "The Abode of Pencils". Your goal is to look for items that majorly appear in only that one box. So if only Box A contains Nike Shoes, you can brand it "Need a Nike? Look no further" (just me playing games here lol). But you get the point.

You want each box to be known for the rarest items it contains, items that are almost never seen in any other box. You still have an inventory of each box, but you've ensured that the rarest items are more prominent. So for Box A, instead of listing items by the frequency at which they appear, you now list them by how rare they are relative to the other boxes. Box A's inventory would almost definitely show Nike at the top.

If you understood that, you perfectly understand IDF conceptually. It's definitely more complicated and slightly tougher to implement in actual code, but that's the system of thinking behind it.

The IDF Formula

IDF = ln(total number of documents in the corpus / number of documents the word appears in)

Because of situations where the ratio equals 1 and we'd be left with ln(1) = 0, we always add 1 to the result. TF-IDF is simply the term frequency multiplied by the IDF.

How does this do the job?

Let's think of the word "the". It would appear in every document in the corpus. So if we have 100 documents and "the" appears in 90 of them:


ratio = 100 / 90 = 1.11
ln(1.11) = 0.104
IDF("the") = 0.104 + 1 = 1.104

So the TF-IDF of "the" in a document where it appears 40 times is 40 × 1.104 = 44.16. That's almost the same as raw TF. Depending on the values of other words, it will likely rank lower, especially if there are rarer words around.

Now let's take a very rare word that mostly shows up in one document, say, "octopus":


ratio = 100 / 9 = 11.11
ln(11.11) = 2.47
IDF("octopus") = 2.47 + 1 = 3.47

So the TF-IDF of "octopus" in every document will be its term frequency multiplied by nearly 3.5x. It shoots up the inventory. It becomes much more important.

The major goal of TF-IDF is to punish filler words while giving more weight to rare ones, and as we can see, it effectively does exactly that.

Let's Implement It

For clarity and better search, I replaced the documents from Part 1 with new ones that have clear, distinct topics. This makes our search queries way more satisfying to look at.

Setting Up the Documents and Preprocessing


import re
from collections import Counter, defaultdict
import numpy as np

doc_A = """
Machine learning is a subset of artificial intelligence that allows computers 
to learn from data without being explicitly programmed. Algorithms improve 
automatically through experience. Common machine learning techniques include 
neural networks, decision trees, and regression.

Machine learning powers recommendations on Netflix and fraud detection in 
banking. Deep learning, a branch of machine learning, uses neural networks 
with many layers to solve complex problems like image recognition and 
natural language processing.
"""

doc_B = """
Football is a team sport played between two teams of eleven players each. 
The goal is to score by getting the ball into the opposing team's goal. 
Players are not allowed to use their hands except the goalkeeper. Football 
is the most popular sport in the world, with the FIFA World Cup being the 
most watched sporting event globally.

Football players train daily to improve their speed, strength and teamwork. 
Famous football clubs like Real Madrid, Barcelona and Manchester United 
compete in leagues across the world. The football World Cup takes place 
every four years and brings together teams from every nation.
"""

doc_C = """
Photosynthesis is the process by which plants convert sunlight into food. 
Using sunlight, water and carbon dioxide, plants produce glucose and release 
oxygen. Photosynthesis occurs in the chloroplasts of plant cells. Without 
photosynthesis, most life on earth could not survive.

Plants rely on photosynthesis to grow and produce energy. The rate of 
photosynthesis increases with more sunlight and carbon dioxide. Aquatic 
plants and algae also perform photosynthesis, producing much of the 
oxygen found in earth's oceans and atmosphere.
"""


# Strip punctuation and symbols, lowercase everything, and split into tokens
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # keep only letters and spaces
    tokens = text.split()
    return tokens

# Same as preprocess but returns a set, used for IDF calculation
# Sets give us unique words per document and make lookups much faster
def create_set(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = set(text.split())
    return tokens

# Count the frequency of each word in a token list using Counter
# Counter is faster and cleaner than manually iterating with a dict
def frequency_generator(word_list):
    return Counter(word_list)

Computing Term Frequency (TF)


documents = [doc_A, doc_B, doc_C]
doc_names = ["doc_A", "doc_B", "doc_C"]

# Build a TF dict for each document: { "doc_A": Counter({word: count, ...}), ... }
tf = {}
for i in range(3):
    tf[doc_names[i]] = frequency_generator(preprocess(documents[i]))

Computing Inverse Document Frequency (IDF)


# Build a set of unique words for each document
word_set = [create_set(doc) for doc in documents]

# Count how many documents each word appears in
# e.g. word_freq["the"] = 3 means "the" appears in all 3 documents
word_freq = Counter()
for document_set in word_set:
    for word in document_set:
        word_freq[word] += 1

# Apply the IDF formula: ln(total_docs / doc_freq) + 1
# Adding 1 prevents a zero IDF when a word appears in every document
idf = {}
N = len(word_set)  # total number of documents in the corpus
for document_set in word_set:
    for word in document_set:
        idf[word] = np.log(N / word_freq[word]) + 1

Computing TF-IDF


# Multiply TF by IDF for every word in every document
# Higher TF-IDF = word is both frequent in this doc AND rare across the corpus
tf_idf = defaultdict(dict)
for i in range(len(word_set)):
    for word in word_set[i]:
        tf_idf[doc_names[i]][word] = tf[doc_names[i]][word] * idf[word]

Building the Search Function


def search_query(search_word):
    # Clean the query the same way we cleaned the documents
    words = re.sub(r'[^a-zA-Z\s]', '', search_word).lower().split()
    
    # Start every document's score at 0
    scores = {k: 0 for k in tf_idf}
    
    # For each word in the query, add its TF-IDF score to the matching document
    for word in words:
        for doc, word_scores in tf_idf.items():
            if word in word_scores:
                scores[doc] += word_scores[word]
    
    # Return the document with the highest cumulative TF-IDF score
    best_doc = max(scores, key=scores.get)
    return (best_doc, scores[best_doc])

Running Some Queries

As you can now see, we have successfully implemented TF-IDF. Let's run some queries.

Query 1: "What is machine learning?"

If we built a strong search system, it should return doc_A, the document that focuses on machine learning.


print(search_query("What is machine learning?"))
# Output: ('doc_A', <score>)

As expected, it returned doc_A.

Query 2: "How many players are in a football team?"

This should return doc_B.


print(search_query("How many players are in a football team?"))
# Output: ('doc_B', <score>)

And bam, it did. You see how simple and intuitive that was? Very, very easy to understand if you take the time to look at it. The code is a little tougher to implement, but if you understand how the system works, implementing it becomes much easier.

What's Next?

I would have loved to say that this is what most systems use today, but an even better algorithm was created that covers some of the issues we've encountered here. We'll work on that in the next part of this series.

I sincerely hope you had a nice time reading this. Should have been released way earlier but I was so busy this week. Till next time, fellas.