Building a Search Engine (Part 1)
I have always been intrigued by search engines and how they work. My first foray into search was when I implemented a vector search database to add RAG (Retrieval Augmented Generation) into an AI app I built last year. Basically, vector search is semantic search. Semantic means you search based on the meaning of words. It's why you can search for money and Google can bring out search pages that talk about the US dollar. But semantic search is a more recent feature of search popularized thanks to AI systems. The earliest search systems used algorithms that were more focused on just the words themselves. So, in my spare time, I set out to understand three of those algorithms and implement them for myself. In this piece, I'll implement one of them and explain it. I'll write on implementing the others as well. It took a lot of digging and searching for me to get resources on them and I hope someone who just wants to dig into search systems can read this piece and have a good grasp of how search works. To demonstrate the algorithm, I'll be using three text files from the 20 NewsGroup Dataset.
TF(Term Frequency)
Term frequency is the simplest of the lot in my opinion. Simplest to explain and to implement.
Imagine having 3 bags. Each of those bags contains different items that you have. They can contain anything from balls to pens, pencils, anything. Let's assume they are all small items and each box is pretty big. However, you know you will need items from those boxes occasionally and to do that, you would have to dig through each box every time to look for the item you need. That's painfully slow and doing it each time is definitely worse.
What if you simply created a list of each box and the items it contains with the number of each item?
So your list would look like:
Box A = Blue Pens: 10, Pencils: 4, Orange Crayons: 4, …
Box B = Red Crayons: 4, Eraser: 20, Highlighter: 30
Box C = Sticky notes: 3, Black Pens: 4, Jotters: 5, …
So when you need a black pen, you just look through your list and immediately know that Box C contains black pens. You just check Box C and you have your Black Pens.
Simple and pretty intuitive. That's exactly what the TF algorithm does. The boxes are documents and the items in them are the words in them. The code for it is pretty simple in fact. Let's implement one in Python over some files.
import re
from collections import Counter
doc_A ='''
"Galactic Pot-Healer"
A fallible alien deity summons a group of Earth craftsmen and women to a
remote planet to raise a giant cathedral from beneath the oceans. When the
deity begins to demand faith from the earthers, pot-healer Joe Fernwright is
unable to comply. A polished, ironic and amusing novel.
"A Maze of Death"
Noteworthy for its description of a technology-based religion.
"VALIS"
The schizophrenic hero searches for the hidden mysteries of Gnostic
Christianity after reality is fired into his brain by a pink laser beam of
unknown but possibly divine origin. He is accompanied by his dogmatic and
dismissively atheist friend and assorted other odd characters.
"The Divine Invasion"
God invades Earth by making a young woman pregnant as she returns from
another star system. Unfortunately she is terminally ill, and must be
assisted by a dead man whose brain is wired to 24-hour easy listening music.
'''
doc_B = '''
If the most Frequently Asked Question in rec.motorcycles is "What is the
DoD?", then the second most Frequently Asked Question must be "How do I
get a DoD number?" That is as simple as asking the Keeper of the List
(KotL, accept no substitue Keepers) for a number. If you're feeling
creative, and your favorite number hasn't been taken already, you can
make a request, subject to KotL approval. (Warning, non-numeric, non-
base-10 number requests are likely to earn a flame from the KotL. Not
that you won't get it, but you _will_ pay for it.)
Oh, and just one little, tiny suggestion. Ask the KotL in e-mail. You'll
just be playing the lightning rod for flames if you post to the whole
net, and you'll look like a clueless newbie too.
By now you're probably asking "So who's the KotL already?". Well, as
John Sloan notes below, that's about the only real "secret" left around
here, but a few (un)subtle hints can be divulged. First, it is not myself,
nor anyone mentioned by name in this posting (maybe :-), though John was
the original KotL. Second, in keeping with the true spirit of Unix, the
KotL's first name is only two letters long, and can be spelled entirely
with hexadecimal characters. (2.5, the KotL shares his name with a line-
oriented text utility.) Third, he has occasionally been seen posting
messages bestowing new DoD numbers (mostly to boneheads with "weenie
mailers"). Fourth, there is reason to suspect the KotL of being a
Dead-Head.
'''
doc_C = '''
1) What is RIPEM?
RIPEM is a program which performs Privacy Enhanced Mail (PEM) using
the cryptographic techniques of RSA and DES. It allows your
electronic mail to have the properties of authentication (i.e. who
sent it can be confirmed) and privacy (i.e. nobody can read it except
the intended recipient.)
RIPEM was written primarily by Mark Riordan <mrr@scss3.cl.msu.edu>.
Most of the code is in the public domain, except for the RSA routines,
which are a library called RSAREF licensed from RSA Data Security Inc.
2) How can I get RIPEM?
RIPEM contains the library of cryptographic routines RSAREF, which is
considered munitions and thus is export-restricted from distribution
to people who are not citizens or permanent residents of the U.S. or
Canada. Therefore, the following request is quoted from the README
file:
#Please do not export the cryptographic code in this distribution
#outside of the USA or Canada. This is a personal request from me,
#the author of RIPEM, and a condition of your use of RIPEM.
Note that RSAREF is not in the public domain, and a license for it is
included with the distribution. You should read it before using
RIPEM.
The best way to get it is to ask a friend for a copy, since this will
reduce the load on those sites that do carry it (not to mention the
humans that run them.) Naturally this requires that you trust the
Friend.
'''
# to remove the signs and symbols, we'll create a regex function
def preprocess(text):
text = text.lower() # get all texts to lowercase.
text = re.sub(r'[^a-z\s]', '', text) #we only need the words for most search so we'll filter out everything else
tokens = text.split()
return tokens
word_frequency_dict ={}
#function for counting the words and getting the frequency of each term per document
def frequncy_generator(word_list):
word_frequency= Counter(word_list) #we could count using get but counter is lightning faster and simpler
return word_frequency
document_list ={}
inter_=[doc_A, doc_B, doc_C]
dict_names = ["doc_A", "doc_B", "doc_C"]
for i in range(3):
document_list [f"{dict_names[i]}"] = frequncy_generator(preprocess(inter_[i]))
word_list is now our list of words in each document and their frequency. Now we can search over it.
How does our search function work? We just simply break down our query into individual words, then look for the document that contains the cumulative highest count of the words in our query.
So if we say "Does God exist?", we look for the document that contains the highest total of the individual words "does", "god" and "exist".
#creating a searching function
def search_query(search_word):
words=re.sub(r'[^a-zA-Z\s]', '', search_word) #since we cleaned our document, we need to clean queries as well
words = words.lower()
clean_text = words.split()
answer_num = {k: 0 for k in document_list} # initialize once, before looping
for i in clean_text:
for k, v in document_list.items():
if i in v:
answer_num[k] = answer_num[k] + v[i]
max_key = max(answer_num, key=answer_num.get)
return (max_key, max(answer_num.values())) #return the name of the doc and the frequency
query = search_query("Where can i buy computers?")
print(query)
oluwaferanmioyelude@Mac Search-Engine % /Users/oluwaferanmioyelude/Documents/Search-Engine/.venv/bin/python /Users/oluwaferanmioyelude/Documents/Search-Engi
ne/test.py
('doc_B', 4)
Let's change things up a bit and search for something more meaningful.
query = search_query("Does God exist?")
print(query)
oluwaferanmioyelude@Mac Search-Engine % /Users/oluwaferanmioyelude/Documents/Search-Engine/.venv/bin/python /Users/oluwaferanmioyelude/Documents/Search-Engi
ne/test.py
('doc_A', 1)
For context, I extracted doc_B from the Newsgroup file called "rec.motorcycles.text", doc_C from "sci.crypt" and doc_A from "alt.atheism.text". So it's not exactly perfect, but it's a start. This simple algorithm was how early versions of search engines worked.
It was pretty easy and it does a tidy job, but if you examine it more closely, it has some downsides. The first and major one that the next algorithm I'll work on sought to work through is that filler words like "the" and "and" have pretty high frequencies despite being almost useless to the context. They appear a lot and so they have a really high frequency. So if the user searches for a query that contains those words, the search would be messed up and might not work well. This was pretty much the main issue with the TF algorithm, and the next algorithm was developed to address that particular issue. The second part of this will cover that and will be out soon.
I hope you enjoyed reading this and learnt as much as I learnt when I was working on it. Feel free to copy the code and play around with it. Till next time!