Can I make my own custom search engine from scratch?

- April 30, 2021

The title is quite intriguing but interesting. As a student of computer science I had always thought how a search engine works, it knows about a newly commissioned web resource, whether a website is up or down, why a website takes time to show-up in search results? There are many questions, but the only answer is search engine. This is first article in series of search engine which will cover the basics, modules of search engine, development of the modules with code and results, integrating the modules with one framework.

We all use search engine daily like

ok Google when cab is going to arrive, when is my birthday?

Alexa will it rain today?

Search engines like Google and duckduckgo do same things, but ideologically different. So lets get down to basics of search engines and get our hands dirty with coding a basic search engine.

Basics

Search engines are a complex piece of code which does so many things like searching for the answer to the question user had asked, calculating which website comes in first 10 results. To explain we will now first understand different components of search engine and they are:

crawling
parsing
indexing

These three processes teams up to make a search engine and keep it up and running. Lets dive inside these components one by one.

Crawling

Crawling is a process which feeds the search engine with the required data. It first visits the page through internet, downloads the information and saves it to the database of the search engine. Some of the information a crawler seeks for are:

title of the website
url
keywords
meta information
domain information
javascript
CSS
headings
other links in the website

The software which does crawling is termed as spider or crawler. A crawler when plans to visit a website first checks for a file robots.txt on the server where the website is hosted. This file contain rules for the crawler. It informs crawler what to crawl and what not to crawl. After the crawler had read the rules then it crawls the index page of the website and jumps on to next website after it. These crawlers can’t search entire website in one go so they stop after scanning the first page(mostly).
Now I will make you understand the next process i.e., parsing.

Parsing

It is a process in which the webpage is scanned for the content inside it.
So how do we do it?
We do it by extracting the features from the webpage and store it with the handle i.e, web url. Some of the features of the websites are:

website domain name
website title
frequently used tokens(words)
headings on the page
links to other websites
links to other pages of the site
text marked as bold
thumbnails of images in website

These features are stored in a database. After storing, indexing is done based on indexing criteria. It is discussed in next section.

Indexing

It is a process of processing the data of the crawled websites and arranging them in an order, so that we can retrieve required information later. It is described in the figure 1 below.

In the Figure 1, text acquisition is done in parsing phase and data is stored in document store, text transformation is done which cleans text and then pushed again for index creation. Index creation is done after it.
I hope you have understood the basics of search engine.
In my next article I’ll explain each module in detail.

Gaurav Parashar (Faculty)

Computer Science Department
KIET Group of Institutions

Search This Blog

The Monthly Wrap

Can I make my own custom search engine from scratch?

Basics

Crawling

Parsing

Indexing

Gaurav Parashar (Faculty)

Computer Science Department
KIET Group of Institutions

Comments

Post a Comment

Popular posts from this blog

Biofuel generation with the help of Bioinformatics

PHISHING ATTACKS IN INDIA

QUANTUM COMPUTING: CAN FIGHT CLIMATE CHANGE

Can I make my own custom search engine from scratch?

Basics

Crawling

Parsing

Indexing

Gaurav Parashar (Faculty)

Computer Science DepartmentKIET Group of Institutions

Comments

Post a Comment

Popular posts from this blog

Biofuel generation with the help of Bioinformatics

PHISHING ATTACKS IN INDIA

QUANTUM COMPUTING: CAN FIGHT CLIMATE CHANGE

Computer Science Department
KIET Group of Institutions