Can I make my own custom search engine from scratch?

The title is quite intriguing but interesting. As a student of computer science I had always thought how a search engine works, it knows about a newly commissioned web resource, whether a website is up or down, why a website takes time to show-up in search results? There are many questions, but the only answer is search engine. This is first article in series of search engine which will cover the basics, modules of search engine, development of the modules with code and results, integrating the modules with one framework.

We all use search engine daily like

ok Google when cab is going to arrive, when is my birthday?

Alexa will it rain today?

Search engines like Google and duckduckgo do same things, but ideologically different. So lets get down to basics of search engines and get our hands dirty with coding a basic search engine.


Basics

Search engines are a complex piece of code which does so many things like searching for the answer to the question user had asked, calculating which website comes in first 10 results. To explain we will now first understand different components of search engine and they are:

  • crawling
  • parsing
  • indexing
These three processes teams up to make a search engine and keep it up and running. Lets dive inside these components one by one.


Crawling



Crawling is a process which feeds the search engine with the required data. It first visits the page through internet, downloads the information and saves it to the database of the search engine. Some of the information a crawler seeks for are:

  • title of the website
  • url
  • keywords
  • meta information
  • domain information
  • javascript
  • CSS
  • headings
  • other links in the website
The software which does crawling is termed as spider or crawler. A crawler when plans to visit a website first checks for a file robots.txt on the server where the website is hosted. This file contain rules for the crawler. It informs crawler what to crawl and what not to crawl. After the crawler had read the rules then it crawls the index page of the website and jumps on to next website after it. These crawlers can’t search entire website in one go so they stop after scanning the first page(mostly).
Now I will make you understand the next process i.e., parsing.

Parsing

It is a process in which the webpage is scanned for the content inside it.
So how do we do it?
We do it by extracting the features from the webpage and store it with the handle i.e, web url. Some of the features of the websites are:

  • website domain name
  • website title
  • frequently used tokens(words)
  • headings on the page
  • links to other websites
  • links to other pages of the site
  • text marked as bold
  • thumbnails of images in website
These features are stored in a database. After storing, indexing is done based on indexing criteria. It is discussed in next section.

Indexing

It is a process of processing the data of the crawled websites and arranging them in an order, so that we can retrieve required information later. It is described in the figure 1 below.




In the Figure 1, text acquisition is done in parsing phase and data is stored in document store, text transformation is done which cleans text and then pushed again for index creation. Index creation is done after it.
I hope you have understood the basics of search engine.
In my next article I’ll explain each module in detail.



Gaurav Parashar (Faculty)

Computer Science Department

KIET Group of Institutions

Comments

Popular posts from this blog

QUANTUM STORAGE & MEMORY

Biofuel generation with the help of Bioinformatics

PHISHING ATTACKS IN INDIA