Can I make my own search engine from scratch-2

Design a Crawler – 1

I had designed this article to make people understand how to Design a Crawler that will do its basic task of fetching documents from web. This article will contain information about the first phase of the search engine i.e., crawler. In my last article, I had given an overview of the search engine and its working. I will define various functions of the crawler and its working.


Crawler Introduction

The web crawlers crawl through the content of the webpage to crawl all the different web pages linked to it. It has many other synonyms like spider and bot. The crawler scans the content of a site, being crawled, and learns about some of the information like the domain of the website, URL, links, etc. It takes the first page of the site as a seed page, which directs the crawler to crawl all the different web pages linked to it.

Here is the algorithm that defines the process that web crawler follows:-


Algorithm Basic-Crawler: Input: URL Output: links stored in storage

  1. Input URL of the seed page.
  2. Parse the seed page and extract all the hyperlinks and store it in a list.
  3. Fetch the link from the list.
  4. If the link is already visited then skip and fetch the next link. If the link is not visited then mark it as visited.
  5. Follow each link and store the non-visited link in the storage.
  6. repeat step 3 until all the links have been visited.
Now, let me get you started with the basics of programming i.e., python to develop a basic crawler.
Now, I will get you started with the python code. To get started there are some prerequisite to follow:
  1. A Python interpreter.
  2. Some basic programming skills
  3. Loads of enthusiasm
We need to understand that the platform to run the Python code can either be online or offline. I prefer both. Initially, beginner's must never get themselves confused with the installations and waste their time in it. After you have learnt something then you can shift yourself from online to offline. So for this tutorial, we will stick to online interpreter like https://onlinegdb.com. In this we can write code and save it and it gets us off from hassles of installation.
I will first walk you through different feature of the portal so that your journey is smooth and hassle free. 
Figure 1


Features of onlinegdb.com:

  1. You can login and save all your work online and can access it from anywhere and anytime. For this click on Sign Up
  2. From the top right corner, select python 3. This will select python environment.
  3. Now we are all set to write our first Python code:
  1. In this article we will try to understand the following statements:

    1. Output

    2. Input

    3. variables

    4. Basic data structure like list 

  2. In our first statement we will see working of print() function.

Type the following code in the window:

print(“Hello World!!”)

Output is: Hello World!!

Print statement prints whatever we pass inside the function either a variable or constant. 

  1. In our next statement we will see the working of input function.

Type the following code in the window:

input(“Enter a number: ”)

and suppose after running the code you input 6.

Output is: Enter a number: _

  1. Let us now understand how to store whatever our input function has gathered from keyboard.

For this we need to define variable. To do this type the following line in the window.

var_int = input(“Enter a number: “)

print(var_int)

Output: Enter a number: 6

6

When you enter 6, the value then get stored in the variable var_int and the function print(var_int) will print the value 6 on screen.

Figure 2




As can be seen in the Figure 2, the output is shown in the bottom part of the screen and code is written in the top portion of the screen. Also observe the top right corner of the screen. We have set the environment as Python 3. 

In order to run the code you need to press the “Run Button” as seen in the Figure 3. 

 

  1. Now we need to understand about the variables. Variables are placeholders of the values to be used during the code execution. These values change depending upon the code written. 

var_int = 12

var_str = “Hello World!!”

var_float = 12.342

The most interesting fact about the variable in python is that the variable automatically takes the data type of the value stored in it. Therefore, we do not need to explicitly declare the datatype fo the variable.

  1. The last thing we have to understand here is list data structure.

To declare a list we have to follow the following method:

L1 = [1,2,3,4]  

L2= [12, 12.34, “Hello World!!”]

As seen from the example above the list data structure can take data of same type or data of different type. 

To print the contents of the list we can simple write the statement like:

print (L1)

output: [1,2,3,4]

 Now, as you have understood the basics of python. In my next article I will explain the algorithm, mentioned above and write code based on it.

Until then, happy coding.




Gaurav Parashar (Faculty)

Computer Science Department

KIET Group of Institutions

Comments

Popular posts from this blog

QUANTUM STORAGE & MEMORY

QUANTUM COMPUTING: CAN FIGHT CLIMATE CHANGE

BLOCKCHAIN: HOW IT WORKS AND WHY SO POPULAR