Tuesday 1 October 2013

How does Google Search work ?

How does Google Search
work ?
Google search engine is
undoubtedly most widely used
search engine. It was founded by
Larry Page and Sergey Brin. We
must have the knowledge of basic
working and methodology used by
google search engine. I have
explained the things in very simple
words. Read Carefully
Overview :
Okay lets assume , you wanna
design a little search engine that
would search the requested key
words in few websites (say 5
websites) ,So what would be our
approach ? First of all, we will store
the contents that is webpages of
that 5 websites in our database.
Then we will make an index
including the important part of
these web pages like
titles,headings,meta tags etc.
Then
we would make a simple search
box meant for users where they
could enter the search query or
keyword. User's entered query will
be processed to match with the
keywords in the index and the
results would be returned
accordingly. We will return user
with list of the links of actual
websites and the preference to
those websites will be given to
them using some algorithm. I
hope the basic overview of working
of search engine is clear to you.
Now read more regarding the
same.
A web search engine works
basically in the following manner.
There are basically three parts.
1. Web Crawling
2. Indexing
3. Query processing or searching
1. First step of working of search
engine is web crawling. A web
crawler or a web spider is a
software that travels across the
world wide web and
downloads,saves webpages. A web
crawaler is fed with URLs of
websites and it starts proceeding.
It starts downloading and saving
web pages associated with
that websites. Wanna have feel of
web crawaler. Feed it with links of websites
and it will start downloading
webpages,images etc associated
with those websites. Name of
google web crawler is GoogleBot .
Wanna see the copies of webpages
saved in google database ?
(actually not exactly)
Lets take example of any website ,
say www.wikipedia.org
Do this -:
Go to google. and search for
'wikipedia' Hopefully you would get
this link on top.
Click on the 'cached' link as shown.
OR
Directly search for
'cache:wikipedia.org'
Then read the lines at top the
page you got and things would be
clear to you.
2. After googlebot has saved all
pages, it submits them to google
indexer. Indexing means extracting
out words from
titles,headings,metatags etc.The
indexed pages are stored in
google index database. The
contents of index database is
similar to the index at the back of
your book. Google ignores the
common or insignificant words like
as,for,the,is,or,on (called as stop
words) which are usually in every
webpage. Index is done basically
to improve the speed of searching.
3. The third part is query
processing or searching. It
includes the search box where we
enter the search query/keyword for
which we are looking for. When
user enters the serach query,
google matches the entered key
words in the pages saved in
indexed database and returns the
actual links of webpages from
where those pages are reterived.
The priority is obviously given to
best matching results. Google
uses a patented algorithm called
PageRank that helps rank web
pages that match a given search
string.
The above three steps are
followed not only google search
but most of the web search
engines.Ofcourse there are many
variations but methodology is
same .
What is Robots.txt ?
Web Administrators do not the web
crawlers or Web spiders to fetch
every page/file of the website and
show the links in search
results. Robots.txt is a simple text
file meant to be placed in top-level
directory of the website which
contain the links that web
administrators do not want to be
fetched by web crawlers. The first
step of a Web Crawler is to check
the content of Robots.txt
Example of contents of Robots.txt
User-agent: * //for
web crawlers of all search
engines
Disallow:/directory_name/
file_name //specify a file
of particular dir.
Disallow:/directory_name/
  //all files of
particular dir.
You can see robots.txt of websites
(if exists). Example http://
www.microsoft.com/robots.txt
For more information visit => http://www.google.co.in/intl/en/insidesearch/howsearchworks/thestory/

0 comments:

Post a Comment