Search This Blog

Wednesday 25 January 2012

Website Fetcher


                                                  

Description of the project:

 The Website Fetcher is a multithreaded windows application that downloads and stores Web pages Uniform Resource Identifier (URI’s), for a Web search engine. Roughly, a crawler starts off by placing an initial set of URLs, so, in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop. Collected pages are later used for other applications, such as a Web search engine or a Web cache.

As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to perform the above task, so that download rate is maximized. We refer to this type of fetcher as a parallel crawler. This type of applications is often used in search engines where there is a need of collecting all the URL’s based on a query and indexing them on priority.

This application is a .Net based fetcher very similar to Googlebot, Google’s crawler. This application has got its use as a backend processing component for a search engine. The results (URI data) gathered by the website fetcher will be given to an indexer which indexes page data so that the search query gives the results faster.

Modules:
Crawler Views
·         Threads view.
·         Requests view.
Configurator
·         MIME types.
·         Output Connections.
·         Advanced settings.

Multithreaded Downloader
Software requirements:
o   Microsoft .Net framework 2.0
o   Microsoft C#.Net language
o   Microsoft Windows 2000 SP4 or higher
o   Microsoft Visual Studio 2005 IDE
Software requirements:
o   Microsoft .Net framework 2.0.
o   Microsoft ASP.Net, HTML.
o   AJAX Tool kit.
o   Microsoft C#.Net language.
o   Microsoft SQL Server 2000 and above.
o   XML.

No comments:

Post a Comment