Description
of the project:
The Website Fetcher is a multithreaded windows application that downloads and stores Web pages Uniform Resource Identifier (URI’s), for a Web search engine. Roughly, a crawler starts off by placing an initial set of URLs, so, in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop. Collected pages are later used for other applications, such as a Web search engine or a Web cache.
As the size of the Web grows, it becomes
more difficult to retrieve the whole or a significant portion of the Web using
a single process. Therefore, many search engines often run multiple processes
in parallel to perform the above task, so that download rate is maximized. We
refer to this type of fetcher as a parallel
crawler. This type of applications is often used in search engines where
there is a need of collecting all the URL’s based on a query and indexing them
on priority.
This application
is a .Net based fetcher very similar to Googlebot, Google’s crawler. This
application has got its use as a backend processing component for a search
engine. The results (URI data) gathered by the website fetcher will be given to
an indexer which indexes page data so that the search query gives the results
faster.
Modules:
Crawler Views
·
Threads view.
·
Requests view.
Configurator
·
MIME types.
·
Output Connections.
·
Advanced settings.
Multithreaded Downloader
Software
requirements:
o Microsoft .Net framework 2.0
o Microsoft C#.Net language
o Microsoft Windows 2000 SP4 or higher
o
Microsoft Visual Studio 2005
IDE
Software
requirements:
o Microsoft .Net framework 2.0.
o Microsoft ASP.Net, HTML.
o AJAX Tool kit.
o Microsoft C#.Net language.
o Microsoft SQL Server 2000 and above.
o XML.
No comments:
Post a Comment