lu.se

Datavetenskap

Lunds Tekniska Högskola

Denna sida på svenska This page in English

A Web Crawler

Objectives

In this assignment, you will program a Web crawler. A Web crawler is a program that reads web pages and that indexes the words they contain. The crawler starts from a single page or a set of pages and follows the links it finds in them. In your program, we will set aside the indexing part and focus on the crawler. The objectives are to:

  • Create a crawler
  • Collect page addresses
  • Collect e-mail addresses
  • Understand the URL class
  • Understand the Java HTML parser

Summary of the Work

Each group will have to:

  1. Write a monothreaded crawler.
  2. Extend it to a multithreaded crawler.

Reading Assignments

  • Read Chapters 7 and 8 on URL and HTML processing in the book Java Network Programming by Elliotte Rusty Harold, pp. 184-274;
  • Read the description of the Become system here and test it here

Programming

Below are the main parts you will implement. Start with the monothreaded search and then extend it to a multithreaded program. Each part lists unsorted points that you program must fulfill.

Search

  1. Your program will start from a single URL address, for example:
    http://www.cs.lth.se/EDA095/ or http://www.cs.lth.se/~pierre/.
  2. Your crawler will analyze the current Web page beginning with the start address, collect all the valid URL addresses, and continue with the addresses the page is pointing to. The program will run until it has no more address to collect or it has reached a limit, say 1000 links.
  3. A breadth-first search is a possible strategy to explore the Web, see the figure below.
    • The crawler reads the start page and finds three addresses, URL1, URL2, and URL3, in it, that it appends to the end of a queue.
    • The crawler uses the address that is the first of the queue (URL1), reads this page, and finds two addresses that it appends to the end of the queue.
    • The crawler uses the next address in the queue (URL2), reads it, and so on.

Breadth-first search

Parsing the pages

  1. The crawler only needs to read pages containing text and you can use the MIME type or the URL suffix to predict it.
  2. The crawler should not visit twice a same Web page.
  3. The crawler will extract URL addresses from the HTML web pages using the Java built-in HTML parser. You can reuse and modify the program LinkGetter.java seen in the lecture (go to the end of the slides). You will consider at least these three types of links:
    1. <a href="http://www.cs.lth.se/~pierre"></a>
      i.e. the href attribute that is in the a element (both relative and absolute addresses).
    2. <a href="mailto:Pierre.Nugues@cs.lth.se">
      i.e. see above but you must a specific processing for the mailto tag (see below).
    3. <frame src="myframe.html">
      i.e. the src attribute that is in the frame element (This corresponds to pages that use frames).
  4. To build a URL from a string, you can use the constructor: URL(String)
    To build a URL from a base URL and a string relative to it, you can use the constructor: URL(URL, String)
    Consider the address http://www.cs.lth.se/~pierre/ and the relative address undervisning.html. You can build the absolute URL corresponding to the second address http://www.cs.lth.se/~pierre/undervisning.html using:
    URL(new URL("http://www.cs.lth.se/~pierre/"), "undervisning.html")
  5. The crawler must store all the URLs from HTML web pages and all mailto addresses in two separate lists. When the crawler has finished to collect them, it will write them or save them in a file:
    List of URLs:
    http://www.cs.lth.se/~pierre/
    http://www.cs.lth.se/~pierre/undervisning.html
    http://www.cs.lth.se/~pierre/research.html
    http://www.cs.lth.se/~pierre/publi.html
    http://www.cs.lth.se/~pierre/index_a.html
    http://www.cs.lth.se/~pierre/index_f.html
    ...
    List of addresses:
    mailto:Pierre.Nugues@cs.lth.se
    mailto:Peter.Moller@cs.lth.se

Multithreading the crawler

Instead of using one single thread, we will extend your program to run simultaneous crawlers in the program. On his machine, Pierre Nugues could collect 2000 URLs in 11:30 minutes with a one-thread program and in 3:30 minutes with 10 threads. Can you explain why?

Here are some unsorted points you can consider to make your crawler multithreaded. They just form a suggestion:

  1. The program can consist of three classes: Main that initializes the data and creates the objects, Processor that extends the Thread class and Spider that process the data.
  2. Main creates and starts 10 Processor threads.
  3. Spider uses two Vector lists: traversedURLs and remainingURLs.
  4. Main initializes the remainingURLs list with one element (the start address).
  5. Each thread is a loop that takes the first unprocessed URL from remainingURLs and analyzes it.
  6. you must write methods so that the thread can access the lists and append the results to the end of the traversedURLs list.
  7. Access to the lists and processing of data must be thread safe.