lu.se

Datavetenskap

Lunds Tekniska Högskola

Denna sida på svenska This page in English

A Distributed Crawler Using Remote Method Invocation

Objectives

In this assignment, you will build a distributed system. You will refactor the program you have developed in the 4th laboratory as an RMI application that will consists of a set of servers. It will enable you to collect Web page addresses from a pool of machines.

Although the application idea is drawn from a real-world example, the focus here is on RMI programming. The program you will build is therefore a very simplified version of the real application.

Summary of the Work

Each group will have to:

  1. Write a RMI server.
  2. Write a RMI client.

Reading Assignments

  • Read Chapter 18 on RMI from the book Java Network Programming by Elliotte Rusty Harold, pp. 610-640;
  • Read again the description of the Become system here to understand the architecture.

RMI Programming

You will build a distributed crawler using the multithreaded code you have produced in the last laboratory and RMI techniques. Your architecture will typically consist of one client and two or more servers. The servers will act as crawlers and the client will start them. The client will periodically collect the URL lists -- the traversed web pages and the web pages that have not yet been explored -- from all the servers.

Here are some points you may consider. They just form a suggestion:

  1. Encapsulate your crawler as a RMI server. You will typically use two or three servers in this laboratory, but possibly more if you want.
  2. The rmiregistry uses port 1099 and must have access to the CLASSPATH value. This means that either you start rmiregistry in the directory where your classes are or you assign CLASSPATH with the path value where your classes are.
  3. As two or more students will possibly start rmiregistry on a same machine, it is preferable that you don't use the default port value. To start rmiregistry with a different port value, use the command rmiregistry port and include the port in the service name URL, rmi://host:port/service (see pages 621 and 622 of the textbook).
  4. Write a RMI client to start your servers using a startCrawling() method.
  5. The client will interact with the servers using the fetch() and fetchAndSet() methods. The first method will fetch the list of traversed URLs as well as the list to explore. The second method will fetch the two lists and replace them in the same operation with two new lists: New traversed links that could be the merged list of all the traversed links all of the servers and a new list of URLs to explore.
  6. The client will build new URL lists from those it has collected from the servers:
    • It will merge all the visited links from all the servers into a single list.
    • It will merge the lists of all the links remaining to be explored and split it into as many lists as there are servers. As a first idea, you can build lists of equal size.
    • Finally for each server, the client will substitute the server lists with those built by the client: the unique list containing the visited links and the respective lists of the remaining links.
  7. Depending on the servers speed, the client may have to wait a couple of minutes before it starts polling the servers again. You can use Thread.sleep(int millisec)
  8. As a possible, optional, exercise, you can imagine to build the lists according to domains. If you have three servers, you can build three lists, one corresponding to se domains, the other one to com domains, and the third one to other remaining domains. You will try to balance the lists and you can imagine another type of processing.
  9. Optionally, you can also implement methods to suspend, resume, reset all the threads, and restart the servers.

Estimate how much time and machines you would need to collect one billion addresses.