lu.se

Datavetenskap

Lunds Tekniska Högskola

Denna sida på svenska This page in English

Networking Tools and PDF Downloader

Assignment #1: Networking Tools and PDF Downloader

Objectives

The objectives of the exercises are to:

  • Use and understand basic networking tools;
  • Understand the URL class;
  • Understand the structure of a client-server application;
  • Write a network client.

Summary of the Work

Each group will have to:

  • Write network client that reads a web page;
  • Implement a program that analyzes the web page, extract links, and download documents from this web page. In this exercise, you will use the URL class the access the server files.
  • Comment briefly on the results.

In the text below, the pointers to the Java classes and methods refer to the Java version available on the student computers: 1.7.0. You are free to use earlier versions.

Reading Assignments

Read the following chapters from the book Java Network Programming by Elliotte Rusty Harold:

  • Chapter 1, on basic network concepts,
  • Chapter 2, on streams,
  • Chapter 4, on Internet addresses,
  • Chapter 5, on URLs.

Using Networking Tools

Looking up Machines Using nslookup

The Internet protocol IPv4 identifies machines using a 4-byte address. Machines can also have names organized using hierarchical domains. A collection of distributed databases (Domain Name System, DNS) ensures the name-to-address mapping. The nslookup command is an interactive tool to find an address from a name and the reverse.

  1. Run nslookup and find the address of www.lth.se.
  2. Find the name corresponding to the IP address 130.235.35.100

Looking up Routes Using traceroute

The Internet interconnects thousands of networks: The Lund campus network to the Swedish university computer network SUNET, SUNET to Telia, etc. To find their way in the network, data use routers: devices that link a network to another one and that orient the data. In this exercise, you will trace the sequence of routers from Lund to the University of Colorado.

  1. Run traceroute and find the routers between your machine and www.colorado.edu.
  2. Six networks are involved: The Lund University network, SUNET (Swedish university network), NORDUnet (Nordic university network), Internet2, and finally the University of Colorado. Identify their respective IP addresses in the router sequence.
  3. Identify these routers visually on the network maps. Use SUNET, NORDUnet (click on the map for the details), and Internet2 network operations center. You may also use the whois tool.
  4. You may want to try traceroute on www.tu-berlin.de and have a look at the Géant map.

Looking up Ports Using netstat

The netstat command shows information on the network status. It comes with a dozen options. Here we examine the interface, routing, and all socket options.

  1. The interface corresponds to the network hardware attachment, Ethernet most of the time on the campus. A connected machine has at least two device entries. The first one is the loop-back interface that prevents self-addressed data to go out and the second one is a real piece of equipment. Run netstat -i. What is its name on your machine?
  2. A routing table contains the final destinations together with their first intermediate gateway. As for the loop-back interface, a machine designates itself using a specific name: localhost. Run netstat -r. Use the options n and a. Where do all your packets go when they are not destined for the local network? What is the localhost address?
  3. Most network communications involve a remote and a distant party. Both parties have IP addresses. Machines usually provide many services: telnet, ftp, chat, etc. A distinct port number identifies each TCP/UDP service. Hence, a communication consists of four numbers <remote_addr, remote_port, local_addr, local_port> . Run netstat -a. What are the transport protocols on your machine? List some services, their IP address, and their TCP/UDP ports.
  4. Finite-state machines are convenient devices to model data transmission. They consist of a set of states linked by arcs. When the protocol is in a state, an event can trigger a transition to another state and the machine puts a symbol out. A TCP event corresponds to the reception of a packet and an output corresponds to a transmission. Have a look at a TCP state machine at http://www.texample.net/tikz/examples/tcp-state-machine/ and identify the server and client path. Run netstat -a again and identify the state of some connections.

Programming

Reading a Web Page

DownThemAll! is a Firefox add-on that reads a web page, parses it, extract its links to images, videos, and other kinds of documents. The DownThemAll! interface lets the user select the objects s/he wants to download and, once selected, it downloads them. The figures below show a web page and the links this add-on has extracted.
Web page image
Links

The objective of your program is to create a download tool that reads a web page address as input and downloads all the PDF files it contains.

  1. Read the URL class from the Java API documentation.
  2. Write a Java program that takes a web page address as input and download the corresponding page.

Analyzing the Page

You will now design a method that will extract the hyperlinks from a HTML web page.

HTML is a mark up language that describes the format of a web page. It consists of starting and closing tags in angle brackets that tell the web browser how to display the page. To render a text in italics, the phrase “in italics” is surrounded by two tags in the HTML source text:

  1. The start tag <i> and
  2. The end tag </i>

Hyperlinks use the tag <a> to mark a phrase as a link and the attribute href within the start tag to give the address of this link. The code below:

<a href="http://fileadmin.cs.lth.se/cs/Education/EDA095/2014/kursprogram2014.pdf">Klicka här!</a>

tells that clicking on the phrase "Klicka här!" will link you to the PDF document kursprogram2014.pdf; The document URL being: http://fileadmin.cs.lth.se/cs/Education/EDA095/2014/kursprogram2014.pdf

NOTE, there may be other attributes besides http in an a tag, for example <a href="http://fileadmin.cs.lth.se/cs/Education/EDA095/2014/kursprogram2014.pdf" title="open link in a new page">Klicka här!</a>

  1. Write a string processing code that extracts the href attribute from a <a> tag. To carry this out, you can either:
    • Use regular expressions (recommended). See a documentation on them here: and here.
    • Use string processing functions from the String class.
    Should you use regular expressions, you can test them online with this site: regex101.com
  2. Write a method in your program that, given a web page, extracts all the links corresponding to a PDF document and returns them in a list: List<URL>.

Downloading the Links

Finish your program so that it downloads all the PDF documents from the links you have collected.

Test your program on a few pages and check that it correctly downloads all the PDF documents these pages contain. Notably, try your program on this page: http://cs.lth.se/edaf65/foerelaesningar