Networking Tools and PDF Downloader
Objectives
The objectives of the exercises are to:
- Use and understand basic networking tools;
- Understand the URL class;
- Understand the structure of a client-server application;
- Write a network client.
Summary of the Work
Each group will have to:
- Write network client that reads a web page;
- Implement a program that analyzes the web page, extract links, and download documents from this web page. In this exercise, you will use the URL class the access the server files.
- Comment briefly on the results.
In the text below, the pointers to the Java classes and methods refer to the Java version available on the student computers: 1.7.0. You are free to use earlier versions.
Reading Assignments
Read the following chapters from the book Java Network Programming by Elliotte Rusty Harold:
- Chapter 1, on basic network concepts,
- Chapter 2, on streams,
- Chapter 4, on Internet addresses,
- Chapter 5, on URLs.
Using Networking Tools
Looking up Machines Using nslookup
The Internet protocol IPv4 identifies machines using a 4-byte address. Machines can also have names organized using hierarchical domains. A collection of distributed databases (Domain Name System, DNS) ensures the name-to-address mapping. The nslookup command is an interactive tool to find an address from a name and the reverse.
- Run nslookup and find the address of www.lth.se.
- Find the name corresponding to the IP address 130.235.35.100
Looking up Routes Using traceroute
The Internet interconnects thousands of networks: The Lund campus network to the Swedish university computer network SUNET, SUNET to Telia, etc. To find their way in the network, data use routers: devices that link a network to another one and that orient the data. In this exercise, you will trace the sequence of routers from Lund to the University of Colorado.
- Run traceroute and find the routers between your machine and www.colorado.edu.
- Six networks are involved: The Lund University network, SUNET (Swedish university network), NORDUnet (Nordic university network), Internet2, and finally the University of Colorado. Identify their respective IP addresses in the router sequence.
- Identify these routers visually on the network maps. Use SUNET, NORDUnet (click on the map for the details), and Internet2 network operations center. You may also use the whois tool.
- You may want to try traceroute on www.tu-berlin.de and have a look at the Géant map.
Looking up Ports Using netstat
The netstat command shows information on the network status. It comes with a dozen options. Here we examine the interface, routing, and all socket options.
- The interface corresponds to the network hardware attachment, Ethernet most of the time on the campus. A connected machine has at least two device entries. The first one is the loop-back interface that prevents self-addressed data to go out and the second one is a real piece of equipment. Run netstat -i. What is its name on your machine?
- A routing table contains the final destinations together with their first intermediate gateway. As for the loop-back interface, a machine designates itself using a specific name: localhost. Run netstat -r. Use the options n and a. Where do all your packets go when they are not destined for the local network? What is the localhost address?
- Most network communications involve a remote and a distant party. Both parties have IP addresses. Machines usually provide many services: telnet, ftp, chat, etc. A distinct port number identifies each TCP/UDP service. Hence, a communication consists of four numbers <remote_addr, remote_port, local_addr, local_port> . Run netstat -a. What are the transport protocols on your machine? List some services, their IP address, and their TCP/UDP ports.
- Finite-state machines are convenient devices to model data transmission. They consist of a set of states linked by arcs. When the protocol is in a state, an event can trigger a transition to another state and the machine puts a symbol out. A TCP event corresponds to the reception of a packet and an output corresponds to a transmission. Have a look at a TCP state machine at http://www.texample.net/tikz/examples/tcp-state-machine/ and identify the server and client path. Run netstat -a again and identify the state of some connections.
Programming
Reading a Web Page
DownThemAll!
is a Firefox add-on that reads a web page, parses it,
extract its links to images, videos, and other kinds of documents. The DownThemAll!
interface lets the user select
the objects s/he wants to download and, once selected, it downloads them.
The figures below show a web page and the links this add-on has extracted.
The objective of your program is to create a download tool that reads a web page address as input and downloads all the PDF files it contains.
- Read the URL class from the Java API documentation.
- Write a Java program that takes a web page address as input and download the corresponding page.
Analyzing the Page
You will now design a method that will extract the hyperlinks from a HTML web page.
HTML is a mark up language that describes the format of a web page. It consists of starting and closing tags in angle brackets that tell the web browser how to display the page. To render a text in italics, the phrase “in italics” is surrounded by two tags in the HTML source text:
- The start tag <i> and
- The end tag </i>
Hyperlinks use the tag <a> to mark a phrase as a link and the attribute href within the start tag to give the address of this link. The code below:
<a href="http://fileadmin.cs.lth.se/cs/Education/EDA095/2014/kursprogram2014.pdf">Klicka här!</a>
tells that clicking on the phrase "Klicka här!" will link you to the PDF document kursprogram2014.pdf; The document URL being: http://fileadmin.cs.lth.se/cs/Education/EDA095/2014/kursprogram2014.pdf
NOTE, there may be other attributes besides http in an a tag, for example <a href="http://fileadmin.cs.lth.se/cs/Education/EDA095/2014/kursprogram2014.pdf" title="open link in a new page">Klicka här!</a>
- Write a string processing code that extracts the href attribute from a <a> tag. To carry this out, you can either:
- Use regular expressions (recommended). See a documentation on them here: and here.
- Use string processing functions from the String class.
- Write a method in your program that, given a web page, extracts all the links corresponding to a PDF document and returns them in a list: List<URL>.
Downloading the Links
Finish your program so that it downloads all the PDF documents from the links you have collected.
Test your program on a few pages and check that it correctly downloads all the PDF documents these pages contain. Notably, try your program on this page: http://cs.lth.se/edaf65/foerelaesningar