Assignment 5: CS 200 Data Structures and Algorithms (Fall 2014)

CS 200, Fall 2014: Assignment 5
Graphs and Document Retrieval with Authority

Due Wednesday 12/10/14 by noon

The goal of the assignment is to learn to represent and use graphs. You will be extracting hrefs from simplified web pages and constructing a graph of these hyperlinks. From the graph, you will augment retrieval slightly and output the graph in a form that allows graphical display.

Extracting Hyperlinks

You need to extract the names of the files being linked to. This will require revisiting how you extract terms because now you need to actually read some parts that are between angle brackets! Consider it a good opportunity to review regular expressions prior to the final...

To expedite this, you can assume that the format of a link is pretty rigid, and is shown as a grammar rule where uppercase indicates the non-terminals and the rest are terminal symbols:

LINK -> <a href="http://FILENAME">

FILENAME is a string of characters for legal filenames; so filenames can include characters you were stripping out before such as -_%...(see example files). [Note: as before, these characters should still be stripped from any text not found to be part of FILENAME.] All of the format above should be thrown away; you need to extract FILENAME. I've also allowed for extra whitespace between the "a" and the "href". This does not need to be legal html -- so </a> or any printable tag is not required after FILENAME.
Note: anything not in this strict format is not to be considered as a link and previous rules apply for parsing. Filenames should be converted to lowercase to expedite matching.

Note: the links may refer to filenames that are not actually being read in. Those filenames should still be vertices in the graph.

Representing Hyperlinks

You should convert the observed links into a graph. Each filename read in and each link extracted should become vertices in the graph. Each link should become a directed edge between the file in which it appears (use its filename) and the extracted link.

You can use either of the graph implementations described in class (either Adjacency Matrix or Adjacency List) and whatever data structures underlying them that you wish. Name the class Graph to expedite grading.

Notes:

Your calling code (in WebPages) will always refer to vertices by their filenames.
Recall that a graph is a set of vertices; each filename should only be added once as a vertex.
Vertices should be kept in alphabetical order by URL. Edges should be ordered by when they are added. If you modify the text code, you can get this automatically. Alphabetical order becomes important if there are ties in which page is best.
In addition to adding vertices and edges, you will need two new methods in the Graph class that are called from WebPages:

public int inDegree(String filename)

which returns the number of edges incoming to a particular filename.

public void writeDotFile(String outputFile)
which creates a file that specifies the graph and can be printed by the dot program. The outputFile should have the extension ".dot" added to it. Please see: Graphviz for details on how this is accomplished. You might also want to read up on dot file format. Graphviz is installed on the department Linux machines. To view a dot file, you use the dotty command. Follow the format shown in the output example. The first line should state "digraph program 5" ( to indicate that it is a directed graph) and order the edges such that the from vertices are in alphabetical order. Edges take the form of FROM-VERTEX -> TO-VERTEX; (where FROM-VERTEX and TO-VERTEX are the labels for the vertices and should be in "").

Augmenting Retrieval

This is a simple change: simply multiply each TFIDF similarity by the indegree of the file. The idea is that nodes with lots of incoming links must be popular and should be viewed as "authorities". Thus, the authority level elevates the similarity. The file with the highest new similarity is the best.

Functionality Compared to PA4

Again, much of the functionality of PA4 should remain. The main method should cause the pages to be read in, should prune the stopwords as directed by the input file and then should process the quries as before. Finally, it should print the graph in an output file.

The input format will be the same as PA4 except for the addition of a line at the front which indicates the name of the output of the graph. Note: it still incudes the initial size for the hash table whether you chose to use the hash table or not.

The output format will be the same except for the creation of the output file for the graph. Of course, the similarity values could differ given the authority computation. Similarity matching should follow the same guidelines as in PA4: similarity values of 0 lead to "not found" messages.

You can decide for yourself which data structure to use for the term index. You need no longer print out its contents.

WebPages will need a new method to print the graph. It will also need to modify how it parses the files and will need to include a Graph object for the links.

Otherwise the rest of the code should not need to be changed.

Examples

Using test5.txt should produce this output and this graph file. The files in test5.txt are simple5.txt, simple5a.txt, simple5b.txt.