CS 200, Fall 2014: Assignment 5
Graphs and Document Retrieval with Authority
Due Wednesday 12/10/14 by noon
The goal of the assignment is to learn to represent and use
graphs. You will be extracting hrefs from simplified web pages and constructing
a graph of these hyperlinks. From the graph, you will augment
retrieval slightly and output the graph in a form that allows
graphical display.
Extracting Hyperlinks
You need to extract the names of the files being linked to. This will
require revisiting how you extract terms because now you need to
actually read some parts that are between angle brackets! Consider it
a good opportunity to review regular expressions prior to the
final...
To
expedite this, you can assume that the format of a link is pretty
rigid, and is shown as a grammar rule where uppercase indicates the
non-terminals and the rest are terminal symbols:
LINK -> <a href="http://FILENAME">
FILENAME is a string of characters for legal filenames; so filenames
can include characters you were stripping out before such as
-_%...(see example files). [Note: as before, these characters should still be stripped
from any text not found to be part of FILENAME.] All of the format above should be thrown
away; you need to extract FILENAME. I've also allowed for extra
whitespace between the "a" and the "href". This does not need to be
legal html -- so </a> or any printable tag is not required after FILENAME.
Note: anything not in this strict format is not to be
considered as a link and previous rules apply for parsing. Filenames
should be converted to lowercase to expedite matching.
Note: the links may refer to filenames that are not actually being
read in. Those filenames should still be vertices in the graph.
Representing Hyperlinks
You should convert the observed links into a graph. Each filename read
in and each link extracted should become vertices in the graph. Each
link should become a directed edge between the file in which it appears (use
its filename) and the extracted link.
You can use either of the graph implementations described in class
(either Adjacency Matrix or Adjacency List) and whatever data
structures underlying them that you wish. Name the class Graph to
expedite grading.
Notes:
- Your calling code (in WebPages) will always refer to vertices by
their filenames.
- Recall that a graph is a set of vertices; each filename should
only be added once as a vertex.
- Vertices should be kept in alphabetical order by URL. Edges should be
ordered by when they are added. If you modify
the text code, you can get this automatically. Alphabetical order becomes
important if there are ties in which page is best.
- In addition to adding vertices and edges, you will need two
new methods in the Graph class that are called from WebPages:
- public int inDegree(String filename)
- which returns the
number of edges incoming to a particular filename.
- public void writeDotFile(String outputFile)
- which
creates a file that specifies the graph and can be printed by the
dot program. The outputFile should have the extension ".dot" added
to it. Please see: Graphviz for details on how this is
accomplished. You might also want to read up on dot file
format. Graphviz is installed on the department Linux machines. To
view a dot file, you use the dotty command. Follow the format
shown in the output example. The first line should state "digraph
program 5" ( to indicate that it is a directed graph) and order
the edges such that the from vertices are in alphabetical
order. Edges take the form of FROM-VERTEX -> TO-VERTEX; (where
FROM-VERTEX and TO-VERTEX are the labels for the vertices and
should be in "").
Augmenting Retrieval
This is a simple change: simply multiply each TFIDF similarity by the
indegree of the file. The idea is that nodes with lots of incoming
links must be popular and should be viewed as "authorities". Thus, the
authority level elevates the similarity. The file with the highest new
similarity is the best.
Functionality Compared to PA4
Again, much of the functionality of PA4 should remain. The main
method should cause the pages to be read in, should prune the
stopwords as directed by the input file and then should process the
quries as before. Finally, it should print the
graph in an output file.
The input format will be the same as PA4 except for the addition of
a line at the front which indicates the name of the output of the
graph. Note: it still incudes the initial size for the hash table
whether you chose to use the hash table or not.
The output format will be the same except for the creation of the
output file for the graph. Of course, the similarity values could
differ given the authority computation. Similarity matching should
follow the same guidelines as in PA4: similarity values of 0 lead to
"not found" messages.
You can decide for yourself which data structure to use for the term
index. You need no longer print out its contents.
WebPages will need a new method to print the graph.
It will also need to modify how it parses the files and will
need to include a Graph object for the links.
Otherwise the rest of the code should not need to be changed.
Examples
Using test5.txt
should produce this
output and this graph file.
The files in test5.txt are
simple5.txt,
simple5a.txt,
simple5b.txt.