CS 200, Fall 2014: Assignment 3
More Java, More Functionality and Binary Search Trees
Due Monday 10/20/14 by 12:00 PM
The purpose of this assignment is to continue to extend your program's capabilities and to practice working with a new data structure: Binary Search Trees. We will be adding functionality needed to match queries to documents: TFIDF computation. We also will be asking you to make more decisions about how to implement your code. For example, the specification requires only one new class, but you might want to create others to support what you need.
Creating New Class: BST
A central advantage of well designed OO code is that it is easy to change the underlying data structures used to implement an object. So, you will need to replace the ArrayList implementation of termIndex with a binary search tree implementation. Both provide similar capabilities for storing and searching.
Each node in the BST will keep the same data as the ArrayList did, i.e., word, overall frequency and a linked list of document names with their frequencies. The new BST class will need the following methods:
- public BST()
- which initializes an instance variable called "root" as null and an instance variable called "count" as 0.
- public int size()
- which returns the number of unique words in the document (i.e., count).
- public void add(String documentName, String word)
- which adds a new Term or increments frequencies if the term already exists in the BST.
- public Term get(String word, Boolean printDepth)
- which
returns the Term object for the word. If printDepth is
true, then get should keep track of how deep in the tree it
finds word and print out the value at the end in the form " At depth 1" (At is preceded by 2 spaces). If the word is not found, it should print the deepest level that it checked.
Because you are not implementing pruneStopWords, you do not need a delete method.
BST class should be available only to the Webpages class and no other class.
You should also create a BSTIterator that supports an inorder
traversal of the binary search tree. The next method should return the
data (type Term).
Update the methods in your other classes as necessary to support the new implementation.
Sorting
To simplify the code, we will not be supporting the
pruneStopWords method for this assignment.
New Functionality: Calculating TFIDF
At this stage, we are ready to start calculating the primary metric used in information retrieval to match queries to documents (i.e., to find a document on a topic of interest to the user). In information retrieval, documents are represented as TF-IDF (Term Frequency - Inverse Document Frequency) vectors. Each document d is converted into a vector in a multi-dimensional space, with one dimension for each term t found in the document set. This vector includes a dimension for each word and is computed for each document.
The components of such a vector are computed using the following well-known formula:
TFIDF(d,t) = TF(d,t) * log (D / DF(t))
where TF(d,t) is the number of occurrences of term t in
document d ("term frequency"), and the logarithm is referred to
as the "inverse document frequency" of term t. D is the
total number of available documents, and DF(t) is the number of
documents containing term t. The log here is the natural log
(Math.log() in Java). TFIDF weights should be of type double,
but should be printed out with only 2 digits after the ".". Hint: If you store D and DF as ints, then you may need to cast them to float to get the equation to work properly.
You should have access to all of these values in your current data
structures. TF should be the frequency count kept with the
document name in the linked list of Occurrences for each Term. DF is the length of the linked list for the Term. D is the total number of documents that have been read in, which Webpages should be counting.
To keep it simple for now, you need only calculate the TFIDF value for each document for a given term when executing the whichPages method from Webpages.
Input/Output
Input format is identical to Prog2.
Output is changed somewhat:
- No longer print out the termIndex twice. Print it out only once.
- Invocation of whichPages from your Main should cause it to print out the level as described in the BST method get.
- When the results of whichPages are printed, the name of the document should be followed by its TFIDF value for that term.
To better show the TFIDF computation, the example requires four files: test3, simple3.txt, example2.txt and data3.txt. Running on these produces this output.
Main Method
The main method in the top level class, called PA3, is in charge of
reading the input file and processing the directives in there. It is
nearly identical to that of PA2 and should:
- Create the Webpages object
- Add the new pages
- Print out the words in the same format as PA1 and PA2
- Run whichPages method on Webpages object for each of the words in
the input file:
- It should print the level at which the word is found (as described for the BST get method)
- If the word has been found, it should print "word in pages:
page1, page2..." as shown in the output example.
- If the word is not found, it should print "word not found".
Summary of Changes to Existing Program
- termIndex should be implemented as a binary search tree.
- Some minor changes to output to reduce length (only print termIndex once) and to show depth at which terms are found in the BST.
- Addition of class: BST.
- Addition of code to calculate TFIDF for the terms when Webpages whichPages method is executed.