Assignment 3: CS 200 Data Structures and Algorithms (Fall 2014)

CS 200, Fall 2014: Assignment 3
More Java, More Functionality and Binary Search Trees

Due Monday 10/20/14 by 12:00 PM

The purpose of this assignment is to continue to extend your program's capabilities and to practice working with a new data structure: Binary Search Trees. We will be adding functionality needed to match queries to documents: TFIDF computation. We also will be asking you to make more decisions about how to implement your code. For example, the specification requires only one new class, but you might want to create others to support what you need.

Creating New Class: `BST`

A central advantage of well designed OO code is that it is easy to change the underlying data structures used to implement an object. So, you will need to replace the ArrayList implementation of termIndex with a binary search tree implementation. Both provide similar capabilities for storing and searching. Each node in the BST will keep the same data as the ArrayList did, i.e., word, overall frequency and a linked list of document names with their frequencies. The new BST class will need the following methods:

public BST(): which initializes an instance variable called "root" as null and an instance variable called "count" as 0.
public int size(): which returns the number of unique words in the document (i.e., count).
public void add(String documentName, String word): which adds a new Term or increments frequencies if the term already exists in the BST.
public Term get(String word, Boolean printDepth): which returns the Term object for the word. If printDepth is true, then get should keep track of how deep in the tree it finds word and print out the value at the end in the form " At depth 1" (At is preceded by 2 spaces). If the word is not found, it should print the deepest level that it checked.

Because you are not implementing pruneStopWords, you do not need a delete method.

BST class should be available only to the Webpages class and no other class.

You should also create a BSTIterator that supports an inorder traversal of the binary search tree. The next method should return the data (type Term).

Update the methods in your other classes as necessary to support the new implementation.

Sorting

To simplify the code, we will not be supporting the pruneStopWords method for this assignment.

New Functionality: Calculating TFIDF

At this stage, we are ready to start calculating the primary metric used in information retrieval to match queries to documents (i.e., to find a document on a topic of interest to the user). In information retrieval, documents are represented as TF-IDF (Term Frequency - Inverse Document Frequency) vectors. Each document d is converted into a vector in a multi-dimensional space, with one dimension for each term t found in the document set. This vector includes a dimension for each word and is computed for each document.

The components of such a vector are computed using the following well-known formula:
TFIDF(d,t) = TF(d,t) * log (D / DF(t))
where TF(d,t) is the number of occurrences of term t in document d ("term frequency"), and the logarithm is referred to as the "inverse document frequency" of term t. D is the total number of available documents, and DF(t) is the number of documents containing term t. The log here is the natural log (Math.log() in Java). TFIDF weights should be of type double, but should be printed out with only 2 digits after the ".". Hint: If you store D and DF as ints, then you may need to cast them to float to get the equation to work properly.

You should have access to all of these values in your current data structures. TF should be the frequency count kept with the document name in the linked list of Occurrences for each Term. DF is the length of the linked list for the Term. D is the total number of documents that have been read in, which Webpages should be counting.

To keep it simple for now, you need only calculate the TFIDF value for each document for a given term when executing the whichPages method from Webpages.

Input/Output

Input format is identical to Prog2.

Output is changed somewhat:

No longer print out the termIndex twice. Print it out only once.
Invocation of whichPages from your Main should cause it to print out the level as described in the BST method get.
When the results of whichPages are printed, the name of the document should be followed by its TFIDF value for that term.

To better show the TFIDF computation, the example requires four files: test3, simple3.txt, example2.txt and data3.txt. Running on these produces this output.

Main Method

The main method in the top level class, called PA3, is in charge of reading the input file and processing the directives in there. It is nearly identical to that of PA2 and should:

Create the Webpages object
Add the new pages
Print out the words in the same format as PA1 and PA2
Run whichPages method on Webpages object for each of the words in the input file:

It should print the level at which the word is found (as described for the BST get method)
If the word has been found, it should print "word in pages: page1, page2..." as shown in the output example.
If the word is not found, it should print "word not found".

Summary of Changes to Existing Program

termIndex should be implemented as a binary search tree.
Some minor changes to output to reduce length (only print termIndex once) and to show depth at which terms are found in the BST.
Addition of class: BST.
Addition of code to calculate TFIDF for the terms when Webpages whichPages method is executed.

CS 200, Fall 2014: Assignment 3 More Java, More Functionality and Binary Search Trees