Click here to view full description of the assignment.

Things to take note of while working on this Assignment:

For this Programming Assignment, you will be working on one single file containing all necessary data of size ~1.8gb. The file will have sentences from each document in the form:

AUTHOR_NAME<===>SENTENCE

Please find a sample dataset here. You may test your code on this sample dataset on your local cluster.

While coding, you will have to clean up the text yourself of non-alphanumerics as you did in the previous assignment.
The path to the large dataset on the shared cluster is /data/books.txt .

While finally running your map-reduce jobs on the large data on the shared cluster to get the TF/IDF values, please divide your jobs so that you do not run them on the shared cluster on instances no data from the shared cluster is to be used. In order to reduce the access to the shared cluster, once you are done getting your all the intermediate data you need for the remaining steps through your first few Map-Reduce jobs, please extract it to your nobackup directory using the hadoop get command. You may then put in in your local cluster and proceed with the rest of your map-reduce jobs.

Also, please ensure that while taking out intermediate outputs from map-reduce jobs on the cluster for usage on local cluster, you store them on your nobackup directory, and not on your local filesystem. The nobackup directory allows you greater storage space to work with. On the other hand, storing too much data on your local filesystem will prompt the network admins to lock your account.


Page last modified on February 27, 2016, at 12:51 AM EST