Things to take note of while working on this Assignment:

For this Programming Assignment, you will be working on one single file containing all necessary data of size ~1.2gb. The file will have sentences from each document in the form:


Please find a sample dataset here. You may test your code on this sample dataset on your local cluster.

While coding, you will have to clean up the text yourself of non-alphanumerics characters.

The path to the large dataset is here

