Lab 17 - Literature Statistics Digital Humanities

Introduction

This assignment explores a field called Digital Humanities. It is a field in which liberal arts and computer science merges, and computer science takes an exploration of humanities to the next level. For example, machine learning is used to help rebuild frescoes when the damage is too great, or to explore archaeological dig sites to figure out the best spots to dig. Within literature, it is applied to the style of writing, to determine if the ideal response is provoked. Here at CSU, a researcher is looking at how to word emergency messages to firefighters in forest fire situations to make sure they get the information they need, at the right time, with minimal amounts of time spent. He is using machine learning to analyze the different messages. In the link above, they go into more examples.

For your assignment, you will develop a ‘fingerprint’ for different writers by analysing their use of words in their writing. While the assignment you are doing is considered only the first stage in such analysis, it is still a critical stage.

What You will Learn

Switch Statements
Using asserts in tests
Basic array access
Using multiple files
Using class level variables

Step 0 - Getting Starting

For the first step, it is important to look through all the files. As a reminder, you can see the various files by clicking the down arrow next to the drop file name in zybooks. You will notice there are three files that are read only. They are support files for the project but essential files. They are as follows.

Main.java

This file is the main driver for the program. If you look through it, you will notice that it process files as given to the program from the command line. What does that mean?

It means that in zybooks, you will see:

Run command
java Main Additional Arguments

Whatever you place in the Additional Arguments slot, get handed to the program as part of the String[] args parameter. In this case, we are going to type in name of files for it to read. The main program then reads the files and asks for the class you are creating, LitStats.java to collect data about those files.

What files are available?

The following files are available for you to work with.

sonnet116.txt - Shakespeare’s Sonnet 116
sonnet134.txt - Shakespeare’s Sonnet 135
gatsby.txt - the entire Great Gatsby
raven.txt - Poe’s Raven (Nevermore)
stranger.txt - the entire Stranger in a Strangeland
hp.txt - The first chapter of Harry Potter and the Sorcerer’s Stone - The boy who lived
if.txt - the poem -If

If you wish to analyze the Raven, you would modify the additional arguments as follows:

Run command
java Main raven.txt

If you wanted to look at Raven and Harry Potter it would be:

Run command
java Main raven.txt hp.txt

In the zybooks terminal you can also type

run raven.txt hp.txt

WordDictionary.java

This class is read-only, as you do not have to modify anything in the file. What it does is read in a file called dictionary.txt that is based on the alt12dicts - the same dictionaries used in many auto spell-checkers such as the one in Word. WordDictionary is taking the first word seen and associating it with the part of speech. It then allows easy access to that word. The important sections of this file for you to know are the constants at the top of the file.

public static final String ADJECTIVE = "A"; //"adjective";
public static final String CONJUNCTION = "C"; //conjunction";
public static final String PRONOUN = "P"; //'"pronoun";
public static final String VERB =  "V"; //"verb";
public static final String NOUN = "N"; //"noun";
public static final String INTERJECTION = "I";
public static final String SPOKEN = "S"; // examples kinda, gonna
public static final String UNKNOWN = "xxUNKNOWNNxx";

You will end up accessing these constants by typing WordDictionary. for example:

case WordDictionary.ADJECTIVE:
  //do something 
  break;
case WordDictionary.VERB:

The other important part of WordDictionary is the method getWordType. A copy of it is here.

/**
 * Returns the type of the word based on the word passed it.
 * @param word word to check
 * @return Returns the part of speech. It will either match one of the ones listed in 
 *         the constant variables or UNKNOWN
 */
public String getWordType(String word) {
    if(dict.containsKey(word)) {
        return dict.get(word);
    }
    return UNKNOWN;
}

When called, this returns the part of speech of the passed in word (lowercase). The part of speech will match up with one of the constants above. To call this method, you will use code similar to the following (maybe not exactly).

String wordType = dictionary.getWordType(word);

The dictionary is the instance of the WordDictionary object that is created in LitStats.java.

FileReader.java

This is a file worth looking at later. It reads in the contents of a text file using the Scanner and File objects in java (another example). The only class that interacts with FileReader.java is the Main class, but it is important to understand how it works.

The hasNext() method checks to see if the file has more lines, and returns true or false pending that (it also checks to make sure the is loaded).

The getNext() method grabs the line in the file, and then breaks that line into an Array of Strings (String[]). Broken up by spaces. This is critical, as in a future assignment, you may be writing something similar, but breaking it up by commas.

In the Main.java file, you will see it loading the files and then grabbing each line. It will then pass that array to your code to analyze.

LitStatsTests

This file contains the tests you will use as you write the methods associated with them. You will notice there are a fair number of tests are commented out. When the time comes, you will not only uncomment them, but you will add your own asserts! It is worth looking through.

LitStats

This is the main file you will be editing. You will want to go to it now, and put your name and email in the header comments.

Step 1 - Adding Counters and Resetting Counters

For every work of literature you are analyzing, you will want to count the total number of lines, the total (valid) words seen, and the total times each part of speech appears. That is a lot of counters, and sometimes, you will want to reset the count.

Class Level Variables

To start, you will want to add the variables. The names need to exactly match the following, as we will be accessing the variables directly in grading. Their scope cannot be private (we just left off any scope - so it would use the default scope).

The variables need to be near the top of the class, right below the comment that says

// the following variables help you keep track of the stats in the file/input

nounCounter
verbCounter
interjectionCounter
conjunctionCounter
adjectiveCounter
pronounCounter
spokenCounter
unknownCounter
wordCount
lineCounter

Every one of the above variables will need to be declared as an int and initialized to zero.

Something to think about, should they be static? Does it make sense for every book in the world to use the exact same number of nouns? Not really. As such, it doesn’t make sense to make them static, and you will see that since nearly every method in this classes accesses those variables, most of them will also be non-static. Reminder non-static means leaving off the static keyword when declaring.

Writing resetStats()

You will then want to go the comment that talks about resetting the counters to zero, and writing a public void method called resetStats with no parameters.

This method will access all the counters you just wrote, and set them to zero. Yes, it really is just writing every variable name, and setting them to 0, but sometimes you need to reset the stats between different books. It is also used to teach about how to access class level variables in a class.

Testing resetStats()

Go to LitStatsTest.java and find the testResetStats() method. You will see a number of lines commented out in that method. Uncomment them. If you run your program now, it will fail if your reset method is wrong. If you succeed, it won’t do much other, but it won’t crash the program.

YOU should add your own tests here with an assert, as we are not testing every variable!

Step 2 - cleanWord(String word)

It is often very common to trim words in large bodies of text (think web pages), and there is a lot of research into how to handle the “stemming”. For example is boats the same as boat. In this case we are going to take the simplest approach, and only remove spaces, specials and numbers - while forcing the word to be lowercase. This may increase our unknown word count considerably (as apples won’t be found), but that is fine for this assignment.

For cleanWord, you cannot assume anything about the String being passed in, other than it is a String - including the fact that it may be an empty string.

You have done similar method already in the Run Length Encoding labs as part of your warm-ups.

Writing cleanWord

The method cleanWord takes in a String as the parameter and returns the “cleaned” word. You will want to make the method a public method that returns a String, as we use it in our test programs.

If you think back to the Run Length Encoding labs, there were a number of methods with a very similar format.

In this method, you will loop through the String that is passed in, looking at every character. If the character is a letter (hint Character.isLetter(word.charAt(i))), you will keep it. If it is anything else, you will ignore it.

Once you have all the letters in the word, you can call toLowerCase() on the String, and return the clean word. (hint: return rtn.toLowerCase())

Testing cleanWord

Go to LitStatsTest and find the method testCleanWord(). You will want to uncomment the tests. It may actually make sense to just stub out the cleanWord method, uncomment the tests even though you know they will fail to help with incremental development.

You can also put in your own asserts to try other cases for words, as we definitely have not tried every common case.

Step 3 - percent(double) and percent(double, double)

For this step, you will write two methods. Both will be called percent, and they both will return a string. The method that only takes in a single double, will call the more specific method that takes in two doubles - which is method Overloading!

The percent method will help with printing out and formatting later, and is part of the process of keeping your code DRY.

Here are some examples of input and put

percent(10, 20); // should return 50.00%
percent(2, 10); // should return 20.00%
wordCount = 3; // to help with overloading below
percent(10); // should return 333.33%
percent(1); // should return 33.33%

Writing percent(double)

This method will call your more specific method percent(double, double), but what makes sense for the “default” case? Considering most stats will want the percent amount of times a word shows up, it means we will be taking the counter divided by the total number of words we see in the piece of literature. As such, it would make sense for your percent to use wordCount as the second default variable.

The code would like as follows:

public String percent(double x) {
    return percent(x, wordCount);
}

Make sure you find the comments that reference it before putting in that line of code.

Writing percent(double, double)

This method will be called percent, and take in two doubles as parameters you can name them what you want. We named them x and y, but admittedly numerator and denominator may make more sense. You will return a String using String.format.

First, write your method signature. Second, you will want to take x divided by y - and then times the result by 100, and store that in a double variable. Why times it by 100? Let’s say you take 10/10 - that would mean x makes up 100% of y. However, 10/10 is 1, so multiply it by 100 causes the shift necessary for the percent. The formula would be the following

(x/y) * 100

Make sure to use parens to keep it your order clear.

Third - and most important, you will use String.format to return a String with the following conditions

will only have two decimal places - hint %.2f
will have the percent sign attached to it - hint %%

Testing Percent

Go to LitStatsTests and find the testPercent() method. Uncomment the lines that are commented out, and run the tests. You will also want to add your own asserts, as we test other things beyond the ones in the testPercent() method.

Step 4 - processLine(String[] words)

processLine(String[]) is the heart of LitStats. It goes through a line passed in as an array of Strings, determining if a word is a verb, noun, adverb, etc. If it finds one of these parts of speech, it will increment the counters - so you can calculate the stats of the file.

You can follow these steps, after finding the method stub (it was already created for you).

Increment your lineCounter variable by one. - We do this as the first thing, because if this method is called - a line that isn’t empty has been found.
Build a loop that looks at every item in your array (hint, for or for-each)
Inside of your loop: call cleanWord on the word in the array. The following code will help with that

String word = cleanWord(line[i]); // assuming standard for loop

You will then want to check to see if word.isEmpty() and continue to the top of the loop, if it isEmpty. This saves you from processing an empty string (which can exist if the word was all numbers!)
If the word isn’t empty, you will want to increment your wordCount by one and get your word type using the dictionary object.

String wordType = dictionary.getWordType(word);

With the returned wordType, you will then want to check to see if it matches one of the cases in the WordDictionary file. If it matches one of those, increment the corresponding counter. For example, inside your switch statement it could look like

switch(wordType) {
  case WordDictionary.ADJECTIVE:
     adjectiveCounter++;
     break;
  // case...
}

If the word is not found, the default case, then increment the unknownCounter.
Make sure to write a case for each of the different parts of speech.

There is nothing beyond the loop at the end of the method.

Testing Process Word

Go to LitStatsTests.java and find testProcessLine(). Uncomment the lines in that method. You will also notice testing this method involves generating arrays. You should try to build your own fixed length arrays to pass into the method. We also aren’t looking at the number of words or lines - you should probably write an assert that looks for that!

Step 5 - printStats(String title, String author)

printStats will print out the stats in the following format.

Stranger In A Strange Land
by Robert Heinlein
Adjective/Adverb Average: 38.54%
Conjunction/Preposition Average: 8.60%
Interjection Average: 1.21%
Noun Average: 20.58%
Pronoun Average: 2.36%
Spoken Average: 0.00%
Verb Average: 2.50%
Unknown Average: 26.21%
Words per line: 13.18

Where the first line is the title of the literature passed in, the by line is the author. You will want to follow this format exactly.

Writing printStats

The method is a series of printlns or printfs (your call). For example

System.out.println(title);

is going to be your first line in the method, but your second line may be

System.out.printf("by %s%n", author);

Your third line may be

System.out.printf("Adjective/Adverb Average: %s%n", percent(adjectiveCounter));

Go ahead and work through each line now. For the words per line, it doesn’t make sense to use the percent method. However, you will still want to format it to 2 decimal places. It would be (double)wordCount / lineCount with a printf statement.

testing printStats

There is not a test in LitStatsTest for printStats as it prints out to the screen. Instead, you will want to simply run the program, passing in various program arguments that are the names of different literature. For example, if you wanted to read the stats for Stranger, you would type the following in the ZyBooks console

run stranger.txt

Look above to see other files you can use. You can even type all of them. We test all the files in our test cases; you should probably also do that before you submit for grading. To prevent any confusion, Spoken Average really is 0.00% in all cases, not because the average isn’t valid, it just is not that common, so we loose it by only going two decimal points deep.

Keep trying different files, and have fun looking at the comparisons. You will find certain authors have more in common than you realized at first.

Last but not least

Please remember to submit through canvas! There isn’t a large window on this lab, so you don’t want to accidentally forget to submit to canvas.