P1: CS160 Review

Due Monday, June 22, 12:00 PM via Checkin
100 points

1. Objective

The objective of this assignment is to refresh your knowledge of how to solve programming problems involving loops, arrays, methods, and files. You will also learn/review the use of extremely useful methods for string and token processing.

2. Problem

Analysis of social media data is a major topic of research these days both in industry and academia. In this assignment you will perform a simple analysis of tweets. Your task is to read a file containing tweets and find the mostly commonly occurring word in those tweets. Since words like "the", "in", etc, occur often and are not very informative, it is common to ignore these very common words, which are called stop-words.

3. Tasks

Create a program called P1.java.
Add the following methods to the program.
- public String [] readTweets(String fileName)
  - The fileName given as an argument to the method will contain the name of a file that contains a collection of tweets. The file has three columns that are separated by tabs ('\t'). The first column contains the user ID, the second contains the date and time the post was made, and the third is the post itself. Here's an example file. Assume that the file exists and follows the required format.
  - The method should return an array containing only the tweets themselves, i.e. the third column.
  - There will be at least one tweet in the file. Note that you do not know the number of tweets in advance. You can read the file once to find out how many tweets exist and use that number to allocate the return array. Then read the file once again to retrieve and store the tweets. There are other ways to do this as well (using Java API).
- public String [] readStopWords(String fileName)
  - The fileName given as an argument to this method refers to a file that contains a list of stop-words, one word per line without any leading/trailing spaces. Here's an example stop-word list. Assume that the file exists and follows the required format. There will be at least one stop-word in the file.
  - The method should return an array with the stop-words, excluding any trailing or leading whitespace. The stop-words need to be stored in the array in the same order they appear in the file.
  - Note that you do not know the number of stop-words in advance.
- public String mostCommonWord(String [] tweets)
  - This method takes an array of tweets provided to it as an argument.
  - This method returns the most common word in the array of tweets. Your method should be case-insensitive; for example, the word "HeLp" and "help" should count as occurrences of the same word. The return value should be all lower case. The most common word could be one of the stop-words.
  - When the most common word is non-unique (i.e. there are two words or more that qualify as most common), just report one of them. We will not use such test cases.
  - Note that you will need to process each tweet in the array, and each word in every tweet (hint: nested loops). Another hint is to maintain a String array of new words that you process from the tweets and an int array of the same size to store how many times the word appears in the tweets. Each element of the count array (ith value) stores the number of times the corresponding word (ith word) appears in the tweets. When you process a word from a tweet, check if the word already exists in the word array. If so, increment the count by one, else store the new word in the array and initialize the count for that word.
- public String mostCommonWordExcludingStopWords(String [] tweets, String [] stopWords)
  - The first argument to this method is an array containing the tweets, and the second is an array of strings containing the list of stop-words.
  - The method returns the most common word that is not a stop-word in the array of tweets provided as input. Your method should be case-insensitive, and the return value should be all lower case.
  - When the most common word is non-unique (i.e. there are two words or more that qualify as most common), just report one of them. We will not use such test cases.
Caution: None of the above methods should be declared static. In fact, compilation of your program will fail if the method signatures in your program are different than above.
Feel free to use private methods that are called by the above public methods.

4. Detailed instructions and tips

Here are some steps to get you started on file processing, tweet processing, and testing.

4.1 File processing

To read and process a file use the following pattern of using Scanner:

    lineScanner = new Scanner(file);  // file is an instance of class File
    while (lineScanner.hasNextLine()) {
        String line = lineScanner.nextLine();
        // process the line
    }

In a stop-words file, each line contains a stop-word. In a tweets file, each line contains three parts. You will need to extract the third part (i.e., the actual tweet) by using the fact that tabs are used to separate the three parts. You can use a StringTokenizer or the split method defined in the class String.

Using a StringTokenizer: Suppose lineBeingProcessed is a String.

    import java.util.StringTokenizer; // must be at the beginning of your class
//...
//...
    StringTokenizer st = new StringTokenizer(lineBeingProcessed,"\t");
    while(st.hasMoreTokens()){
        String element = st.nextToken();
        // do something with that token
        // call nextToken two more times. The third call returns the tweet.
    }

Using the split method in String: Suppose lineBeingProcessed is a String.

    String[] parts = new String[3];
    parts = lineBeingProcessed.split("\t");
    // parts[2] contains the tweet

4.2 Processing tweets

Your methods need to ignore punctuation symbols and spaces. Assume that the possible punctuation symbols are contained in this set: {".", "-", ",", "!", "?", "*"}. Tweets will often contain the hashtag symbol ("#") and the "@" symbol. Do not ignore them. Leave those as part of the word to which they are attached, since they are not used as regular words. To parse a tweet contained in a variable t ignoring punctuation, use the Scanner in the following way:

    Scanner s = new Scanner(t).useDelimiter("[ *-,!?.]+");
    while(s.hasNext()) {
        String word = s.next();
        // process the word
    }

The argument of useDelimiter requires some explanation. It is what's called in Computer Science a regular expression: We are using as a delimiter one or more occurrences of the characters within the square brackets (that's what the + is doing).

Your solution should be case-insensitive, "tail" and "Tail" both count as occurrences of the same word. This is easy to achieve using a String's toLowerCase() method.

4.3 Preliminary and final testing

To help you develop your program, here is a tweets file, and a stop-words file. Preliminary testing using the auto-grading system will be performed using these files. You get results of preliminary testing almost instantaneously using the oneline submission system). For final grading of your program we will use different files (both filename and content), so do not hardcode any parameters into your program. We encourage you to construct your own tweet files that test various scenarios (e.g. that you successfully ignore case).

Keep in mind that we will not test your main method. The methods you implement will be tested directly and separately. However, you should write your main method to test the methods that you write. A barebones main can include something like:

    public static void main(String[] args) {
        P1 p1 = new P1();
        String [] tweets = p1.readTweets("tweets.txt");
        String [] stopWords = p1.readStopWords("stopWords.txt");
        System.out.println(p1.mostCommonWord(tweets));
        System.out.println(p1.mostCommonWordExcludingStopWords(tweets, stopWords));
    }

You should also need to manually determine the most common word in order to check the output of your program.

During preliminary testing, your score equals the number of test cases passed. H owever, during final testing, certain test cases may be weighted more than other s. More difficult methods will be worth more points.

5. Submission

Submit the file P1.java via the online checkin system. This system performs preliminary testing of your program on the same data files to the ones provided above. Final grading will be performed on a different set of files (both filenames and contents may differ).

6. Acknowledgements

The twitter dataset we provided to you is a subset of a much larger dataset which is available here.