Colorado State University Logo | CS 163/4: Java Programming (CS 1) Colorado State University Logo | CS 163/4: Java Programming (CS 1)
CS 163/4: Java Programming (CS 1)
Computer Science

Practical Three - STEM University Demographics

Introduction

When you talk about machine learning and data science, data wrangling becomes essential. What is data wrangling? It is taking data, and putting into a format that is easy to work with structuring your data in a way, so the problem becomes easier to manage. Machine Learning is about using data to create predictions, and react to those predictions, so you can see the critical nature of wrangling that data into manageable formats.

You will learn about:

  • More String.format
  • Loops - for and for:each
  • Objects (makes heavy use of objects)
  • Arrays

Scenario

You are working for Institutional Research, a group that analyzes data across the university to help departments make funding decisions and strategic initiatives. In particular, you have been provided with enrollment counts as a list of people, and their self-identified genders of every student officially majoring in Engineering or Natural Sciences major at the university.

The file provided zybooks (click download) contains the following columns with the data being separated by commas (called a CSV or comma separated value file).

  • PRIMARY_COLLEGE - As the college code (NS, EG)
  • PRIMARY_COLLEGE_DESC - The name of the college (unused for this assignment)
  • PRIMARY_DEPARTMENT_DESC - The name of the department in the college (unused)
  • PRIMARY_MAJOR_DESC - The primary major of the student
  • TERM - The term the data was collected
  • GENDER - the self-identified gender of the student

There are 52,304 lines in the file! It is essential to build a program to help you calculate the percentage of Male vs. Female vs. Non-binary identifying students, and then present that information in an easy-to-read manner.

Requirements

  • The program will take in two inputs via System.in
    • The name of the file to load. Defaults to STEM_Diversity_Data.csv if no file is given
    • The identifier of the term. Defaults to 202110 (Spring 2021)
      • For reference, university terms are defined as YearStartMonth0, so 202090 is Fall 2020, 202110 is Spring 21, and 201960 is Summer 2019. You can look through the file to see various terms that are included.
    • The prompts for the inputs are
      • Enter a file to load (STEM_Diversity_Data.csv):
      • Term (202110):
  • The program will output a table for both Natural Sciences and Engineering to System.out.
    • There is no need to sort the entries within the College (challenge bonus: sort them!)
    • The program can immediately exit after table is printed
    • If an invalid file or term is given, the program can just exit with an error message.

Here are some sample input and outputs of the program running:

 Enter a file to load (STEM_Diversity_Data.csv): 
 Term (202110): 
 Natural Sciences:        Major     Male       Female
                     Psychology     23.70%     76.30%
                        Zoology     17.61%     82.39%
                   Data Science     70.27%     29.73%
             Biological Science     27.71%     72.29%
               Computer Science     83.61%     16.39%
                     Statistics     65.08%     34.92%
                      Chemistry     44.53%     55.47%
                   Biochemistry     40.07%     59.93%
                    Mathematics     54.67%     45.33%
   Applied Computing Technology     85.07%     14.93%
                        Physics     80.56%     19.44%
               Natural Sciences     41.67%     58.33%

 Engineering:             Major     Male       Female
 Biomedical Engineering with ME     53.08%     46.92%
 Biomedical Engineering with EE     66.67%     33.33%
         Mechanical Engineering     87.55%     12.45%
 Chemical & Biological Engineer     63.49%     36.51%
         Electrical Engineering     88.89%     11.11%
           Computer Engineering     91.53%      8.47%
              Civil Engineering     75.98%     24.02%
      Environmental Engineering     53.26%     46.74%
 Biomedical Engineering with CB     42.47%     57.53%
     Engrg Sci and Intl Studies     42.86%     57.14%
            Engineering Science     71.43%     28.57%
 Biomedical Engineering with EL     66.67%     33.33%
        Engineering Open Option    100.00%      0.00%

The example above, the person just hit return for each entry, and didn’t put in any input besides return. The one below, the client is typing in a file name and a different term.

 Enter a file to load (STEM_Diversity_Data.csv): STEM_Diversity_Data.csv
 Term (202110): 201990
 Natural Sciences:        Major     Male       Female
                     Psychology     24.49%     75.51%
                        Zoology     19.77%     80.23%
             Biological Science     29.86%     70.14%
               Computer Science     86.75%     13.25%
                     Statistics     62.07%     37.93%
                   Biochemistry     42.36%     57.64%
                      Chemistry     47.88%     52.12%
                    Mathematics     58.26%     41.74%
   Applied Computing Technology     81.51%     18.49%
                        Physics     81.63%     18.37%
                   Data Science     75.56%     24.44%
               Natural Sciences     50.00%     50.00%

 Engineering:             Major     Male       Female
         Electrical Engineering     88.72%     11.28%
 Biomedical Engineering with EE     61.11%     38.89%
         Mechanical Engineering     86.78%     13.22%
 Biomedical Engineering with ME     55.16%     44.84%
              Civil Engineering     74.89%     25.11%
 Chemical & Biological Engineer     64.41%     35.59%
           Computer Engineering     92.41%      7.59%
        Engineering Open Option     80.30%     19.70%
 Biomedical Engineering with CB     43.13%     56.88%
      Environmental Engineering     48.60%     51.40%
            Engineering Science     78.38%     21.62%
     Engrg Sci and Intl Studies     28.57%     71.43%
 Biomedical Engineering with EL    100.00%      0.00% 

The order of your answers may vary, but the percentages should not. Our tests will randomly take lines out to test them individually.

Why only Male vs. Female?
Wait, didn’t we mention Non-Binary was an option? It is an option to identify as at the university, after many years of hard work from faculty (some from Computer Science) to get various other ways to identify listed on the admission’s paperwork. However, that does not mean it is selected frequently, and many students don’t know they can choose to change it in Ramweb (Pride Resource Center Howto). As such, the demographic data tends to match what we get from high schools, and other information the university gets. To complicate it even more, this option only became available a couple of years ago, and there are legal ramifications on how it can be updated. We left it out in the table, due to the fact that with rounding, it came up as 0% for every major (it actually isn’t). It is also an important reminder that behind all the data, there are people, and people are more than a statistic. Good data analysis includes learning about the narrative behind the data, and validating the data collected.

Coding Specifications

For grading purposes, we are asking you to follow this format. Furthermore, the format emphasis the design of the program, focusing on storing the data in objects, and then printing by grabbing the data from the objects. This is very standard to do. Also, you will notice that while in Practical 2, you created 3 files from scratch, in this practical, you will be creating 5 files from scratch.

Full breakdown for the required methods can be found in the javadoc. Make sure to map out how the files are interacting, and work with a TA early.

Five classes will be made and submitted:

  • Main.java - the main driver of the program, but does a bit more work than past mains.
  • CSVReader - Helps you read a Comma Separated Value file by using a Scanner to grab each line, and String.split to break up the lines into String arrays.
  • Data - Uses CSVReader to read the data, and keeps track of the CollegeDemographics as an array of two elements. Prints out table when requested.
  • CollegeDemographics - Keeps track of the various MajorDemographics (majors) in a single College. Builds a String value of the table to print out unique to each college.
  • MajorDemographics - Keeps the total gender counts for a major, so percents can be easily calculated when asked.

Like practical 2, you wll also be asked to write tests in separate main methods for your classes. It is always important to test thoroughly before submitting.

Where to Start?

Here are some hints on how to start

  1. Write down on paper what you want to do!
    • do you understand the specifications? If not, ask on MS Teams!
    • Draw out the program flow. Can you picture how the program will work?
      The picture doesn’t have to be clear, but seeing a picture can help you see how classes interact.
    • Sometimes it helps to develop empty classes, with comments on what to do
      in your own words in each class (you can even turn them in, to get your “does it compile” point)
  2. Look at the problems, divide and conquer.
    • What do you know how to do?
    • What is ‘self-contained’, meaning you can write it without dependence on the other classes?
      • MajorDemographics - does not rely on the other classes to work, so could be a starting place. You can write, and test to make sure it works before moving on.
      • CSVReader - does not rely on any of the other classes to work. You could start with reading a CSV file (write your own, make it only 3 lines!) and printing out the contents of that file to the screen. You now know how to read a file, which is a major component of this assignment.
      • With those two classes done, you can then build from there!
  3. A quick reminder, when working with most IDEs (IntelliJ and Eclipse), the file path reads from the root project directory (not src), so you should place your CSV file in the project root.

The biggest problem when starting with code is getting lost in a “where to start mental loop”. Take your time to think it through, but don’t take too long. You should start writing code, even if the code is simply to help you figure out how to do something (read a file for example, or get client input). That type of “warm up” will then get your brain working on how to work on the entire problem. Basically, don’t stare at the problem and do nothing. It doesn’t hurt to try, fail, try again.

Going Further
This type of program is very common when working with Artificial Intelligence / Machine Learning or Data Science. While you aren’t actually doing any of the cool Machine Learning algorithms that they both use (take CS 345, CS 445, or CS 440 for that!), you are working on an essential step of getting the data into a usable format. Data often has to be cleaned and organized before it can be analyzed. If you are interested in Artificial Intelligence, you should look at the concentration in computer science that focuses in it, as it is one of the few concentrations that modify your math requirements (Neural Network Backpropagation is an example of calculus used in CS), while Data Science majors spend even more time on the math side of the algorithms. Furthermore, those who minor in computer science have access to the ML and AI classes, especially CS 345 is a great course for all minors, and a great supplement to any major.

Computer Science Department

279 Computer Science Building
1100 Centre Avenue
Fort Collins, CO 80523
Phone: (970) 491-5792
Fax: (970) 491-2466

CS 163/4: Java Programming (CS 1)

Computer Programming in Java: Topics include variables, assignment, expressions, operators, booleans, conditionals, characters and strings, control loops, arrays, objects and classes, file input/output, interfaces, recursion, inheritance, and sorting.