csu logo green Computer Science Department

Big Data
Assignments and Term Project
CS535
Spring 2020
Home Syllabus Schedule Assignments Grading Policy Course Policy Code of Conduct Canvas
Programing Assignment 1

Title: Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

Due: Feb. 18 Tuesday, 5:00PM 

Submission: Via Canvas, Individual submission 

Objectives

The goal of this programming assignment is to enable you to gain experience in:
(1) Installing and using analytics tools such as HDFS and Apache Spark
(2) Generating root set based on link information
(3) Implementing iterative algorithtms to estimate Hub and Authority scores of Wikipedia articles using Wikipedia dump data

Complete description (Last Updated - 24th Jan): [Link]

* Dataset
Dataset from the CS servers [Links Sorted] [Titles]
Original dataset source and instruction [Link]

Walkthrough of installing your cluster:

(1) Accessing CS Linux machines Remote Setup Guide [Download]

(2) Installation guide for Apache Hadoop (Last Updated - 24th Jan) [Download]

(3) Installation guide for Apache Spark [Download]

Nodes and port range assignment : [Link]

Helpful Infospace Video Links -

1. Hadoop Environment Setup : [ Link ]
2. Changing configuration files for Hadoop : [ Link ]
3. Setting up a Spark Cluster : [ Link ]
4. Running jobs using spark shell for debugging : [ Link ]
5. Compiling and creating jar using SBT and submitting job on Spark standalone cluster : [ Link ]
 
Programing Assignment 2

Title: Detecting the Most Popular Topics from Live Twitter Message Streams using the Lossy Counting Algorithm with Apache Storm

Due: March 10, Tuesday 5:00PM

Submission: via Canvas, Individual submission 

Objectives

The goal of this programming assignment is to enable you to gain experience in:
1. Implementing approximate on-line algorithms using a real-time streaming data processing framework
2. Understanding and implementing parallelism over a real-time streaming data processing framework

Complete description (Last Updated - 24th Feb 20): [Link]

Helpful Infospace Video Links -

1. Installing Apache Zookeeper and Storm : [ Link ]
2. Launching Storm Daemons under Supervision : [ Link ]
3. Running jobs on Intellij and Cluster : [ Link ]


 
Term Project Deliverables

Important links for Project
Writing distributed applications using Pytorch: [ Link ]
Recording Presentations using OBS Instruction guide: [ Link ]
Editing Videos using iMovie: [ Link ]

Deliverable 0: Team information and Topic
Due: January 28, Tuesday 5:00PM via Canvas
Submit the list of your team members and your tentative titles (You can submit up to 3 multiple tentative titles)


Deliverable 1: Term Project Proposal
Due: March 31, Tuesday 5:00PM via Canvas
Description:   [Link]
Presentation: Rapid Fire Presentation April 6 in class --> Canceled

 

Deliverable 2: Term Project Report and Software
Due: May 5 5:00PM via Canvas
Description: [Link]

Please submit your (1) report and (2) code.

 

Deliverable 3: Final Presentation
Description: [Link]
Presentation: May 6 5:00PM via Canvas  


Home Syllabus Schedule Assignments Grading Policy Course Policy Code of Conduct Canvas