csu logo green Computer Science Department

Big Data:
Assignments and Term Project
Spring 2019
Home Syllabus Schedule Assignments Grading Policy Course Policy Code of Conduct Canvas
Programing Assignment 1

Title: Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

Due: Feb. 26 Tuesday, 5:00PM 

Submission: Via Canvas, Team submission 


The goal of this programming assignment is to enable you to gain experience in:
(1) Installing and using analytics tools such as HDFS and Apache Spark
(2) Generating root set based on link information
(3) Implementing iterative algorithtms to estimate Hub and Authority scores of Wikipedia articles using Wikipedia dump data

(Complete description)

* Dataset
Dataset from the CS servers [links-simple-sorted.zip][titles-sorted.zip]
Original dataset source and instruction [Link]

Walkthrough of installing your cluster:

(1) Installation guide for Apache Hadoop [Download]

(2) Installation guide for Apache Spark [Download]

Nodes and port range assignment : Please see the announcement on canvas.

Programing Assignment 2

Title: Detecting the Most Popular Topics from Live Twitter Message Streams using the Lossy Counting Algorithm with Apache Storm

Due: March 29, Friday 5:00PM (adjusted)

Submission: via Canvas, Team submission 

(Complete Description)

Term Project Deliverables

Deliverable 0: Team information and Topic
Due: February 1, Friday 5:00PM via Canvas
Submit the list of your team members and your tentative titles (You can submit up to 3 multiple tentative titles)

Deliverable 1: Term Project Proposal
Due: March 12, Tuesday 5:00PM via Canvas
Description:   [Link]
Presentation: Rapid Fire Presentation March 13 in class
Schedule: TBA


Deliverable 2: Term Project Report and Software
Due: May 3 5:00PM via Canvas

Please submit your (1) report and (2) code.


Deliverable 3: Final Presentation
Presentation: 5/6 and 5/8  in class [Link]

Home Syllabus Schedule Assignments Grading Policy Course Policy Code of Conduct Canvas