csu logo green Computer Science Department

Introduction to Big Data: assignments CS 435
Spring 2018
| Home | Syllabus | Schedule | Assignments | Grading Policy | Course Policy | Code of Conduct | Canvas |

Programming Assignments
Programming Assignment 0: Feb. 6 By 5:00PM via Canvas
Programming Assignment 1: Feb. 21 By 5:00PM via Canvas
Programming Assignment 2: Mar. 22 By 5:00PM via Canvas
Programming Assignment 3: Apr. 18 By 5:00PM via Canvas

Term Project Phases
Term Project Phase 0: January 30 By 5:00PM via Canvas (cs435@cs.colostate.edu)
Term Project Phase 1: March 29 By 5:00PM via Canvas
Term Project Phase 2 (Software/Report): April 26 By 5:00PM via Canvas
Term Project Phase 3: No submission required. Schedule your demo and presentation

CS435 Piazza page [Link]



Programming Assignment 0:
Creating your own Hadoop cluster and a word count example

Due: Feb. 6 2018 By 5:00 PM

Submission: via Canvas, Individual submission

Objectives:
The goal of this programming assignment is to enable you to gain experience in
--Installing and configuring Hadoop
--Gaining Familiarity with basic features of the Hadoop distributed file system
--Running simple example of MapReduce

Full description: [Link]

Wiki page for this homework: [Link]

Hadoop Installation Guide: [Link]

Map-Reduce Code for Word Count: [Link]

Node and Port Assignment: [Link]

Hadoop Configuration Jar: [Link]

Hadoop installation/configuration in CSB130 (Recitation 1): [Video] [Slide]

Hadoop installation/configuration in CSB130 (Recitation 2): [Video] [Slide]

Hadoop installation/configuration in CSB130 (Recitation 3): [Video] [Slide]

Test file for DEMO: [Link]

For the technical questions, please contact GTA.

 

Programming Assignment 1: Creating N-gram Profile for a Wikipedia Corpus

Due: Feburary 21, 2018 By 5:00 PM

Full description: [Link]
Submission: via Canvas, Individual submission

Complete Dataset: [Link]

Sample File: [Link]

Introduction to Programming Assignment 1 (Recitation 4): [Video] [Slide]

Discussion on Programming Assignment 1 (Recitation 5): [Video] [Slide]

Programming Assignment 2:Document Summarization using TF/IDF Scores
Due: March 22, 2018 By 5:00PM
Full description: [Link] Submission: via Canvas, Individual submission

Introduction to Programming Assignment 2 (Recitation 6): [Video] [Slide]

PA2 Demo File 1 [Link]

PA2 Demo File 2 [Link]



Programming Assignment 3: Estimating PageRank Values of Wikipedia Articles using Apache Spark

Due: April 18, 2018 By 5:00 PM

Full description: [Link]
Submission: via Canvas, Individual submission

Links Dataset: [Link]

Title Dataset: [Link]

Introduction to Apache Spark (Recitation 10): [Video] [Slide]

Setting up Spark Cluster (Recitation 11): [Video] [Slide]

Discussion on PA3 (Recitation 12): [Video] [Slide]

Term Project

The objectives of the term project are,
- Performing a large-scale data analytics
- Using technologies typically used in modern data centers
- Interpreting your results to extract insight from the data

Phase 0: Term Project Team Assignment

Due: January 30, 2018 By 5:00PM
Requirement: 3 or 4 teammembers only
Submission: via Canvas

Phase 1: Term Project Proposal

Due: March 29 2018 By 5:00 PM via Canvas
Submission: via Canvas, Team submission

Full description of the term project proposal: [Link]

Phase 2: Term Project Submission (Software and Report)

Due: April 26 2018 By 5:00 PM via Canvas

Submission: via Canvas, Team submission

Description: [Link]

Phase 3: Presentations and Software Demonstration

Phase 3 include a team presentation in class and a software demonstration to the instructure. Your presentation will be 10 minutes (including 2 minutes for Q&A and preparation). There will be 3-4 judges in the class.

Your presentation should cover,

Slide 1. Title

Slide 2. Problem description: Describe your problem and goal

Slide 3. Description of your data: The characteristics of your data (e.g. why is it challenging?)

Slide 4 (1 or 2 slides). Your approaches (Methodology)
- algorithm (e.g. description of linear regression, and how you applied the algorithm on your problem)
- programing paradigm (e.g. your design of mapreduce)
- system and services

Slide 5. Discussion of your analysis
- What did you find from the results of your analysis?
- Do you think that it is accurate?
- How was the performance of your analysis? (if it is applicable)
- Did you find any challenges during your project? Please share those.

Slide 6. Conclusion
-Summary of your project
-Lessons learned from your project

You can provide a realtime demo if you think that can highlight your software.

Your team will also provide a short demo to the instructor.

Presentation schedule
TBA

 

Home Syllabus Schedule Assignments Grading Policy Course Policy Code of Conduct Canvas