csu logo green Computer Science Department

Introduction to Big Data: assignments CS 435
Spring 2016
| Home | Syllabus | Schedule | Assignments | Grading Policy | Course Policy | Code of Conduct | Canvas |

 

Programming Assignments
Programming Assignment 0: Feb. 5 By 5:00 via Canvas
Programming Assignment 1: Feb. 21 By 5:00PM via Canvas
Programming Assignment 2: Mar. 23 By 5:00PM via Canvas
Programming Assignment 3: Apr. 18 By 5:00PM via Canvas

Term Project Phases
Term Project Phase 0: January 30 By 5:00PM via email (cs435@cs.colostate.edu)
Term Project Phase 1: March 30 By 5:00PM via Canvas
Term Project Phase 2 (Software/Report/Demo): April 27 By 5:00PM via Canvas
Term Project Phase 3: No submission required. Schedule your demo.

CS435 Wiki for Help [Link]



Programming Assignment 0:
Creating your own Hadoop cluster and a word count example

Due: Feb. 5 2017 By 5:00 PM

Submission: via Canvas, Individual submission

Objectives:
The goal of this programming assignment is to enable you to gain experience in
--Installing and configuring Hadoop
--Gaining Familiarity with basic features of the Hadoop distributed file system
--Running simple example of MapReduce

Full description: Click Here

Wiki page for this homework: Click Here

Hadoop Installation Guide: Click Here

Map-Reduce Code for Word Count: Click Here

Hadoop installation/configuration in CSB130: Click Here

For the technical questions, please contact GTA.

Help session 1: Jan 27 2017 [Video]

Help session 2: Feb 03 2017 [PDF|Video]
The links to the recorded session will be provided 3 hours after the session.

 

Programming Assignment 1:Creating N-gram profile for the English literature Corpus

Due: Feburary 21, 2017 By 5:00 PM

Submission: via Canvas, Individual submission

Objectives:
The goal of this programming assignment is to enable you to gain experience in:
--Basic features of Hadoop distributed file system and MapReduce
--Creating NGram profiles using Hadoop MapReduce

Full description: Click Here

Wiki page for this homework: Click Here

Hadoop Installation Guide: Click Here

Download 1GB dataset: Click Here

Download test(7MB) dataset: Click Here

Sample Outputs: [1A]|[1B]

For the technical questions, please contact GTA.


Help session 3: Feb 17 2017 [Video]


 


 
Programming Assignment 2:
Content Based Authorship Detection using TF/IDF Scores and Cosine Similarity

 

Due: March 23, 2017 By 5:00 PM

Submission: via Canvas, Individual submission

Objectives:

The goal of this programming assignment is to enable you to gain experience in:

  • Creating an authorship identification system based on the similarity of the author’s attributes
  • Calculating TF/IDF scores using MapReduce
  • Calculating Cosine Distance using MapReduce

Full description: Click Here

Download 1GB dataset: Click Here

Download test(7MB) dataset: Click Here

Wiki page for this homework: Click Here

Help session 4: Mar 03 2017 [Video]


 
Programming Assignment 3:
Staging and Analytics of Wikipedia Edit History Using Hadoop, HBase, and the Hive Query Language

 

Due: April 18, 2017 By 5:00 PM

Submission: via Canvas, Individual submission

Objectives:

The objectives of this programming assignment are to enable you to gain experience with:

  • Automated data staging using Hadoop MapReduce
  • Data Import from HDFS into HBase Table
  • Performing interactive analytics using the Hive Query Language

Full description: Click Here

HBase and Hive installation Guide: Click Here

Wiki page for this homework:Click Here

Help session 5: Apr 07 2017[PDF|Video]


 
Term Project

Phase 0: Term Project Team Assignment

Due: January 30, 2017 By 5:00PM
Requirement: 2 or 3 teammembers only
Submission: via email (cs435@cs.colostate.edu)

 

Phase 1: Term Project Proposal

Due: March 30 2017 By 5:00 PM via Canvas
Submission: via Canvas, Team submission


The objectives of the term project are,

  1. Performing a large-scale analysis
  2. Using technologies typically used in modern data centers
  3. Interpreting your results to extract insight from the data

Full description of the term project proposal:[Link]

 

Phase 2: Term Project Submission (Software and Report)

Due: April 27 2017 By 5:00 PM via Canvas

Submission: via Canvas, Team submission

Description:

Your final report should be more than 3000 words (Maximum 5000 words). Your final submission should include following information:

0. Title


1. Problem description: Describe your problem and goal


2. Description of your data: The characteristics of your data (e.g. why is it challenging?)


3. Your approaches (Methodology)
- Description of algorithm that you have used (e.g. description of linear regression, and how you applied the algorithm on your problem)
- Description of your software(e.g. your design of mapreduce)
- System architecture



4. Discussion of your analysis
- What did you find from the results of your analysis?
- Evaluation of your approach (Accuracy, latency, etc.)
- Did you find any challenges during your project? Please explain those here.

 

5. Your contributions
Please specify contributions of each team member here.

Phase 3: Presentations and Software Demonstration

Phase 3 include a team presentation in class and a software demonstration to the instructure. Your presentation will be 10 minutes (including 2 minutes for Q&A and preparation). There will be 3-4 judges in the class.

Your presentation should cover,

Slide 1. Title


Slide 2. Problem description: Describe your problem and goal


Slide 3. Description of your data: The characteristics of your data (e.g. why is it challenging?)


Slide 4 (1 or 2 slides). Your approaches (Methodology)
- algorithm (e.g. description of linear regression, and how you applied the algorithm on your problem)
- programing paradigm (e.g. your design of mapreduce)
- system and services


Slide 5. Discussion of your analysis
- What did you find from the results of your analysis?
- Do you think that it is accurate?
- How was the performance of your analysis? (if it is applicable)
- Did you find any challenges during your project? Please share those.

Slide 6. Conclusion
-Summary of your project
-Lessons learned from your project

And you can provide a realtime demo if you think that can highlight your software.

Your team will also provide a short demo to the instructor.

Presentation schedule
TBA


 

Home Syllabus Schedule Assignments Grading Policy Course Policy Code of Conduct Canvas