csu logo green Computer Science Department

Big Data: Schedule

CS535

Fall 2017

Home Syllabus Schedule Assignments Grading Policy Course Policy Code of Conduct Canvas
Note that this schedule will be altered during the semester. Please make sure to check it every week.
Part 0. Introduction to Big Data
Week 1. (8/22, 8/24)

Topics
What is Big Data?
Course Introduction
A Paradigm for Big Data
- Lambda architecture
- Data Models



Readings
Keshav's "How to read a paper"[Link]
"How to Read and Understand a Scientific Paper: A Step-by-Step Guide for Non-Scientists"[Link]

Lecture Notes
8/22: Download
8/24: Download

Notes
8/25 Restricted drop deadline
9/6 Last day to Add or Drop most of courses on RAMweb
CSU Academic Calendar 2017 ~ 2018 [Link]
Term project planning (team and topic area) due on 8/28 5:00PM via email

Part 1. Batch Computing Models for Big Data Analytics
Week 2. (8/29, 8/31)

Topics
Distributed Model for Scalable Batch Computing
- MapReduce

Readings
Jeffrey Dean and Sanjay Ghemawat, "MapReduce:Simplified Data Processing on Large Clusters" In Proceeding
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Vol. 6, [Link]

Lecture Notes
8/29: Download
8/31: Download


Notes

9/6 Last day to Add or Drop most of courses on RAMweb
CSU Academic Calendar 2017 ~ 2018 [Link]

Term project planning (team and topic area) due on 8/28 5:00PM via email


 
Week 3. (9/5, 9/7)

Topics
Advanced Batch Computing at Scale (1): Link Analysis

Readings

Jure Leskovec, Anand Rajaraman, and Jeffrey D. "UllmanMining of Massive Dataset", Cambridge University Press, Chapter 5. [Link]
Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the ACM46 (5): 604–632.
[Link]

Lecture Notes
9/5: Download
9/7: Download



 
Week 4. (9/12, 9/14)

Topics
In-Memory Cluster Computing: Apache Spark 1

Readings
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, USENIX on NSDI 2012, [Link]

Lecture Notes
9/12: Download
9/14: Download


Notes

 
Week 5. (9/19, 9/21)

Topics
In-Memory Cluster Computing: Apache Spark 2
Distributed File Systems: Google File System and Colossus

Readings
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, USENIX on NSDI 2012, [Link]

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, "The Google file system"  Proceedings of SOSP 2003: 29-43 [Link]

C.K.P. Clarke, Reed-Solomon Error Correction [Link]

Lecture Notes
9/19: Download
9/21: Download

Notes


 
Week 6. (9/26, 9/28)

Topics
Distributed File Systems: Continued

Readings
Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, and Andrew Y. Ng, Map-Reduce for Machine Learning on Multicore, NIPS 2006: 281-288, [Link]

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (August 2009), 30-37. [Link]

Lecture Notes
9/26: Download
9/28: Download

Notes
Programing Assignment 1 due on 9/27 5:00PM via canvas

 

Week 7. (10/3, 10/5)

Topics
Advanced Batch Computing at Scale with Analytics Algorithms using Spark
-Predicting Forest Cover with Random Forest
-Recommendation system with Audioscrobbler dataset
-Validation: bootstrapping, jacknife, and cross validation
-Understanding Wikipedia with Latent Semantic Analysis



Readings

Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, and Andrew Y. Ng, Map-Reduce for Machine Learning on Multicore, NIPS 2006: 281-288, [Link]


Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (August 2009), 30-37. [Link]

 

 

Lecture Notes
10/3: Download
10/5: Download


Notes

Term project proposal due on Oct. 11 (5:00PM) via Canvas

Part 2. Scalable Frameworks for Real-time Big Data Analytics
Week 8. (10/10, 10/12)
 

Topics
Advanced Batch Computing at Scale with Analytics Algorithms using Spark (continued)

Readings
Greg Linden, Brent Smith, and Jeremy York, “Amazon.com Recommendations, Item-to-Item Collaborative Filtering” IEEE Internet Computing, 2003 [Link]

Lecture Notes
10/10: Download
10/19: Canceled


Notes


 
Week 9. (10/17, 10/19)

Topics
Student Presentation (Term project proposal)
Schedule:

[10/17, Tuesday]
Team Boxelder
Team Chokecherry
Team Engelmann Spruce
Team Gambel Oak
Team Limber Pine
Team Lodgepole Pine

[10/19, Thursday]
Team Narrowleaf Cottenwood
Team Peachleaf Willow
Team Pinon Pine
Team Quaking Aspen
Team Rocky Mountain Juniper
Team Rocky Mountain Maple

 
Week 10. (10/24, 10/26)

Topics
Framework for Real-time data stream analytics: Apache Storm (1)


Readings
Toshniwal, Ankit and Taneja, Siddarth and Shukla, Amit and Ramasamy, Karthik and Patel, Jignesh M. and Kulkarni, Sanjeev and Jackson, Jason and Gade, Krishna and Fu, Maosong and Donham, Jake and Bhagat, Nikunj and Mittal, Sailesh and Ryaboy, Dmitriy, “Storm@twitter”, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD June 22-27, 2014, Snowbird, Utah [Link]

Zhiyuan Cheng, James Caverlee, and Kyumin Lee, "You are where you Tweet: A Content-Based Approach to Geo-locating Twitter Users", ACM CIKM 2010 [Link]


Lecture Notes
10/24: Download
10/26: Download

Notes


 
Week 11. (10/31, 11/2)

Topics
Framework for Real-time data stream analytics: Apache Storm (2)
Scalable NoSQL storage systems (1)
- Distributed Hash Tables and Apache/Facebook Cassandra


Readings
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan,"Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications" Proc. 2001 SIGCOMM, Mar. 2001, pp.149-160 [Link]

Avinash Lakshman, Prashant Malik, “A Decentralized Structured Storage System” ACM SIGOPS Operation Systems Review, Vol. 44-(2), April 2010 pp. 35-40[Link]


Lecture Notes
10/31: Download
11/2: Download


Notes
Programing Assignment 2 due on 11/ 6 5:00PM via canvas

 
Week 12. (11/7, 11/9)

Topics

Scalable NoSQL storage systems (2)
- Distributed Hash Tables and Apache/Facebook Cassandra
: Continued

Readings
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan,"Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications" Proc. 2001 SIGCOMM, Mar. 2001, pp.149-160 [Link]

Avinash Lakshman, Prashant Malik, “A Decentralized Structured Storage System” ACM SIGOPS Operation Systems Review, Vol. 44-(2), April 2010 pp. 35-40[Link]


Lecture Notes

11/7: Download
11/9: Download


Notes



Week 13. (11/14, 11/16)

Topics
Midterm Exam (11/16 in class)


Lecture Notes
11/14: Download

Notes


Week 14. (11/21, 11/23)

Fall Recess: No class


Week 15. (11/28, 11/30)

Topics
Frameworks for the Graph Data Analytics
-Pregel
-GraphX


Readings

Grzegorz Malewicz et. el. "Pregel: a system for large-scale graph processing" Proceeding
SIGMOD '10 Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Pages 135-146 [Link]

Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, Sanjeev Kumar, "f4: Facebook’s Warm BLOB Storage System",OSDI 2014,[link]

Doug Beaver, Sanjeef Kumar, Harry C. Li, Jason Sobel, Peter Vajgel, "Finding a needle in Haystack: Facebook's photo storage", OSDI 2010, [link]

Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica, "GraphX: A Resilient Distributed Graph System on Spark",Proceedings of the First International Workshop on Graph Data Manage- ment Experience and Systems (GRADES 2013), June 23, 2013, New York, New York, USA, [link]


Lecture Notes
11/28: Download
11/30: Download

Notes
Term project: Final report due on 12/4

 
Week 16. (12/5,12/7) --> (12/13 ~ 15) Final week

Topics
Term Project Presentation Session I, and II

Notes
Presentation schedule: TBA
Demonstration scheduel: TBA


Home Syllabus Schedule Assignments Grading Policy Course Policy Code of Conduct Canvas