Description
Welcome to Big Data Systems, EECS-4415, for winter term 2021. Materials, instructions, and notices for the class will accumulate at the eClass (“Moodle”) Portal, and via links to here under the EECS Website as the term progresses.
Marks accumulate on ePost.
Essentials
lecture time : | 11:30–13:00 Tuesdays & Thursdays (via Zoom) |
place : | virtual |
instructor : | Parke Godfrey |
office hours : | Mo 17:30–18:30 & Tu 13:00–14:00 |
An Overview of the Online Class
This class is online, but interactive. So how things are run differ necessarily from how we might do things for the class if we were meeting in person. Online makes some things harder, as we all know; but it makes other things easier! We will be looking to take advantage of our online forum. For this to work, we shall need everyone to be engaged.
How will this class be conducted? We do meet as a class via Zoom as by our lecture schedule.
- readings
- Readings from the required textbook are assigned for each topic.
- Students should read the assigned readings in advance of the associated lecture to benefit the most.
- The textbook is very aligned with what we cover in the course, and we are careful to be consistent with the textbook’s style and terminology.
- Articles — seminal academic papers — will be assigned as additional reading paired with some topics.
- Readings from the required textbook are assigned for each topic.
- lectures
- We do meet as a class via Zoom as by our lecture schedule.
- Part of the lecture time will be conducted more in a “flipped classroom” style and are intended to be fairly interactive. Our lecture meetings are the heart of the course; it is expected that students attend.
- Each lecture period will include some lecturing with slides.
- Examples will be covered and hands-on walk-throughs done.
- Problems will be pitched to students, and then solutions worked out.
- The lecture Zoom sessions will be additionally recorded and the lecture videos posted after.
- assignments
- There are six assignments spaced through the course.
- Each is directly tied to the topics presented beforehand.
- quizzes
- There will be four small quizzes, spaced out on every other Wednesday.
- midterm test
- There will be a midterm test scheduled in the middle of the term.
- final exam
- A final exam will be scheduled in the exam period.
See the syllabus for the details.
Additional materials will accumulate here as the course progresses.
Lecture Notes
Will be added throughout the term.
- Introduction
- Zen and the Art of Tool Maintenance
- MapReduce Introduction
- MapReduce Architecture [pdf]
- Data Flow & Spark
(thanks to Jure Leskovec, Intro, MapReduce & Spark, CS246: Mining Massive Data Sets)
- Data Flow & Spark
- Link Analysis [pdf]
- Data Streams
- Analysis of Large Graphs
⋮
Lecture Recordings
Will be added throughout the term.
- Tuesday 12 January 2021 [mp4]
- Thursday 14 January 2021 [mp4]
- Tuesday 19 January 2021 [mp4]
- Thursday 21 January 2021 [mp4]
- Tuesday 26 January 2021 [mp4]
- Thursday 28 January 2021 [mp4]
- Tuesday 2 February 2021 [mp4]
- Thursday 4 February 2021 [mp4]
- Tuesday 9 February 2021 [mp4]
- Thursday 11 February 2021 [mp4]
- Tuesday 23 February 2021 [mp4]
- Thursday 25 February 2021 [mp4]
- Tuesday 2 March 2021 [mp4]
- Tuesday 9 March 2021 [mp4]
- Thursday 11 March 2021 [mp4]
- Tuesday 16 March 2021 [mp4]
- Thursday 18 March 2021 [mp4]
- Tuesday 23 March 2021 [mp4]
- Thursday 25 March 2021 [mp4]
- Tuesday 30 March 2021 [mp4]
- Thursday 1 April 2021 [mp4]
- Tuesday 6 April 2021 [mp4]
- Thursday 8 April 2021 [mp4]
⋮
Readings
Read the textbook chapters listed day by day in the Schedule. (Do the reading before that day.)
The list of assigned articles will accumulate here.
- Gray, J. & Compton, M.
A call to arms.
Queue. 3(3): 30-38, 2005 April 1.
GC2006-CallToArms.pdf
- What is the semi-structured data challenge?
- What is the idea of column store?
- Dean, J. & Ghemawat, S.
MapReduce: simplified data processing on large clusters.
Communications of the ACM.
51(1): 107–113, 2008 January 1.
DG2004-MapReduce.pdf
- What does locality mean in the context of the paper?
- What is the granularity of fault tolerance provided by MapReduce as introduced by the paper?
- Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., & Czajkowski G.
Pregel: a system for large-scale graph processing.
ACM SIGMOD International Conference on Management of Data.
pp. 135–146, 2010 June 6.
MAB+2010-Pregel.pdf
- What is the “think like a vertex” mode of programming?
- What does it mean that edges are not first-class citizens in the Pregel model?
- Brewer, Eric.
CAP twelve years later: How the ‘rules’ have changed.
IEEE Computer.
45(2): 23–29, February 2012.
Brewer2012-CAP12YearsLater.pdf
- What is eventual consistency?
- Web apps which can work offline (HTML5) favour which, availability or consistency? Explain briefly.
- Lewis-Kraus, Gideon.
The Great A.I. Awakening.
The New York Times Magazine.
2016 December 14.
LewisKraus2016-Awakening.pdf
original at NYT Magazine- On what grounds were neural networks considered a folly?
- What was the big difference between the approach in “the cat paper” and previous image-recognition networks?
- staff.
Blockchains: The Great Chain of being sure about Things.
The Economist.
2015 October 31.
TheEconomist2015-Blockchains.pdf
original at The Economist- How does the puzzle stage add to bitcoin’s security?
- Who is the originator of blockchain?
- Gessert, F., Wingerath, W., Friedrich, S., & Ritter, N.
NoSQL Database Systems: A Survey and Decision Guidance.
Computer Science-Research and Development.
32(3–4): 353–365, 2017.
GWFR2016-NOSQL.pdf
- What is an advantage of and what is a disadvantage of hash sharding?
- Name disadvantages of SSD compared with HDD.
- Castaldo, J.
What really happened at Target Canada: The retailer’s last days.
Maclean’s.
2016 January 21.
Castaldo2016-TargetCanada.pdf
A cached copy at Facebook of the video, “How to go bankrupt the Target Canada way (in thirtheen easy steps).”- What company did Target Canada go to for its inventory management? What is that company known for?
- Why did Target Canada think they could do the integration in two years, whereas it had taken other retail chains much longer?
- Ching, A., Edunov, S., Kabiljo, M., Logothetis, D., and Muthukrishnan, S.
One trillion edges: Graph processing at facebook-scale.
Proceedings of the VLDB Endowment.
8(12):1804-15, 2015 August 1.
CEK+2015-trillion_edges.pdf
- Is Facebook’s graph data stored in a vertex-centric way as would be assumed for input into Pregel?
- What did not scale with the original Giraph — that is, before Facebook’s updates to Giraph as ennumerated in the paper — for Facebook with respect to aggregators?
- Hill, K.
Your Face Is Not Your Own.
The New York Timesi Magazine.
2021 March 18.
Hill2021-Face.pdf
(local PDF)
original at NYT Magazine- Where did Clearview AI obtain its face data from?
- For what applications would low accuracy be a concern?
- Harford, T.
Big Data: Are we making a big mistake?
Financial Times.
2014 March 28.
Harford2014-Mistake.pdf
(local PDF)
original at Financial Times- What is the multiple-comparisons problem?
- What bias exists with Boston’s Street Bump app?
Assignments
- Analysis: TopTen
- Due before midnight Friday 22 January.
- marked Assn #1’s returned (pdf)
- MapReduce: Frequency
- due before midnight Friday 5 February.
- marked Assn #2’s returned (pdf)
- Stream: Scan
- due before midnight Monday 1 March.
- marked Assn #3’s returned (pdf)
- NoSQL: XQuery
- due before midnight Monday 22 March.
- marked Assn #4’s returned (pdf)
- Graph: Communities
- due before midnight Monday 12 April.
- marked Assn #5’s returned (pdf)