Distributed Data Management SoSe 2015

Lecture (2V+1Ü, 4 ECTS-LP) "Distributed Data Management" (Module Description), Course Number INF-24-53-V-7

  • Level:  Master
  • Language: English


This course addresses fundamental concepts of distributed data management. Emphasis is put on novel approaches/paradigms to managing Big Data. The course aims at a mixture of system issues and hands on experience (like Hadoop/HDFS) and on fundamental algorithms and techniques (such as consistent hashing or Bloom filters).

  • Big Data, Cloud Computing
  • MapReduce (Hadoop, HDFS, …)
  • Various algorithms on top of MR (such as PageRank, Min-Hashing)
  • NoSQL Stores (MongoDB, Amazon Dynamo, Riak, ...)
  • (State Machine) Replication, Paxos
  • (Eventual) Consistency Models
  • Synopses: Bloomfilter, count-min sketch, KMV, ...
  • Distributed Data Stream Processing: STREAM, Storm, ...
  • Gossip protocols, consistent hashing

 Time and Location

  • Lecture:
    • KIS entry
    • Thursday, 15:30-17:00.
    • Room 42-110   (with one exception, will be announced in the lecture or see the KIS entry)
    • Begin: 23.04.2015
  • Exercise:
    • KIS entry
    • Tuesday, 15:30-17:00.
    • Room 46-210 (with at least one exception, will be announced in lecture)
    • Begin: 28.04.2015


Date News             


The students that still need to take the DDM exam, can use the following doodle poll to pick a time slot. The selection is first-come, first-served. Select only one slot, specify your name, and please try to remember the time and date you picked. You need to register also at the examination office.


For scheduling re-exams, please contact Prof. Michel directly.


We allocated a couple of additional slots for oral exams in the week August 10-14 to handle students with exceptional cases of conflicts to the already posted standard slots. If you think that your case is exceptional, please contact Heike Neu via email stating your conflicts, by the end of this week (i.e., end of June 28). We will decide on an individual basis whether or not we assign a new time slot.


Exam in German language? If you want to have the oral exam being held in German language, then please send an email to Heike Neu (neu@cs.....) stating that.


Exam registration is enabled now. If you are already qualified or still have the chance to qualify for the exam, use this doodle form to pick a slot for your DDM exam. Select only one slot and specify your name. First-come, first-served. Write down the time and date of your slot before you click on the save button. You need to register also at the examination office. This registration link is only for the DDM lecture, there is a separate one for the IRDM lecture on its website. Registration is possible until end of June 2015.


As already announced in the lecture, we will have ORAL EXAMS, around end of July / beginning of August 2015. Registration instructions will follow soon.


To clarify confusions on when to place a mark for an assignment that consists of multiple parts: Please mark an assignment as "done" if and only if you have accomplished some solution for each of the parts.


 Some useful links for getting started with Hadoop: ClouderaApache. You can also follow the book Hadoop: The Definitive Guide by Tom White

14.04.2015  The regulations for the course have been updated. Instead of five exercise sheets, there will be six. You can read the details in the Regulations section.

A further note on the Amazon AWS grant we received: In order to make use of your share of the grant for Amazon's Web Services, you need an account there for which a credit card is required. Read more about Amazon AWS here. If you do not have a credit card or won't use it for this purpose, no worries, Amazon AWS is not required to solve the mandatory exercises.


Regulations for qualification to the final exam are posted. Please read carefully.

08.04.2015 To participate in this lecture, specifically for the exercises, you need to register in the KIS tool to this lecture, see KIS link above.
19.03.2015 We obtained a generous grant by Amazon to use their Web services (AWS) in this lecture. More information will follow in the first lecture.


 Please read carefully.

Students need to successfully participate in the exercise sessions, according to the regulations below, in order to get admitted to the final exam. 

  • There will be 6 exercise sheets.
  • Each sheet consists of 3 assignments, which makes 18 assignments in total.
  • Each assignment is equivalent to one point.
  • A student needs to reach a total of at least 13 points throughout the semester to qualify for the final exam.
  • Solutions to exercise sheets do not have to be handed in.
  • Instead, at the beginning of each exercise session, the teaching assistant (TA) asks each student to mark on a sheet the assignments that he or she solved and can present.
  • Then, for each assignment, the TA selects students among the ones that placed a mark for this respective assignment, to present the solution.
  • This selection is done solely on discretion of the TA and does not require any justification.
  • If the presented solution is correct, at least to the largest extent, the student retains the full point.
  • Else if the presented solution is wrong but it is apparent that the student has spent time in solving it, zero points are given on the assignment.
  • Else it should be obvious that the mark has been placed in a dishonest attempt to obtain a point without proper engagement with the assignment, in which case the entire sheet is assessed with zero points.
  • For student that were not called to present, and, hence, the above cases do not apply, each made mark will translate to one point.
  • This assessment, i.e., in which of the three cases the performance of the students falls into, is, again, solely done on discretion of the TA.
  • Next to these 18 obligatory assignments there might be additional optional assignments, for which the above regulations do not apply.
    Nonetheless, we would be happy to still see a lively participation in presenting and discussion their solutions.




Slides will be made available roughly 24 hours before the lecture. Note that they are then still subject to change. The final version is uploaded after the lecture.

  • Lecture 1: Introduction and MapReduce pdf
  • Lecture 2: MapReduce, SQL (Joins) in MR, Hadoop pdf
  • Lecture 3: Hadoop, Secondary Sort, n-gram computation in MR pdf
  • Lecture 4: PageRank in MR,Min-Hashing pdf
  • Lecture 5: Pig and Hive, NoSQL, (State Machine) Replication, Paxos. pdf
  • Lecture 6: Logical Clocks, Concurrency Control, CAP Theorem, BASE, Eventual Consistency, Vector Clocks. pdf
  • Lecture 7: Vector Clocks, Consistency Models, Consistent Hashing. pdf
  • Lecture 8: Replication, Synchronization via Rumor Spreading / Anti Entropy, Merkle Trees. pdf
  • Lecture 9: Data Streams: CM sketch, FM sketch, KMV sketch pdf
  • Lecture 10: Sliding Windows, STREAM, CQL, Storm pdf
  • Lecture 11: Guest Lecture on Cloud Computing pdf


If you find potential mistakes in the script, please contact us. The script is kept up to date accordingly.

  • Lecture 4, slide 17: corrected illustration (to match the calculation)
  • Lecture 6, slide 9: should say: ..... x -> y => C(x) < C(y)
  • Lecture 3, slide 47: Added note on customer grouper.
  • Lecture 4, slide 3 and 12: Added reference to (online) book.
  • Lecture 7, slide 58: Put note on formal def. of consistent hashing difference to how keys are assigned to nodes in Chord and this lecture.
  • Lecture 7, slide 64: Changed routing table lookup for key 54 in node p51 (i.e., p51+2, not p51+4)   (but same result in this example).
  • Ex. Sheet 1, Assignment 3 b: Changed hour to date in the emit statement of the mapper, and the stopping condition in the for loop of the reducer.
  • Ex. Sheet 5, Assignment 1 a: Added g and p to the list of events concurrent to event l. 
  • Ex. Sheet 6, Assignment 1 b: Switched the solution for Client 3 and 4.


Exercise Sheets

Hint: See above for some links to get started with Hadoop


(c) AG DBIS, TU Kaiserslautern, 2015