[IMMD] - [de] - [Mining Massive Datasets]


Mining Massive Datasets [2026 SoSe]
Code
IMMD
Name
Mining Massive Datasets
LP
6
Dauer
one semester
Angebotsturnus
at least every 4th semester
Format
Lecture 2 SWS + Exercise course 2 SWS
Arbeitsaufwand
180 h; thereof
60 h lecture
15 h preparation for exam
105 h self-study and working on assignments (optionally in groups)
Verwendbarkeit
M.Sc. Data and Computer Science
M.Sc. Scientific Computing
Sprache
English
Lehrende
Artur Andrzejak
Prüfungsschema
Lernziele The students
- know selected approaches and programming paradigms of parallel data processing,
- know how to use tools for parallel data processing (among others Apache Hadoop and Spark),
- are familiar with application domains of big data analysis,
- know methods of parallel pre-processing of data,
- know methods like classification, regression, clustering and their parallel implementations,
- know about scaling of parallel algorithms.
Lerninhalte This module covers the following topics:
- Programming paradigms for parallel-distributed data processing, especially Map-Reduce and Spark programming models
- Usage of tools like Apache Spark, Hadoop, Pig, Hive, and possibly other frameworks for parallel-distributed data processing
- Application cases in parallel data analysis, for example clustering, recommendation, search for similar objects, mining of data streams
- Techniques for parallel pre-processing of data
- Fundamentals of analysis techniques such as classification, regression, clustering and evaluation of the results
- Parallel algorithms for data analysis and their implementations
- Theory and practice of scalability and tuning of frameworks
Teilnahme-
voraus-
setzungen
recommended are Knowledge of Java/Python and in elementary probability theory / statistics; module IBD can be taken as a complement / extension.
Vergabe der LP und Modulendnote The module is completed with a graded examination. The final grade of the module is determined by the grade of the examination. Details for this examination as well as the requirements for the assignment of credits will be given by the lecturer an the beginning of this course.
Nützliche Literatur Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, Version 2.1 von 2014 (http://www.mmds.org/)
Trevor Hastie, Robert Tibshirani, Jerome Fried-man, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/)
Ron Bekkerman, Misha Bilenko, John Langford, Scaling Up Machine Learning, Cambridge University Press, 2012
Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, (third edition), 2012
Books from O'Reilly Data Science Starter Kit, 2014 (http://shop.oreilly.com/category/get/data-science-kit.do)