Portfolio_Modulhandbuch

[IMMD] - [de] - [Mining Massive Datasets]

Mining Massive Datasets [2026 SoSe]
Code IMMD	Name Mining Massive Datasets
LP 6	Dauer one semester	Angebotsturnus at least every 4th semester
Format Lecture 2 SWS + Exercise course 2 SWS	Arbeitsaufwand 180 h; thereof 60 h lecture 15 h preparation for exam 105 h self-study and working on assignments (optionally in groups)	Verwendbarkeit M.Sc. Data and Computer Science M.Sc. Scientific Computing
Sprache English	Lehrende Artur Andrzejak	Prüfungsschema
Lernziele	The students - know selected approaches and programming paradigms of parallel data processing, - know how to use tools for parallel data processing (among others Apache Hadoop and Spark), - are familiar with application domains of big data analysis, - know methods of parallel pre-processing of data, - know methods like classification, regression, clustering and their parallel implementations, - know about scaling of parallel algorithms.
Lerninhalte	This module covers the following topics: - Programming paradigms for parallel-distributed data processing, especially Map-Reduce and Spark programming models - Usage of tools like Apache Spark, Hadoop, Pig, Hive, and possibly other frameworks for parallel-distributed data processing - Application cases in parallel data analysis, for example clustering, recommendation, search for similar objects, mining of data streams - Techniques for parallel pre-processing of data - Fundamentals of analysis techniques such as classification, regression, clustering and evaluation of the results - Parallel algorithms for data analysis and their implementations - Theory and practice of scalability and tuning of frameworks
Teilnahme- voraus- setzungen	recommended are Knowledge of Java/Python and in elementary probability theory / statistics; module IBD can be taken as a complement / extension.
Vergabe der LP und Modulendnote	The module is completed with a graded examination. The final grade of the module is determined by the grade of the examination. Details for this examination as well as the requirements for the assignment of credits will be given by the lecturer an the beginning of this course.
Nützliche Literatur	Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, Version 2.1 von 2014 (http://www.mmds.org/) Trevor Hastie, Robert Tibshirani, Jerome Fried-man, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/) Ron Bekkerman, Misha Bilenko, John Langford, Scaling Up Machine Learning, Cambridge University Press, 2012 Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, (third edition), 2012 Books from O'Reilly Data Science Starter Kit, 2014 (http://shop.oreilly.com/category/get/data-science-kit.do)