[IMMD] - [de] - [Mining Massive Datasets]


Mining Massive Datasets [2022 Sommer]
Code
IMMD
Name
Mining Massive Datasets
LP
6
Dauer
one semester
Angebotsturnus
at least every 4th semester
Format
Lecture 2 SWS + Exercise course 2 SWS
Arbeitsaufwand
180 h; thereof
60 h lecture
15 h preparation for exam
105 h self-study and working on assignments (optionally in groups)
Verwendbarkeit
M.Sc. Angewandte Informatik
M.Sc. Data and Computer Science
M.Sc. Scientific Computing
Sprache
English
Lehrende
Artur Andrzejak
Prüfungsschema
Lernziele * Knowledge of selected approaches and programming paradigms of parallel data processing
* Knowledge how to use tools for parallel data processing (among others Apache Hadoop and Spark)
* Familiarity with application domains of big data analysis
* Knowledge of methods of parallel pre-processing of data
* Knowledge of methods like classification, regression, clustering and their parallel implementations
* Knowledge of scaling of parallel algorithms
Lerninhalte This module covers the following topics:
* programming paradigms for parallel-distributed data processing, especially Map-Reduce and Spark programming models
* usage of tools like Apache Spark, Hadoop, Pig, Hive, and possibly other frameworks for parallel-distributed data processing
* application cases in parallel data analysis, for example clustering, recommendation, search for similar objects, mining of data streams
* techniques for parallel pre-processing of data
* fundamentals of analysis techniques such as classification, regression, clustering and evaluation of the results
* parallel algorithms for data analysis and their implementations
* theory and practice of scalability and tuning of frameworks
Teilnahme-
voraus-
setzungen
recommended are Knowledge of Java/Python and in elementary probability theory / statistics; module IBD can be taken as a complement / extension.
Vergabe der LP und Modulendnote The module is completed with a graded exam. This note of this exam gives the note for this module. Details for this exam as well as the requirements for the assignment of credits will be given by the lecturer an the beginning of this course.
Nützliche Literatur * Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, Version 2.1 von 2014 (http://www.mmds.org/)
* Trevor Hastie, Robert Tibshirani, Jerome Fried-man, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/)
* Ron Bekkerman, Misha Bilenko, John Langford, Scaling Up Machine Learning, Cambridge University Press, 2012
* Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, (third edition), 2012
* Books from O'Reilly Data Science Starter Kit, 2014 (http://shop.oreilly.com/category/get/data-science-kit.do)