[IMMD] - [de] - [Mining Massive Datasets]

Mining Massive Datasets [2021 Sommer]
Mining Massive Datasets
6 LP
one semester
at least every 4th semester
Lecture 2 SWS + Exercise course 2 SWS
180 h; thereof
60 h lecture
15 h preparation for exam
105 h self-study and working on assignments (optionally in groups)
B.Sc. Angewandte Informatik,
M.Sc. Angewandte Informatik,
M.Sc. Scientific Computing
Lernziel * Knowledge of selected approaches and programming paradigms of parallel data processing
* Knowledge how to use tools for parallel data processing (among others Apache Hadoop and Spark)
* Familiarity with application domains of big data analysis
* Knowledge of methods of parallel pre-processing of data
* Knowledge of methods like classification, regression, clustering and their parallel implementations
* Knowledge of scaling of parallel algorithms
Inhalt This module covers the following topics:
* programming paradigms for parallel-distributed data processing, especially Map-Reduce and Spark programming models
* usage of tools like Apache Spark, Hadoop, Pig, Hive, and possibly other frameworks for parallel-distributed data processing
* application cases in parallel data analysis, for example clustering, recommendation, search for similar objects, mining of data streams
* techniques for parallel pre-processing of data
* fundamentals of analysis techniques such as classification, regression, clustering and evaluation of the results
* parallel algorithms for data analysis and their implementations
* theory and practice of scalability and tuning of frameworks
Voraussetzungen recommended are Knowledge of Java/Python and in elementary probability theory / statistics; module IBD can be taken as a complement / extension.
a written exam
Literatur * Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, Version 2.1 von 2014 (http://www.mmds.org/)
* Trevor Hastie, Robert Tibshirani, Jerome Fried-man, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/)
* Ron Bekkerman, Misha Bilenko, John Langford, Scaling Up Machine Learning, Cambridge University Press, 2012
* Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, (third edition), 2012
* Books from O'Reilly Data Science Starter Kit, 2014 (http://shop.oreilly.com/category/get/data-science-kit.do)