[IMMD] - [2024Winter] - [en] - [Mining Massive Datasets]


Mining Massive Datasets [2024 SoSe]
Code
IMMD
Name
Mining Massive Datasets
CP
6
Duration
one semester
Offered
at least every 4th semester
Format
Lecture 2 SWS + Exercise course 2 SWS
Workload
180 h; thereof
60 h lecture
15 h preparation for exam
105 h self-study and working on assignments (optionally in groups)
Availability
M.Sc. Angewandte Informatik
M.Sc. Data and Computer Science
M.Sc. Scientific Computing
Language
English
Lecturer(s)
Artur Andrzejak
Examination scheme
Learning objectives * Knowledge of selected approaches and programming paradigms of parallel data processing
* Knowledge how to use tools for parallel data processing (among others Apache Hadoop and Spark)
* Familiarity with application domains of big data analysis
* Knowledge of methods of parallel pre-processing of data
* Knowledge of methods like classification, regression, clustering and their parallel implementations
* Knowledge of scaling of parallel algorithms
Learning content This module covers the following topics:
* programming paradigms for parallel-distributed data processing, especially Map-Reduce and Spark programming models
* usage of tools like Apache Spark, Hadoop, Pig, Hive, and possibly other frameworks for parallel-distributed data processing
* application cases in parallel data analysis, for example clustering, recommendation, search for similar objects, mining of data streams
* techniques for parallel pre-processing of data
* fundamentals of analysis techniques such as classification, regression, clustering and evaluation of the results
* parallel algorithms for data analysis and their implementations
* theory and practice of scalability and tuning of frameworks
Requirements for participation recommended are Knowledge of Java/Python and in elementary probability theory / statistics; module IBD can be taken as a complement / extension.
Requirements for the assignment of credits and final grade The module is completed with a graded examination. The final grade of the module is determined by the grade of the examination. Details for this examination as well as the requirements for the assignment of credits will be given by the lecturer an the beginning of this course.
Useful literature * Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, Version 2.1 von 2014 (http://www.mmds.org/)
* Trevor Hastie, Robert Tibshirani, Jerome Fried-man, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/)
* Ron Bekkerman, Misha Bilenko, John Langford, Scaling Up Machine Learning, Cambridge University Press, 2012
* Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, (third edition), 2012
* Books from O'Reilly Data Science Starter Kit, 2014 (http://shop.oreilly.com/category/get/data-science-kit.do)