Mining Massive Datasets [2024 SoSe] | ||
---|---|---|
Code IMMD |
Name Mining Massive Datasets |
|
LP 6 |
Dauer one semester |
Angebotsturnus at least every 4th semester |
Format Lecture 2 SWS + Exercise course 2 SWS |
Arbeitsaufwand 180 h; thereof 60 h lecture 15 h preparation for exam 105 h self-study and working on assignments (optionally in groups) |
Verwendbarkeit M.Sc. Angewandte Informatik M.Sc. Data and Computer Science M.Sc. Scientific Computing |
Sprache English |
Lehrende Artur Andrzejak |
Prüfungsschema |
Lernziele | * Knowledge of selected approaches and programming paradigms of parallel data processing * Knowledge how to use tools for parallel data processing (among others Apache Hadoop and Spark) * Familiarity with application domains of big data analysis * Knowledge of methods of parallel pre-processing of data * Knowledge of methods like classification, regression, clustering and their parallel implementations * Knowledge of scaling of parallel algorithms |
|
Lerninhalte | This module covers the following topics: * programming paradigms for parallel-distributed data processing, especially Map-Reduce and Spark programming models * usage of tools like Apache Spark, Hadoop, Pig, Hive, and possibly other frameworks for parallel-distributed data processing * application cases in parallel data analysis, for example clustering, recommendation, search for similar objects, mining of data streams * techniques for parallel pre-processing of data * fundamentals of analysis techniques such as classification, regression, clustering and evaluation of the results * parallel algorithms for data analysis and their implementations * theory and practice of scalability and tuning of frameworks |
|
Teilnahme- voraus- setzungen |
recommended are Knowledge of Java/Python and in elementary probability theory / statistics; module IBD can be taken as a complement / extension. | |
Vergabe der LP und Modulendnote | The module is completed with a graded examination. The final grade of the module is determined by the grade of the examination. Details for this examination as well as the requirements for the assignment of credits will be given by the lecturer an the beginning of this course. | |
Nützliche Literatur | * Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, Version 2.1 von 2014 (http://www.mmds.org/) * Trevor Hastie, Robert Tibshirani, Jerome Fried-man, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 (http://statweb.stanford.edu/~tibs/ElemStatLearn/) * Ron Bekkerman, Misha Bilenko, John Langford, Scaling Up Machine Learning, Cambridge University Press, 2012 * Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, (third edition), 2012 * Books from O'Reilly Data Science Starter Kit, 2014 (http://shop.oreilly.com/category/get/data-science-kit.do) |