[IDSTA] - [de] - [Data Science for Text Analytics]


Data Science for Text Analytics [2022/23 WiSe]
Code
IDSTA
Name
Data Science for Text Analytics
LP
6
Dauer
one semester
Angebotsturnus
every 2nd winter semester
Format
Lecture 2 SWS + Exercise 2 SWS
Arbeitsaufwand
180 h; thereof
60 h lecture
120 h self-study and working on assignments/projects (optionally in groups)
Verwendbarkeit
B.Sc. Informatik
B.Sc. Angewandte Informatik
Not open for students who have already taken the lecture ITA in the winter semester 2020/21.
Sprache
English
Lehrende
Michael Gertz
Prüfungsschema
Lernziele Students
- can implement and apply different text analytics methods using open source NLP and machine learning frameworks
- can describe different document and text representation models and can compute and analyze characteristic parameters of these models
- know the concepts and techniques underlying Information Retrieval (IR) systems and search engines
- know how to determine, apply, and interpret use-case specific document similarity measures and underlying ranking concepts
- know the concepts and techniques underlying basic text classification and clustering approaches, such as Naïve Bayes and Logistic Regression
- understand the principles of evaluating results of text analytics components and tasks
- can implement a full stack text analytics pipeline, from backend IR component to frontend UI component
- are aware of ethical issues arising from applying text analytics in different domains
- are able to apply standard software engineering practices
Lerninhalte - Text analytics in the context of data science
- Open source text analytics frameworks (e.g., spaCy, gensim)
- Open source Information Retrieval (IR) systems and search engines (e.g., Elasticsearch, Opensearch)
- Components of text analytics pipelines (including tokenization, stemming, PoS tagging),
- Document and text representation models (incl. TF-IDF, n-grams, and embeddings)
- Document and text similarity metrics (e.g., BM25)
- Text classification and clustering approaches (e.g., Naïve Bayes, logistic regression, kNN)
- Techniques for information extraction
- Approaches, techniques and corpora for benchmarking text analytics tasks
- Ethical and legal aspects of text analytics methods
- Text Analytics project management
Teilnahme-
voraus-
setzungen
Recommended are: solid knowledge of basic calculus, statistics, and linear algebra; good Python programming skills
Vergabe der LP und Modulendnote Assignment (40%) and Programming Project (60%); about 4-6 assignments focusing on the material learned in class on a conceptual and formal level; group project in which 3-4 students develop a prototypical text analytics framework using an open source search engine, including design and evaluation, a written report; project documentation as well as the code need to be submitted at the end of classes (Gitlab), clearly indicating what contributions were made by each group member. Both assignments and project must be at least satisfactory (4,0) in order to pass the class.
Nützliche Literatur The following textbook and texts are useful but not required.
- Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft)
Furthermore, during the course of this lecture, several papers covering topics discussed in class will be provided.