 # Curriculum (Winter Term 2020/21, Online)

Each lecture is divided in two days and will be held on thursdays (13:30 - 17:00) and fridays (09:00 - 12:30).

## Day 0 (22. and 23.09.2020): R basic course (separate booking required)

For many practical tutorials in the data science certificate program, basic knowledge of the statistical programming language R is required. If you don't have any basic knowledge in R or want to refresh your knowledge in R, we highly recommend you to learn R in self-study before the certificate program starts. Alternatively, you can book a separate R basics course at Essential Data Science Training, an official LMU Spin-off. For more information, visit the R basics course registration page.

## Lecture 1 (01. and 02.10.2020): First steps in Data Analysis, Casalicchio

The aim of this course is to give an overview of the different data analysis methods, including data visualization, summary statistics and the aggregation of data which allows data scientists to gain insights into the data.

At the end of the course, participants should have acquired the ability to apply the learned methods to their respective fields of work and their own data, as well as the ability to present results in an easily understandable way.

• Measurement Scales (Nominal, Ordinal, Interval, Ratio)
• Descriptive Statistics
• Univariate Data Analysis
• Multivariate Data Analysis
• Use Case of Exploratory Data Analysis (here, the examples are based on the statistical programming language R)

## Lecture 2 (08. and 09.10.2020): Statistical Foundations, Kauermann

Statistical Reasoning is an essential step in data analytics. Statistics allows to quantify information in data and to distinguish random variation and from relevant / significant effects. This is based on probability statements and statistical reasoning. It also includes the step to quantify information, be it with confidence intervals or based on Baysian reasoning. The principle ideas are extended to regression models, which are given in a general format allowing for arbitrary data formats. This day provides the basic foundation in statistical reasoning and inference in Data Science.

• Principles of Statistical Reasoning
• Bayesian Statistics/Statistical Tests
• Multiple testing/Model Selection
• Linear Regressions
• Generalized Regression / Quantile Regression
• Use Case of Exploratory Data Analysis and Regression (here, the examples are based on the statistical programming language R)

## Lecture 3 (15. and 16.10.2020): Causality and Causation & Visualization, Kauermann, Wiedemann

Data today are seldomly perfect, they contain missing or erroneous entries and after all, one needs to ask whether data at hand allow to answer the question posed. This step is often overseen, but notedly important in order to draw the right conclusions from data and to answer the questions at hand with the data. For instance, recorded sales data usually do not allow to estimate price elasticity. This half day provides an introduction to statistical concepts to deal with deficient, missing and/or erroneous data. It also touches the core ideas of causality. Moreover, the principle of boostrapping as resampling method is motivated.

Causality and Causation, Kauermann

• Bootstrapping
• Principles of Causation
• Error in Variables
• Missing Data

Visualization, Wiedemann

• Introduction and Background
• Visualization techniques
• Virtual Reality and Mixed Reality

Visualization is a powerful tool for analysing, exploring and understanding complex data sets. This part focuses on the theory behind data visualization and introduces different concepts for various data.

After an introduction to the basics of the topic, the different types of visualization and their advantages/disadvantages will be discussed. This is followed by an overview of modern visualization technologies as well as application examples.

## Lecture 4 (22. and 23.10.2020): Data Management, Kröger

The first part of this lecture provides an introduction to state-of-the-art techniques in data management, particularly relational databases (SQL), data warehousing and a brief overview on technologies for big data beyond SQL. Participants will get theoretical as well as hands-on experience in these topics.

• Relational Databases (SQL)
• Data Warehouses and BI

## Lecture 5 (29. and 30.10.2020): Predictive Modelling 1, Bischl

Supervised machine learning, in particular by means of non-linear, non-parametric methods, has become a central part of modern data analysis in order to uncover complex patterns and relationships in data. During this training lecture, participants are introduced to decision trees and ensemble techniques like random forests and gradient boosting, as these methods offer a very attractive trade-off between complexity, predictive power and interpretability. Proper model evaluation through resampling techniques (e.g. cross-validation) is a central topic. The lecture concludes with a session on feature selection and a practical case study, including hints on data preprocessing and best practices. All methods are introduced through practical examples and demos in R, so that participant can directly apply them.

• Intro to ML (Machine Learning)
• Trees and Forests
• Resampling and model evaluation
• Variable selection
• Use case study of ML (here, the examples are based on the statistical programming language R)

## Lecture 6 (05. and 06.11.2020): Predictive Modelling 2 & Deep Learning, Thomas

This second lecture on supervised learning focusses on more advanced topisc. During the first half of the lecture, participants are introduced to the main concepts of modern deep learning techniques including their optimization, convolutional neural networks for image data and auto encoders. Practical examples will use either the keras or mxnet toolbox via R.

During the second half of the lecture, imbalanced data situations, preprocessing and pipeline configuration will be tackled. All methods are introduced through practical examples and demos in R, so that participant can directly apply them.

• Deep Learning 1: Intro and network structure
• Deep Learning 2: Optimization and demos
• Deep Learning 3: CNNs and AutoEncoder
• Imbalanced data and ROC (Receiver-Operating Characteristic)
• Preprocessing and feature generation
• Parameter tuning and pipeline configuration
• Use Case Study of Deep Learning with discussion (here, the examples are based on the statistical programming language R)

## Lecture 7 (12. and 13.11.2020): Unsupervised Methods, Kröger

Unsupervised learning methods including frequent pattern mining, clustering, and anomaly/outlier detection are essential tools for applications where no information on data is known a priori. This training lecture gives an introduction to mining frequent itemsets and association rules as well as to finding clusters and outliers without prior knowledge such as training data. Participants will get an overview of different models and algorithms with a discussion of strength and weaknesses and best practices. All methods are introduced through practical examples and demos using an open source data mining framework, so that participant can experience them in action and play with different parameterizations.

• Introduction & Frequent Itemset Mining
• Frequent Itemset Mining
• Clustering
• Outlier Detection
• Evaluation of Unsupervised Methods

## Lecture 8 (19. and 20.11.2020): Tools and Concepts for Large Data Sets, Kranzlmüller, gentschen Felde

This lecture focuses on on tools and concepts for handling and working with large data sets. During the first half of the lecture, participants are introduced to the main concepts of designing data-intensive applications with an special focus on parallel and high performance computing (HPC). As cloud computing becomes more and more relevant in data science, basic concepts and storage models for cloud computing are emphasized. The theoretical introduction of Hadoop and Flink/Spark hands over to the second part of the day, during which practical assignments deepen the understanding and use of both Map-Reduce with Hadoop and the capabilities of Flink/Spark.

• Introduction: Designing Data-Intensive Applications
• HPC (High Performance Computing) and parallel computing
• Cloud Computing

## Lecture 9 (26. and 27.11.2020): Data Privacy, Security & Visualization, gentschen Felde, Wiedemann

This lecture comprises two main topics and is thus split into two disjunct parts. The first half focuses on data privacy and security. Having defined basic terms of information security, three technical sessions are to follow. Firstly, the fundamentals of cryptography are repeated, before a more in-depth consideration of anonymization and pseudonymization of data sets is followed by a special focus on homomorphic encryption.

Data Privacy and Security, gentschen Felde

• Cryptography
• Anonymization, Pseudonymization
• Homomorphic Encryption
The second part of the lecture focuses on interactive visualization. For that several R-libraries will be introduced and their basic usage explained. This part will be a hands-on introduction featuring code examples and in-course exercises. As a second tool, the widely used open-source visualization software ParaView will be presented.

Visualization, Wiedemann

• Plotly
• visNetwork
• NetworkD3
• ParaView