# Curriculum (Summer Term 2024, In-person Course)

# As we are constantly improving our material, we reserve the right to make minor changes to the topics described below.

## Optional pre-course: R basic course (separate booking required)

For many practical sessions in the Data Science Certificate Program, you will need basic knowledge of the statistical programming language R (in some practical sessions, we may supplement the R code with additional Python code; however, our primary emphasis remains on utilizing R in all practical code examples). Therefore, if you lack basic proficiency in R or seek to refresh your skills, we strongly advise either undertaking self-study to learn R or enrolling in a dedicated R course, e.g. at Essential Data Science Training, an LMU Spin-off which offers a specialized R course before the beginning of every program.

## Lecture 1: First Steps in Data Analysis (Dr. Casalicchio)

The aim of this course is to give an overview of the different data analysis methods, including data visualization, summary statistics and the aggregation of data which allows data scientists to gain insights into the data.

At the end of the course, participants should have acquired the ability to apply the learned methods to their respective fields of work and their own data, as well as the ability to present results in an easily understandable way.

- Measurement Scales (Nominal, Ordinal, Interval, Ratio)
- Descriptive Statistics
- Univariate Data Analysis
- Multivariate Data Analysis
- Use Case of Exploratory Data Analysis (here, the examples are based on the statistical programming language R)

## Lecture 2: Statistical Foundations (Prof. Kauermann)

Uncertainty is omnipresent in Machine Learning and data analyes. Statistical Reasoning is therefore an essential which allows to quantify uncertainty and to distinguish information and random variation. This is based on probability statements and tools of statistical inference, which include classical concepts like It confidence intervals, prediction intervals as well as Baysian reasoning. Moreover, resampling methods allow to draw inference based on simulations. This day provides the basic foundation in statistical reasoning and inference in Data Science. We also touch on questions concerning causailty and experimental design.

## Lecture 3: Statistical Modeling (Prof. Kauermann)

Regression can be considered as workhorse of statistical modelling. We introduce the main ideas and extend these to more complex regression setups with discrete valued response variable, multi-level models and demonstrate the power of modern statistical regression models with a case study. The course also touches in topics concerning data deficiencies, which occur due to missing data, biased data, or omitted (input) variables.

## Lecture 4: Large-Scale Data Science and AI (Prof. Kranzlmüller)

This lecture explains ways to handle large-scale data sets that exceed the capabilities of modern desktop system. First, the lecture provides an introduction to supercomputing and cloud computing, discussing their architectural characteristics and providing examples. This includes the main concepts of designing data-intensive applications with on parallel and high performance computing (HPC) using GPU-accelartors, as well as the basics od compute and storage models for cloud computing. The theoretical basis is explained with practical examples of using data science tools on such infrastructures.

- Introduction: Large-scale processing of data sets on supercomputers and clouds
- Concepts for HPC (High Performance Computing) and GPU (Graphics Processing Units) Computing
- Models for Cloud Computing and Storage on Data Center Infrastructures
- Practical examples using tools such as Hadoop, Flink/Spark
- A tour through the Leibniz Supercomputing Centre (LRZ)

## Lecture 5: Unsupervised Methods (Dr. Casalicchio)

While supervised machine learning focuses on creating accurate predictions for a specific target variable, unsupervised machine learning emphasizes the discovery of structures and patterns in data (without any a priori information about a target variable). Principal Component Analysis (PCA) and Cluster Analysis are well-known techniques in the field of unsupervised machine learning and are extensively covered in this course. Participants will get an overview of different models and algorithms with a discussion of strengths and weaknesses and best practices.

- Cluster Analysis (e.g., hierarchical cluster analysis and partitioning cluster algorithms such as k-Means, k-Median, and k-Medoids)
- Metrics for evaluating cluster algorithms (or indices for cluster validation).
- Characteristics, comparison, as well as advantages and disadvantages of different clustering methods.
- Dimensionality reduction using Principal Component Analysis (PCA).

## Lecture 6: Supervised Machine Learning 1 (Prof. Bischl)

Supervised machine learning, in particular by means of non-linear, non-parametric methods, has become a central part of modern data analysis in order to uncover complex patterns and relationships in data. During this training lecture, participants are introduced to simple machine learning algorithms and basic concepts such as proper model evaluation through resampling techniques (e.g. cross-validation). The lecture concludes with a practical case study, including hints on best practices. All methods are introduced through practical examples and code demos so that participants can directly apply them.

- Intro to supervised machine learning (ML) and overview of general ML tasks
- Clarification of fundamental terms such as loss function, empirical risk minimization, overfitting, hyperparameters, training, and test data, etc.
- Simple ML algorithms such as K-NN
- Overview of important evaluation metrics for regression and classifications and their characteristics
- Resampling and model evaluation
- Benchmarking and comparing ML algorithms
- Simple use case study to practice the learned concepts

## Lecture 7: Supervised Machine Learning 2 (Prof. Bischl)

This second lecture on supervised learning focuses on tree-based algorithms and introduces the concept of hyperparameter optimization (tuning).

- Introduction to decision trees, random forests, and gradient boosting
- Hyperparameter optimization (random search and grid search)
- Nested cross-validation for optimal model selection
- Pitfalls and practical tips in model evaluation and selection

## Lecture 8: Supervised Machine Learning 3 (Prof. Bischl / Dr. Casalicchio)

This third lecture on supervised learning focuses on further important and advanced topics.

- Model-agnostic interpretability
- Machine learning pipelines (simple preprocessing and handling of imbalanced classes)
- Outlook: Automated machine learning (AutoML)

## Lecture 9: Introduction to Deep Learning (Prof. Rügamer)

This lecture aims to provide a basic theoretical and practical understanding of neural networks. First, we cover the necessary background on traditional artificial neural networks, backpropagation, online learning, and regularization. Then we explain special methods used in deep learning, like drop-out. We also talk about further advanced topics like convolutional layers, recurrent neural networks, and auto-encoders. Practical applications in relevant deep learning languages accompanying the theoretical background further help to better understand the discussed concepts.