Project in Computer Science 1 - Machine Learning Algorithms and Applications

General information

TISS: (link)
contact: Max Thiessen
everything important will be announced in TUWEL.

Format

In this course, you will experience the role of a typical machine learning researcher. You will:

Familiarise yourself with a specific area of machine learning
Understand and summarise research papers
Work on a concrete problem/task in machine learning

There are three types of projects:

Theoretical: Formulate proofs, find novel bounds, show the complexity of an algorithm
Algorithmic: Implement an algorithm, perform experiments, achieve good results on data
Applied: Build an application that at its core uses machine learning

Working on the project consists of the following steps:

1. Problem statement

Choose a project topic or come up with your own, original idea. Then, write a short problem statement, including your research question, motivation, related work, methodology, and expected results.

2. Working phase

Independently work on your project. Additionally, you will have discussion meetings with your supervisor if needed.

3. Final submission

Write a final report to present the results of your work.

Projects (Tentative)

We are happy to supervise machine learning related projects that are connected to our research interests. Examples are:

Empirically understanding the role of depth in GNNs (click to expand)

Type: Coding with Python and Pytorch. Simple algorithms to generate graphs and implement the 1-WL test (or use a library for it).

Question: How deep do GNNs need to be? (on average)

Context: It has been shown that GNNs, in terms of classifying graphs up to isomorphism, are only as expressive as the 1-dimensional Weisfeiler-Lehman test [1]. However, Babai and Kucera [2] showed that 1-WL is enough to classify almost all graphs (generated under a uniform distribution). Furthermore, they show that only two colour refinement steps of WL are enough to classify almost all graphs.

We can draw a rather indirect connection between the depth of GNN and the number of colour-refinement steps, whereby the depth of the GNN and the number of refinement steps needed in 1-WL to classify graphs is the same. Given this analogy between GNN depth and colour-refinement steps, are GNN with depth two enough for classifying almost all graphs?

Goal: In this project, we would like to investigate the role of depth in GNNs empirically. The project would consist of the following sub-goals:

How well do GNNs of depth two perform in classifying graphs generated from a uniform distribution in comparison to GNNs of larger depth?
How much does this performance degrade under non-uniformity? i.e. when graphs are generated from an erdos-reyni, stochastic block model etc.

Suggested approach: Dataset generation: We suggest starting with a simple random-graph model that generates graphs uniformly at random (Erdos-Reyni, with edge probability equal to 0.5), implementing 1-WL to classify them based on isomorphism, and cataloguing the colour refinements needed. The distribution of the refinement steps should ideally be centred around two. It would be interesting to see the variance and spread of the required refinement steps around two.

GNN Training/Testing: We should train GNNs of varying depths and compare their performances on the dataset generated in the previous step.

Additional Step: Most real-world graphs do not follow an Erdos-Reyni distribution. As a first step, we can generate graphs from a Stochastic Block Model (SBM) like distribution and test the gains achieved by deeper GNNs on SBM compared to gains achieved on Erdos-Reyni.

Related work:

[1] Morris et. al. Weisfeiler and Leman go neural: higher-order graph neural networks. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.
[2] Babai and Kucera. Canonical labelling of graphs in linear average time.

Algorithm Representation Learning from CLRS with Pytorch (click to expand)

Type: Implement and train graph neural networks with PyTorch (Geometric).

Recommended Skills: Experience with machine learning using Python. Ideally: experience with a deep learning framework (PyTorch, TensorFlow, …).

Task: Implement GNNs in PyTorch Geometric and train them to perform algorithms from the CLRS dataset

Context: In algorithm representation learning neural networks are trained to perform classical algorithms, for example sorting or shortest-path algorithms. This allows the algorithm to be adapted to given data. Furthermore, the resulting algorithm is differentiable which means it can be used inside another neural network or as a loss function (for an example, see Differentiably Sorting Pictures of Numbers with Algorithm Representation Learning). It is common to use graph neural networks (GNNs) as the neural network that learns to execute the algorithm.

The CLRS benchmark is a dataset of classical algorithms together with example inputs and outputs of those algorithms. The authors train GNNs on this dataset and show that they are capable of executing the algorithms. However, their implementation makes use of Jax instead of the more common PyTorch Geometric. Since most of GNN research is done with PyTorch Geometric it would be useful to have a working PyTorch Geometric implemention of a GNN trained on at least one task from CLRS.

Suggested approach: Implement or find an existing implementation of an MPNN or a graph attention network (GAT). Implement the training of this GNN on at least one task from CLRS. Evaluate the performance of the GNN against the results of Veličković et al..

Related work:

Veličković et al., The CLRS Algorithmic Reasoning Benchmark, ICML, 2022

Advisor: Fabian

Convexity in real-world graphs (click to expand)

Question: Do node labelled real-world graphs have (almost) geodesically convex classes?

Suggested approach: Check for node-labelled graphs in benchmark datasets (e.g., http://snap.stanford.edu/data/), whether the classes (vertex sets corresponding with same node label) are convex. Try to find a more realisitic notion of convexity which, for example, allows outliers and disconnected regions.

Related work:

Ignacio M. Pelayo “Geodesic Convexity in Graphs” (Springer 2013) (chapter 1, 2.1, 2.2, and 2.10) (Researchgate link)
Eike Stadtländer, Tamás Horváth, and Stefan Wrobel. “Learning weakly convex sets in metric spaces.” (ECMLPKDD 2021)
Florian Seiffarth, Tamás Horváth, and Stefan Wrobel “Maximal Closed Set and Half-Space Separations in Finite Closure Systems.” (ECMLPKDD 2019)
Maximilian Thiessen and Thomas Gärtner. “Active Learning of Convex Halfspaces on Graphs.” (NeurIPS 2021).

Context: While classical notions of convexity in Euclidean space are largely studied and used by the machine learning community, related notions in discrete spaces (e.g., graphs) have been mostly overlooked. A vertex set S is in a graph is geodesically convex, if all shortest paths joining two vertices in S stay in S. Recently, we (and other groups) have started to use the notion of convexity in graphs to achieve guarantees for machine learning problems on graphs. Currently, these results are mainly of theoretical interest, but the next step is to evaluate and test the developed methods on real-world graphs. For that, we have to find reasonable datasets and application areas that fit our convexity-based assumptions. Your goal is to identify graphs where the labelled subgraphs are (close to) convex.

Debiasing graph-based models (GNNs) (click to expand)

Type: Research methods for debiasing (graph-based) models, implement and train GNNs, apply and evaluate debiasing methods

Recommended Skills: Experience with deep learning framework such as PyTorch (Geometric)

Task: Conduct an initial research related to debiasing methods for deep neural networks (and graph-based models) and experiment in translating them to graph-based models.

Context: Bias in ML models is a very important part of ML research which tries to identify and reduce implicit biases such as gender or racial bias. While there has been a plethora of different algorithms and methods to debias deep neural networks (e.g., debiasing Recommender Systems, Large Language Models), there is still plenty of room to explore, especially for graph-based models such as GNNs. With the rising popularity of using GNNs for a variety of different tasks, we want to investige methods suitable to debias those models.

Suggested approach: Find suitable data sets, i.e., data sets with additional meta information such as gender or age that can be represented as a graph (e.g., Knowledge Graphs of recommendation tasks such as MovieLens, LFM2b); implement and train a GNN; evaluate performance of the task and metrics related to bias; apply debiasing methods and evaluate (e.g., adversarial training)

Related work:

Chen et al., Bias and Debias in Recommender System: A Survey and Future Directions, ACM Transactions on Information Systems, Vol. 41, No. 3
Ganhör et al., Unlearning Protected User Attributes in Recommendations with Adversarial Training, SIGIR 2022
RecBole as a framework for Recommender Systems and related data sets [https://recbole.io/]

Advisor: David

Debiasing deep models using Siamese Neural Networks (click to expand)

Type: Implement and train deep neural networks

Recommended Skills: Experience with deep learning using PyTorch

Task: Extend research on debiasing recommender systems (see Ganhör et al. below) by experimenting with the use of Siamese Neural Networks

Context: Bias in ML models is a very important part of ML research which tries to identify and reduce implicit biases such as gender or racial bias. Recent research has shown good results using a variety of different approaches (e.g., adversarial training). However, many of those approaches are difficult to fine tune and come with a loss in performance. Using a simpler training loop using Siamese Neural Networks, we hope for a method that is easy to tune and preserves the models performance.

Suggested approach: Implement recommender systems based on VAE architectures and different data sets including additional meta information such as gender (e.g. LFM2b, MovieLens), extend the implementation to a Siamese Neural Network architecture,

Related work:

Ganhör et al., Unlearning Protected User Attributes in Recommendations with Adversarial Training, SIGIR 2022
Koch et al., Siamese neural networks for one-shot image recognition, ICML deep learning workshop 2015

Advisor: David

Differentiably Sorting Pictures of Numbers with Algorithm Representation Learning (click to expand)

Type: Mostly coding with (ideally) pytorch.

Recommended Skills: Experience with machine learning using Python. Ideally: experience with implementing and training neural networks with a deep learning framework (PyTorch, TensorFlow, …).

Question: Is differentiable sorting with algorithm representation learning better than sorting networks?

Context: Differentiable sorting allows gradient computation of a sorting function which allows the training on datasets where only the order of labels is known. As an example consider the following task: we want to train a CNN to classify the number a picture depicts. During training we are given a batch of pictures of numbers (for example images of 123, 7, 90) and their sorted values (7, 90, 123). As we do not know which number belongs to which picture, there is no direct way to train the CNN. However, with differentiable sorting we can have the CNN predict the numbers, then we differentiably sort the numbers to compute a loss and train the CNN.

Sorting networks have been shown to work well for this task. However, we are interested in seeing if algorithm representation learning (ARL) can be used for this task. For this, we propose to train a GNN to execute a sorting algorithm which can then be used for differentiable sorting.

Goal: Train a GNN to perform differentiable sorting and evaluate them on the MNIST prediction task from Monotonic Differentiable Sorting Networks.

Suggested approach: We suggest building a GNN with pytorch geometric (or using an existing implementation) and training it on data from the CLRS dataset. For the CNN the implementation fron Petersen (code) can be used.

Related work:

Chapter 3 of: Petersen, Learning with Differentiable Algorithms, arXiv, 2022
Petersen et al., Monotonic Differentiable Sorting Networks, ICLR, 2022
Veličković et al., The CLRS Algorithmic Reasoning Benchmark, ICML, 2022

Advisor: Fabian

Disentangled Representations (click to expand)

Type: Implement and train different algorithms designed to compute disentangled representations (PyTorch)

Recommended Skills: Experience with deep learning using PyTorch

Task: Implement different state-of-the-art algorithms for computing disentangled representations and benchmark them on a variety of different data sets, preferrably (including) data sets outside the Computer Vision domain.

Context: Many state-of-the-art deep learning models in various domains such as Language Models and Recommender Systems thrive on semantincally rich hidden representations. Those representations, however, are most of the time highly entangled, i.e., a change of a single factor such as gender, colour, shape, age, … will change all the dimensions of the representation. Disentangling individual dimensions or vectors from eachother might offer benefits such as better explainability of the predictions, new insights in what the network is learning, controllability over what information flows through the network, …

Suggested approach: Find suitable data sets, i.e., data sets that come with additional meta information such as gender and age (e.g., LFM2b, MovieLens, BIOS); implement and train a variety of different algorithms suitable for the given task (e.g., beta-VAE, FactorVAE); evaluate performance of the task and metrics related to disentanglement

Related work:

Eastwood and Williams, A Framework for the Quantitative Evaluation of Disentangled Representations, ICLR 2018
Kim and Mnih, Disentangling by Factorising, PMLR 2018
Burgess et al., Understanding disentangling in beta-VAE

Advisor: David

Double-descent in graph neural networks (click to expand)

Type: Mostly implementing graph neural networks with (ideally) PyTorch and visualization of results

Recommended Skills: Experience with machine learning in Python

Question: Can we observe the double-descent phenomenon for over-paramerized graph neural networks?

Context: The so-called double-descent phenomenon describes the observation that for over-parameterized machine learning models (i.e., models which have many more parameters than, e.g., training samples) it can happen that the test error, after its initial decrease, increases again and eventually decreases again (hence, coined “double descent”). This behavior is usually amplified with noisy labels. The double-descent phenomenon is of general interest, because it means that the training error goes to zero, and yet our model does not overfit. This behavior has been shown to apply to a variety of different learning algorithms, with a focus on deep neural networks. To observe such behavior, we usually choose a certain depth (i.e., the number of layers in the neural network) and vary the width of said layers. However, there is currently little work on whether this behavior can also be observed for graph neural networks.

Suggested approach: Choose one or two GNN architectures (e.g. GIN, GCN), where you fix the depth and vary the width. First, use some synthetic data where you can vary the label noise. Then, use some graph classification benchmark datasets. Investigate whether you can observe the double-descent behavior, i.e., test error decreases, increases, and decreases again.

Related work:

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proceedings of the National Academy of Sciences 116 (2019)
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever, Deep double descent: Where bigger models and more data hurt, Journal of Statistical Mechanics: Theory and Experiment 2021 (2021)

Advisor: Tamara

Counting Graph Substructures with GNNs (click to expand)

Type: Implement and train graph neural networks with PyTorch (Geometric).

Recommended Skills: Experience with machine learning using Python. Ideally: experience with a deep learning framework (PyTorch, TensorFlow, …).

Task: Implement a transformation to count subgraph patterns in a given graph dataset. Train and evaluate the performance of GNNs in counting different type of subgraphs.

Context: It is proven that message passing graph neural networks (MPNNs) can only precisely count subgraph structures that are star shaped. We are interested in investigating this experimentally. Thus, we want to train MPNNs to count patterns such as cycles or cliques.

Suggested approach: Find an implementation to count subgraph patterns and use it to augment graph datasets with pattern counts. Implement or find an implementation of a GNN and train it to predict the pattern counts.

Related work:

Chen et al., Can Graph Neural Networks Count Substructures?, NeurIPS, 2020
Liu et al., Neural Subgraph Isomorphism Counting, KDD, 2020

Advisor: Fabian

Empirical validation of generalization bounds for graph neural networks (click to expand)

Type: Mostly implementing graph neural networks with (ideally) PyTorch, some going through related work and collecting interesting generalization bounds

Recommended Skills: (Some) experience with machine learning in Python. Familiarity with mathematical proofs is helpful to understand the related work, but not strictly required for this project

Question: How (empirically) tight are generalization bounds for graph neural networks?

Context: In machine learning theory, we are interested in giving formal guarantees about the behavior of machine learning algorithms. One type of guarantee is about generalization, where we may want to provide an (upper) bound on the generalization error, i.e., how well does a model generalize to yet unseen data? Recently, there has been a number of theoretical contributions for generalization bounds for graph neural networks. It would be interesting to assess (1) how practically useful they are, and (2) how they compare to each other.

Suggested approach: Consult related work and choose two to three different generalization bounds for graph neural networks (this can be done in discussion with your supervisor). Assess them empirically on a number of graph benchmark datasets for at least one GNN architecture.

Related work:

Garg, Vikas, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. International Conference on Machine Learning. PMLR, 2020.
Renjie Liao, Raquel Urtasun, and Richard Zemel. A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks. In International Conference on Learning Representations (2021).

Advisor: Tamara

Extending Graph Embeddings with Graph Properties (click to expand)

Type: Researching interesting graph properties. Implement and train graph neural networks with PyTorch (Geometric).

Recommended Skills: Experience with machine learning using Python. Ideally: experience with a deep learning framework (PyTorch, TensorFlow, …).

Task: Find interesting graph properties (for example: connectedness), combine them with graph neural networks (GNNs) and evaluate whether this improves the prediction quality of the GNN.

Context: It is known that message passing graph neural networks (MPNNs) have limitations in the kind of functions they can express. We propose that by extending the embedding computed by an MPNN by graph properties we can make the resulting embedding more expressive and improved prediction quality. In this project we are interested in investigating whether this approach is promising and finding good graph properties.

Suggested approach: Implement or find implementations of one or two basic GNNs (for example MPNN, GCN or GAT). Look for interesting graph properties and find (or implement) methods to compute them. Combine these properties with the embedding of the GNNs and see whether this improves prediction performance over “just” the GNN.

Related work:

Xu et al., How Powerful are Graph Neural Networks?, ICLR, 2019
Barceló et al., Graph Neural Networks with Local Graph Parameters, NeurIPS, 2021
Bouritsas et al., Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting, arXiv, 2021

Advisor: Max and Fabian

Robustness vs. privacy in GNNs (click to expand)

Type: Implementation and training of GNNs with PyTorch.

Question: Are GNNs which are robust against adversarial attacks less private? Are private GNNs less robust?

Suggested Approach: Implementation/adaptation of robust GNNs and evaluation of the privacy/utility trade-off on benchmark datasets against a number of privacy attacks.

Related Work:

Song, Liwei, Reza Shokri, and Prateek Mittal. “Privacy risks of securing machine learning models against adversarial examples.” Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 2019.
Ghazi, Badih, et al. “Robust and private learning of halfspaces.” International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
Dai, Enyan, et al. “A comprehensive survey on trustworthy graph neural networks: Privacy, robustness, fairness, and explainability.” arXiv (2022).

Context: A tension between differential privacy and robustness (against adversarial examples) has been empirically observed [Song et al., 2019] and theoretically shown [Ghazi et al., 2021]. Do these result extend to structured data? Focusing on a centralised setting: are more robust algorithms less private (or vice versa)?

Advisor: Patrick

Graph-based active learning (click to expand)

Task: Implement and design an efficient (python) framework to compare graph-based active learning algorithms.

Suggested Approach: The goal is to build a framework to compare different active learning algorithms more easily. Adding new active learning algorithms, new datasets, or new evaluation strategies should be as simple as possible. Data I/O, the active learning algorithms, the prediction algorithms, and the evaluations should be encapsulated and exchangeable (think of scikit-learn).

Related Work:

Zhu, et al. “Combining active learning and semi-supervised learning using gaussian fields and harmonic functions.” ICML 2003 workshop. Matlab implementation
Ma, et al. “Σ-Optimality for Active Learning on Gaussian Random Fields.” NIPS 2013. Matlab implementation of their and related algorithms
graph cuts (Blum and Chawla. “Learning from labeled and unlabeled data using graph mincuts.” ICML 2001. Example implementation
Dasarathy, et al. “S2: an efficient graph based active learning algorithm with application to nonparametric classification.” COLT 2015
Cesa-Bianchi, et al. “Active learning on trees and graphs”, COLT 2013
Burr Settles, “Active Learning”, Morgan & Claypool 2012.

Context: Even though implementations of graph-based active learning algorithms are often available, it is tedious to evaluate and compare them on real-world graphs. Additionally, most existing active learning libraries like modAL, alipy and libact focus on heuristic (e.g., uncertainty based) methods, whereas we are interested in graph-based methods with theoretical guarantees. Because of that, our goal is to design an easy-to-use python framework that enables efficient evaluations and comparisons of graph-based methods. In the long run, it might be possible to integrate these implementations in popular machine learning libraries like scikit-learn (https://scikit-learn.org) or vowpal wabbit (https://vowpalwabbit.org/).

Finding Graphs where Message Passing Neural Networks Fail (click to expand)

Type: Understanding a theory focused paper and implementing a graph algorithm

Recommended Skills: A good understanding of graph algorithms and some experience with theoretical computer science. Experience with implementing graph algorithms.

Task: Implement an algorithm to find pairs of non-isomorphic that cannot be distinguished by message passing graph neural networks or the Weisfeiler-Leman graph isormoprhism test

Context: It is known that message passing graph neural networks (MPNNs) have limitations in the kind of functions they can express. Formally, they are only as good in distinguish isomorphic graphs as the Weisfeiler-Leman graph isomorphism test (WL). Thus, any pair of graphs that cannot be distinguished by WL cannot be distinguished by any MPNN i.e., any MPNN will return the same output for the two graphs. Arvind et al. propose an efficient algorithm to determine whether a given graph can be distinguished from all other graphs by WL. Such graphs are called amenable. We propose using this algorithm to generate a dataset of pairs of graphs that are indistinguishable by WL and MPNNs. This dataset could serve as a benchmark to evaluate graph neural networks that are more expressive than the MPNN in distinguishing graphs.

Suggested approach: Implement the algorithm from Corollary 11 of Arvind et al.. This algorithm allows us to search for non-amenable graphs i.e., graphs for which a non-isomorphic graph exists that cannot be distinguished via WL. Implement a brute-force algorithm to find this non-isomorphic graph or (optionally) design a more efficient algorithm to find these graphs.

Related work:

Arvind et al., On the Power of Color Refinement, International Symposium on Fundamentals of Computation Theory, 2015
Xu et al., How Powerful are Graph Neural Networks?, ICLR, 2019

Advisor: Fabian

Privacy attacks on GNNs (click to expand)

Type: Implementation and training of GNNs with PyTorch.

Question: What is the privacy protection performance of differentially private GNNs against various privacy attacks?

Suggested Approach: Implementation/adaptation of existing differentially private GNNs and evaluation of the privacy/utility trade-off on benchmark datasets against a number of privacy attacks.

Related Work:

Dai, Enyan, et al. “A comprehensive survey on trustworthy graph neural networks: Privacy, robustness, fairness, and explainability.” arXiv (2022).
Sajadmanesh, Sina, and Daniel Gatica-Perez. “Locally private graph neural networks.” Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021.

Context: Similarly to deep learning algorithms trained on images and text, GNNs rely on large amounts of (possibly sensitive) data. Differential privacy is one of the possible approaches to protect sensitive data, which is of particular importance when the models are trained in a centralised way. In particular, there exist differentially private techniques that aim at preserving the privacy of features, labels, or edges of the graph, mainly focusing on protection against membership inference attacks. Nevertheless, privacy protection performance against various other privacy attacks if under-investigated.

Advisor: Patrick

Kernel Methods for Large Scale Graph Learning Problems (click to expand)

Type:

Coding and evaluation in python, likely using sklearn [4] and grakel [5].

Question:

How well do SVMs and graph kernels still do on large scale problems?

Context: Recent large scale graph level learning problems give you a few million training graphs and a learning problem such as molecule property prediction. In recent years, such tasks are typically solved by large (message passing) graph neural networks (GNN) with millions of parameters [1]. In fact, most top scoring competitors in recent challenges are in fact ensembles of GNNs. These methods require expensive hardware (e.g. 4 NVIDIA A100 for 10k€ a piece) to train in reasonable time.

One argument that is often mentioned as a drawback of SVM/SVR learning is its typical at-least-quadratic scaling behavior with the number of training examples. In particular, kernel methods are assumed to require a full Gram matrix of the input data, which is infeasible for datasets with millions of graphs.

However, multiple graph kernels have been proposed that capture different graph properties that have been shown to achieve good predictive performance on small to medium sized graph datasets see e.g. [3] [4].

Suggested approach:

If ensembling of models is an integral part of recent practical approaches using GNNs, why don’t we allow it for kernel methods, as well?

Split the training data in reasonable size chunks (e.g. 2000 graphs each) on which kernel methods are empirically fast
Train SVM/SVR on each chunk independently, using multiple graph kernels.
Ensemble the trained models using a voting scheme or a simple model such as linear regression to obtain a global model.
Optimize the approach such that a leaderboard position can be achieved, where you can write ‘Hardware: My Laptop’

Related work:

[1] https://ogb.stanford.edu/docs/lsc/pcqm4mv2
[2] Nils M. Kriege, Fredrik D. Johansson, Christopher Morris: A survey on graph kernels. Appl. Netw. Sci. 5(1): 6 (2020)
[3] Karsten M. Borgwardt, M. Elisabetta Ghisu, Felipe Llinares-López, Leslie O’Bray, Bastian Rieck: Graph Kernels: State-of-the-Art and Future Challenges. Found. Trends Mach. Learn. 13(5-6) (2020)
[4] https://scikit-learn.org/
[5] https://ysig.github.io/GraKeL/

Teaching in ML - Designing a Massive Open Online Course (click to expand)

Type: Implementation of exercises and projects for students in Python (Jupyter Notebook)

Recommended Skills: Knowledge of basic ML algorithms such as Regression, Decision Trees, SVM, Clustering algorithms, programming with Python

Task: Support our project team designing a Massive Open Online Course (MOOC) for Bachelor students which introduces ML. Part of this project is to design and implement exercises and projects as well as methods to evaluate them automatically.

Context: In collaboration with JKU Linz, our ML team is tasked to design a MOOC for introducing the concepts of AI and ML to Bachelor students. As this project involves many different tasks, we are looking for students to support us with the above described task.

Suggested approach: Description of individual tasks and implementation using Jupyter Notebooks, designing a (guided) code skeleton in Jupyter Notebooks for implementing different types of algorithms for the projects. An interesting direction for the automatic evaluation of projects might also be to train a model that evaluates the quality and correctness of Python code for ML projects.

Advisor: David

Reproducibility of ML papers (click to expand)

Type: Experimental

Recommended Skills: Some programming experience, ideally some experience with standard machine learning libraries such as PyTorch.

Task: Reproduce the results of a machine learning paper.

Context: Reproducibility, i.e., obtaining similar output using the same data, is a critical quality in research. Unfortunately, this is often not the case.

Suggested approach: Choose a paper from one of the following venues: NeurIPS, ICML, ICLR, ECML, JMLR, MLJ. Please do no just re-use existing code! (i.e., re-implement in a different programming language/library, or implement methods from multiple papers to perform a comparison.)

Related work:

NeurIPS reproducibility program (Pineau, Joelle, et al. “Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).” JMLR, 2021)
quantifying reproducibility (Raff. “A step toward quantifying independently reproducible machine learning research.” NeurIPS 2019)
https://reproducedpapers.org/

Advisor: Depends on paper

Robust Machine Learning (click to expand)

Question: How can we find robust hypotheses that attackers cannot easily manipulate?

Suggested Approach: Approximate Tukey center of hypotheses via iterated Radon points.

Related Works:

Michael Kamp, Mario Boley, Olana Missura, and Thomas Gärtner. Effective Parallelisation for Machine Learning. Accepted for publication in Advances in Neural Information Processing Systems 30, 2017.
Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Bayes and Tukey Meet at the Center Point. COLT 2004: Learning Theory.

Context: The more (distributed) machine learning algorithms and/or the models learned by them are to be applied in critical applications where lives depend on predictions, the more trustworthy they need to be. For that, (distributed) learning algorithms need to be robust to manipulation of (training) data and communication channels. They also need to guarantee that their models have well-calibrated confidence. The Tukey center is a high-dimensional generalisation of the median and shares similar stability properties that will be exploited in this project.

SAR Despeckling Improvement by Using Active Learning (click to expand)

Type: Implementation heavy research

Question: How Active learning can improve SAR despeckling problems?

Suggested approach: This project aims to implement actively SAR despeckling and evaluate its results by related metrics (Peak Signal-to-Noise Ratio and Equivalent Number of Looks). Here (https://github.com/grip-unina/SAR-CNN) is the implementation of SAR-CNN, which can be used for SAR image despeckling as a base model.

Related work:

Chierchia, et al. “ SAR image despeckling through convolutional neural networks “, IGARSS 2017
Burr Settles. “ Active Learning Literature Survey. 2009

Context: Synthetic Aperture Radar (SAR) is an active microwave imaging system that provides high-resolution images day and night under all weather conditions. It has been widely used in many practical applications, such as environment, crop monitoring, and disaster detection. However, interpreting SAR images is a challenging task, both for human observers and for automatic tools aiming at extracting useful information. Indeed, they are corrupted by speckle. Although commonly referred to as noise, speckle is a physical phenomenon that is caused by the coherent sum of the contributions from different elementary scatterers within the same resolution cell, which the radar cannot resolve. Speckle reduction is a crucial step in SAR image processing, and using best-suited machine learning algorithms to derive useful information from these data is essential.

SAR Despeckling Using Diffusion Models (click to expand)

Type: Implementation heavy research

Question: Can we use diffusion models for SAR despeckling?

Suggested approach: This project aims to implement SAR despeckling methods based on diffusion models and evaluate its results by related metrics (Peak Signal-to-Noise Ratio and Equivalent Number of Looks).

Related work: Perera, et al. “SAR Despeckling using a Denoising Diffusion Probabilistic Model”, arXiv:2206.04514 Fracastoro, et al. “Deep Learning Methods for Synthetic Aperture Radar Image Despeckling.” IEEE Geoscience and Remote Sensing Magazine. 2021

Scalable Interactive Analysis (click to expand)

Question: How can constraints be incorporated into meaningful low-dimensional embeddings of large data sets?

Suggested Approach: Approximate knowledge-constrained PCA via the power method.

Related Works:

Dino Oglic, Daniel Paurat, and Thomas Gärtner. Interactive Knowledge-Based Kernel PCA. Proceedings of ECML PKDD, 2014. Springer.
Daniel Paurat and Thomas Gärtner. InVis: A Tool for Interactive Visual Data Analysis. Proceedings of ECML PKDD, 2013. Springer.

Context: Traditionally, embedding techniques focus on finding one fixed embedding, which depends on some technical parameters and emphasises a singular aspects of the data. In contrast, interactive data exploration aims at allowing the user to explore the structure of a dataset by observing a projection dynamically controlled by the user in an intuitive way. Ultimately it provides a way to search and find an embedding, emphasising aspects that the user desires to highlight. Current approaches often do not allow for real-time manipulation of large data sets.

Your own idea!

Describe the scientific merit and novelty of your idea. It is very important to narrow down the rough topic to a tentative research question and approach of interest to us. The research question should not have been answered previously and the answer needs to be verifiable. To answer the question, typically one has to:

implement some machine learning algorithms and apply them to dataset,
implement an interesting application that uses machine learning, or
prove some properties of a learning algorithm.

If you choose your own topic, please briefly describe your project following this structure (check our suggested topics to get an idea):

Research question
Suggested Approach
Related Work
Context

General resources (freely available books and lecture notes)

Understanding machine learning: from theory to algorithms. Shai Shalev-Shwartz and Shai Ben-David (pdf)
Foundations of machine learning. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (pdf)
Foundations of data science. Avrim Blum, John Hopcroft, and Ravindran Kannan (pdf)
Mathematics for machine learning. Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (pdf)
Mining of massive datasets. Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman (pdf)
Reinforcement learning: an introduction. Richard Sutton and Andrew Barto (pdf)
Research Methods in Machine Learning. Tom Dietterich (pdf)