Project in Computer Science 1 - Machine Learning Algorithms and Applications

General information

TISS: (link)
contact: Tamara Drucks
meeting link: https://tuwien.zoom.us/my/tamaradrucks
physical meeting room: FB0210, Erzherzog-Johann-Platz 1
everything important will be announced in TUWEL/TISS.

Format

In this course, you will experience the role of a typical machine learning researcher. You will:

Familiarise yourself with a specific area of machine learning
Understand and summarise research papers
Work on a concrete problem/task in machine learning

There are three types of projects:

Theoretical: Formulate proofs, find novel bounds, show the complexity of an algorithm
Algorithmic: Implement an algorithm, perform experiments, achieve good results on data
Applied: Build an application that at its core uses machine learning

Working on the project consists of the following steps:

1. Problem statement

Choose a project topic or come up with your own, original idea. Then, write a short problem statement, including your research question, motivation, related work, methodology, and expected results.

2. Working phase

Independently work on your project. Additionally, you will have discussion meetings with your supervisor if needed.

3. Final submission

Write a final report to present the results of your work.

Projects (Tentative)

We are happy to supervise machine learning related projects that are connected to our research interests. Examples are:

Analysis of (Bacteriophage) Cocktails (click to expand)

This is an interdisciplinary project jointly with the Department of Genetics and Genome Biology at the University of Leicester and the School of Life Sciences at the University of Nottingham,

Question: How can we find the most effective (Bacteriophage) Cocktail?

Suggested Approach: Learning with structured data, structured output prediction, constructive machine learning.

Related Works:

Dino Oglic, Steven A. Oatley, Simon J. F. Macdonald, Thomas Mcinally, Roman Garnett, Jonathan D. Hirst, and Thomas Gärtner. Active search for computer-aided drug design. Molecular Informatics, 2018.
Dino Oglic, Roman Garnett, and Thomas Gärtner. Active search in intensionally specified structured spaces. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017.

Context: Antimicrobial resistance is a growing problem in many types of bacteria which cause disease (pathogens) in animals and humans. There is an urgent need to find alternatives to antibiotics which are more sustainable. Bacteriophages are viruses which infect and kill bacteria. They are quite specific, only affecting the targeted bacterial species while leaving other bacteria, which may be beneficial, unharmed. Unlike other viruses, phages do not infect the cells of animals or humans and can be found widely in the environment. We will use machine learning to predict which phage(cocktail)s will infect Salmonella under different conditions.

Convexity in real-world graphs (click to expand)

Question: Do node labelled real-world graphs have (almost) geodesically convex classes?

Suggested approach: Check for node-labelled graphs in benchmark datasets (e.g., http://snap.stanford.edu/data/), whether the classes (vertex sets corresponding with same node label) are convex. Try to find a more realisitic notion of convexity which, for example, allows outliers and disconnected regions.

Related work:

Ignacio M. Pelayo “Geodesic Convexity in Graphs” (Springer 2013) (chapter 1, 2.1, 2.2, and 2.10) (Researchgate link)
Eike Stadtländer, Tamás Horváth, and Stefan Wrobel. “Learning weakly convex sets in metric spaces.” (ECMLPKDD 2021)
Florian Seiffarth, Tamás Horváth, and Stefan Wrobel “Maximal Closed Set and Half-Space Separations in Finite Closure Systems.” (ECMLPKDD 2019)
Maximilian Thiessen and Thomas Gärtner. “Active Learning of Convex Halfspaces on Graphs.” (NeurIPS 2021).

Context: While classical notions of convexity in Euclidean space are largely studied and used by the machine learning community, related notions in discrete spaces (e.g., graphs) have been mostly overlooked. A vertex set S is in a graph is geodesically convex, if all shortest paths joining two vertices in S stay in S. Recently, we (and other groups) have started to use the notion of convexity in graphs to achieve guarantees for machine learning problems on graphs. Currently, these results are mainly of theoretical interest, but the next step is to evaluate and test the developed methods on real-world graphs. For that, we have to find reasonable datasets and application areas that fit our convexity-based assumptions. Your goal is to identify graphs where the labelled subgraphs are (close to) convex.

Cross-target Learning (click to expand)

Question: How can we use data from related tasks for effective learning? (across multiple biological targets and/or across languages)

Suggested Approach: Embedding them in the same space.

Related Works:

Dino Oglic, Daniel Paurat, and Thomas Gärtner. Interactive Knowledge-Based Kernel PCA. Proceedings of ECML PKDD, 2014.
Sven Giesselbach, Katrin Ullrich, Michael Kamp, Daniel Paurat, and Thomas Gärtner. Corresponding Projections for Orphan Screening. https://arxiv.org/abs/1812.00058

Context: In many applications of machine learning one has to start making decisions without having access to lablelled training data specific to the task at hand. One approach in that case is to transfer knowledge from similar tasks. In this project we will explore approaches that jointly embed tasks and hypotheses in the same space to find a good starting hypothesis for a novel (``orphan’‘) task.

Harnessing graph structures to make GNNs more efficient (click to expand)

Type: Theory based research with (ideally) some coding

Question: Can we develop methods to use graph structures (such as paths or cycles) to simplify graph neural networks that are known to be inefficient (e.g.: k-GNNs or CWNs)?

Suggested approach: This is a very open / free form project. We suggest you start by understanding how MPNNs and k-GNNs work and then see if it is necessary to apply k-GNNs to subgraphs that consist of acyclic parts of the graph. Other GNNs to consider investigating: CW Networks (CWNs) and Equivariant Subgraph Aggregation Networks (ESAN). If the time allows, theoretical findings should be backed up by experiments.

Recommended Skills: Background in proving theoretical properties of algorithms and a good understanding of graphs. Maybe also some experience with machine learning using Python and training neural networks with a deep learning framework (PyTorch, TensorFlow, Keras, …)

Related work:

Expressiveness of MPNNs: Xu et al., How Powerful are Graph Neural Networks?, ICLR 2019
k-GNNs: Morris et al., Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks, AAAI 2019
CW Networks: Bodnar et al., Weisfeiler and Lehman Go Cellular: CW Networks, NeurIPS 2021
Equivariant Subgraph Aggregation Networks: Bevilacqua et al., Equivariant Subgraph Aggregation Networks, ICLR 2022

Context: It is known that message passing graph neural networks (MPNNs) have limitations in the kind of functions they can express. While there are GNNs that improve upon MPNNs such as k-GNNs, they are generally highly inefficient. We are interested in learning if it is possible to make them more efficient by exploiting graph structures. For example, it is probably not necessary to apply the more advanced GNNs to all parts of the graph.

Extracting Emotional Status from Biosignals (click to expand)

This is an interdisciplinary project jointly with the Institute for Computer Technology (ICT) at TU Wien.

Question: How well can we predict emotions from raw biosignals?

Suggested Approach: Kernel methods in Kreı̆n Space.

Related Works:

Dino Oglic and Thomas Gärtner. Learning in Reproducing Kernel Kreı̆n Spaces. In Proceedings of the 35th International Conference on Machine learning, 2018.
David Pollreisz and Nima TaheriNejad. A Simple Algorithm for Emotion Recognition, Using Physiological Signals’of a Smart Watch. 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2017.

Context: You will be given a set of raw biosignals (PPG, EDA, temperature, and acceleration) collected by a smartwatch as well as certain features (heart rate, respiratory rate, …) that can be extracted from these signals. These data were collected while subjects were exposed to emotional stimuli. They then labeled these data with the emotions they felt and its respective strength. The objective is to develop a suitable ML algorithm that can accurately and reliably tag the biosignals (using the raw data and/or extracted features) with those emotions or a subset of those emotions.

Graph-based active learning (click to expand)

Task: Implement and design an efficient (python) framework to compare graph-based active learning algorithms.

Suggested Approach: The goal is to build a framework to compare different active learning algorithms more easily. Adding new active learning algorithms, new datasets, or new evaluation strategies should be as simple as possible. Data I/O, the active learning algorithms, the prediction algorithms, and the evaluations should be encapsulated and exchangeable (think of scikit-learn).

Related Work:

Zhu, et al. “Combining active learning and semi-supervised learning using gaussian fields and harmonic functions.” ICML 2003 workshop. Matlab implementation
Ma, et al. “Σ-Optimality for Active Learning on Gaussian Random Fields.” NIPS 2013. Matlab implementation of their and related algorithms
graph cuts (Blum and Chawla. “Learning from labeled and unlabeled data using graph mincuts.” ICML 2001. Example implementation
Dasarathy, et al. “S2: an efficient graph based active learning algorithm with application to nonparametric classification.” COLT 2015
Cesa-Bianchi, et al. “Active learning on trees and graphs”, COLT 2013
Burr Settles, “Active Learning”, Morgan & Claypool 2012.

Context: Even though implementations of graph-based active learning algorithms are often available, it is tedious to evaluate and compare them on real-world graphs. Additionally, most existing active learning libraries like modAL, alipy and libact focus on heuristic (e.g., uncertainty based) methods, whereas we are interested in graph-based methods with theoretical guarantees. Because of that, our goal is to design an easy-to-use python framework that enables efficient evaluations and comparisons of graph-based methods. In the long run, it might be possible to integrate these implementations in popular machine learning libraries like scikit-learn (https://scikit-learn.org) or vowpal wabbit (https://vowpalwabbit.org/).

GNNs for learning on Planar Graphs (click to expand)

Type: Implementation heavy research

Question: Consider planar graphs dataset and interesting tasks for the GNNs to solve. How do GNNs perform on these tasks?

Suggested approach: Planar graphs dataset are available on the internet (example https://users.cecs.anu.edu.au/~bdm/data/graphs.html). For problems to solve, we recommend algorithmic problems with already existing implementations such as path finding, computing treewidth or solving dominating set. Make sure you use problems of varying difficulties, for example shortest path finding is significantly easier than solving dominating set. Finally, for GNNs you need to use 3-GNNs, in addition we recommend an MPNN and a GAT. You do not need to implement the GNNs yourself.

Recommended Skills: Some experience with machine learning using Python. Ideally some experience with implementing and training neural networks with a deep learning framework (PyTorch, TensorFlow, Keras, …)

Related work:

3-GNNs (they are a special type of k-GNNs): Morris et al., Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks, AAAI 2019
Proof that 3-GNNs can express all functions on planar graphs (you do not need to read it):
For solving algorithmic problems with GNNs: Velickovic (FIX NAME) et al., The CLRS Algorithmic Reasoning Benchmark, ICML 2022, arXiv

Context: Many GNNs have limitiations on the kind of functions they can represent. However, if we restrict the graphs we are interested on this is not necessarily the case. It has been recently shown that 3-GNNs can express all functions on planar graphs. We are interested how planar graphs influence the ability of GNNs to solve problems on them.

Using randomness to increase the predictive powers of GNNs (click to expand)

Type: Implementation heavy research

Question: How can random vertex features be used to improve the performance of GNNs on real-life datasets?

Suggested approach: Implement or find previous implementations of a benchmark evaluations for an MPNN (ideally GIN) and possibly other GNNs. We would like to emphasize that we do not expect you to implement the benchmarks and GNNs from scratch, ideally you adapt previous implementations to save time. Afterwards, implement graph transformations that add random features to vertices and compare the impact of random vertex features on the benchmark results. Develop techniques to make better use of the random features, for example: averaging over multiple predictions with different random features, a separate GNN that compresses the random features which feed into the main GNN, random features that depend on structural properties of the graph, or anything you come up with and find interesting.

Related work:

Abboud et al., The Surprising Power of Graph Neural Networks with Random Node Initialization, IJCAI 2021, arXiv
Sato et al., Random Features Strengthen Graph Neural Networks, SDM 2021, arXiv
Dasoulas et al., Coloring graph neural networks for node disambiguation, IJCAI 2020, arXiv

Implementations of GNNs with benchmarks:

https://github.com/snap-stanford/ogb/tree/master/examples - Implementations of GIN and GCN together with benchmarks for the ogb datasets
https://github.com/beabevi/ESAN - Implementation of another GNN called ESAN together with benchmarks)

Other resources:

Benchmark datasets, data loaders, and evaluators for graph machine learning - [GitHub link]
Equivariant Subgraph Aggregation Networks (ICLR 2022 Spotlight) - [GitHub link]

Context: It has been shown that random vertex features allow message passing graph neural networks (MPNNs) to express more functions and distinguish more graphs. However, this does not necessarily mean that random vertex features improve the predictive performance in more realistic settings. The goal of this project is to investigate whether random vertex features can be used in practice or whether they remain of purely theoretical interest.

Robust Machine Learning (click to expand)

Question: How can we find robust hypotheses that attackers cannot easily manipulate?

Suggested Approach: Approximate Tukey center of hypotheses via iterated Radon points.

Related Works:

Michael Kamp, Mario Boley, Olana Missura, and Thomas Gärtner. Effective Parallelisation for Machine Learning. Accepted for publication in Advances in Neural Information Processing Systems 30, 2017.
Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Bayes and Tukey Meet at the Center Point. COLT 2004: Learning Theory.

Context: The more (distributed) machine learning algorithms and/or the models learned by them are to be applied in critical applications where lives depend on predictions, the more trustworthy they need to be. For that, (distributed) learning algorithms need to be robust to manipulation of (training) data and communication channels. They also need to guarantee that their models have well-calibrated confidence. The Tukey center is a high-dimensional generalisation of the median and shares similar stability properties that will be exploited in this project.

Scalable Interactive Analysis (click to expand)

Question: How can constraints be incorporated into meaningful low-dimensional embeddings of large data sets?

Suggested Approach: Approximate knowledge-constrained PCA via the power method.

Related Works:

Dino Oglic, Daniel Paurat, and Thomas Gärtner. Interactive Knowledge-Based Kernel PCA. Proceedings of ECML PKDD, 2014. Springer.
Daniel Paurat and Thomas Gärtner. InVis: A Tool for Interactive Visual Data Analysis. Proceedings of ECML PKDD, 2013. Springer.

Context: Traditionally, embedding techniques focus on finding one fixed embedding, which depends on some technical parameters and emphasises a singular aspects of the data. In contrast, interactive data exploration aims at allowing the user to explore the structure of a dataset by observing a projection dynamically controlled by the user in an intuitive way. Ultimately it provides a way to search and find an embedding, emphasising aspects that the user desires to highlight. Current approaches often do not allow for real-time manipulation of large data sets.

Predictive maintenance for car chassis components (click to expand)

This is an interdisciplinary project with CARIAD, an automotive software company which is part of the Volkswagen Group.

Question: To which extent is it possible to perform predictive maintenance using vehicle component data and prior knowledge?

Suggested Approach: Performing literature review on predictive maintenance and comparing different modelling approaches.

Related Work:

Rune Prytza, Sławomir Nowaczyk, Thorsteinn Rögnvaldsson and Stefan Byttner. Predicting the need for vehicle compressor repairs using maintenance records and logged vehicle data. Engineering Applications of Artificial Intelligence 2015.

Context: Predictive maintenance is a proactive strategy to determine the performance of an asset (e.g., a specific component in a truck) and predict when maintenance is needed before breakdowns. The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that is utilised in various functions in a truck, such as breaking and gear changes. The data consists of a subset of all available data, selected by experts. The positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.

Entity classification in crypto ecosystems (click to expand)

This is an interdisciplinary project jointly with the Complexity Science Hub Vienna

Question: How can we use graph representation learning for entity classification in crypto ecosystems?

Suggested Approach: The goal of this project should be to classify entities in the Ethereum relationships between Ethereum addresses, which can be represented in a multigraph. The paper of Weber [1] should be the guildine and reference point how the project could look like.

Related Works:

Weber, Mark, et al. “Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics.” arXiv preprint arXiv:1908.02591 (2019).
Lin, Dan, et al. “Modeling and understanding ethereum transaction records via a complex network approach.” IEEE Transactions on Circuits and Systems II: Express Briefs 67.11 (2020): 2737-2741.
Hamilton, William L. “Graph representation learning.” Synthesis Lectures on Artifical Intelligence and Machine Learning 14.3 (2020): 1-159.

Recommended requirements:

Crypto Asset Analytics course (192.080) or at least an advances understanding of cryptocurrencies;
Machine learning in Python;
Network science basic knowledge;

Your own idea!

Describe the scientific merit and novelty of your idea. It is very important to narrow down the rough topic to a tentative research question and approach of interest to us. The research question should not have been answered previously and the answer needs to be verifiable. To answer the question, typically one has to:

implement some machine learning algorithms and apply them to dataset,
implement an interesting application that uses machine learning, or
prove some properties of a learning algorithm.

If you choose your own topic, please briefly describe your project following this structure (check our suggested topics to get an idea):

Research question
Suggested Approach
Related Work
Context

General resources (freely available books and lecture notes)

Understanding machine learning: from theory to algorithms. Shai Shalev-Shwartz and Shai Ben-David (pdf)
Foundations of machine learning. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar (pdf)
Foundations of data science. Avrim Blum, John Hopcroft, and Ravindran Kannan (pdf)
Mathematics for machine learning. Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (pdf)
Mining of massive datasets. Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman (pdf)
Reinforcement learning: an introduction. Richard Sutton and Andrew Barto (pdf)
Research Methods in Machine Learning. Tom Dietterich (pdf)