In this course, you will experience the role of a typical machine learning researcher. You will:
There are three types of projects:
Working on the project consists of the following steps:
Choose a project topic or come up with your own, original idea. Then, write a short problem statement, including your research question, motivation, related work, methodology, and expected results.
Work on your project and present your work to colleagues and our group. Additionally, you will have regular discussion meetings with your supervisor.
Write a final report (short or full paper) to present your project, and design a poster. Your poster will be exhibited to your colleagues and our group during a poster session (snacks included, if Covid allows it!) at the end of the semester.
Here you can find a list of potential projects to choose from:
We are happy to supervise machine learning related projects that are connected to our research interests. Examples are:
This is an interdisciplinary project jointly with the Department of Genetics and Genome Biology at the University of Leicester and the School of Life Sciences at the University of Nottingham,
Question: How can we find the most effective (Bacteriophage) Cocktail?
Suggested Approach: Learning with structured data, structured output prediction, constructive machine learning.
Related Works:
Context: Antimicrobial resistance is a growing problem in many types of bacteria which cause disease (pathogens) in animals and humans. There is an urgent need to find alternatives to antibiotics which are more sustainable. Bacteriophages are viruses which infect and kill bacteria. They are quite specific, only affecting the targeted bacterial species while leaving other bacteria, which may be beneficial, unharmed. Unlike other viruses, phages do not infect the cells of animals or humans and can be found widely in the environment. We will use machine learning to predict which phage(cocktail)s will infect Salmonella under different conditions.
Question: Do node labelled real-world graphs have (almost) geodesically convex classes?
Suggested approach: Check for node-labelled graphs in benchmark datasets (e.g., http://snap.stanford.edu/data/), whether the classes (vertex sets corresponding with same node label) are convex. Try to find a more realisitic notion of convexity which, for example, allows outliers and disconnected regions.
Related work:
Context: While classical notions of convexity in Euclidean space are largely studied and used by the machine learning community, related notions in discrete spaces (e.g., graphs) have been mostly overlooked. A vertex set S is in a graph is geodesically convex, if all shortest paths joining two vertices in S stay in S. Recently, we (and other groups) have started to use the notion of convexity in graphs to achieve guarantees for machine learning problems on graphs. Currently, these results are mainly of theoretical interest, but the next step is to evaluate and test the developed methods on real-world graphs. For that, we have to find reasonable datasets and application areas that fit our convexity-based assumptions. Your goal is to identify graphs where the labelled subgraphs are (close to) convex.
Question: How can we use data from related tasks for effective learning? (across multiple biological targets and/or across languages)
Suggested Approach: Embedding them in the same space.
Related Works:
Context: In many applications of machine learning one has to start making decisions without having access to lablelled training data specific to the task at hand. One approach in that case is to transfer knowledge from similar tasks. In this project we will explore approaches that jointly embed tasks and hypotheses in the same space to find a good starting hypothesis for a novel (``orphan’‘) task.
This is an interdisciplinary project jointly with the Institute for Computer Technology (ICT) at TU Wien.
Question: How well can we predict emotions from raw biosignals?
Suggested Approach: Kernel methods in Kreı̆n Space.
Related Works:
Context: You will be given a set of raw biosignals (PPG, EDA, temperature, and acceleration) collected by a smartwatch as well as certain features (heart rate, respiratory rate, …) that can be extracted from these signals. These data were collected while subjects were exposed to emotional stimuli. They then labeled these data with the emotions they felt and its respective strength. The objective is to develop a suitable ML algorithm that can accurately and reliably tag the biosignals (using the raw data and/or extracted features) with those emotions or a subset of those emotions.
Task: Implement and design an efficient (python) framework to compare graph-based active learning algorithms.
Suggested Approach: The goal is to build a framework to compare different active learning algorithms more easily. Adding new active learning algorithms, new datasets, or new evaluation strategies should be as simple as possible. Data I/O, the active learning algorithms, the prediction algorithms, and the evaluations should be encapsulated and exchangeable (think of scikit-learn).
Related Work:
Context: Even though implementations of graph-based active learning algorithms are often available, it is tedious to evaluate and compare them on real-world graphs. Additionally, most existing active learning libraries like modAL, alipy and libact focus on heuristic (e.g., uncertainty based) methods, whereas we are interested in graph-based methods with theoretical guarantees. Because of that, our goal is to design an easy-to-use python framework that enables efficient evaluations and comparisons of graph-based methods. In the long run, it might be possible to integrate these implementations in popular machine learning libraries like scikit-learn (https://scikit-learn.org) or vowpal wabbit (https://vowpalwabbit.org/).
Question: How can we find robust hypotheses that attackers cannot easily manipulate?
Suggested Approach: Approximate Tukey center of hypotheses via iterated Radon points.
Related Works:
Context: The more (distributed) machine learning algorithms and/or the models learned by them are to be applied in critical applications where lives depend on predictions, the more trustworthy they need to be. For that, (distributed) learning algorithms need to be robust to manipulation of (training) data and communication channels. They also need to guarantee that their models have well-calibrated confidence. The Tukey center is a high-dimensional generalisation of the median and shares similar stability properties that will be exploited in this project.
Question: How can constraints be incorporated into meaningful low-dimensional embeddings of large data sets?
Suggested Approach: Approximate knowledge-constrained PCA via the power method.
Related Works:
Context: Traditionally, embedding techniques focus on finding one fixed embedding, which depends on some technical parameters and emphasises a singular aspects of the data. In contrast, interactive data exploration aims at allowing the user to explore the structure of a dataset by observing a projection dynamically controlled by the user in an intuitive way. Ultimately it provides a way to search and find an embedding, emphasising aspects that the user desires to highlight. Current approaches often do not allow for real-time manipulation of large data sets.
This is an interdisciplinary project with CARIAD, an automotive software company which is part of the Volkswagen Group.
Question: To which extent is it possible to perform predictive maintenance using vehicle component data and prior knowledge?
Suggested Approach: Performing literature review on predictive maintenance and comparing different modelling approaches.
Related Work:
Context: Predictive maintenance is a proactive strategy to determine the performance of an asset (e.g., a specific component in a truck) and predict when maintenance is needed before breakdowns. The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that is utilised in various functions in a truck, such as breaking and gear changes. The data consists of a subset of all available data, selected by experts. The positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.
Question: Can machine learning methods be used to design more efficient algorithms/heuristics for combinatorial problems on graphs?
Suggested approach: Participate in the PACE challenge using a data-driven/learning approach.
Related work:
Context: In recent years, machine learning is increasingly used for optimization problems on graphs. The motivation is that often problem instances in practice are very similar to each other, i.e., have related combinatorial properties. We can use learning approaches to identify these properties and that way achieve practically efficient algorithms for these particular instances. For example, one can directly use machine learning methods to find a (close to optimal) heuristic solution to the given problem provided a large dataset of problem instances-solution pairs (e.g., a graph and the solution). Alternatively, one can train a machine learning model to set appropriate (hyper-)parameters of a solver for the problem (e.g., decide how to branch in a branch&bound approach). While the primary goal of the PACE challenge seems to be to develop novel efficient fixed-parameter tractable algorithms, in this project you will tackle it with a learning based approach.
Task: Participate in multiple Kaggle competitions and try different learning approaches. Note: if the deadline is too close, you may not be able to submit to the competition, but you can still keep working on the project for this course.
Suggested approach: Tackle 1-3 Kaggle challenges (depending on the scope of the challenges). We recommend taking at least one of the “ongoing challenges”. Additionally, tackle at least one from the “suggested further challenges” with at least two different approaches. Preferably, at least one of your challenges deals with non-tabular data, such as text, time series, graphs, and so on.
Ongoing challenges:
Suggested further challenges:
Question: When given user sessions, purchase data and content data about items, can you accurately predict which fashion item will be bought at the end of the session?
Suggested approach: Participate in the RecSys Challenge 2022.
Related work: Past winners for the RecSys Challenge.
Context: In the fashion domain items churn at a very high rate and using content data is essential. In this challenge a portion of the test purchases we evaluate against will be newer items that have little or no past interaction data. However, these candidate items will have label data and can be recommended successfully if we can identify which labels a user session has a preference for. This is an accurate reflection of what happens in the real world, where items are available to us to preview and apply content data before they go live, but when they go live we have to be able to recommend them accurately straight away.
Dataset and start date: Sign up and access the dataset, starting 7 March, 2022.
Further competitions and challenges:
More competitions and challenges will be added soon!
Describe the scientific merit and novelty of your idea. It is very important to narrow down the rough topic to a tentative research question and approach of interest to us. The research question should not have been answered previously and the answer needs to be verifiable. To answer the question, typically one has to:
If you choose your own topic, please briefly describe your project following this structure (check our suggested topics to get an idea):