Projects

Danish Foundation Models (since 2025)

The aim of the Danish Foundation Models project is to develop and evaluate large multilingual and multimodal foundation models that excel at Danish – trained only on permissible Danish data and providing access to the full models (open-weight) and their source code (open-source). The project is a collaborative effort between the University of Southern Denmark, Aarhus University, Copenhagen University, and the Alexandra Institute.

Project website

Interpretability of Language Models (since 2023)

Large language models have shown incredible results and experienced a rapid widespread adoption. However, we do not know how the trained models come to their decisions of what text to generate. Therefore, this project aims to investigate the interpretability of language models through structural probing, behavioral probing, and mechanistic interpretability – and by bringing theories and experimental paradigms from psycholinguistics into machine learning.

Paper: Deep neural networks and humans both benefit from compositional language structure. Nature Communications 15:10816.
Paper: Learning and communication pressures in neural networks: Lessons from emergent communication. Language Development Research 5(1).
Paper: Morphology Matters: Probing the Cross-linguistic Morphological Generalization Abilities of Large Language Models through a Wug Test published in the CMCL workshop at the ACL 2024 conference.
Poster at the Highlights in the Language Sciences conference, July 8–11, 2024
Poster at the workshop on Using Artificial Neural Networks for Studying Human Language Learning and Processing, June 10–12, 2024
Extended abstract on Testing the Linguistic Niche Hypothesis in Large Language Models with a Multilingual Wug Test published in the Proceedings of Evolang XV (pp. 91–94)
Podium presentation by Anh Dang at Evolang 2024
Talk at the Protolang 8 conference, Rome, September 27–28, 2023. Abstract published in the Protolang 8 book of abstracts

Machine Communication and the Emergence of Language (2022–2024)

Imagine you put artificial neural network agents into a box and give them a task that can only be solved with communication. How do neural network agents learn to communicate? What communication protocols emerge? What are the parallels to human communication?

Machine Communication and the Emergence of Language. Talk given at the MPI Proudly Presents series, July 4, 2024, Max Planck Institute for Psycholinguistics, Nijmegen, NL.
Invited talk at the workshop on Using Artificial Neural Networks for Studying Human Language Learning and Processing, June 10–12, 2024
Machine Learning and the Evolution of Language workshop at the Joint Conference on Language Evolution
Emergent Communication workshop paper at ICLR

What Matters for Text Classification (2021–2024)

Automatically categorizing text documents is a popular research topic with numerous new approaches being published each year. However, do they really give an advantage over earlier approaches? This question has triggered this line of research and the answer we have is worrisome. For instance, many recently proposed methods such as TextGCN are outperformed by a simple multilayer perceptron on a bag-of-words representation – a decade old technique enhanced with today’s best practices.

Even in the era of language models, the results on text classification are sometimes counter-intuitive. As such, fine-tuned small models typically outperform large language models.

Lifelong Graph Representation Learning (2019–2024)

Graph neural networks have quickly risen to be the standard technique for machine learning on graph-structured data. Yet graph neural networks are usually only applied to static snapshots of graphs, while real-world graphs (social media, publication networks) are continually evolving. Evolving graphs come with challenges that are rarely reflected in the graph representation learning literature, such as dealing with new classes in node classification. We pursue a lifelong learning approach for graph neural networks on evolving graphs and investigate incremental training, out-of-distribution detection, and issues caused by an imbalanced class distribution.

Analyzing the Scientific, Economic and Societal Impact of Research Activities and Research Networks (Q-AKTIV) (2019–2022, funded by BMBF)

The aim of Q-AKTIV is to improve the methods for forecasting dynamics and interactions between research, technology development, and innovation. The network analysis methods will be based on recent developments in Deep Learning. In addition to the emergence of new knowledge areas and networks, we focus on the convergence processes of established sectors. The development and evaluation of the new methods initially takes place in the field of life sciences, which is characterized by high marked dynamics. The additional application in economics enables a systematic comparison of the dynamics between the disciplines of science. The new methods will be used to predict the impact of existing research and network structures on the dynamics of knowledge and technologies as well as the future relevance of topics and actors. The result of Q-AKTIV is an implemented and evaluated instrument for the strategic analysis and prognosis of the dynamics in science and innovation. This complements today’s primarily qualitative approaches to early strategic planning and increases the decision-making ability of research institutions, policy makers, and industries. In addition to the analysis of dynamics, also valuable indicators for R & D performance measurement can be derived, e.g., the registration of patents based on scientific publications, the economic development of the companies involved, as well as the outreach of research activities. The practice partner brings in the necessary experience in the field of business valuation and strategy development and ensures a practical testing of the toolkit.

Representation Learning for Texts and Graphs (2017–2022)

My PhD project. A meta-project bringing together word2mat, textclf, aaerec, lifelong-graph-learning. phd thesis

Word Matrices for Text Representation Learning (word2mat) (2017-2022)

The idea of this project was to embed each word as a matrix as an alternative to vectors used in word embeddings. By using matrix embeddings instead of vector embeddings, we can use matrix multiplication as a composition function to form a representation for phrases and sentences. While word matrices alone did not exceed the performance of word vectors, a combination of word matrices and word vectors turned out to be beneficial. Later, we showed that pre-trained language models can be distilled into such a purely embedding-based model, giving benefits in efficiency while keeping reasonable accuracy.

Autoencoders for Document-based Recommendations (aaerec) (2017–2022)

The aim of this project was to build a document-level citation recommendation system that could, for example, make users aware of missing references. A specialty of this project compared to other recommender systems is that we do not use any a user data or a user profile but only operate on the contents of the current draft. The main research question was whether models could be enhanced by using textual side information, such as the title of the draft, which we confirmed for a wide range of autoencoder-based recommendation models. Interestingly, we found that the choice of the best model depends on the semantics of item co-occurrence. When item co-occurrence implies relatedness (as in citations), looking at other items is far more useful than looking at the text. In contrast, when item co-occurrence implies diversity, such as in subject labels from professional subject indexers, the text is more useful.

Linked Open Citation Database (LOC-DB) (2017–2020, funded by DFG)

The LOC-DB project will develop ready-to-use tools and processes based on the linked-data technology that make it possible for a single library to meaningfully contribute to an open, distributed infrastructure for the cataloguing of citations. The project aims to prove that, by widely automating cataloguing processes, it is possible to add a substantial benefit to academic search tools by regularly capturing citation relations. These data will be made available in the semantic web to make future reuse possible. Moreover, we document effort, number and quality of the data in a well-founded cost-benefit analysis. The project will use well-known methods of information extraction and adapt them to work for arbitrary layouts of reference lists in electronic and print media. The obtained raw data will be aligned and linked with existing metadata sources. Moreover, it will be shown how these data can be integrated in library catalogues. The system will be deployable to use productively by a single library, but in principle it will also be scalable for using it in a network.

conference paper on citation recommendation
main paper on the project as a whole
demo of the LOC-DB project outcome
collection of code for the LOC-DB project
Second Linked Open Citation Database (LOC-DB) workshop, 2018, Mannheim, Germany.
First Linked Open Citation Database (LOC-DB) workshop, 2017, Mannheim, Germany.
project website (de)

Word Embeddings for Information Retrieval (vec4ir) (2016–2017)

The key idea was to use word embeddings for similarity scoring in information retrieval. The two main take-aways:

It is important to retain the crisp matching operation (before similarity scoring), even when using word embeddings.
A combination of classic information retrieval method TF-IDF and word embeddings led to the best results.

TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation (MOVING) (2016–2019, EU funding, Grant Agreement Number 693092)

I engaged in this EU Horizon 2020 project as a research assistant between 2016 and 2017, leading to contributions to the deliverables 3.1, 3.2 and 3.3, as well as various conference and workshop papers:

Extreme Multi-label Text Classification (Quadflor) (2015–2018)

This project originated from my Master’s project (2015–2016), where we developed a pipeline for extreme (=many possible classes) multi-label text classification. We found that a multi-layer perceptron beats the state-of-the-art kNN approach by more than 30%. Moreover, we compared using either the full-text or only the title of a research paper as a basis for classification. The result was that the full-text is only marginally better than the title. My team member Florian Mai investigated the trade-off between full-text and title in his Master’s thesis, finding that the increased availability of title data compensates for increased information in full-text articles.

code (bought by ZBW for production usage)
K-CAP 2017 paper (outcome of the Master’s project)
JCDL 2018 paper (outcome of Florian’s Master’s thesis)