Projects
Interpretability of Language Models
Since 2023
Large language models have shown incredible results and experienced a rapid widespread adoption. However, we do not know how the trained models come to their decisions of what text to generate. Therefore, this project aims to investigate the interpretability of language models through structural probing, behavioral probing, and mechanistic interpretability – and by bringing theories and experimental paradigms from psycholinguistics into machine learning.
- Paper: What makes a language easy to deep-learn? (in final revision, to appear soon). preprint code
- Learning pressures in neural networks and large language models (in revision) preprint
- Paper: Morphology Matters: Probing the Cross-linguistic Morphological Generalization Abilities of Large Language Models through a Wug Test published in the CMCL workshop at the ACL 2024 conference.
- Poster at the Highlights in the Language Sciences conference, July 8–11, 2024
- Poster at the workshop on Using Artificial Neural Networks for Studying Human Language Learning and Processing, June 10–12, 2024
- Extended abstract on Testing the Linguistic Niche Hypothesis in Large Language Models with a Multilingual Wug Test published in the Proceedings of Evolang XV (pp. 91–94)
- Podium presentation by Anh Dang at Evolang 2024
- Talk at the Protolang 8 conference, Rome, September 27–28, 2023. Abstract published in the Protolang 8 book of abstracts
Machine Communication and the Emergence of Language
Since 2022
Imagine you put artificial neural network agents into a box and give them a task that can only be solved with communication. How do neural network agents learn to communicate? What communication protocols emerge? What are the parallels to human communication?
- Machine Communication and the Emergence of Language. Talk given at the MPI Proudly Presents series, July 4, 2024, Max Planck Institute for Psycholinguistics, Nijmegen, NL.
- Invited talk at the workshop on Using Artificial Neural Networks for Studying Human Language Learning and Processing, June 10–12, 2024
- Machine Learning and the Evolution of Language workshop at the Joint Conference on Language Evolution
- Emergent Communication workshop paper at ICLR
What Matters for Text Classification (textclf)
Since 2021
Automatically categorizing text documents is a popular research topic with numerous new approaches being published each year. However, do they really give an advantage over earlier approaches? This question has triggered this line of research and the answer we have is worrisome. For instance, many recently proposed methods such as TextGCN are outperformed by a simple multilayer perceptron on a bag-of-words representation – a decade old technique enhanced with today’s best practices.
Even in the era of language models, the results on text classification are sometimes counter-intuitive. As such, fine-tuned small models typically outperform large language models.
- Simplifying hierarchical text classifciation
- preprint of a comparative survey paper
- ACL 2022 paper
- code for multi-label classification
- code for single-label classification
- extra code for single-label classification with different methods
Lifelong Graph Representation Learning (LGL)
Since 2019
Graph neural networks have quickly risen to be the standard technique for machine learning on graph-structured data. Yet graph neural networks are usually only applied to static snapshots of graphs, while real-world graphs (social media, publication networks) are continually evolving. Evolving graphs come with challenges that are rarely reflected in the graph representation learning literature, such as dealing with new classes in node classification. We pursue a lifelong learning approach for graph neural networks on evolving graphs and investigate incremental training, out-of-distribution detection, and issues caused by an imbalanced class distribution.
- CoLLAs 2024 conference paper on zero-shot learning in graphs or on arXiv
- Code for CoLLAs 2024 paper
- IJCNN 2023 conference paper on new class detection in graphs
- Neural Networks journal paper, 2023
- IJCNN 2021 conference paper
- ICLR 2019 workshop paper
- Code for 2023 Neural Networks paper
- Code for IJCNN 2023 paper
- Code for ICLR 2019 workshop paper
- related project: Q-AKTIV
Analyzing the Scientific, Economic and Societal Impact of Research Activities and Research Networks (Q-AKTIV)
2019–2022, funded by the Federal Ministry of Education and Research (BMBF)
The aim of Q-AKTIV is to improve the methods for forecasting dynamics and interactions between research, technology development, and innovation. The network analysis methods will be based on recent developments in Deep Learning. In addition to the emergence of new knowledge areas and networks, we focus on the convergence processes of established sectors. The development and evaluation of the new methods initially takes place in the field of life sciences, which is characterized by high marked dynamics. The additional application in economics enables a systematic comparison of the dynamics between the disciplines of science. The new methods will be used to predict the impact of existing research and network structures on the dynamics of knowledge and technologies as well as the future relevance of topics and actors. The result of Q-AKTIV is an implemented and evaluated instrument for the strategic analysis and prognosis of the dynamics in science and innovation. This complements today’s primarily qualitative approaches to early strategic planning and increases the decision-making ability of research institutions, policy makers, and industries. In addition to the analysis of dynamics, also valuable indicators for R & D performance measurement can be derived, e.g., the registration of patents based on scientific publications, the economic development of the companies involved, as well as the outreach of research activities. The practice partner brings in the necessary experience in the field of business valuation and strategy development and ensures a practical testing of the toolkit.
- Scientometrics journal paper on the lack of interdisciplinarity in the scientific response to COVID-19, 2024
- TEM journal paper evaluating the developed methods, 2023
- project website
- COVID-19++ dataset
- COVID-19++ dataset paper.
- conference paper on learning concept representations
- follow-up work building on concept representation learning methods
- Python package for learning concept representations and analyzing network dynamics
- workshop paper on data enrichment 2
- workshop paper on data enrichment 1
- tools for harvesting raw data
- interview about open science tools for digital collaboration (en)
- interview about open science tools for digital collaboration (de)
- journal paper (German)
- press item (German)
- press release (ZB MED, German)
- funding announcement (German)
- final project report (German)
Representation Learning for Texts and Graphs
2017-2022
My PhD project. A meta-project bringing together word2mat, textclf, aaerec, LGL. phd thesis
Word Matrices for Text Representation Learning (word2mat)
2017-2022
The idea of this project was to embed each word as a matrix as an alternative to vectors used in word embeddings. By using matrix embeddings instead of vector embeddings, we can use matrix multiplication as a composition function to form a representation for phrases and sentences. While word matrices alone did not exceed the performance of word vectors, a combination of word matrices and word vectors turned out to be beneficial. Later, we showed that pre-trained language models can be distilled into such a purely embedding-based model, giving benefits in efficiency while keeping reasonable accuracy.
- IJCNN paper on distilling the knowledge from a pre-trained language model into matrix embedding models
- extended abstract in the best-of data science track at the INFORMATIK 2019
- ICLR paper on self-supervised learning of word matrices
- code
Autoencoders for Document-based Recommendations (aaerec)
2017–2022
The aim of this project was to build a document-level citation recommendation system that could, for example, make users aware of missing references. A specialty of this project compared to other recommender systems is that we do not use any a user data or a user profile but only operate on the contents of the current draft. The main research question was whether models could be enhanced by using textual side information, such as the title of the draft, which we confirmed for a wide range of autoencoder-based recommendation models. Interestingly, we found that the choice of the best model depends on the semantics of item co-occurrence. When item co-occurrence implies relatedness (as in citations), looking at other items is far more useful than looking at the text. In contrast, when item co-occurrence implies diversity, such as in subject labels from professional subject indexers, the text is more useful.
- journal paper on citation and subject label recommendation
- conference paper on citation and subject label recommendation
- RecSys challenge workshop paper on music recommendation for automatic playlist continuation
- codebase for all three papers
Linked Open Citation Database (LOC-DB)
2017–2020, funded by Deutsche Forschungsgemeinschaft (DFG)
The LOC-DB project will develop ready-to-use tools and processes based on the linked-data technology that make it possible for a single library to meaningfully contribute to an open, distributed infrastructure for the cataloguing of citations. The project aims to prove that, by widely automating cataloguing processes, it is possible to add a substantial benefit to academic search tools by regularly capturing citation relations. These data will be made available in the semantic web to make future reuse possible. Moreover, we document effort, number and quality of the data in a well-founded cost-benefit analysis. The project will use well-known methods of information extraction and adapt them to work for arbitrary layouts of reference lists in electronic and print media. The obtained raw data will be aligned and linked with existing metadata sources. Moreover, it will be shown how these data can be integrated in library catalogues. The system will be deployable to use productively by a single library, but in principle it will also be scalable for using it in a network.
- conference paper on citation recommendation
- main paper on the project as a whole
- demo of the LOC-DB project outcome
- collection of code for the LOC-DB project
- Second Linked Open Citation Database (LOC-DB) workshop, 2018, Mannheim, Germany.
- First Linked Open Citation Database (LOC-DB) workshop, 2017, Mannheim, Germany.
- project website (de)
Word Embeddings for Information Retrieval (vec4ir)
2016–2017
The key idea was to use word embeddings for similarity scoring in information retrieval. The two main take-aways:
- It is important to retain the crisp matching operation (before similarity scoring), even when using word embeddings.
- A combination of classic information retrieval method TF-IDF and word embeddings led to the best results.
TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation (MOVING)
2016–2019, EU funding, Grant Agreement Number 693092
I engaged in this EU Horizon 2020 project as a research assistant between 2016 and 2017, leading to contributions to the deliverables 3.1, 3.2 and 3.3, as well as various conference and workshop papers:
- Deliverable 3.3
- Deliverable 3.2
- Deliverable 3.1
- ICADL 2018 paper on information retrieval on titles vs. full-text
- DEXA 2018 workshop paper on response suggestion
- code for response suggestion
- Project website
Extreme Multi-label Text Classification (Quadflor)
2015–2018
This project originated from my Master’s project (2015–2016), where we developed a pipeline for extreme (=many possible classes) multi-label text classification. We found that a multi-layer perceptron beats the state-of-the-art kNN approach by more than 30%. Moreover, we compared using either the full-text or only the title of a research paper as a basis for classification. The result was that the full-text is only marginally better than the title. My team member Florian Mai investigated the trade-off between full-text and title in his Master’s thesis, finding that the increased availability of title data compensates for increased information in full-text articles.
- code (bought by ZBW for production usage)
- K-CAP 2017 paper (outcome of the Master’s project)
- JCDL 2018 paper (outcome of Florian’s Master’s thesis)
Contact: lukas 'at' lpag.de
Design: Adapted from Diane Mounter.
Privacy: No personal data, no cookies.