Projects | Lukas Paul Achatius Galke

Projects

Interpretability of Language Models

Since 2023

Large language models have shown incredible results and experienced a rapid widespread adoption. However, we do not know how the trained models come to their decisions of what text to generate. Therefore, this project aims to investigate the interpretability of language models through structural probing, behavioral probing, and mechanistic interpretability – and by bringing theories and experimental paradigms from psycholinguistics into machine learning.

Machine Communication and the Emergence of Language

Since 2022

Imagine you put artificial neural network agents into a box and give them a task that can only be solved with communication. How do neural network agents learn to communicate? What communication protocols emerge? What are the parallels to human communication?

What Matters for Text Classification (textclf)

Since 2021

Automatically categorizing text documents is a popular research topic with numerous new approaches being published each year. However, do they really give an advantage over earlier approaches? This question has triggered this line of research and the answer we have is worrisome. For instance, many recently proposed methods such as TextGCN are outperformed by a simple multilayer perceptron on a bag-of-words representation – a decade old technique enhanced with today’s best practices.

Even in the era of language models, the results on text classification are sometimes counter-intuitive. As such, fine-tuned small models typically outperform large language models.

Lifelong Graph Representation Learning (LGL)

Since 2019

Graph neural networks have quickly risen to be the standard technique for machine learning on graph-structured data. Yet graph neural networks are usually only applied to static snapshots of graphs, while real-world graphs (social media, publication networks) are continually evolving. Evolving graphs come with challenges that are rarely reflected in the graph representation learning literature, such as dealing with new classes in node classification. We pursue a lifelong learning approach for graph neural networks on evolving graphs and investigate incremental training, out-of-distribution detection, and issues caused by an imbalanced class distribution.

Analyzing the Scientific, Economic and Societal Impact of Research Activities and Research Networks (Q-AKTIV)

2019–2022, funded by the Federal Ministry of Education and Research (BMBF)

The aim of Q-AKTIV is to improve the methods for forecasting dynamics and interactions between research, technology development, and innovation. The network analysis methods will be based on recent developments in Deep Learning. In addition to the emergence of new knowledge areas and networks, we focus on the convergence processes of established sectors. The development and evaluation of the new methods initially takes place in the field of life sciences, which is characterized by high marked dynamics. The additional application in economics enables a systematic comparison of the dynamics between the disciplines of science. The new methods will be used to predict the impact of existing research and network structures on the dynamics of knowledge and technologies as well as the future relevance of topics and actors. The result of Q-AKTIV is an implemented and evaluated instrument for the strategic analysis and prognosis of the dynamics in science and innovation. This complements today’s primarily qualitative approaches to early strategic planning and increases the decision-making ability of research institutions, policy makers, and industries. In addition to the analysis of dynamics, also valuable indicators for R & D performance measurement can be derived, e.g., the registration of patents based on scientific publications, the economic development of the companies involved, as well as the outreach of research activities. The practice partner brings in the necessary experience in the field of business valuation and strategy development and ensures a practical testing of the toolkit.

Representation Learning for Texts and Graphs

2017-2022

My PhD project. A meta-project bringing together word2mat, textclf, aaerec, LGL. phd thesis

Word Matrices for Text Representation Learning (word2mat)

2017-2022

The idea of this project was to embed each word as a matrix as an alternative to vectors used in word embeddings. By using matrix embeddings instead of vector embeddings, we can use matrix multiplication as a composition function to form a representation for phrases and sentences. While word matrices alone did not exceed the performance of word vectors, a combination of word matrices and word vectors turned out to be beneficial. Later, we showed that pre-trained language models can be distilled into such a purely embedding-based model, giving benefits in efficiency while keeping reasonable accuracy.

Autoencoders for Document-based Recommendations (aaerec)

2017–2022

The aim of this project was to build a document-level citation recommendation system that could, for example, make users aware of missing references. A specialty of this project compared to other recommender systems is that we do not use any a user data or a user profile but only operate on the contents of the current draft. The main research question was whether models could be enhanced by using textual side information, such as the title of the draft, which we confirmed for a wide range of autoencoder-based recommendation models. Interestingly, we found that the choice of the best model depends on the semantics of item co-occurrence. When item co-occurrence implies relatedness (as in citations), looking at other items is far more useful than looking at the text. In contrast, when item co-occurrence implies diversity, such as in subject labels from professional subject indexers, the text is more useful.

Linked Open Citation Database (LOC-DB)

2017–2020, funded by Deutsche Forschungsgemeinschaft (DFG)

The LOC-DB project will develop ready-to-use tools and processes based on the linked-data technology that make it possible for a single library to meaningfully contribute to an open, distributed infrastructure for the cataloguing of citations. The project aims to prove that, by widely automating cataloguing processes, it is possible to add a substantial benefit to academic search tools by regularly capturing citation relations. These data will be made available in the semantic web to make future reuse possible. Moreover, we document effort, number and quality of the data in a well-founded cost-benefit analysis. The project will use well-known methods of information extraction and adapt them to work for arbitrary layouts of reference lists in electronic and print media. The obtained raw data will be aligned and linked with existing metadata sources. Moreover, it will be shown how these data can be integrated in library catalogues. The system will be deployable to use productively by a single library, but in principle it will also be scalable for using it in a network.

Word Embeddings for Information Retrieval (vec4ir)

2016–2017

The key idea was to use word embeddings for similarity scoring in information retrieval. The two main take-aways:

  1. It is important to retain the crisp matching operation (before similarity scoring), even when using word embeddings.
  2. A combination of classic information retrieval method TF-IDF and word embeddings led to the best results.

TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation (MOVING)

2016–2019, EU funding, Grant Agreement Number 693092

I engaged in this EU Horizon 2020 project as a research assistant between 2016 and 2017, leading to contributions to the deliverables 3.1, 3.2 and 3.3, as well as various conference and workshop papers:

Extreme Multi-label Text Classification (Quadflor)

2015–2018

This project originated from my Master’s project (2015–2016), where we developed a pipeline for extreme (=many possible classes) multi-label text classification. We found that a multi-layer perceptron beats the state-of-the-art kNN approach by more than 30%. Moreover, we compared using either the full-text or only the title of a research paper as a basis for classification. The result was that the full-text is only marginally better than the title. My team member Florian Mai investigated the trade-off between full-text and title in his Master’s thesis, finding that the increased availability of title data compensates for increased information in full-text articles.

Contact: lukas 'at' lpag.de
Design: Adapted from Diane Mounter.
Privacy: No personal data, no cookies.