Using Titles vs. Full-text as Source for Automated Semantic Document Annotation

Published in Knowledge Capture Conference, 2017

Recommended citation: Lukas Galke, Florian Mai, Alan Schelten, Dennis Brunsch, Ansgar Scherp, "Using Titles vs. Full-text As Source for Automated Semantic Document Annotation." Knowledge Capture Conference, 2017.

Access paper here

Download author copy here

The paper has been accepted as a short paper. The longer version is available on arxiv

Code is available here

Abstract: We conduct the first systematic comparison of automated semantic annotation based on either the full-text or only on the title metadata of documents. Apart from the prominent text classification baselines kNN and SVM, we also compare recent techniques of Learning to Rank and neural networks and revisit the traditional methods logistic regression, Rocchio, and Naive Bayes. Across three of our four datasets, the performance of the classifications using only titles reaches over 90% of the quality compared to the performance when using the full-text.

    author = {Galke, Lukas and Mai, Florian and Schelten, Alan and Brunsch, Dennis and Scherp, Ansgar},
    title = {Using Titles vs. Full-Text as Source for Automated Semantic Document Annotation},
    year = {2017},
    isbn = {9781450355537},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {},
    doi = {10.1145/3148011.3148039},
    booktitle = {Proceedings of the Knowledge Capture Conference},
    articleno = {20},
    numpages = {4},
    keywords = {document analysis, semantic annotation, Multi-label classification},
    location = {Austin, TX, USA},
    series = {K-CAP 2017}