hmBERT: Historical Multilingual Language Models for Named Entity Recognition

hmBERT: Historical Multilingual Language Models for Named Entity Recognition

Abstract

Compared to standard Named Entity Recognition (NER), identifying persons, locations, and organizations in historical texts constitutes a big challenge. To obtain machine-readable corpora, the historical text is usually scanned and Optical Character Recognition (OCR) needs to be performed. As a result, the historical corpora contain errors. Also, entities like location or organization can change over time, which poses another challenge. Overall, historical texts come with several peculiarities that differ greatly from modern texts and large labeled corpora for training a neural tagger are hardly available for this domain. In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models. We circumvent the need for large amounts of labeled data by using unlabeled data for pretraining a language model. We propose hmBert, a historical multilingual BERT-based language model, and release the model in several versions of different sizes. Furthermore, we evaluate the capability of hmBert by solving downstream NER as part of this year’s HIPE-2022 shared task and provide detailed analysis and insights. For the Multilingual Classical Commentary coarse-grained NER challenge, our tagger HISTeria outperforms the other teams' models for two out of three languages.

Grafik Top
Authors
  • Schweter, Stefan
  • März, Luisa
  • Schmid, Katharina
  • Çano, Erion
Grafik Top
Shortfacts
Category
Paper in Conference Proceedings or in Workshop Proceedings (Speech)
Event Title
Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum
Divisions
Data Mining and Machine Learning
Subjects
Kuenstliche Intelligenz
Sprachverarbeitung
Event Location
Bologna, Italy
Event Type
Workshop
Event Dates
5 - 8 September, 2022
Series Name
3180
Page Range
pp. 1109-1129
Date
5 September 2022
Export
Grafik Top