Data Centric Domain Adaptation for Historical Text with OCR Errors

Data Centric Domain Adaptation for Historical Text with OCR Errors

Abstract

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

Grafik Top
Authors
  • März, Luisa
  • Schweter, Stefan
  • Poerner, Nina
  • Roth, Benjamin
  • Schütze, Hinrich
Grafik Top
Shortfacts
Category
Paper in Conference Proceedings or in Workshop Proceedings (Paper)
Event Title
16th International Conference on Document Analysis and Recognition ICDAR 2021
Divisions
Data Mining and Machine Learning
Subjects
Kuenstliche Intelligenz
Sprachverarbeitung
Informatik in Beziehung zu Mensch und Gesellschaft
Event Location
Lausanne, Switzerland
Event Type
Conference
Event Dates
September 5-10, 2021
Series Name
Document Analysis and Recognition – ICDAR 2021
ISSN/ISBN
978-3-030-86331-9
Page Range
pp. 748-761
Date
5 September 2021
Export
Grafik Top