Accuracy of Slovak Language Lemmatization and MSD Tagging – MorphoDiTa and SpaCy
Abstract
The Slovak language, as a “typical” Slavic language, belongs to the group of moderately inflected languages, with three or four genders, two grammatical numbers, all
interacting with the inflections in somewhat complicated and unpredictable ways. The
inflections are realized primarily by suffixes, but with many irregularities; one suffix
encodes several relevant grammatical categories and the same suffix often reflects unrelated features in other words, a typical inflectional language not amenable to a heuristic
analysis. Following these limitations, lemmatization is often an indispensable step in
all kinds of text processing (starting with full-text search), and full morphosyntactic
analysis or description (MSD) is the core of corpus linguistic research. Given the core
importance of lemmatization and MSD in Slovak corpus linguistics, it is important to
realize its limitations and recognize achievable accuracy. Since modern approaches aim
to utilize deep learning and huge language models, we evaluate the accuracy of lemmatization + MSD in several common usage scenarios by comparing the state-of-the-art
“classical” lemmatizer and MSD tagger MorhoDiTa, based on perceptron; and spaCy,
using a multilingual BERT language model.