Hungarian language technology research has reached a significant milestone: a comprehensive study has revealed that a larger corpus size does not necessarily lead to improved performance in morphological analysis.
In their study, Andrea Dömötör, Balázs Indig, and Dávid Márk Nemeskey conducted a detailed analysis of three Hungarian-language corpora of varying sizes: the ELTE DH gold standard corpus (496,060 tokens), NYTK-NerKor (1,017,340 tokens), and the Szeged Treebank (1,362,505 tokens). Their findings are presented in the paper entitled Does Size Matter? A Comparative Evaluation of Morphologically Annotated Corpora ("A méret a lényeg? Morfológiailag annotált korpuszok összehasonlító kiértékelése"). The research yielded several surprising insights. Notably, the performance of the HuSpaCy analyser plateaued at around half a million tokens, with no substantial gains observed from further increases in corpus size. Even more strikingly, the PurePos analyser demonstrated strong performance with markedly smaller data: it achieved a lemmatisation accuracy of 93.8% using a test corpus of just 120,000 tokens. Perhaps the most counterintuitive result was that combining corpora not only failed to enhance performance but in some cases degraded it. When NerKor and the Szeged Treebank were used in tandem, lemmatisation accuracy dropped to 91.8%, compared to 98.2% and 98.7% respectively when used independently.
The key takeaway from the research is that training models for morphological annotation do not necessarily require vast corpora; annotation consistency proves far more critical. According to the study, as few as 120,000 tokens of consistently annotated data may suffice to achieve reliable results—an insight that could significantly reshape future corpus-building strategies.
Sources:
1.
https://rgai.inf.u-szeged.hu/sites/rgai.inf.u-szeged.hu/files/mszny2025%20%281%29.pdf#page=226.11