Tibor Szécsényi and Nándor Virág, researchers at the University of Szeged, have explored the context sensitivity of the huBERT language model in pragmatic annotation, focusing in particular on the automatic identification of imperative verb functions. Their study, conducted on the MedCollect corpus—a dataset of health-related misinformation—investigates how both the length and position of contextual input influence the model’s annotation accuracy.
The researchers tested the model using input sequences of four different token lengths (64, 128, 256, and 512), and found that longer contextual input led to modest improvements in annotation reliability: the F1 score increased from 0.80 to 0.83. Crucially, the study reveals that huBERT does not process context uniformly. It relies more heavily on the preceding context (approximately 25–30 words before the target token) than on the subsequent context, which appears to be useful only within a span of 15–20 words. The model performed exceptionally well in recognising the most frequently occurring categories, including subjunctive verb forms not used in imperative contexts (e.g. “I’m tired of having to wear a mask all the time”) and direct commands (e.g. “Everyone should wear a mask!”), achieving accuracy scores of up to 0.9 in these cases.
A key takeaway from the research is that although huBERT incorporates contextual information in pragmatic annotation, the effective contextual window is significantly narrower than that utilised by human annotators. The study "A pragmatikai annotáció kontextusfüggősége nagy nyelvi modell esetében: Felszólító alakok funkcióinak annotálása huBert modellel" (Context Sensitivity in Pragmatic Annotation with Large Language Models: Annotating Imperative Verb Functions Using huBER) also demonstrated that the model’s performance (F1 = 0.83) is comparable to that of human annotators (F1 = 0.830), suggesting that automatic annotation may serve as a viable solution in practice, especially for identifying high-frequency linguistic functions.
Sources:
1.
