EuroBERT: A Next-Generation Multilingual Encoder Model Family for Language Technology

EuroBERT: A Next-Generation Multilingual Encoder Model Family for Language Technology
Source: Freepik - seventyfour

EuroBERT, the newly developed multilingual encoder model family, marks a significant advancement in modern language technology. It enables more efficient processing of 15 European and global languages, handling sequences of up to 8,192 tokens. Officially introduced on 10 March 2025, the EuroBERT family was trained on a dataset of 5 trillion tokens and is available in three sizes (210M, 610M, and 2.1B parameters). The model combines a bidirectional encoder architecture with the latest innovations from decoder models, achieving performance that significantly surpasses earlier multilingual systems.

EuroBERT outperforms comparable technologies in a wide range of language processing tasks. According to research findings, the model family ranked first in 10 out of 18 evaluated language benchmarks. Its document retrieval accuracy is particularly noteworthy: in the MIRACL benchmark, it achieved a hit rate of 92.9%, meaning it successfully retrieved nearly all relevant documents in response to search queries. It also reached 66.1% accuracy in the MLDR benchmark and 95.8% in the Wikipedia search test — figures that are 5–15% higher than those of the current leading models. According to lead researcher Nicolas Boizard, the EuroBERT models consistently outperform alternatives in multilingual retrieval, classification, and regression tasks, and also excel in coding and mathematical reasoning.

One of EuroBERT’s distinctive capabilities lies in its ability to handle long texts efficiently. The largest version, with 2.1 billion parameters, demonstrated only a 2% performance drop when processing sequences of 8,192 tokens (approximately 15–20 pages), compared to shorter texts. Competing models such as XLM-RoBERTa, by contrast, experience a 40–50% performance loss on the same test. Research also shows that the inclusion of code and mathematical content improves retrieval capabilities by 15%, while parallel language corpora enhance both classification and retrieval accuracy by 8–10%. The significance of EuroBERT’s impact is further amplified by its release as an open-source model family, with all intermediate training checkpoints made publicly available to developers and researchers.

Sources:

1.

Introducing EuroBERT: A High-Performance Multilingual Encoder Model
A Blog post by EuroBERT on Hugging Face

2.

arXiv Logo
EuroBERT: Scaling Multilingual Encoders for European Languages Introducing EuroBERT, a family of multilingual encoders covering European and widely spoken global languages, outperforming existing alternatives across diverse tasks.

3.

EuroBERT: Advanced Multilingual AI Model Breaks New Ground in European Language Processing
EuroBERT: Advanced Multilingual AI Model Breaks New Ground in European Language Processing