GenAI textbook Part 5 Chapter 4 Stages of Language Model creation: Pre-training and Fine-tuning

Benchmark-based Evaluation of Language Models and Their Limits

Feb 18, 2025

6 min read

Benchmark-based Evaluation of Language Models and Their Limits — Source: Freepik - vitaliiw001x

Benchmarking is the practice of evaluating artificial intelligence models on a standard suite of tasks under controlled conditions. In the context of large language models (LLMs), benchmarks provide a common yardstick for measuring capabilities such as factual knowledge, reasoning, and conversational coherence. They emerged because the proliferation of new models requires systematic comparison; for example, more than 700,000 LLMs are available on public platforms, and model developers compete to show superior performance. Benchmarks are therefore both a research tool and a social signal: high scores on a widely recognised benchmark can confer prestige and influence. However, they are imperfect proxies for real‑world utility and can distort research incentives (McIntosh et al. 2025).

Standardised benchmark datasets

MMLU: Measuring Massive Multitask Language Understanding (MMLU) tests general knowledge by aggregating 57 multiple‑choice exams across humanities, science and professional domains. Tasks include elementary mathematics, US history, computer science and law, and require models to apply world knowledge and problem‑solving ability. MMLU was created to bridge the gap between the diverse information in pre‑training corpora and the narrower scope of earlier benchmarks (CAIS 2024).
Humanity’s Last Exam: HLE is a comprehensive benchmark comprising over 2,500 expert‑curated questions across more than 100 university‑level disciplines, including law, history, philosophy, economics, mathematics and the natural sciences. It features a mix of multiple‑choice and short-answer questions—many requiring multimodal inputs—and is explicitly designed to resist web search and shallow pattern matching. In contrast to earlier benchmarks, which focus heavily on STEM or conversational tasks, HLE prioritises humanities and domain‑expert reasoning. State-of-the-art models achieve less than 25 % accuracy, with substantial calibration and reasoning failures. This benchmark highlights the persistent performance gap between LLMs and expert human judgment, particularly in complex academic domains (Phan et al. 2025).
BIG‑bench: The Beyond the Imitation Game benchmark contains 204 tasks contributed by more than 450 authors. Tasks span linguistics, mathematics, common‑sense reasoning, biology, physics, social bias and software development. The benchmark investigates whether performance scales with model size and finds that larger models improve in many tasks, yet still perform poorly relative to human raters. It also observes that social bias increases with model scale but can be mitigated through prompting, indicating that benchmarks can reveal bias‑related risks (Srivastava et al. 2025).
HELM: The Holistic Evaluation of Language Models (HELM) aims to cover a broad range of use cases and metrics. It measures accuracy, calibration, robustness, fairness, bias, toxicity and efficiency across 16 core scenarios and conducts targeted evaluations on 26 additional scenarios. By releasing all prompts and completions, HELM encourages transparency and facilitates reproducibility. Its multi‑metric approach highlights trade‑offs—models may excel in accuracy yet fail in robustness or fairness (Liang et al. 2022).
TruthfulQA: This dataset tests whether generative models produce truthful answers when questions contain false premises. It comprises 817 questions across 38 categories such as health, law, finance and politics. Because the questions target common misconceptions, human respondents answer truthfully 94 % of the time, but the best model studied answered only 58 % of questions truthfully. Larger models tended to be less truthful because they reproduced false statements memorised during training (Rosati 2024).
MT‑Bench: Developed to evaluate multi‑turn conversation and instruction following, MT‑Bench contains 80 open‑ended questions in eight categories: writing, role‑play, extraction, reasoning, mathematics, coding, STEM knowledge and humanities/social sciences. By using dialogue rather than multiple choice, it probes a model’s ability to sustain coherent interactions and follow instructions over several turns. It also demonstrates the use of LLMs as judges, revealing biases such as position bias, where the order of answers influences the judgment (Zheng et al. 2023).
ARC and HellaSwag: The AI2 Reasoning Challenge (ARC) contains 7 787 grade‑school science questions partitioned into an easy set and a challenge set, with the latter consisting of questions that simple retrieval methods fail to answer. It tests a model’s ability to perform elementary scientific reasoning. HellaSwag is a commonsense natural‑language inference dataset generated via adversarial filtering: wrong answers are crafted to appear plausible yet ridiculous to humans. While humans exceed 95 % accuracy, state‑of‑the‑art models achieve less than 48 %, revealing a significant gap in commonsense reasoning (AI2 ARC Dataset 2024).
SuperGLUE: Built as the successor to the original GLUE benchmark, SuperGLUE integrates more difficult language‑understanding tasks. It includes reading‑comprehension tests, recognising textual entailment, choice of plausible alternatives and the Winograd Schema Challenge. The benchmark summarises these tasks in a table, listing metrics such as F1 score, Matthew’s correlation or accuracy for each task. For example, the CommitmentBank (CB) task uses average F1 and accuracy, COPA uses accuracy, and Multi‑Sentence Reading Comprehension (MultiRC) is evaluated via F1a and exact match. This diversity of tasks and metrics forces models to handle coreference resolution, disambiguation and commonsense reasoning (Wang et al. 2019).

Limitations of current benchmarks

Despite their utility, benchmarks have notable limitations. First, many widely used benchmarks rely on multiple‑choice questions (e.g., MMLU, BIG‑bench, ARC), which simplify evaluation but do not fully capture the generative abilities of LLMs or the complexity of real‑world tasks (Huang et al. 2024). Second, benchmarks often assume a single correct answer, ignoring context or nuance and neglecting tasks that require open‑ended creativity or reasoning (McIntosh et al. 2025). Third, fine‑tuning models directly on benchmark datasets can inflate scores without improving general capability; the same dataset may be reused during training, leading to overfitting. The survey of 23 benchmarks by McIntosh and colleagues identified numerous inadequacies, including biases, difficulties in measuring genuine reasoning, adaptability, and prompt‑engineering complexity, and the overlooking of cultural and ideological norms. They argue that current exam‑style benchmarks often fail to capture the subtleties of real‑world applications and may lead models to optimise for superficial metrics (McIntosh et al. 2025).

Another limitation is domain and linguistic coverage. Most benchmarks are English‑centric, leaving other languages and cultural contexts under‑represented (McIntosh et al. 2025). TruthfulQA demonstrates that larger models can become less truthful because they mirror false statements from training data (Rosati 2024). Additionally, MT‑Bench’s use of LLMs as judges reveals position bias—the order of answers can influence evaluation results (Zheng et al. 2023). HELM emphasises that metrics such as fairness and toxicity must be evaluated alongside accuracy, but the current metrics themselves might not fully capture harm or bias (Liang et al. 2022).

Benchmark scores reduce complex behaviours to simple metrics like accuracy or F1, but higher performance does not always mean deeper competence. A model might excel on MMLU or SuperGLUE by exploiting shortcuts or memorising patterns, without genuine understanding. Newer benchmarks like HELM and MT‑Bench highlight the need to consider robustness, fairness, and human preference, though these too are affected by prompt design and evaluator bias. Fine-tuning on benchmark datasets can inflate scores and reinforce embedded cultural assumptions, leading to overfitting and diminished generalisability. This leaderboard-driven focus risks sidelining transparency, safety, and real-world utility. Benchmarks are essential, but they must be used critically and complemented with broader, more dynamic forms of evaluation.

References:

1. AllenAI via Hugging Face. 2024. AI2 ARC Dataset. Available at: https://huggingface.co/datasets/allenai/ai2_arc ^ Back

2. CAIS. 2024. MMLU Dataset. Available at: https://huggingface.co/datasets/cais/mmlu ^ Back

3. Huang, Hui, et al. 2024. On the Limitations of Fine-Tuned Judge Models for LLM Evaluation. arXiv preprint arXiv:2403.02839. Available at: https://arxiv.org/abs/2403.02839 ^ Back

4. Liang, Percy, Tatsunori B. Hashimoto, Chi Wang, Kelvin Guu, Sameer Singh, Dan Klein, and Christopher D. Manning. 2022. Holistic Evaluation of Language Models. arXiv preprint arXiv:2211.09110. Available at: https://arxiv.org/abs/2211.09110 ^ Back

5. McIntosh, Timothy R., Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, and Paul Watters. 2025. Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence. IEEE Transactions on Artificial Intelligence. ^ Back

6. Phan, Long, Alice Gatti, Ziwen Han, et al. 2025. Humanity's Last Exam. arXiv preprint arXiv:2501.14249. Available at: https://arxiv.org/abs/2501.14249 ^ Back

7. Rosati, Domenic. 2024. TruthfulQA Dataset. Available at: https://huggingface.co/datasets/domenicrosati/TruthfulQA ^ Back

8. Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, et al. 2025. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. Accepted at TMLR (ICLR 2025 Journal Track). Available at: https://openreview.net/forum?id=uyTL5Bvosj ^ Back

9. Zheng, Lianmin, Siyuan Zhuang, Zhuohan Li, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Matei Zaharia. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems 36: 46595–46623. ^ Back

10. Wang, Alex, et al. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Advances in Neural Information Processing Systems. Available at: https://arxiv.org/abs/1905.00537 ^ Back