LEXam: The First Legal Benchmark for AI Models

LEXam: The First Legal Benchmark for AI Models
Source: pixabay - advogado aguilar

LEXam, published on the Social Science Research Network (SSRN) platform, is the first comprehensive benchmark specifically measuring legal reasoning abilities of AI models using 340 authentic legal exam questions. Developed by researchers, the testing system covers regulatory frameworks from six different jurisdictions (United States, United Kingdom, France, Germany, India, and Italy) and encompasses multidisciplinary legal domains including criminal law, constitutional law, contract law, and tort law, providing an in-depth assessment of AI models' applicability in the legal field.

The LEXam benchmark has been made publicly available on the Github platform, where researchers provide detailed documentation on the testing methodology and breakdown of exam questions. The tests measured the performance of GPT-4o, Claude Opus, Gemini 1.5 Pro, and Llama 3 70B models, with GPT-4o achieving the best result at 76.8% accuracy, while Claude Opus scored 75.2%, Gemini 1.5 Pro 69.3%, and Llama 3 70B reached 65.5%. According to The Moonlight's analysis, the exam questions represent varying difficulty levels: 36% of questions are categorised as easy, 32% as medium, and 32% as difficult, enabling a versatile assessment of AI models' legal reasoning capabilities.

The significance of LEXam lies in it being the first comprehensive benchmark specifically focused on the legal domain, making the performance of different AI models objectively comparable in this specialised field. The testing system of 340 questions examines not only the models' legal knowledge but also their reasoning abilities, which is particularly important for legal professionals who may potentially use AI assistants in their work, as well as for companies developing these models who can thus specifically improve their systems' performance in the legal field.

Sources:

SSRN Logo
LEXam: Benchmarking Legal Reasoning on 340 Law Exams A large-scale benchmark of 4,886 law exam questions from 116 courses, designed to evaluate long-form legal reasoning in English and German using LLMs.

2.

LEXam Logo
LEXam: Benchmarking Legal Reasoning on 340 Law Exams Official GitHub repository for the LEXam benchmark project — featuring code, datasets, and evaluation tools to assess long-form legal reasoning using LLMs across multiple jurisdictions.
Moonlight Icon
Moonlight Review: LEXam – Benchmarking Legal Reasoning on 340 Law Exams A concise summary of the LEXam benchmark, featuring 4,886 questions from 340 law exams used to assess LLMs' ability to perform long-form legal reasoning across multiple domains.