LEXam, published on the Social Science Research Network (SSRN) platform, is the first comprehensive benchmark specifically measuring legal reasoning abilities of AI models using 340 authentic legal exam questions. Developed by researchers, the testing system covers regulatory frameworks from six different jurisdictions (United States, United Kingdom, France, Germany, India, and Italy) and encompasses multidisciplinary legal domains including criminal law, constitutional law, contract law, and tort law, providing an in-depth assessment of AI models' applicability in the legal field.
The LEXam benchmark has been made publicly available on the Github platform, where researchers provide detailed documentation on the testing methodology and breakdown of exam questions. The tests measured the performance of GPT-4o, Claude Opus, Gemini 1.5 Pro, and Llama 3 70B models, with GPT-4o achieving the best result at 76.8% accuracy, while Claude Opus scored 75.2%, Gemini 1.5 Pro 69.3%, and Llama 3 70B reached 65.5%. According to The Moonlight's analysis, the exam questions represent varying difficulty levels: 36% of questions are categorised as easy, 32% as medium, and 32% as difficult, enabling a versatile assessment of AI models' legal reasoning capabilities.
The significance of LEXam lies in it being the first comprehensive benchmark specifically focused on the legal domain, making the performance of different AI models objectively comparable in this specialised field. The testing system of 340 questions examines not only the models' legal knowledge but also their reasoning abilities, which is particularly important for legal professionals who may potentially use AI assistants in their work, as well as for companies developing these models who can thus specifically improve their systems' performance in the legal field.
Sources:
2.

