benchmarks

OpenAI Released Two Open-Weight GPT Models Under Apache 2.0 License

On August 5, 2025, OpenAI released two open-weight reasoning models under the Apache 2.0 license, named gpt-oss-120b and gpt-oss-20b, allowing researchers to freely access, modify and distribute these AI models. This industry-milestone move responds to growing demand for open-source, high-performance models that make AI development more transparent. The two

by poltextLAB AI journalist

Anthropic Unveils Claude Opus 4.1 Model with Enhanced Coding Capabilities

On August 7, 2025, Anthropic released Claude Opus 4.1, featuring significant improvements in coding, agent, and reasoning capabilities, showing particular advancement in handling complex real-world programming tasks and multi-step problems. The updated model delivers 38% better performance on coding tasks, and 27% enhanced reasoning capabilities on HumanEval, MMLU, and

by poltextLAB AI journalist

Large Language Models Are Proficient in Solving and Creating Emotional Intelligence Tests

AI Outperforms Average Humans in Tests Measuring Emotional Capabilities A recent study led by researchers from the Universities of Geneva and Bern has revealed that six leading Large Language Models (LLMs) – including ChatGPT – significantly outperformed human performance on five standard emotional intelligence tests, achieving an average accuracy of 82% compared

by poltextLAB AI journalist

LEXam: The First Legal Benchmark for AI Models

LEXam, published on the Social Science Research Network (SSRN) platform, is the first comprehensive benchmark specifically measuring legal reasoning abilities of AI models using 340 authentic legal exam questions. Developed by researchers, the testing system covers regulatory frameworks from six different jurisdictions (United States, United Kingdom, France, Germany, India, and

by poltextLAB AI journalist

Google's Latest Gemma 3n Model Enhances Mobile AI Application Efficiency Through Innovative Solutions

Officially released on June 26, 2025, Gemma 3n includes significant developments specifically targeting on-device AI operation. The multimodal model natively supports image, audio, video, and text inputs and is available in two sizes: E2B (5 billion parameters) and E4B (8 billion parameters), operating with just 2GB and 3GB of memory

by poltextLAB AI journalist

Mistral AI Unveils Its First Reasoning Model, 10x Faster Than Competitors

French AI lab Mistral AI officially announced Magistral on June 10, 2025, its first family of reasoning models capable of step-by-step thinking, available in two variants: the open-source 24-billion-parameter Magistral Small and the enterprise-focused Magistral Medium. Magistral Medium scored 73.6% accuracy on the AIME2024 mathematics benchmark, rising to 90%

by poltextLAB AI journalist

Chinese Startup Introduced New DeepSeek-R1-0528 Model Approaching Market Leaders with 87.5% Accuracy

Chinese startup DeepSeek announced DeepSeek-R1-0528 on 28 May 2025, delivering significant performance improvements in complex reasoning tasks and achieving near-parity capabilities with paid models OpenAI o3 and Google Gemini 2.5 Pro. The update increased accuracy on the AIME 2025 test from 70% to 87.5%, whilst improving coding performance

by poltextLAB AI journalist