AIREVOLUTION

Google's Latest Gemma 3n Model Enhances Mobile AI Application Efficiency Through Innovative Solutions

Officially released on June 26, 2025, Gemma 3n includes significant developments specifically targeting on-device AI operation. The multimodal model natively supports image, audio, video, and text inputs and is available in two sizes: E2B (5 billion parameters) and E4B (8 billion parameters), operating with just 2GB and 3GB of memory

by poltextLAB AI journalist • Jul 11, 2025

Mistral reasoning model benchmarks

Mistral AI Unveils Its First Reasoning Model, 10x Faster Than Competitors

French AI lab Mistral AI officially announced Magistral on June 10, 2025, its first family of reasoning models capable of step-by-step thinking, available in two variants: the open-source 24-billion-parameter Magistral Small and the enterprise-focused Magistral Medium. Magistral Medium scored 73.6% accuracy on the AIME2024 mathematics benchmark, rising to 90%

by poltextLAB AI journalist • Jun 27, 2025

DeepSeek China benchmarks

Chinese Startup Introduced New DeepSeek-R1-0528 Model Approaching Market Leaders with 87.5% Accuracy

Chinese startup DeepSeek announced DeepSeek-R1-0528 on 28 May 2025, delivering significant performance improvements in complex reasoning tasks and achieving near-parity capabilities with paid models OpenAI o3 and Google Gemini 2.5 Pro. The update increased accuracy on the AIME 2025 test from 70% to 87.5%, whilst improving coding performance

by poltextLAB AI journalist • Jun 6, 2025

Anthropic Claude benchmarks

Anthropic's New Claude 4 Model Leads Software Engineering Benchmarks

Anthropic introduced its new Claude 4 models on 22 May, Claude Opus 4 and Claude Sonnet 4, setting new standards for coding, advanced reasoning and AI agents. Claude Opus 4 officially became the world's best coding model, achieving 72.5% on the SWE-bench benchmark and 43.2% on

by poltextLAB AI journalist • May 27, 2025

AI-safety benchmarks GenAI

The SpeechMap Free Speech Eval Reveals AI Responses to Controversial Topics

A pseudonymous developer unveiled the SpeechMap "free speech eval" on 16 April 2025, measuring how different AI models—including OpenAI's ChatGPT and X's Grok—respond to sensitive and controversial topics. The benchmarking tool compares 78 different AI models across 492 question themes, analysing over

by poltextLAB AI journalist • May 7, 2025

OpenAI hallucination benchmarks

OpenAI's New Reasoning AI Models Hallucinate More Frequently

OpenAI's o3 and o4-mini models released in April 2025 have significantly higher hallucination rates than their predecessors – according to the company's own tests, o3 hallucinated 33% of the time while o4-mini hallucinated 48% during the PersonQA evaluation. This development marks a surprising turn, as previous models

by poltextLAB AI journalist • May 5, 2025

research results LLM benchmarks

Large Language Models in Maths Olympiads: Impressive Results or Just a Bluff?

Recent advancements in the mathematical capabilities of large language models (LLMs) have sparked interest, yet detailed human evaluations from the 2025 USAMO (USA Mathematical Olympiad) reveal that current models fall significantly short in generating rigorous mathematical proofs. While benchmarks like MathArena paint a positive picture of LLM performance on the

by poltextLAB AI journalist • May 2, 2025