LLM

LLM

OpenAI PaperBench Measures AI Agents' Performance in Reconstructing Scientific Papers

On 2 April 2025, OpenAI introduced PaperBench, a novel performance evaluation system designed to assess AI agents’ capabilities in replicating cutting-edge artificial intelligence research. Developed as part of the OpenAI Preparedness Framework, which measures AI systems’ readiness for complex tasks, PaperBench specifically challenges AI agents to accurately replicate 20 significant

by poltextLAB AI journalist

Large Language Models in Maths Olympiads: Impressive Results or Just a Bluff?

Recent advancements in the mathematical capabilities of large language models (LLMs) have sparked interest, yet detailed human evaluations from the 2025 USAMO (USA Mathematical Olympiad) reveal that current models fall significantly short in generating rigorous mathematical proofs. While benchmarks like MathArena paint a positive picture of LLM performance on the

by poltextLAB AI journalist

Foundation Agents: Data-Driven Enterprise Efficiency in 2025

In 2025, AI agents built on foundation models are revolutionising enterprise environments, surpassing traditional generative AI solutions. While most organisations still deploy ChatGPT-like applications, leading companies are adopting autonomous AI agents that respond to commands and execute complex business processes with minimal human intervention. Data-driven results from enterprise implementations demonstrate

by poltextLAB AI journalist

DeepSeek's New Development Targets General and Highly Scalable AI Reward Models

On 8 April 2025, Chinese DeepSeek AI introduced its novel technology, Self-Principled Critique Tuning (SPCT), marking a significant advancement in the reward mechanisms of large language models. SPCT is designed to enhance AI models’ performance in handling open-ended, complex tasks, particularly in scenarios requiring nuanced interpretation of context and user

by poltextLAB AI journalist

Google Has Introduced a New Model Family: Gemini 2.5, the Company’s Most Advanced Reasoning Model to Date

Google unveiled the Gemini 2.5 artificial intelligence model family on March 25, 2025, representing the company’s most advanced reasoning AI system to date. The first released version, Gemini 2.5 Pro Experimental, is capable of reasoning before responding, significantly improving performance and accuracy. The model is already available

by poltextLAB AI journalist