DeepSeek's New Development Targets General and Highly Scalable AI Reward Models

Apr 28, 2025

2 min read

DeepSeek's New Development Targets General and Highly Scalable AI Reward Models — Source: Unsplash - solenfeyissa

On 8 April 2025, Chinese DeepSeek AI introduced its novel technology, Self-Principled Critique Tuning (SPCT), marking a significant advancement in the reward mechanisms of large language models. SPCT is designed to enhance AI models’ performance in handling open-ended, complex tasks, particularly in scenarios requiring nuanced interpretation of context and user needs.

The core concept of SPCT is that the reward model does not merely evaluate responses based on predefined rules but can generate its own principles and evaluation criteria, providing detailed critiques of responses accordingly. DeepSeek applied this approach to the Gemma-2-27B model, creating the DeepSeek-GRM-27B. This new model not only outperformed its original counterpart but also proved competitive against significantly larger models with up to 671 billion parameters.

Researchers noted substantial performance improvements when more samples were used during evaluation: 32 samples were sufficient for the 27B model to surpass models orders of magnitude larger. This suggests that smarter feedback mechanisms may be more critical than simply increasing model size. DeepSeek plans to make SPCT-based AI models available as open-source, though no specific release date has been announced. A distinctive feature of SPCT is its ability to enable model improvement during the inference phase by adding computational capacity, without increasing model size. This approach is more cost-effective while offering enhanced adaptability and scalability.

Sources: