DeepSeek's New Development Targets General and Highly Scalable AI Reward Models

DeepSeek's New Development Targets General and Highly Scalable AI Reward Models
Source: Unsplash - solenfeyissa

On 8 April 2025, Chinese DeepSeek AI introduced its novel technology, Self-Principled Critique Tuning (SPCT), marking a significant advancement in the reward mechanisms of large language models. SPCT is designed to enhance AI models’ performance in handling open-ended, complex tasks, particularly in scenarios requiring nuanced interpretation of context and user needs.

The core concept of SPCT is that the reward model does not merely evaluate responses based on predefined rules but can generate its own principles and evaluation criteria, providing detailed critiques of responses accordingly. DeepSeek applied this approach to the Gemma-2-27B model, creating the DeepSeek-GRM-27B. This new model not only outperformed its original counterpart but also proved competitive against significantly larger models with up to 671 billion parameters.

Researchers noted substantial performance improvements when more samples were used during evaluation: 32 samples were sufficient for the 27B model to surpass models orders of magnitude larger. This suggests that smarter feedback mechanisms may be more critical than simply increasing model size. DeepSeek plans to make SPCT-based AI models available as open-source, though no specific release date has been announced. A distinctive feature of SPCT is its ability to enable model improvement during the inference phase by adding computational capacity, without increasing model size. This approach is more cost-effective while offering enhanced adaptability and scalability.

Sources:

1.

DeepSeek unveils new technique for smarter, scalable AI reward models
Reward models holding back AI? DeepSeek’s SPCT creates self-guiding critiques, promising more scalable intelligence for enterprise LLMs.

2.

DeepSeek is developing self-improving AI models. Here’s how it works
DeepSeek and China’s Tsinghua University say they have found a way that could make AI models more intelligent and efficient.

3.

arXiv logo
Inference-Time Scaling for Generalist Reward Modeling