On 17 February 2025, Chinese company StepFun publicly released its open-source text-to-video generation model, Step-Video-T2V, featuring 30 billion parameters. Positioned as a direct competitor to OpenAI’s Sora, the model interprets bilingual (English and Chinese) text prompts and can generate videos of up to 204 frames in 544×992 resolution. It offers significantly improved motion dynamics and visual coherence compared to existing video generation tools.
Compared to OpenAI’s Sora model, Step-Video-T2V offers potentially greater capacity. While Sora’s exact parameter count remains undisclosed—estimated to lie between 33 million and 3 billion—the Chinese model boasts 30 billion parameters. The architecture of Step-Video-T2V comprises three core components: a highly efficient video compression system that significantly reduces file size; two text interpretation modules capable of processing prompts in both English and Chinese; and a specialised DiT (Denoising Transformer) framework designed to generate higher-quality and more coherent video output.
To further enhance visual fidelity, the developers have implemented a custom Video-DPO (Direct Preference Optimisation) technique, which improves the realism of motion, reduces visual artefacts, and enhances overall output quality. In comparative benchmarks, Step-Video-T2V delivered outstanding performance across multiple categories, particularly in rendering sports scenes and dynamic motion.
The model was jointly announced by StepFun and the automotive division of Geely Holding Group, with both the source code and model weights made publicly available to the developer community. This move aligns with a broader trend in China’s tech sector—initiated in January by DeepSeek—towards open-sourcing AI models to encourage collaborative development and innovation.
Sources:
1.

2.
3.
