1GigaAI
2CASIA
*Equal contribution.
†Corresponding author.
VLA-R1 is a reasoning-enhanced vision–language–action model that enables step-by-step reasoning and robust action execution across diverse tasks and domains.
(1) A RL-based reasoning optimization scheme that strengthens step-by-step reasoning and execution;
(2) VLA-CoT-13K, a high-quality dataset with explicit chain-of-thought supervision;
(3) Comprehensive benchmarks, simulation, and real-world evaluation demonstrating the superiority of VLA-R1.
Overall architecture of VLA-R1. The system operates in two stages. Stage I applies supervised fine-tuning (SFT) augmented with Chain-of-Thought (CoT) supervision to endow the multimodal model with foundational reasoning, enabling it to generate trajectories or affordance regions conditioned on images and instructions; a downstream control stack then converts these outputs into joint-level robot commands. Stage II introduces reinforcement learning with verifiable rewards (GRPO) to further refine both reasoning quality and action execution, yielding more robust cross-task generalization and resilience in complex scenarios.
@article{song2025maniplvm,
title={Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models},
author={Song, Zirui and Ouyang, Guangxian and Li, Mingzhe and Ji, Yuheng and Wang, Chenxi and Xu, Zixiang and Zhang, Zeyu and Zhang, Xiaoqing and Jiang, Qian and Chen, Zhenhao and others},
journal={arXiv preprint arXiv:2505.16517},
year={2025}
}