1GigaAI
2CASIA
3Tsinghua University
*Equal contribution.
†Corresponding author.
VLA-R1 is a reasoning-enhanced vision–language–action model that enables step-by-step reasoning and robust action execution across diverse tasks and domains.
(1) A RL-based reasoning optimization scheme that strengthens step-by-step reasoning and execution;
(2) VLA-CoT-13K, a high-quality dataset with explicit chain-of-thought supervision;
(3) Comprehensive benchmarks, simulation, and real-world evaluation demonstrating the superiority of VLA-R1.
Overall architecture of VLA-R1. The system operates in two stages. Stage I applies supervised fine-tuning (SFT) augmented with Chain-of-Thought (CoT) supervision to endow the multimodal model with foundational reasoning, enabling it to generate trajectories or affordance regions conditioned on images and instructions; a downstream control stack then converts these outputs into joint-level robot commands. Stage II introduces reinforcement learning with verifiable rewards (GRPO) to further refine both reasoning quality and action execution, yielding more robust cross-task generalization and resilience in complex scenarios.
@article{ye2025vlar1,
title={VLA-R1: Enhancing Reasoning in Vision-Language-Action Models},
author={Ye, Angen and Zhang, Zeyu and Wang, Boyuan and Wang, Xiaofeng and Zhang, Dapeng and Zhu, Zheng},
journal={arXiv preprint arXiv:2510.01623},
year={2025}
}