VLA-R1

Enhancing Reasoning in Vision-Language-Action Models

Angen Ye^12* Zeyu Zhang^1* Boyuan Wang¹² Xiaofeng Wang¹³ Dapeng Zhang² Zheng Zhu^1†

¹GigaAI ²CASIA ³Tsinghua University
^*Equal contribution. ^†Corresponding author.

TL;DR

VLA-R1 is a reasoning-enhanced vision–language–action model that enables step-by-step reasoning and robust action execution across diverse tasks and domains.
(1) A RL-based reasoning optimization scheme that strengthens step-by-step reasoning and execution;
(2) VLA-CoT-13K, a high-quality dataset with explicit chain-of-thought supervision;
(3) Comprehensive benchmarks, simulation, and real-world evaluation demonstrating the superiority of VLA-R1.

Method

Overall architecture of VLA-R1. The system operates in two stages. Stage I applies supervised fine-tuning (SFT) augmented with Chain-of-Thought (CoT) supervision to endow the multimodal model with foundational reasoning, enabling it to generate trajectories or affordance regions conditioned on images and instructions; a downstream control stack then converts these outputs into joint-level robot commands. Stage II introduces reinforcement learning with verifiable rewards (GRPO) to further refine both reasoning quality and action execution, yielding more robust cross-task generalization and resilience in complex scenarios.