VLA-R1

Enhancing Reasoning in Vision-Language-Action Models


Angen Ye12*  Zeyu Zhang1*  Boyuan Wang12  Xiaofeng Wang13  Dapeng Zhang2  Zheng Zhu1†

1GigaAI     2CASIA     3Tsinghua University
*Equal contribution.     Corresponding author.

TL;DR


VLA-R1 is a reasoning-enhanced vision–language–action model that enables step-by-step reasoning and robust action execution across diverse tasks and domains.
(1) A RL-based reasoning optimization scheme that strengthens step-by-step reasoning and execution;
(2) VLA-CoT-13K, a high-quality dataset with explicit chain-of-thought supervision;
(3) Comprehensive benchmarks, simulation, and real-world evaluation demonstrating the superiority of VLA-R1.


Method


Overall architecture of VLA-R1. The system operates in two stages. Stage I applies supervised fine-tuning (SFT) augmented with Chain-of-Thought (CoT) supervision to endow the multimodal model with foundational reasoning, enabling it to generate trajectories or affordance regions conditioned on images and instructions; a downstream control stack then converts these outputs into joint-level robot commands. Stage II introduces reinforcement learning with verifiable rewards (GRPO) to further refine both reasoning quality and action execution, yielding more robust cross-task generalization and resilience in complex scenarios.

Visualization


Real-World Results

Pick up the yellow bowl. Move the yellow bowl to the white basket.
Grasp the plush toy on the table.
Move the bread into the microwave.
Put the bread into the bread box.
Pick the yellow bowl.
Grip the yellow bowl. Grip the red bowl. Grip the blue bowl. Grip the white bowl. Grip the green bowl.
Grasp the green bowl.
Transfer the bananas into the brown basket.


Simulated Results

Pick up the green bowl.
Pick the black box.
Grab the orange box.
Move the orange box to the empty spot on the table-top.

Citation


@article{ye2025vlar1,
  title={VLA-R1: Enhancing Reasoning in Vision-Language-Action Models},
  author={Ye, Angen and Zhang, Zeyu and Wang, Boyuan and Wang, Xiaofeng and Zhang, Dapeng and Zhu, Zheng},
  journal={arXiv preprint arXiv:2510.01623},
  year={2025}
}