VLA-R1

Enhancing Reasoning in Vision-Language-Action Models


Angen Ye12*  Zeyu Zhang1*  Boyuan Wang12  Xiaofeng Wang1  Dapeng Zhang2  Zheng Zhu1†

1GigaAI     2CASIA
*Equal contribution.     Corresponding author.

TL;DR


VLA-R1 is a reasoning-enhanced vision–language–action model that enables step-by-step reasoning and robust action execution across diverse tasks and domains.
(1) A RL-based reasoning optimization scheme that strengthens step-by-step reasoning and execution;
(2) VLA-CoT-13K, a high-quality dataset with explicit chain-of-thought supervision;
(3) Comprehensive benchmarks, simulation, and real-world evaluation demonstrating the superiority of VLA-R1.


Method


Overall architecture of VLA-R1. The system operates in two stages. Stage I applies supervised fine-tuning (SFT) augmented with Chain-of-Thought (CoT) supervision to endow the multimodal model with foundational reasoning, enabling it to generate trajectories or affordance regions conditioned on images and instructions; a downstream control stack then converts these outputs into joint-level robot commands. Stage II introduces reinforcement learning with verifiable rewards (GRPO) to further refine both reasoning quality and action execution, yielding more robust cross-task generalization and resilience in complex scenarios.

Visualization


Real-World Results

Pick up the yellow bowl. Move the yellow bowl to the white basket.
Grasp the plush toy on the table.
Move the bread into the microwave.
Put the bread into the bread box.
Pick the yellow bowl.
Grip the yellow bowl. Grip the red bowl. Grip the blue bowl. Grip the white bowl. Grip the green bowl.
Grasp the green bowl.
Transfer the bananas into the brown basket.


Simulated Results

Pick up the green bowl.
Pick the black box.
Grab the orange box.
Move the orange box to the empty spot on the table-top.

Citation


@article{song2025maniplvm,
  title={Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models},
  author={Song, Zirui and Ouyang, Guangxian and Li, Mingzhe and Ji, Yuheng and Wang, Chenxi and Xu, Zixiang and Zhang, Zeyu and Zhang, Xiaoqing and Jiang, Qian and Chen, Zhenhao and others},
  journal={arXiv preprint arXiv:2505.16517},
  year={2025}
}