Continual Learning VLA Models via Reinforcement Fine-Tuning

Yuan Liu1,2 †, Haoran Li1,3,4 ✉, Shuai Tian1,3, Yuxing Qin1,3, Yuhui Chen1,3, Yupeng Zheng1,3,
Yongzhen Huang2, Dongbin Zhao1,3,4
1SKL-MAIS, Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China 2School of Artificial Intelligence, Beijing Normal University, Beijing, China 3School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 4Beijing Academy of Artificial Intelligence, Beijing, China Work done during an internship at CASIA    Corresponding author

Abstract

Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies.However, Supervised FineTuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting.

To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pretrained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed Multi-Dimensional Process Reward (MDPR) mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization.

Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs.

Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs.

Method

Overview of the proposed LifeLong-RFT. This strategy integrates the chunking-level on-policy reinforcement learning algorithm with the Multi-Dimensional Process Reward mechanism to facilitate policy optimization.

Real-World Experiments

Real-World Multi-Task Results
Multi-Task learning performance on real-world tasks.
Real-World Continual Results
Continual learning performance on real-world tasks.

Adaptation Efficiency

Adaption Efficiency
Adaptation efficiency on representative new tasks.

Cite our paper

@article{liu2026towards,
          title={Towards long-lived robots: Continual learning vla models via reinforcement fine-tuning},
          author={Liu, Yuan and Li, Haoran and Tian, Shuai and Qin, Yuxing and Chen, Yuhui and Zheng, Yupeng and Huang, Yongzhen and Zhao, Dongbin},
          journal={arXiv preprint arXiv:2602.10503},
          year={2026}
        }