Generalist robot policies overfit severely when finetuned to a target task with limited demonstrations: not only do they fail to generalize to simple variations of the target task, they lose their generalist capabilities as well. We find that simply interpolating the weights of pretrained and finetuned models can retain generalist abilities, and further, transfer the generalist knowledge to the target task to generalize to more complex variations of it. We introduce a simple method based on this called RETAIN.
In this task, the robot must precisely grab a plate and insert it vertically into the grooves of a dish rack.
In in-distribution evaluation (evaluation on the exact task observed in the finetuning dataset), normal SFT succeeds as expected. However, when we evaluate the policy on out-of-distribution scenarios (some variations of the same semantic task, see examples below), SFT policies fail to generalize but RETAIN policies do.
This task involves grabbing an eraser and wiping text off a whiteboard. Similarly, we find that SFT policies also fail to generalize to OOD variations of the task, but RETAIN policies succeed.
In addition to achieving robust generalization to OOD variations of the target task, RETAIN also preserves the generalist capabilities of the pretrained model. In the following generalist evaluations, we evaluate the RETAIN policy on tasks from the pretraining distribution.
Generalist robot policies have strong capabilities, but still require adaptation for new downstream tasks. Naive finetuning approaches suffer from overfitting and fail to preserve the pretrained model's generality and cannot robustly generalize beyond the narrow conditions present in the limited finetuning dataset. This creates a critical need for methods that can leverage broad pretrained competencies to enable learning generalized skills.
As an example, we finetune a VLA on 3 different LIBERO tasks with standard SFT finetuning, training all parameters of the model on the target task. While the ID performance improves with gradient updates, the Generalist performance drastically degrades as we finetune for longer. This shows that naive finetuning has overfitted the model to the exact dataset distribution, and suffers from catastrophic forgetting of pretraining abilities. Moreover, there is a large gap between ID and OOD performance, suggesting the model has failed to generalize to small variations of the target task because of overfitting.
Model Merging
$$\tilde{\theta} = (1 - \alpha) \cdot \theta_{pre} + \alpha \cdot \theta_{ft}$$
We can combine the pretrained model \(\theta_{pre}\) and the finetuned model \(\theta_{ft}\) directly in weight space with linear interpolation. Intuitively, the merged model \(\tilde{\theta}\) can combine generalization capabilities of the pretrained model with the task-specific adaptation of the finetuned model to robustly learn a more general version of the target task. where \(\theta_{pre}\) represents the pretrained model parameters, \(\theta_{ft}\) represents the parameters after task-specific finetuning, and \(\alpha\) ∈ [0, 1] controls the balance between generalization and task specialization.
Co-Finetuning
We consider two finetuning settings: task-finetuning (task-FT), where we finetune the entire model on a target task dataset \(\mathcal{D}_{\tau}\), and co-finetuning (co-FT), where we finetune on a mix of the target task dataset \(\mathcal{D}_{\tau}\) and the pretraining dataset \(\mathcal{D}_{pre}\). We find that when we have access to the pretraining data, model merging with co-finetuning (RETAIN-co-FT) outperforms model mergine in the task-FT setting (RETAIN-task-FT).
Contiual Task Adaptation
we can use RETAIN to sequentially add tasks into a pretrained checkpoint by iteratively merging finetuned weights into the base model and continuing to finetune from the merged checkpoint. This is done by the following formula:
On two DROID real robot tasks, RETAIN consistently achieves superior out-of-distribution and generalistperformance compared to the baseline methods, while being comparable in in-distribution performance.
RETAIN helps OOD generalization much more when the pretrained model is trained on more data, and therefore has more generalist capabilities for ``merging''.
We sequentially apply RETAIN to learn the DROID plates and whiteboard tasks in sequence. Compared to the strongest baseline of sequential co-finetuning on both tasks' datasets, sequential RETAIN is able to maintain its robust generalization to out-of-distribution scenarios for both tasks.
Vision-Language-Action models consist of a vision encoder (v), language model backbone (l), and often an action expert (a). We study the effect of merging different parameter groups on performance:
$$ \begin{pmatrix} \tilde{\theta}_v \\ \tilde{\theta}_l \\ \tilde{\theta}_a \end{pmatrix} = \left[1-\begin{pmatrix} \alpha_v \\ \alpha_l \\ \alpha_a \end{pmatrix}\right] \cdot \begin{pmatrix} \theta_{pre,v} \\ \theta_{pre,l} \\ \theta_{pre,a} \end{pmatrix} + \begin{pmatrix} \alpha_v \\ \alpha_l \\ \alpha_a \end{pmatrix} \cdot \begin{pmatrix} \theta_{ft,v} \\ \theta_{ft,l} \\ \theta_{ft,a} \end{pmatrix} $$
After conducting a grid search over the coefficients \(\alpha_v\), \(\alpha_l\), and \(\alpha_a\) in simulation, we find that the largest variation in OOD performance is achieved when \(\alpha_l\) is varied, as seen in the color gradient of the plot cube below. We also find that the best performance for a given \(\alpha_l\) is achieved by setting \(\alpha_v = \alpha_a = 1\), as seen in the second plot below.
These results suggest that during model merging, it may suffice to only merge the parameters of the language model backbone (\(\alpha_l < 1\), \(\alpha_v = \alpha_a = 1\)). To validate this hypothesis, we compare the OOD performance of merging all parameters with RETAIN to only merging language model parameters on 3 LIBERO tasks. The results are shown in the bar chart below.
@article{yadav2025retain,
author = {Yajat Yadav and Zhiyuan Zhou and Andrew Wagenmaker and Karl Pertsch and Sergey Levine},
title = {Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging},
conference = {arXiv Pre-print},
year = {2025},
url = {https://arxiv.org/abs/2512.08333},
}