Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging

UC Berkeley, * Core Contributors

TL;DR

Generalist robot policies overfit severely when finetuned to a target task with limited demonstrations: not only do they fail to generalize to simple variations of the target task, they lose their generalist capabilities as well. We find that simply interpolating the weights of pretrained and finetuned models can retain generalist abilities, and further, transfer the generalist knowledge to the target task to generalize to more complex variations of it. We introduce a simple method based on this called RETAIN.

RETAIN Enables Robust Finetuning

In practice, when we finetune a policy, we don't simply want it to work only in the setting where we collected the finetuning demonstration, but for it to complete the demonstrated task in a variety of contexts or scenes. Therefore, we evaluate the performance of the finetuned policy in three settings:
  • Target task in-distribution (ID): measures policy performance on the exact task observed in the finetuning dataset.
  • Target task out-of-distribution (OOD): measures the performance on the target task, in scenarios not observed in the finetuning dataset, such as changes in object instances, backgrounds, lighting conditions and camera angles. This measures the robustness of the finetuned policy.
  • Generalist tasks: measures policy performance on tasks other than the target task, but for which we would expect the generalist policy to perform reasonably. This measures how well the finetuned policy retains generalist capabilities from the pretrained model.
We find that RETAIN achives robust and generalizable finetuning of vision-language-action (VLA) robot policies. See two example tasks below on how RETAIN compares to SFT (supervised finetuning of all parameters on the target task) on the three evaluation settings.

Example Task 1: Placing Plates in a Dish Rack

In this task, the robot must precisely grab a plate and insert it vertically into the grooves of a dish rack.

In-Distribution: SFT succeeds

In in-distribution evaluation (evaluation on the exact task observed in the finetuning dataset), normal SFT succeeds as expected. However, when we evaluate the policy on out-of-distribution scenarios (some variations of the same semantic task, see examples below), SFT policies fail to generalize but RETAIN policies do.

Out-of-Distribution Scenario 1
SFT fails
RETAIN succeeds
Out-of-Distribution Scenario 2
SFT fails
RETAIN succeeds
Out-of-Distribution Scenario 3
SFT fails
RETAIN succeeds

Example Task 2: Wiping Whiteboard

This task involves grabbing an eraser and wiping text off a whiteboard. Similarly, we find that SFT policies also fail to generalize to OOD variations of the task, but RETAIN policies succeed.

In-Distribution: SFT succeeds
Out-of-Distribution Scenario 1
SFT fails
RETAIN succeeds
Out-of-Distribution Scenario 2
SFT fails
RETAIN succeeds
Out-of-Distribution Scenario 3
SFT fails
RETAIN succeeds

RETAIN Preserves Generalist Capabilities

In addition to achieving robust generalization to OOD variations of the target task, RETAIN also preserves the generalist capabilities of the pretrained model. In the following generalist evaluations, we evaluate the RETAIN policy on tasks from the pretraining distribution.

put the spoon in the dishrack
put the marker in the cup
wipe the table
close the drawer
put the tape in the purple bowl
put the plate on the table
put the black sponge in the blue bowl
put the stapler on the notebook
put the watermelon in the purple bowl
put the red bottle in the black bowl

Motivation

Generalist robot policies have strong capabilities, but still require adaptation for new downstream tasks. Naive finetuning approaches suffer from overfitting and fail to preserve the pretrained model's generality and cannot robustly generalize beyond the narrow conditions present in the limited finetuning dataset. This creates a critical need for methods that can leverage broad pretrained competencies to enable learning generalized skills.

Naive Finetuning Suffers from Overfitting

As an example, we finetune a VLA on 3 different LIBERO tasks with standard SFT finetuning, training all parameters of the model on the target task. While the ID performance improves with gradient updates, the Generalist performance drastically degrades as we finetune for longer. This shows that naive finetuning has overfitted the model to the exact dataset distribution, and suffers from catastrophic forgetting of pretraining abilities. Moreover, there is a large gap between ID and OOD performance, suggesting the model has failed to generalize to small variations of the target task because of overfitting.

Overfitting to specific environments

A Simple Solution: RETAIN

robust finetuning of VLA robot policies

Model Merging

$$\tilde{\theta} = (1 - \alpha) \cdot \theta_{pre} + \alpha \cdot \theta_{ft}$$

We can combine the pretrained model \(\theta_{pre}\) and the finetuned model \(\theta_{ft}\) directly in weight space with linear interpolation. Intuitively, the merged model \(\tilde{\theta}\) can combine generalization capabilities of the pretrained model with the task-specific adaptation of the finetuned model to robustly learn a more general version of the target task. where \(\theta_{pre}\) represents the pretrained model parameters, \(\theta_{ft}\) represents the parameters after task-specific finetuning, and \(\alpha\) ∈ [0, 1] controls the balance between generalization and task specialization.

Co-Finetuning

We consider two finetuning settings: task-finetuning (task-FT), where we finetune the entire model on a target task dataset \(\mathcal{D}_{\tau}\), and co-finetuning (co-FT), where we finetune on a mix of the target task dataset \(\mathcal{D}_{\tau}\) and the pretraining dataset \(\mathcal{D}_{pre}\). We find that when we have access to the pretraining data, model merging with co-finetuning (RETAIN-co-FT) outperforms model mergine in the task-FT setting (RETAIN-task-FT).

Contiual Task Adaptation

we can use RETAIN to sequentially add tasks into a pretrained checkpoint by iteratively merging finetuned weights into the base model and continuing to finetune from the merged checkpoint. This is done by the following formula:

Continual task adaptation

Results

RETAIN Performance on DROID

On two DROID real robot tasks, RETAIN consistently achieves superior out-of-distribution and generalistperformance compared to the baseline methods, while being comparable in in-distribution performance.

Main DROID results comparison

RETAIN Scales with Pretraining Data

RETAIN helps OOD generalization much more when the pretrained model is trained on more data, and therefore has more generalist capabilities for ``merging''.

RETAIN scaling with pretraining data

RETAIN Enables Continual Learning of Tasks

We sequentially apply RETAIN to learn the DROID plates and whiteboard tasks in sequence. Compared to the strongest baseline of sequential co-finetuning on both tasks' datasets, sequential RETAIN is able to maintain its robust generalization to out-of-distribution scenarios for both tasks.

Sequential application of RETAIN

Language Parameters Matter Most for Merging

Vision-Language-Action models consist of a vision encoder (v), language model backbone (l), and often an action expert (a). We study the effect of merging different parameter groups on performance:

$$ \begin{pmatrix} \tilde{\theta}_v \\ \tilde{\theta}_l \\ \tilde{\theta}_a \end{pmatrix} = \left[1-\begin{pmatrix} \alpha_v \\ \alpha_l \\ \alpha_a \end{pmatrix}\right] \cdot \begin{pmatrix} \theta_{pre,v} \\ \theta_{pre,l} \\ \theta_{pre,a} \end{pmatrix} + \begin{pmatrix} \alpha_v \\ \alpha_l \\ \alpha_a \end{pmatrix} \cdot \begin{pmatrix} \theta_{ft,v} \\ \theta_{ft,l} \\ \theta_{ft,a} \end{pmatrix} $$

After conducting a grid search over the coefficients \(\alpha_v\), \(\alpha_l\), and \(\alpha_a\) in simulation, we find that the largest variation in OOD performance is achieved when \(\alpha_l\) is varied, as seen in the color gradient of the plot cube below. We also find that the best performance for a given \(\alpha_l\) is achieved by setting \(\alpha_v = \alpha_a = 1\), as seen in the second plot below.

These results suggest that during model merging, it may suffice to only merge the parameters of the language model backbone (\(\alpha_l < 1\), \(\alpha_v = \alpha_a = 1\)). To validate this hypothesis, we compare the OOD performance of merging all parameters with RETAIN to only merging language model parameters on 3 LIBERO tasks. The results are shown in the bar chart below.

Vision encoder merging Language encoder merging
Action decoder merging

BibTeX

      @article{yadav2025retain,
        author = {Yajat Yadav and Zhiyuan Zhou and Andrew Wagenmaker and Karl Pertsch and Sergey Levine},
        title  = {Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging},
        conference = {arXiv Pre-print},
        year = {2025},
        url = {https://arxiv.org/abs/2512.08333},
      }