VT-Refine: Learning Bimanual Assembly with
Visuo-Tactile Feedback via Simulation Fine-Tuning

Binghao Huang1, Jie Xu2, Iretiayo Akinola2, Wei Yang2, Balakumar Sundaralingam2, Rowland O'Flaherty2, Dieter Fox2, Xiaolong Wang2,3, Arsalan Mousavian2, Yu-Wei Chao2†, Yunzhu Li1†

1 Columbia University   2 NVIDIA   3 University of California, San Diego
* Work partially done at internship   Equal advising

           

Our Visuo-Tactile Diffusion Policy can be Fine-Tuned in Simulation Using RL!

Left: Fine-Tuned Policy Rollout in Real-World                 Right: Fine-Tuned Policy Rollout in Simulation

Abstract

Humans excel at bimanual assembly tasks by adapting to rich tactile feedback—a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning.

Tactile Sensor Hardware and Simulation

Left: Tactile Signals in Real World                                 Right: Tactile Signals in Simulation



Capability Experiments

Nut-and-Bot Assembly

Trial 1

Trial 2

Bayonet Joint Assembly

Trial 1

Trial 2

Peg‑in‑Hole Insertion

Trial 1 (Table-Top Bimanual Setup)

Trial 2 (Semi-Humanoid Setup)



Method

Visuo-Tactile Points Representation

Stage 1: We collect real-world human demonstrations with visual and tactile modalities and pre-train a diffusion policy.
Stage 2: We simulate the same sensory modalities in simulation and fine-tune the pre-trained diffusion policy using policy-gradient-based RL.


Policy Robustness: Comparison with Baselines

Comparison of Before and After Fine-Tuning

Fine-Tune w/ Tactile (Ours): Successful In-hand Adjustment

Pre-Train w/ Tactile: Failed to Re-Align the Socket due to Occlusion

Fine-Tune w/ Tactile (Ours): Successful In-hand Adjustment

Pre-Train w/ Tactile: Failed to Recover Unexpected Contact Dynamics

Comparison of w/ and w/o Tactile

Fine-Tune w/ Tactile (Ours): Accurate Insertion

Fine-Tune w/o Tactile: Incorrect Insertion Direction

Fine-Tune w/ Tactile (Ours): Accurate Insertion

Fine-Tune w/o Tactile: Failed Re-Adjustment Attempt



Policy Rollout in Real



Large-Scale Simulation Fine-Tuning

Table-Top Bimanual Setup

Semi-Humanoid Setup


BibTeX

@article{huang2025vtrefine,
title={VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning},
author={Huang, Binghao and Xu, Jie and Akinola, Iretiayo and Yang, Wei and Sundaralingam, Balakumar and O'Flaherty, Rowland and Fox, Dieter and Wang, Xiaolong and Mousavian, Arsalan and Chao, Yu-Wei and Li, Yunzhu}
booktitle={RSS 2025 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond},
year={2025}
}

Related Works