VT-Refine: Learning Bimanual Assembly with
Visuo-Tactile Feedback via Simulation Fine-Tuning

Binghao Huang¹, Jie Xu², Iretiayo Akinola², Wei Yang², Balakumar Sundaralingam², Rowland O'Flaherty², Dieter Fox², Xiaolong Wang^2,3, Arsalan Mousavian², Yu-Wei Chao^2†, Yunzhu Li^1†

¹ Columbia University ² NVIDIA ³ University of California, San Diego
^* Work partially done at internship ^† Equal advising

Our Visuo-Tactile Diffusion Policy can be Fine-Tuned in Simulation Using RL!

Left: Fine-Tuned Policy Rollout in Real-World Right: Fine-Tuned Policy Rollout in Simulation

Abstract

Humans excel at bimanual assembly tasks by adapting to rich tactile feedback—a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning.

Tactile Sensor Hardware and Simulation

Left: Tactile Signals in Real World Right: Tactile Signals in Simulation

Capability Experiments

Nut-and-Bot Assembly

Trial 1

Trial 2

Bayonet Joint Assembly

Trial 1

Trial 2

Peg‑in‑Hole Insertion

Trial 1 (Table-Top Bimanual Setup)

Trial 2 (Semi-Humanoid Setup)

Method

Visuo-Tactile Points Representation

Stage 1: We collect real-world human demonstrations with visual and tactile modalities and pre-train a diffusion policy.
Stage 2: We simulate the same sensory modalities in simulation and fine-tune the pre-trained diffusion policy using policy-gradient-based RL.

Policy Robustness: Comparison with Baselines

Comparison of Before and After Fine-Tuning

Fine-Tune w/ Tactile (Ours): Successful In-hand Adjustment

Pre-Train w/ Tactile: Failed to Re-Align the Socket due to Occlusion

Fine-Tune w/ Tactile (Ours): Successful In-hand Adjustment

Pre-Train w/ Tactile: Failed to Recover Unexpected Contact Dynamics

Comparison of w/ and w/o Tactile

Fine-Tune w/ Tactile (Ours): Accurate Insertion

Fine-Tune w/o Tactile: Incorrect Insertion Direction

Fine-Tune w/ Tactile (Ours): Accurate Insertion

Fine-Tune w/o Tactile: Failed Re-Adjustment Attempt

Policy Rollout in Real

Large-Scale Simulation Fine-Tuning

Table-Top Bimanual Setup

Semi-Humanoid Setup

BibTeX

@article{huang2025vtrefine,
title={VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning},
author={Huang, Binghao and Xu, Jie and Akinola, Iretiayo and Yang, Wei and Sundaralingam, Balakumar and O'Flaherty, Rowland and Fox, Dieter and Wang, Xiaolong and Mousavian, Arsalan and Chao, Yu-Wei and Li, Yunzhu}
booktitle={RSS 2025 Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond},
year={2025}
}

Related Works

3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing
Binghao Huang¹, Yixuan Wang¹, Xinyi Yang², Yiyue Luo³, Yunzhu Li¹
Conference on Robot Learning (CoRL), 2024
[Webpage] [Paper] [Hardware Tutorial] [Video]

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper
Xinyue Zhu^*^,¹, Binghao Huang^*^,¹, Yunzhu Li¹
[Webpage] [Paper]
Best Demo Award at RSS 2025 Workshop on Robot Hardware-Aware Intelligence [Link]

VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning

1 Columbia University 2 NVIDIA 3 University of California, San Diego * Work partially done at internship † Equal advising

Our Visuo-Tactile Diffusion Policy can be Fine-Tuned in Simulation Using RL!

Left: Fine-Tuned Policy Rollout in Real-World Right: Fine-Tuned Policy Rollout in Simulation

Abstract

Tactile Sensor Hardware and Simulation

Left: Tactile Signals in Real World Right: Tactile Signals in Simulation

Capability Experiments

Nut-and-Bot Assembly

Trial 1

Trial 2

Bayonet Joint Assembly

Trial 1

Trial 2

Peg‑in‑Hole Insertion

Trial 1 (Table-Top Bimanual Setup)

Trial 2 (Semi-Humanoid Setup)

Shaft and Shaft‑collar Assembly

Trial 1

Trial 2

Mechanical Connector Assembly

Trial 1 (Table-Top Bimanual Setup)

Trial 2 (Semi-Humanoid Setup)

Method

Visuo-Tactile Points Representation

Policy Robustness: Comparison with Baselines

Comparison of Before and After Fine-Tuning

Fine-Tune w/ Tactile (Ours): Successful In-hand Adjustment

Pre-Train w/ Tactile: Failed to Re-Align the Socket due to Occlusion

Fine-Tune w/ Tactile (Ours): Successful In-hand Adjustment

Pre-Train w/ Tactile: Failed to Recover Unexpected Contact Dynamics

Comparison of w/ and w/o Tactile

Fine-Tune w/ Tactile (Ours): Accurate Insertion

Fine-Tune w/o Tactile: Incorrect Insertion Direction

Fine-Tune w/ Tactile (Ours): Accurate Insertion

Fine-Tune w/o Tactile: Failed Re-Adjustment Attempt

More Comparison Videos

Policy Rollout in Real

Large-Scale Simulation Fine-Tuning

Table-Top Bimanual Setup

Semi-Humanoid Setup

BibTeX

Related Works

VT-Refine: Learning Bimanual Assembly with
Visuo-Tactile Feedback via Simulation Fine-Tuning

¹ Columbia University ² NVIDIA ³ University of California, San Diego
^* Work partially done at internship ^† Equal advising