Tactile and visual perception are both crucial for humans to perform fine-grained interactions with their
environment. Developing similar multi-modal sensing capabilities for robots can significantly enhance and
expand their manipulation skills. This paper introduces 3D-ViTac, a multi-modal sensing and
learning system designed for dexterous bimanual manipulation. Our system features tactile sensors equipped
with dense sensing units, each covering an area of 3 mm^2. These sensors are low-cost and flexible,
providing detailed and extensive coverage of physical contacts, effectively complementing visual
information.
To integrate tactile and visual data, we fuse them into a unified 3D representation space that preserves
their 3D structures and spatial relationships. The multi-modal representation can then be coupled with
diffusion policies for imitation learning. Through concrete hardware experiments, we demonstrate that even
low-cost robots can perform precise manipulations and significantly outperform vision-only policies,
particularly in safe interactions with fragile items and executing long-horizon tasks involving in-hand
manipulation.