Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning

1Georgia Institute of Technology, 2Georgia Tech Research Institute, 3University of Toronto

TL;DR Adapt3R is a 3D perception encoder that, combined with any of a variety of IL algorithms, enables zero-shot transfer to unseen embodiments and camera viewpoints.

Abstract

Imitation learning (IL) is a compelling method for training robots to complete a wide variety of manipulation tasks. While many modern IL algorithms use RGB observations as a default, several recent works have shown that 3D scene representations lifted from calibrated RGBD cameras can be useful for completing more complicated tasks, generalizing between camera viewpoints, generalizing to new instances of objects, and learning in low-data regimes. However, these works generally either focus on 3D keyframe prediction, utilize the scene information in a way that is highly specific to one action decoder, or omit semantic information important to task success. In this work, we argue that these 3D scene representations are useful for a variety of IL algorithms, even those originally designed to work with 2D inputs. To that end, we introduce Adaptive 3D Scene Representation (Adapt3R), a general-purpose 3D observation encoder. Adapt3R uses a novel architecture to synthesize data from one or more depth cameras into a single vector which can then be used as conditioning for a variety of IL algorithms. We show that when combined with SOTA multitask IL algorithms, Adapt3R maintains their multitask learning capacity while enabling zero-shot transfer to novel embodiments and camera poses.

Adapt3R is a unified method for extracting scene representations from RGBD inputs for imitation learning, and is designed to work well with a variety of state-of-the-art imitation learning algorithms. It starts by lifting pre-trained foundation model features on RGBD inputs into a 3D scene representation. Then, after a carefully designed point cloud processing step, it uses attention pooling to compress the point cloud into a single vector z which can be used as conditioning for a policy in an end-to-end learning setup.


Real Robot Videos

We train a language-conditioned multitask Adapt3R policy to complete 6 tasks on a real UR5 robot.


Unseen Viewpoint

Adapt3R enables zero-shot transfer to a novel viewpoint which views the scene from a completely different angle.



LIBERO-90 Videos

We train a multitask Adapt3R policy to complete 90 tasks from the LIBERO-90 benchmark.



Novel Embodiment

After training only with the Franka Panda robot, Adapt3R enables zero-shot transfer to novel embodiments.


Kinova3


UR5e


Kuka IIWA



Novel Camera Pose

Adapt3R enables zero-shot transfer to new camera poses. Here we show changes with a small, medium and large difference in camera pose.


Small Camera Change


Medium Camera Change


Large Camera Change

BibTeX

@misc{wilcox2025adapt3radaptive3dscene,
    title={Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning}, 
    author={Albert Wilcox and Mohamed Ghanem and Masoud Moghani and Pierre Barroso and Benjamin Joffe and Animesh Garg},
    year={2025},
    eprint={2503.04877},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2503.04877}}