Authors:
(1) Ruohan Zhang, Department of Computer Science, Stanford University, Institute for Human-Centered AI (HAI), Stanford University & Equally contributed; [email protected];
(2) Sharon Lee, Department of Computer Science, Stanford University & Equally contributed; [email protected];
(3) Minjune Hwang, Department of Computer Science, Stanford University & Equally contributed; [email protected];
(4) Ayano Hiranaka, Department of Mechanical Engineering, Stanford University & Equally contributed; [email protected];
(5) Chen Wang, Department of Computer Science, Stanford University;
(6) Wensi Ai, Department of Computer Science, Stanford University;
(7) Jin Jie Ryan Tan, Department of Computer Science, Stanford University;
(8) Shreya Gupta, Department of Computer Science, Stanford University;
(9) Yilun Hao, Department of Computer Science, Stanford University;
(10) Ruohan Gao, Department of Computer Science, Stanford University;
(11) Anthony Norcia, Department of Psychology, Stanford University
(12) Li Fei-Fei, 1Department of Computer Science, Stanford University & Institute for Human-Centered AI (HAI), Stanford University;
(13) Jiajun Wu, Department of Computer Science, Stanford University & Institute for Human-Centered AI (HAI), Stanford University.
Table of Links
Brain-Robot Interface (BRI): Background
Conclusion, Limitations, and Ethical Concerns
Appendix 1: Questions and Answers about NOIR
Appendix 2: Comparison between Different Brain Recording Devices
Appendix 5: Experimental Procedure
Appendix 6: Decoding Algorithms Details
Appendix 7: Robot Learning Algorithm Details
Appendix 7: Robot Learning Algorithm Details
Object and skill learning details
We utilize pre-trained R3M as the feature extractor. Our training procedure aims to learn a latent representation of an input image for inferring the correct object-skill pair in the given scene. The feature embedding model is a fully-connected neural network that further encodes the outputs of the foundation model. Model parameters and training hyperparameters are summarized in Table 7. Collecting human data using a BRI system is expensive. To enable few-shot learning, the feature embedding model is trained using a triplet loss [80], which operates on three input vectors: anchor, positive (with the same label as the anchor), and negative (with a different label). Triplet loss pulls inputs with the same labels together by penalizing their distance in the latent space, as well as pushes inputs with different labels away. The loss function is defined as:
where a is the anchor vector, p the positive vector, n the negative vector, f the model, and α = 1 is our separation margin.
Generalization test set. We test our algorithm in the following generalization settings:
Position and pose. For position generalization, we randomize the initial positions of all objects in the scene with fixed orientation and collect 20 different trajectories. For the pose generalization, we randomize both the initial positions and orientations of all objects.
Context. The context-level generalization refers to placing the target object in different environments, defined by different backgrounds, object orientations, and the inclusion of different objects in the scene. We collect 20 different trajectories with these variations.
Instance. The instance generalization aims to assess the model’s capability to generalize across different types of objects present in the scene. For our target task (MakePasta), we collect 20 trajectories with 20 different kinds of pasta with different shapes, sizes, and colors.
Latent representation visualization. To understand the separability of latent representations generated by our object-skill learning model, we visualize the 1024-dimensional final image representations using the t-SNE data visualization technique [81]. Results for the MakePasta pose generalization test set are shown in Figure 7. The model can well separate each of the different stages of the task, allowing us to retrieve the correct object-skill pair for an unseen image.
One-shot parameter learning details
Design choices. We empirically found that using DINOv2’s ViT-B model, alongside a 75x100 feature map and a 3x3 sliding window, with cosine similarity as the distance metric, resulted in the best performance for our image resolutions.
Generalization test set We test the generalization ability of our algorithm on 1008 unique training and test pairs, encompassing four types of generalizations including 8 position trials, 8 orientation trials, 32 context trials, and 960 instance trials.
Position and orientation The position and orientation generalizations, shown in Fig. 8 and Fig. 9 respectively, are tested in isolation, e.g. when the position is varied, the orientation is kept the same.
Context The context-level generalization, shown in Fig. 10, refers to placing the target object in different environments, e.g. the training image might show the target object in the kitchen while the test image shows the target object in a workspace. Here, we allow for position and orientation to vary as well.
Instance To test our algorithm’s capability of instance-level generalization, shown in Fig. 11, we collected a set of four different object categories, each containing five unique object instances. Our object categories consist of mug, pen, bottle, and medicine bottle, whereas the bottle and medicine bottle categories consist of images from both the top-down and side views. We test all permutations within each object category including train and test pairs with different camera views. Here, we allow for position and orientation to vary as well.
Test set for comparing our method against baselines We test our method against baselines on 1080 unique training and test pairs, encompassing four types of generalizations including 8 position trials, 8 orientation trials, 32 context trials, 960 instance trials, 48 trials where we vary all four generalizations simultaneously, and 24 trials from the SetTable task.
Position, orientation, context, and instance simultaneously. Finally, we test our algorithm’s ability to generalize when all four variables differ between the training and test image, shown in Fig. 12. Here, the only object category we use is a mug.
This paper is available on arxiv under CC 4.0 license.