Decoding Human Intentions: NOIR's Performance and Learning Algorithms

cover
14 Feb 2024

Authors:

(1) Ruohan Zhang, Department of Computer Science, Stanford University, Institute for Human-Centered AI (HAI), Stanford University & Equally contributed; [email protected];

(2) Sharon Lee, Department of Computer Science, Stanford University & Equally contributed; [email protected];

(3) Minjune Hwang, Department of Computer Science, Stanford University & Equally contributed; [email protected];

(4) Ayano Hiranaka, Department of Mechanical Engineering, Stanford University & Equally contributed; [email protected];

(5) Chen Wang, Department of Computer Science, Stanford University;

(6) Wensi Ai, Department of Computer Science, Stanford University;

(7) Jin Jie Ryan Tan, Department of Computer Science, Stanford University;

(8) Shreya Gupta, Department of Computer Science, Stanford University;

(9) Yilun Hao, Department of Computer Science, Stanford University;

(10) Ruohan Gao, Department of Computer Science, Stanford University;

(11) Anthony Norcia, Department of Psychology, Stanford University

(12) Li Fei-Fei, 1Department of Computer Science, Stanford University & Institute for Human-Centered AI (HAI), Stanford University;

(13) Jiajun Wu, Department of Computer Science, Stanford University & Institute for Human-Centered AI (HAI), Stanford University.

Abstract & Introduction

Brain-Robot Interface (BRI): Background

The NOIR System

Experiments

Results

Conclusion, Limitations, and Ethical Concerns

Acknowledgments & References

Appendix 1: Questions and Answers about NOIR

Appendix 2: Comparison between Different Brain Recording Devices

Appendix 3: System Setup

Appendix 4: Task Definitions

Appendix 5: Experimental Procedure

Appendix 6: Decoding Algorithms Details

Appendix 7: Robot Learning Algorithm Details

5 Results

We seek to provide answers to the following questions through extensive evaluation: 1) Is NOIR truly general-purpose, in that it allows all of our human subjects to accomplish the diverse set of everyday tasks we have proposed? 2) Does our decoding pipeline provide accurate decoding results? 3) Does our proposed robot learning and intention prediction algorithm improve NOIR’s efficiency?

System performance. Table 1 summarizes the performance based on two metrics: the number of attempts until success and the time to complete the task in successful trials. When the participant reached an unrecoverable state in task execution, we reset the environment and the participant reattempted the task from the beginning. Task horizons (number of primitive skills executed) are included as a reference. Although these tasks are long-horizon and challenging, NOIR shows very encouraging results: on average, tasks can be completed with only 1.83 attempts. The reason for task failures is human errors in skill and parameter selection, i.e., the users pick the wrong skills or parameters, which leads to non-recoverable states and needs manual resets. Decoding errors or robot execution errors are avoided thanks to our safety mechanism with confirmation and interruption. Although our primitive skill library is limited, human users find novel usage of these skills to solve tasks creatively. Hence we observe emerging capabilities such as extrinsic dexterity. For example, in task CleanBook (Fig. 1.6), Franka’s Pick skill is not designed to grasp a book from the table, but users learn to push the book towards the edge of the table and grasp it from the side. In CutBanana (Fig. 1.12), users utilize Push skill to cut. The average task completion time is 20.29 minutes. Note that the time humans spent on decision-making and decoding is relatively long (80% of total time), partially due to the safety mechanism. Later, we will show that our proposed robot learning algorithms can address this issue effectively.

Table 1: NOIR system performance. Task horizon is the average number of primitive skills executed. # attempts indicate the average number of attempts until the first success (1 means success on the first attempt). Time indicates the task completion time in successful trials. Human time is the percentage of the total time spent by human users, this includes decision-making time and decoding time. With only a few attempts, all users can accomplish these challenging tasks.

Decoding accuracy. A key to our system’s success is the accuracy in decoding brain signals. Table 2 summarizes the decoding accuracy of different stages. We find that CCA on SSVEP produces a high accuracy of 81.2%, meaning that object selection is mostly accurate. As for CSP + QDA on MI for parameter selection, the 2-way classification model performs at 73.9% accuracy, which is consistent with current literature [36]. The 4-way skill-selection classification models perform at about 42.2% accuracy. Though this may not seem high, it is competitive considering inconsistencies attributed to long task duration (hence the discrepancy between calibration and task-time accuracies). Our calibration time is only 10 minutes, which is significantly shorter compared to the duration of typical MI calibration and training sessions by several orders of magnitude [21]. More calibration provides more data for training more robust classifiers, and allows human users to practice more which typically yields stronger brain signals. Overall, the decoding accuracy is satisfactory, and with the safety mechanism, there has been no instance of task failure caused by incorrect decoding.

Object and skill selection results. We then answer the third question: Does our proposed robot learning algorithm improve NOIR’s efficiency? First, we evaluate object and skill selection learning. We collect a dataset offline with 15 training samples for each object-skill pair in MakePasta task. Given an image, a prediction is considered correct if both the object and the skill are predicted correctly. Results are shown in Table 3. While a simple image classification model using ResNet [72] achieves an average accuracy of 0.31, our method with a pre-trained ResNet backbone achieves significantly higher accuracy at 0.73, highlighting the importance of contrastive learning and retrieval-based learning. Using R3M as the feature extractor further improves the performance to 0.94. The generalization ability of the algorithm is tested on the same MakePasta task. For instance-level generalization, 20 different types of pasta are used; for context generalization, we randomly select and place 20 task-irrelevant objects in the background. Results are shown in Table 3. In all variations, our model achieves accuracy over 93%, meaning that the human can skip the skill and object selection 93% of the time, significantly reducing their time and effort. We further test our algorithm during actual task execution (Fig. 5). A human user completes the task with and without object-skill prediction two times each. With object and skill learning, the average time required for each object-skill selection is reduced by 60% from 45.7 to 18.1 seconds. More details about the experiments and visualization of learned representation can be found in Appendix 7.1.

Table 2: Decoding accuracy at different stages of the experiment.

One-shot parameter learning results. First, using our pre-collected dataset (see Appendix 7.2), we compare our algorithm against multiple baselines. The MSE values of the predictions are shown in Table 4. Random sample shows the average error when randomly predicting points in the 2D space. Sample on objects randomly predicts a point on objects and not on the background; the object masks here are detected with the Segment Anything Model (SAM) [73]. For Pixel similarity, we employ the cosine similarity and sliding window techniques used in our algorithm, but on raw images without using DINOv2 features. All of the baselines are drastically outperformed by our algorithm. Second, our one-shot learning method demonstrates robust generalization capability, as tested on the respective dataset; table 4 presents the results. The low prediction error means that users spend much less effort in controlling the cursor to move to the desired position. We further demonstrate the effectiveness of the parameter learning algorithm in actual task execution for SetTable, quantified in terms of saved human effort in controlling the cursor movement (Fig. 5). Without learning, the cursor starts at the chosen object or the center of the screen. The predicted result is used as the starting location for cursor control which led to a considerable decrease in cursor movement, with the mean distance reduced by 41%. These findings highlight the potential of parameter learning in improving efficiency and reducing human effort. More results and visualizations can be found in Appendix 7.2.

Table 4: One-shot parameter learning results. Our method is highly accurate and generalizes well.

Figure 5: Left: Object and skill selection learning reduces the decoding time by 60%. Right: Parameter learning decreases cursor movement distance by 41%.

This paper is available on arxiv under CC 4.0 license.