The NOIR System: Building a General Purpose BRI System

Authors:

(1) Ruohan Zhang, Department of Computer Science, Stanford University, Institute for Human-Centered AI (HAI), Stanford University & Equally contributed; [email protected];

(2) Sharon Lee, Department of Computer Science, Stanford University & Equally contributed; [email protected];

(3) Minjune Hwang, Department of Computer Science, Stanford University & Equally contributed; [email protected];

(4) Ayano Hiranaka, Department of Mechanical Engineering, Stanford University & Equally contributed; [email protected];

(5) Chen Wang, Department of Computer Science, Stanford University;

(6) Wensi Ai, Department of Computer Science, Stanford University;

(7) Jin Jie Ryan Tan, Department of Computer Science, Stanford University;

(8) Shreya Gupta, Department of Computer Science, Stanford University;

(9) Yilun Hao, Department of Computer Science, Stanford University;

(10) Ruohan Gao, Department of Computer Science, Stanford University;

(11) Anthony Norcia, Department of Psychology, Stanford University

(12) Li Fei-Fei, 1Department of Computer Science, Stanford University & Institute for Human-Centered AI (HAI), Stanford University;

(13) Jiajun Wu, Department of Computer Science, Stanford University & Institute for Human-Centered AI (HAI), Stanford University.

Table of Links

Abstract & Introduction

Brain-Robot Interface (BRI): Background

The NOIR System

Experiments

Results

Conclusion, Limitations, and Ethical Concerns

Acknowledgments & References

Appendix 1: Questions and Answers about NOIR

Appendix 2: Comparison between Different Brain Recording Devices

Appendix 3: System Setup

Appendix 4: Task Definitions

Appendix 5: Experimental Procedure

Appendix 6: Decoding Algorithms Details

Appendix 7: Robot Learning Algorithm Details

3 The NOIR System

The challenges we try to tackle are: 1) How do we build a general-purpose BRI system that works for a variety of tasks? 2) How do we decode relevant communication signals from human brains? 3) How do we make robots more intelligent and adaptive for more efficient collaboration? An overview of our system is shown in Fig. 2. Humans act as planning agents to perceive, plan, and communicate behavioral goals to the robot, while robots use pre-defined primitive skills to achieve these goals.

The overarching goal of building a general-purpose BRI system is achieved by synergistically integrating two designs together. First, we propose a novel modular brain decoding pipeline for human intentions, in which the human intended goal is decomposed into three components: what, how, and where (Sec. 3.1). Second, we equip the robots with a library of parameterized primitive skills to accomplish human-specified goals (Sec. 3.2). This design enables humans and robots to collaborate to accomplish a variety of challenging, long-horizon everyday tasks. At last, we show a key feature of NOIR to allow robots to act more efficiently and to be capable of adapting to individual users, we adopt few-shot imitation learning from humans (Sec. 3.3).

3.1 The brain: A modular decoding pipeline

We hypothesize that the key to building a general-purpose EEG decoding system is modularization. Decoding complete behavioral goals (e.g., in the form of natural language) is only feasible with expensive devices like fMRI, and with many hours of training data for each individual [31]. As shown in Fig. 3, we decompose human intention into three components: (a) What object to manipulate; (b) How to interact with the object; (c) Where to interact. The decoding of specific user intents from EEG signals is challenging but can be done with steady-state visually evoked potential and motor imagery, as introduced in Sec. 2. For brevity, details of decoding algorithms are in Appendix 6.

Selecting objects with steady-state visually evoked potential (SSVEP). Upon showing the task set-up on a screen, we first infer the user’s intended object. We make objects on the screen flicker with different frequencies (Fig. 3a), which, when focused on by the user, evokes SSVEP [26]. By identifying which frequency is stronger in the EEG data, we may infer the frequency of the flickering visual stimulus, and hence the object that the user focuses on. We apply modern computer vision techniques to circumvent the problem of having to physically attach LED lights [27, 28]. Specifically, we use the foundation model OWL-ViT [32] to detect and track objects, which takes in an image and object descriptions and outputs object segmentation masks. By overlaying each mask of different flickering frequencies (6Hz, 7.5Hz, 8.57Hz, and 10Hz [33, 34]), and having the user focus on the desired object for 10 seconds, we are able to identify the attended object.

We use only the signals from the visual cortex (Appendix 6) and preprocess the data with a notch filter. We then use Canonical Correlation Analysis (CCA) for classification [35]. We create a Canonical Reference Signal (CRS), which is a set of sin and cos waves, for each of our frequencies and their harmonics. We then use CCA to calculate the frequency whose CRS has the highest correlation with the EEG signal, and identify the object that was made to flicker at that frequency.

Selecting skill and parameters with motor imagery (MI). The user then chooses a skill and its parameters. We frame this as a k-way (k ≤ 4) MI classification problem, where we aim to decode which of the k pre-decided actions the user was imagining. Unlike SSVEP, a small amount of calibration data (10-min) is required due to the distinct nature of each user’s MI signals. The four classes are: Left Hand, Right Hand, Legs, and Rest; the class names describe the body parts that users imagine using to execute some skills (e.g. pushing a pedal with feet). Upon being presented with the list of k skill options, we record a 5-second EEG signal, and use a classifier trained on the calibration data. The user then guides a cursor on the screen to the appropriate location for executing the skill. To move the cursor along the x axis, the user is prompted to imagine moving their Left hand for leftward cursor movement. We record another five seconds of data and utilize a 2-way classifier. This process is repeated for x, y, and z axes.

For decoding, we use only EEG channels around the brain areas related to motor imagery (Appendix 6). The data is band-pass-filtered between 8Hz and 30Hz to include µ-band and β-band frequency ranges correlated with MI activity [36]. The classification algorithm is based on the common spatial pattern (CSP) [37–40] algorithm and quadratic discriminant analysis (QDA). Due to its simplicity, CSP+QDA is explainable and amenable to small training datasets. Contour maps of electrode contributions to the top few CSP-space principal components are shown in the middle row of Fig. 3. There are distinct concentrations around the right and left motor areas, as well as the visual cortex (which correlates with the Rest class).

Confirming or interrupting with muscle tension. Safety is critical in BRI due to noisy decoding. We follow a common practice and collect electrical signals generated from facial muscle tension (Electromyography, or EMG). This signal appears when users frown or clench their jaws, indicating a negative response. This signal is strong with near-perfect decoding accuracy, and thus we use it to confirm or reject object, skill, or parameter selections. With a pre-determined threshold value obtained through the calibration stage, we can reliably detect muscle tension from 500-ms windows.

3.2 The robot: Parameterized primitive skills

Our robots must be able to solve a diverse set of manipulation tasks under the guidance of humans, which can be achieved by equipping them with a set of parameterized primitive skills. The benefits of using these skills are that they can be combined and reused across tasks. Moreover, these skills are intuitive to humans. Since skill-augmented robots have shown promising results in solving longhorizon tasks, we follow recent works in robotics with parameterized skills [41–52], and augment the action space of our robots with a set of primitive skills and their parameters. Neither the human nor the agent requires knowledge of the underlying control mechanism for these skills, thus the skills can be implemented in any method as long as they are robust and adaptive to various tasks.

We use two robots in our experiment: A Franka Emika Panda arm for tabletop manipulation tasks, and a PAL Tiago robot for mobile manipulation tasks (see Appendix for hardware details). Skills for the Franka robot use the operational space pose controller (OSC) [53] from the Deoxys API [54]. For example, Reaching skill trajectories are generated by numerical 3D trajectory interpolation conditioned on the current robot end-effector 6D pose and target pose. Then OSC controls the robot to reach the waypoints along the trajectory orderly. The Tiago robot’s navigation skill is implemented using the ROS MoveBase package, while all other skills are implemented using MoveIt motion planning framework [55]. A complete list of skills for both robots is in Appendix 3. Later, we will show that humans and robots can work together using these skills to solve all the tasks.

3.3 Leveraging robot learning for efficient BRI

The modular decoding pipeline and the primitive skill library lay the foundation for NOIR. However, the efficiency of such a system can be further improved. During the collaboration, the robots should learn the user’s object, skill, and parameter selection preferences, hence in future trials, the robot can predict users’ intended goals and be more autonomous, hence reducing the effort required for decoding. Learning and generalization are required since the location, pose, arrangement, and instance of the objects could differ from trial to trial. Meanwhile, the learning algorithms should be sample-efficient since human data is expensive to collect.

Retrieval-based few-shot object and skill selection. In NOIR, human effort can be reduced if the robot intelligently learns to propose appropriate object-skill selections for a given state in the task. Inspired by retrieval-based imitation learning [56–58], our proposed method learns a latent state representation from observed states. Given a new state observation, it finds the most similar state in the latent space and the corresponding action. Our method is shown in Fig. 4. During task execution, we record data points that consist of images and the object-skill pairs selected by the human. The images are first encoded by a pre-trained R3M model [59] to extract useful features for robot manipulation tasks, and are then passed through several trainable, fully-connected layers. These layers are trained using contrastive learning with a triplet loss[60] that encourages the images with the same object-skill label to be embedded closer in the latent space. The learned image embeddings and object-skill labels are stored in the memory. During test time, the model retrieves the nearest data point in the latent space and suggests the object-action pair associated with that data point to the human. Details of the algorithm can be found in Appendix 7.1.

One-shot skill parameter learning. Parameter selection requires a lot of human effort as it needs precise cursor manipulation through MI. To reduce human effort, we propose a learning algorithm for predicting parameters given an object-skill pair as an initial point for cursor control. Assuming that the user has once successfully pinpointed the precise key point to pick a mug’s handle, does this parameter need to be specified again in the future? Recent advancement in foundation models such as DINOv2 [61] allows us to find corresponding semantic key points, eliminating the need for parameter re-specification. Compared to previous works, our algorithm is one-shot [62–66] and predicts specific 2D points instead of semantic segments [67, 68]. As shown in Fig. 4, given a training image (360 × 240) and parameter choice (x, y), we predict the semantically corresponding point in the test images, in which positions, orientations, instances of the target object, and contexts may vary. We utilize a pre-trained DINOv2 model to obtain semantic features [61]. We input both train and test images into the model and generate 768 patch tokens, each as a pixel-wise feature map of dimension 75 × 100. We then extract a 3 × 3 patch centered around the provided training parameter and search for a matching feature in the test image, using cosine similarity as the distance metric. Details of this algorithm can be found in Appendix 7.2.

This paper is available on arxiv under CC 4.0 license.