Capturing the Data That Users Naturally Emit – Robotics and Autonomous Systems Center

The widespread success of foundation models like ChatGPT, Gemini, Deepseek, and Claude demonstrates the effectiveness of scaling data collection to address many vision and language tasks. Robotics has also seen benefits from learning from large amounts of high-quality data, but robots are still not reliably performing many of the tasks that people do every day without thinking. The performance gap between robot systems and other forms of artificial intelligence stems from the fact that robots are physically embodied.

Philosophically, having a physical body changes the kinds of data that an embodied agent has access to. For example, a person’s physical embodiment provides them with a limited field of view, and prevents them from being in multiple places at once. Other data access modifiers relate to a person’s flexibility, strength, known motor skills, age, and numerous other embodied social and physical factors.

The limitations of embodiment have direct implications for the production of large-scale datasets. People are generally socially motivated to share data on the internet that is interesting. Interesting data accrues interaction and social capital, which satisfy the basic human psychological needs of autonomy, belonging, and competence. The quality of “being interesting” can be thought of data having low probability for the population in general, but a higher probability for the person sharing the data. If there were no social impetus to post interesting data, the internet would likely look wildly different.

A speculative social media feed if people were motivated to post the media that robots use to learn manipulation skills. While this would help with robot learning, this is not what internet data typically looks like. Note: these “instagram accounts” are fictional; any resemblance to actual persons is coincidental.

One of the key advantages of robots is their capacity to automate the minutiae of day-to-day life that people are generally unmotivated to perform. Because these dull tasks are not reflective of activities generally posted on the internet, there is a large distribution shift between the data that we have in vast quantities online (“interesting”), and the data we need to train functional robot policies (“uninteresting”). While this sounds like a difficult problem, keep in mind that uninteresting data are only uninteresting because everyone is constantly producing it. The only thing we need to do is actually collect it.

The key insight of this blog post is that people naturally emit data through many embodied channels, but our existing interfaces do not (yet) effectively capture the data from these embodied channels.

Learning from Intrinsically Rewarding Interface Interactions

Data collected through digital interfaces are compelling because the data generated from these interfaces exist in a digital format we can directly use for robot learning. Numerous works have shown great benefits of using digital interfaces to collect robot demonstrations, pairwise preferences, trajectory rankings, or text/language specifications. These interfaces work well when users know how they want the robot to act. In practice, users interacting with novel technology do not yet know how they want robots to act, and engage in exploratory search processes to understand both robot capability (i.e., the limits of its embodiment) and their own preferences for robot behaviors. We examined how users perform exploratory search when designing signals for a robot that navigates around the room to help them find misplaced items.

The Kuri Mayfield Robot helps a user find items around the room while the user performs a secondary task. The robot uses signals that the user designs to indicate if it is ready to receive a command, if it is actively searching, has found an item, or has information about an item’s location.

To encourage exploratory search, we created RoSiD, a signal design interface, that promotes exploratory search by allowing users to review several robot behaviors at once. An example of a user performing exploratory search with RoSiD is shown below. In our first study, 25 non-expert users selected visual, auditory, and kinetic robot behaviors for four different robot signals according to their own preferences.

Users engage in perceptual processes to rapidly determine if behavior previews are worth exploring or not. This user skips over several origami images in favor of images featuring icons of a person’s hand.

We found that users’ explorations communicated their internal decision making processes by exploring relevant behaviors in more detail by playing them on the physical robot. In contrast, users ignored behaviors that were irrelevant to their goals. Not only does this interaction give us a new and scalable datasource to learn robot representations from, users described the exploratory search process as fun (users cited parallels to the video game franchise “The Sims”).

To learn representations from this new datasource, we formulated a contrastive objective using a triplet loss between the behaviors that users chose to explore in detail and the behaviors that users chose to ignore. We call this formulation Contrastive Learning from Exploratory Actions (CLEA), illustrated below.

*Users naturally engage in exploratory search when specifying their preferences, enabling CLEA to learn representations of robot behaviors that are aligned with users’ values.*

We learned representations for each of the three modalities of robot behaviors (visual, auditory, and kinetic). To evaluate if these representations are useful in other downstream preferences learning tasks, we recruited a new set of 40 non-expert users to engage in a traditional behavior ranking task, where users ordered robot behaviors from their least favorite to their most favorite. We found that CLEA representations generalized to new users, and learned users’ preferences with 80% less data required than self-supervised methods.

*CLEA learns representations that facilitate downstream preference learning tasks.*

When robots are widely deployed in homes, CLEA has the potential to allow robots to learn behavior representations from mass distributions of preferences because exploratory actions are easy to collect at scale. These representations can significantly speed up the adaptation process to new user preferences and environments.

Learning from Physically Embodied Data

While screen-based interfaces can be used to collect natural interaction data from users, robots’ embodiment also allow us to create entirely new interfaces to learn from people. In particular, physical interactions with 3D objects. Learning how users move has wide implications for sports / exercise training, performing manipulation tasks in human environments, and evaluating physical therapy. Often the goal of learning from embodied physical interactions with robots is to collect data from humans to learn about robots. However these communication channels are bidirectional, so we can use robots to collect embodied information from humans too.

This is particularly relevant in neurorehabilitation. The goal of neurorehabilitation is to restore mobility and independence to people who have experienced a neurological injury such as a stroke or spinal cord injury. Ideally, the exercises practiced within physical therapy sessions transfer to daily life. In practice, there is often a gap between conscious limb use in physical therapy sessions and automatic limb use in daily life. This gap is known as nonuse.

Nonuse is highly important for clinical evaluations, but is difficult to quantify. The existing clinical standard, the actual amount of use test (AAUT), assesses arm use in fourteen daily tasks that typically require bimanual manipulation. This test is separated into two parts: the covert assessment and the conscious assessment. In the covert assessment, participants are recorded while they perform the fourteen bimanual tasks. In the conscious assessment , the purpose of the test is revealed to the participant, and they perform the fourteen tasks again, focusing on using both of their arms as much as possible. Due to the structure of this assessment, the test can only be conducted to assess nonuse once before the participant realizes they are being tested, making subsequent tests invalid.

Robots offer several benefits for scalable physical testing through physical channels by collecting precise spatial and interaction information. To evaluate this idea, we created the Bimanual Arm Reaching Test with a Robot (BARTR). In this test, a Socially Assistive Robot (SAR) provides instructions on how to perform the test to a user. The Physically Assistive Robot (PAR) moves a sensorized button around the workspace in front of the user. This test has two phases. In the first phase, users reach toward the button with either hand. In the second phase they only use their more affected side. In both of these phases the robot collects information on time to press the button, and which side the user reached with.

A user performing the BARTR interaction to assess nonuse. In the spontaneous phase, the user can press the button with either hand. In the constrained phase, the user presses the button with their more affected side only.

From these physical interaction data, we formulated a metric to quantify arm nonuse based on prior research in the relationship between arm choice and arm nonuse. We found that this metric was (A) valid, by showing high correlation with the AAUT clinical assessment. This metric was (B) reliable, because it calculated similar quantities for the same person on three repeated tests. Finally this metric was (C) simple to use, with an average score of ‘A’ on the System Usability Scale.

*Results of the BARTR evaluation. The BARTR metric shows high correlation with existing clinical standards (A), exhibits high test-retest reliability (B), and was easy for participants to use (C).*

These benefits of the BARTR Interaction demonstrate the efficacy of using robots to generate embodied data for learning from physical interaction, rather than viewing robots as a data sink.

Learning from Socially Embodied Data

While the most straightforward application of robotics is in physical manipulation and interaction, embodied knowledge is also created and transferred through social interaction. For example, the communication of information between people happens across several social embodied channels like gaze, facial expression, or posture. Even language emerged from interpersonal needs has meaning grounded in both the physical world and inferred social context. As a consequence, many people naturally and automatically express preferences, evaluations, and internal states through these embodied social channels.

We were interested in using these natural social communication channels to learn preferences about robots’ social actions in the context of a rehabilitation game for youth with cerebral palsy. In this game, youth communicate with the robot through a thumbs-up or thumbs-down gesture, and the robot selects among encouraging, rewarding, or clarifying social actions to facilitate voluntary participation in the game.

The key idea of this work was to learn users’ engagement dynamics during interaction with the robot. We obtained labels of users’ engagement states from annotators based on the gaze, facial expression, and posture of the user (estimating this from foundation models was not sufficiently accurate at the time). We learned individual models of users’ engagement dynamics, T_user(s_t, a_t) -> p(s_t+1), throughout the games users played.

*A user’s engagement evolves based on the social actions robots select and their own personal preferences for feedback.*

We found that we could implicitly learn users’ preferences from these engagement dynamics, and that using these models of users has the potential to increase positive outcomes in therapeutic interventions. Future work could similarly extend this technique to robots that are engaged in other forms of long-term interactions with humans to learn social preferences.

Concluding Remarks

Embodied interfaces that collect the data that users naturally emit have the potential to bring the benefits of internet-scale data to robotics. In the same way that mice and keyboards enabled people to create and share high-quality data on the internet, leveraging natural interfaces that facilitate communication in the physical world can lead to intuitive data-generating interactions that users can have with robots.