The widespread success of foundation models like ChatGPT, Gemini, Deepseek, and Claude demonstrates the effectiveness of scaling data collection to address many vision and language tasks. Robotics has also seen benefits from learning from large amounts of high-quality data, but robots are still not reliably performing many of the tasks that people do every day without thinking. The performance gap between robot systems and other forms of artificial intelligence stems from the fact that robots are physically embodied.
Philosophically, having a physical body changes the kinds of data that an embodied agent has access to. For example, a person’s physical embodiment provides them with a limited field of view, and prevents them from being in multiple places at once. Other data access modifiers relate to a person’s flexibility, strength, known motor skills, age, and numerous other embodied social and physical factors.
The limitations of embodiment have direct implications for the production of large-scale datasets. People are generally socially motivated to share data on the internet that is interesting. Interesting data accrues interaction and social capital, which satisfy the basic human psychological needs of autonomy, belonging, and competence. The quality of “being interesting” can be thought of data having low probability for the population in general, but a higher probability for the person sharing the data. If there were no social impetus to post interesting data, the internet would likely look wildly different.
One of the key advantages of robots is their capacity to automate the minutiae of day-to-day life that people are generally unmotivated to perform. Because these dull tasks are not reflective of activities generally posted on the internet, there is a large distribution shift between the data that we have in vast quantities online (“interesting”), and the data we need to train functional robot policies (“uninteresting”). While this sounds like a difficult problem, keep in mind that uninteresting data are only uninteresting because everyone is constantly producing it. The only thing we need to do is actually collect it.
The key insight of this blog post is that people naturally emit data through many embodied channels, but our existing interfaces do not (yet) effectively capture the data from these embodied channels.
Learning from Intrinsically Rewarding Interface Interactions
Data collected through digital interfaces are compelling because the data generated from these interfaces exist in a digital format we can directly use for robot learning. Numerous works have shown great benefits of using digital interfaces to collect robot demonstrations, pairwise preferences, trajectory rankings, or text/language specifications. These interfaces work well when users know how they want the robot to act. In practice, users interacting with novel technology do not yet know how they want robots to act, and engage in exploratory search processes to understand both robot capability (i.e., the limits of its embodiment) and their own preferences for robot behaviors. We examined how users perform exploratory search when designing signals for a robot that navigates around the room to help them find misplaced items.
To encourage exploratory search, we created RoSiD, a signal design interface, that promotes exploratory search by allowing users to review several robot behaviors at once. An example of a user performing exploratory search with RoSiD is shown below. In our first study, 25 non-expert users selected visual, auditory, and kinetic robot behaviors for four different robot signals according to their own preferences.
We found that users’ explorations communicated their internal decision making processes by exploring relevant behaviors in more detail by playing them on the physical robot. In contrast, users ignored behaviors that were irrelevant to their goals. Not only does this interaction give us a new and scalable datasource to learn robot representations from, users described the exploratory search process as fun (users cited parallels to the video game franchise “The Sims”).
To learn representations from this new datasource, we formulated a contrastive objective using a triplet loss between the behaviors that users chose to explore in detail and the behaviors that users chose to ignore. We call this formulation Contrastive Learning from Exploratory Actions (CLEA), illustrated below.
We learned representations for each of the three modalities of robot behaviors (visual, auditory, and kinetic). To evaluate if these representations are useful in other downstream preferences learning tasks, we recruited a new set of 40 non-expert users to engage in a traditional behavior ranking task, where users ordered robot behaviors from their least favorite to their most favorite. We found that CLEA representations generalized to new users, and learned users’ preferences with 80% less data required than self-supervised methods.
When robots are widely deployed in homes, CLEA has the potential to allow robots to learn behavior representations from mass distributions of preferences because exploratory actions are easy to collect at scale. These representations can significantly speed up the adaptation process to new user preferences and environments.
Read more in these papers:
- The RoSiD Tool: Empowering Users to Design Multimodal Signals for Human-Robot Collaboration
Nathan Dennler, David Delgado, Daniel Zeng, Stefanos Nikolaidis, Maja Matarić
2023 IFRR International Symposium on Experimental Robotics (ISER) - Contrastive Learning from Exploratory Actions: Leveraging Natural Interactions for Preference Elicitation
Nathan Dennler, Stefanos Nikolaidis, Maja Matarić
2025 IEEE/ACM International Conference on Human-Robot Interaction (HRI) (Best Paper Finalist)
Learning from Physically Embodied Data
While screen-based interfaces can be used to collect natural interaction data from users, robots’ embodiment also allow us to create entirely new interfaces to learn from people. In particular, physical interactions with 3D objects. Learning how users move has wide implications for sports / exercise training, performing manipulation tasks in human environments, and evaluating physical therapy. Often the goal of learning from embodied physical interactions with robots is to collect data from humans to learn about robots. However these communication channels are bidirectional, so we can use robots to collect embodied information from humans too.
This is particularly relevant in neurorehabilitation. The goal of neurorehabilitation is to restore mobility and independence to people who have experienced a neurological injury such as a stroke or spinal cord injury. Ideally, the exercises practiced within physical therapy sessions transfer to daily life. In practice, there is often a gap between conscious limb use in physical therapy sessions and automatic limb use in daily life. This gap is known as nonuse.
Nonuse is highly important for clinical evaluations, but is difficult to quantify. The existing clinical standard, the actual amount of use test (AAUT), assesses arm use in fourteen daily tasks that typically require bimanual manipulation. This test is separated into two parts: the covert assessment and the conscious assessment. In the covert assessment, participants are recorded while they perform the fourteen bimanual tasks. In the conscious assessment , the purpose of the test is revealed to the participant, and they perform the fourteen tasks again, focusing on using both of their arms as much as possible. Due to the structure of this assessment, the test can only be conducted to assess nonuse once before the participant realizes they are being tested, making subsequent tests invalid.
Robots offer several benefits for scalable physical testing through physical channels by collecting precise spatial and interaction information. To evaluate this idea, we created the Bimanual Arm Reaching Test with a Robot (BARTR). In this test, a Socially Assistive Robot (SAR) provides instructions on how to perform the test to a user. The Physically Assistive Robot (PAR) moves a sensorized button around the workspace in front of the user. This test has two phases. In the first phase, users reach toward the button with either hand. In the second phase they only use their more affected side. In both of these phases the robot collects information on time to press the button, and which side the user reached with.
From these physical interaction data, we formulated a metric to quantify arm nonuse based on prior research in the relationship between arm choice and arm nonuse. We found that this metric was (A) valid, by showing high correlation with the AAUT clinical assessment. This metric was (B) reliable, because it calculated similar quantities for the same person on three repeated tests. Finally this metric was (C) simple to use, with an average score of ‘A’ on the System Usability Scale.
These benefits of the BARTR Interaction demonstrate the efficacy of using robots to generate embodied data for learning from physical interaction, rather than viewing robots as a data sink.
Read more in these papers:
- A metric for characterizing the arm nonuse workspace in poststroke individuals using a robot arm.
Nathaniel Dennler, Amelia Cain, Erica De Guzman, Claudia Chiu, Carolee J. Winstein, Stefanos Nikolaidis, and Maja J. Matarić.
Science Robotics, 8(84), eadf7723. - Modeling Personalized Difficulty of Rehabilitation Exercises Using Causal Trees
Nathan Dennler, Zhonghao Shi, Uksang Yoo, Stefanos Nikolaidis, Maja Matarić
2025 IEEE/RAS-EMBS International Conference on Rehabilitation Robotics (ICORR)
Learning from Socially Embodied Data
While the most straightforward application of robotics is in physical manipulation and interaction, embodied knowledge is also created and transferred through social interaction. For example, the communication of information between people happens across several social embodied channels like gaze, facial expression, or posture. Even language emerged from interpersonal needs has meaning grounded in both the physical world and inferred social context. As a consequence, many people naturally and automatically express preferences, evaluations, and internal states through these embodied social channels.
We were interested in using these natural social communication channels to learn preferences about robots’ social actions in the context of a rehabilitation game for youth with cerebral palsy. In this game, youth communicate with the robot through a thumbs-up or thumbs-down gesture, and the robot selects among encouraging, rewarding, or clarifying social actions to facilitate voluntary participation in the game.
The key idea of this work was to learn users’ engagement dynamics during interaction with the robot. We obtained labels of users’ engagement states from annotators based on the gaze, facial expression, and posture of the user (estimating this from foundation models was not sufficiently accurate at the time). We learned individual models of users’ engagement dynamics, Tuser(st , at) -> p(st+1), throughout the games users played.
We found that we could implicitly learn users’ preferences from these engagement dynamics, and that using these models of users has the potential to increase positive outcomes in therapeutic interventions. Future work could similarly extend this technique to robots that are engaged in other forms of long-term interactions with humans to learn social preferences.
Read more in this paper:
- Personalizing user engagement dynamics in a non-verbal communication game for cerebral palsy.
Nathaniel Dennler, Catherine Yunis, Jonathan Realmuto, Terence Sanger, Stefanos Nikolaidis, and Maja Matarić.
2021 IEEE International Conference on Robot & Human Interactive Communication (RO-MAN)
Concluding Remarks
Embodied interfaces that collect the data that users naturally emit have the potential to bring the benefits of internet-scale data to robotics. In the same way that mice and keyboards enabled people to create and share high-quality data on the internet, leveraging natural interfaces that facilitate communication in the physical world can lead to intuitive data-generating interactions that users can have with robots.