
A USC RASC blog post by Jesse Zhang and Abrar Anwar.
Introduction
In The Automatic Motorist, a robot chauffeur takes a newlywed couple from Earth’s dirt roads to Saturn’s rocky rings—all while evading police. This 1911 (!) short film depicted a generalist robot before the word “robot” even existed.
Until recently, robotics research focused on specialists, task-specific policies that master one domain. Today, the trend is toward generalists, often vision-language-action (VLA) models trained on thousands of hours of robot data. To offset the high cost of collecting real-world demonstrations, researchers augment VLA training with human videos, cross-domain or simulation data, and alternative objectives such as affordances and object detection.
But these sources rarely provide detailed information about the exact scene or precise low-level actions needed for unseen tasks. For that, we need scene- and robot-specific real-world data. Here, we argue for a renewed focus on specialists—robots that collect and learn from their own real-world experience—while still leveraging the knowledge gained from the generalist boom.
The Need for Real World Data
As explained in Sergey Levine’s recent Sporks of AGI blog post, real robot data is “indispensable if we are to truly build robotic foundation model that can generalize.” Real robot data is important in part because collecting it allows us to align the robot’s deployment conditions with its pre-trained knowledge. When we don’t, even state-of-the-art VLAs struggle to achieve reliable performance as minor environment or task changes can take the robot out-of-distribution.
The challenge: teaching robots new tasks still requires collecting large amounts of data—a costly and slow process. We believe this specialization must ultimately happen autonomously: robots should improve themselves during deployment without waiting for constant human feedback. Scaling human labeling, demonstrations, or resets simply cannot keep pace with the diversity of real-world environments.
Real-world reinforcement learning (RL) is one promising way to enable this autonomy. RL fine-tuning lets a generalist robot specialize to its deployment domain, but it introduces many of its own hurdles that preclude its deployment.
Specifically, how can we:
- Learn reward functions from data when the real world lacks the privileged information that RL in simulation commonly relies on?
- Effectively leverage offline data when tackling new tasks that weren’t in the original dataset?
- Evaluate performance across multiple tasks in a principled way—how do we know where a policy works and where it fails?
- Handle the reset problem when resets are costly or simply unavailable in real-world settings?
In the rest of the blog post, we explore some of our work aimed at answering these questions, especially in the context of now having access to large datasets and large pre-trained models that have advanced training of generalist robot policies.
Real-World Reward Functions:
If we want to use RL in the real world, we require reward functions. However, these rewards typically from a simulator that has access to privileged information, but if we want robots to adapt in the real world with RL, we must learn image-based reward functions as privileged state information is not available. This can help robots specialize by improving performance on known tasks (e.g., 20% to 100% success) or adapting to new variations.
Previous approaches using embedding distances (RoboCLIP, LIV, Rank2Reward, VLC) or generative models (GVL) lacked sufficient generalization capability for real world policy learning as these models were either not trained on sufficient robotics data and therefore usually require human demonstrations for each intended task. Requiring human demos for every task in order takes away from a robot’s ability to be autonomous.
We detail our method, ReWiND, which helps make learning from real-world interaction, in order to specialize our robots, more autonomous.

Specifically, we train a specialist language-conditioned video reward model that predicts dense task progress from a small set (5 per task) of language-labeled robot demos, which we co-train with a subset of Open-X data to enhance generalization. However, reward models trained only on successful demonstrations cannot understand how to reward policies when they’re failing. Thus, we introduce a video rewinding mechanism (gif below) to teach ReWiND to provide good negative reward feedback when the policy is making mistakes.

To learn a new task, we perform online RL directly in the real world. Specifically, we pre-train a policy on the demos with rewards coming from the ReWiND reward function. Then, to learn a new task online, we perform offline-to-online RL with dense rewards again coming from ReWiND’s reward function, conditioned only on the new task instruction and videos of the policy attempting to solve the task. We find that we are able to improve robot policies by 5x in only 1 hour of online interaction (see video and bar chart below).

We are presenting this paper as an oral presentation at CoRL 2025 this year.
Leveraging Actions from Offline Data for New Tasks
A key challenge in specialization is: how can we effectively use actions from offline data when addressing tasks not included in the original dataset? While great approaches exist (like Pi’s FAST Tokenizer), we’ve developed an RL-focused solution:
Our CoRL 2024 work EXTRACT enables efficient specialization by extracting discrete, parameterized skills from offline data. These reusable skills transform low-level control into functional calls (like rotate(x, y)), simplifying the learning problem.

EXTRACT works in three stages:
- Offline Skill Extraction: We use vision-language models to cluster visual difference embeddings, identifying discrete high-level behaviors from the offline data.
- Offline Skill Learning: We train a skill decoder that maps from skill IDs and continuous arguments to variable-length action sequences, along with priors for skill selection and argument generation.
- Online Skill-Based RL: For new tasks, we train a policy to select discrete skills and their continuous arguments, guided by our learned priors to efficiently explore the action space.
By structuring the RL problem around these transferable skills, EXTRACT achieves significant sample efficiency gains—10x better than prior skill-based methods in the Franka Kitchen environment. Below we see an example of real-world online fine-tuning in FurnitureBench.
In the future, a robot that has seen pre-training data of assembling tables can extract a parameterized “insert and screw in” skill and adapt it to assembling chairs via real-world RL, rather than re-learning it from scratch.
Evaluation
When adapting generalist policies to new tasks through real-world RL, we also need a reliable way to evaluate performance. Especially in multi-task settings, it’s not enough to just track how well we’re doing on a single goal. We need to maintain estimates of performance across a suite of tasks, even unseen ones. This is critical both for debugging failures and for deciding when to stop fine-tuning or reallocate data collection efforts. But in practice, exhaustive evaluation across tasks is infeasible.
In another CoRL 2025 work, we took a first step toward scalable evaluation by framing it as a probabilistic matrix completion problem. After each human evaluation of a trained policy, we update a surrogate model fsurrogate(πi, Tj) that used language and policy priors to predict the performance distribution of unseen task-policy pairs. Based on active testing literature, we then used fsurrogate to select the next policy-task pair to for a human to evaluate that maximizes information gain to reduce the overall number of evaluations needed. Though this is not yet fully autonomous, it is a step towards reducing human involvement in robot evaluation.

If we want real-world RL to scale, selecting which task to actively improve becomes just as important as collecting data. While our work focused on estimating performance across policies and tasks, there’s a broader need in the community: how do we continually track performance as policies evolve in the real world?
What’s next? Resets, VLA fine-tuning, Exploration, etc.
Real-world RL for policy specialization is hard, and we’re only scratching the surface. An underlying challenge of solving resets remains an elephant in the room, and is something we should address as a community. We are hoping that smart evaluation, combined with great pre-trained models, can help partially address this problem by helping us predict which tasks to attempt learning through estimation of how well these models can perform tasks that can reset other ones.
In addition, fine-tuning large pre-trained models with RL is difficult. Some nice work and blog posts exist on fine-tuning these large models, e.g.:
- Haonan Yu’s blog post on OpenVLA finetuning with online RL
- VLA-RL by Guanxing Lu et al.
- ConRFT by Yuhui Chen et al.
- Diffusion Steering by Andrew Wagenmaker et al.
Generally speaking, we should also be able to bootstrap the underlying, reasonable behaviors that these models encode for specialization by using them for more intelligent exploration. We’re excited to see how the community advances these techniques in the near future.
In conclusion, we are arguing for a renewed focus on training specialists. While pre-trained generalist models offer a foundation, real-world reinforcement learning enables robots to autonomously adapt to specific environments. By addressing challenges in reward learning, skill extraction, and evaluation, we can bridge the gap between research and practical deployment. The future of robotics depends on developing systems that efficiently specialize to meet diverse real-world needs.
Acknowledgements
We thank Erdem Bıyık for helpful feedback on this blog post!