Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models
- 01Eliminating the Reward Engineering Bottleneck for Real-World RL
- 02Three Complementary Reward Signals Beat One Monolithic Score
- 03Meaningful Gains in Very Few RL Iterations
- 04Training on Human Video Closes the Gap Between Robot Clumsiness and Success Criteria
- 05The Reward Model Gets Better As the Policy Gets Better
Authors: Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, Yue Wang Institution: USC Physical Superintelligence Lab / Toyota Research Institute Paper: arXiv:2603.16065v2, March 2026
1. Key Themes
Eliminating the Reward Engineering Bottleneck for Real-World RL
The core problem this paper solves is one of the biggest obstacles to deploying RL on physical robots: you need a reward function before RL can work, and writing that function by hand is expensive, brittle, and doesn't generalize. LRMs replace human-engineered rewards with a fine-tuned VLM that watches camera frames and decides how well the robot is doing — no simulator state access, no task-specific code, no human labelers.
The paper frames this plainly: "traditional methods rely heavily on either labor-intensive human labeling or brittle, task-specific hand-coded objectives" (Section I). The replacement is a frozen reward model that operates "in a purely zero-shot manner within these test environments" (Abstract).
Three Complementary Reward Signals Beat One Monolithic Score
Rather than asking the VLM "how well is the robot doing overall?", the authors decompose the reward into three distinct signals that address different failure modes: (1) a Temporal Contrastive Reward that compares two frames to ask "are we moving forward or backward?", (2) an Absolute Progress Reward that estimates percentage completion from a single frame, and (3) a Task Completion Reward that gives a binary yes/no on task success.
Each signal has a different role. As the paper explains: "the Temporal Contrastive Reward is designed to provide a relative directional gradient... the Absolute Progress Reward performs continuous progress regression... the Task Completion Reward acts as a definitive terminal signal" (Section III-B). The Task Completion variant achieves the highest simulation success rate at 60.93%, compared to the 56.88% imitation learning baseline (Table III).
Meaningful Gains in Very Few RL Iterations
The practical punchline: this works fast. Starting from a π0.5 imitation learning baseline, the LRM-guided RL achieves measurable performance gains in just 30 iterations across 320 parallel environments. Real-world results show a jump from 38.3% to 51.7% task success on a physical pick-and-place task.
The paper characterizes this as "remarkable sample efficiency" and notes that "our method significantly improves the success rate of the initial IL policy within just 30 RL iterations" (Abstract). For teams who can't afford thousands of real-world rollouts, this matters enormously.
Training on Human Video Closes the Gap Between Robot Clumsiness and Success Criteria
A non-obvious design choice: the authors include human-object interaction data (HOI4D, EgoDex) in the reward model training set, not just robot trajectories. The rationale is that humans demonstrate cleaner, higher-precision versions of the same manipulation goals robots are attempting.
The paper states that "human video data offers a great standard for precise manipulation and successful task completion, effectively bridging the gap between coarse robot movements and fine-grained success criteria" (Section III-A). This is significant because it means the reward model is calibrated against what good execution looks like, not just what robots typically do.
The Reward Model Gets Better As the Policy Gets Better
One of the more elegant findings: as RL training progresses, the LRM's reward signals become more aligned with ground truth, not less. The ROC-AUC for the progress reward model improves from 0.874 to 0.950, and per-trajectory Pearson correlation for the progress model rises from 0.577 to 0.671 (Table V).
The paper explains this as "emergent synchronization between RL-driven behaviors and LRM-perceived value: as the policy internalizes the LRM's physical priors, it generates trajectories with clearer semantic markers and more distinct physical transitions" (Section V). In other words, a better robot is easier for the reward model to evaluate — creating a virtuous cycle.
2. Contrarian Perspectives
You Don't Need a Simulator to Scale Robot RL — You Need a Better Reward Model
Conventional wisdom in robotics RL is that you need a high-fidelity simulator with built-in ground-truth reward signals to safely scale training. This paper argues the bottleneck isn't simulation access — it's reward quality. The LRM operates entirely from camera images with no privileged simulator state, yet narrows the gap between imitation learning and ground-truth reward performance substantially.
The numbers make this concrete: LRM-guided RL reaches 60.93% success vs. the privileged "Env Reward" ceiling of 66.87% (Table III). That's a 7-point gap from imitation baseline, capturing roughly 63% of the available headroom — all from vision alone. If this scales, teams that lack sim infrastructure have a credible path to RL.
Trajectory-Level Reward Evaluation Is the Wrong Unit of Analysis
Several existing systems (RoboReward, RoboMeter) evaluate entire video trajectories — either post-hoc or by accumulating all frames seen so far. The authors argue this is architecturally wrong for online RL. Post-hoc evaluation means you can't correct mid-episode errors. Accumulating frames creates inference latency that compounds over time.
The paper states directly: "this sparse and delayed signal lacks the temporal resolution necessary for active, real-time guidance... as the environment steps accumulate, this expanding visual context significantly increases the sequential computational load of the autoregressive backbone, leading to non-negligible inference latency" (Section I). The LRM's frame-level evaluation addresses both problems, and the comparative results show it outperforms both RoboReward-8B (59.06%) and RoboMeter-4B (56.56%) (Table IV).
Fine-Tuning a VLM on Unlabeled Video Is Sufficient for Reward Generation — No Human Labels Needed
The standard assumption for training reward models is that you need human preference labels, carefully curated demonstrations, or privileged task success information. This paper uses none of that. The entire training signal comes from the temporal position of frames within video episodes — the assumption being that later in a successful trajectory means closer to goal.
"We extract supervision directly from unlabeled video trajectories... utilizing the inherent temporal monotonicity of these episodes, where the reward signal intensity is strictly determined by the temporal progress of the video" (Section III-A). Twenty-four data sources, zero human reward labels. The practical implication: if this assumption holds broadly, the cost of building a reward model collapses dramatically.
3. Companies Identified
Toyota Research Institute (TRI) Vitor Guizilini is a co-author affiliated with TRI, and TRI is listed as a named financial supporter of the USC Physical Superintelligence Lab. TRI has been an active funder and collaborator on foundation model robotics research. Relevant because this paper represents a direct output of TRI-affiliated research investment.
Physical Intelligence (π) Referenced via their π0.5 model, which the authors use as their base imitation learning policy. The paper initializes from "π0.5 SFT baseline" (Section IV-B) and explicitly cites π0.6* ("A VLA that learns from experience," reference [26]) as related work on multi-stage RL refinement of generalist policies. Physical Intelligence's models are the foundation being improved upon here — this paper is directly relevant to their deployment roadmap.
Nvidia Listed as a named financial supporter of the USC Physical Superintelligence Lab in the acknowledgments. No direct product or platform usage cited, but institutional alignment is relevant given Nvidia's Isaac Lab and broader Physical AI infrastructure investments.
Qualcomm Also listed as a named financial supporter in the acknowledgments. Relevant as an indicator of industry interest in the research direction.
Google DeepMind Listed as a financial supporter. DeepMind has parallel work on VLM-based reward modeling (referenced through Octo, Open X-Embodiment ecosystem). Relevant as both a funder and a competitive research context.
Capital One / Dolby Listed as financial supporters (acknowledgments). Less operationally relevant to robotics, but signal of broad institutional backing for the USC lab.
4. People Identified
Jiageng Mao — USC Physical Superintelligence Lab, Equal Advising Mao is a prolific researcher at USC with multiple papers cited in the references (robot learning from physical world models, humanoid pose learning from internet video, robot learning from any images). His work spans sim-to-real transfer, humanoid control, and foundation model integration for robotics. He is a central figure in the USC Physical Superintelligence Lab's output and worth tracking for anyone building in the physical AI space.
Yue Wang — USC Physical Superintelligence Lab, Equal Advising Wang co-leads the USC Physical Superintelligence Lab and is acknowledged for Powell Research Award support. Wang's broader research portfolio includes physical simulation, fluid dynamics from perception, and generalist robot policies. The lab has become a notable academic-to-industry pipeline given its Toyota and Nvidia affiliations.
Vitor Guizilini — Toyota Research Institute Guizilini is the industry-side co-author representing TRI, indicating this work has direct relevance to Toyota's internal robotics and autonomous systems research. His involvement bridges the academic findings to applied deployment contexts.
Yanru Wu — USC Physical Superintelligence Lab (Lead Author) First author and likely PhD student driving this specific research thread. Worth tracking as the primary technical contributor on LRM architecture and training methodology.
5. Operating Insights
The "Interval-Hold" Strategy Solves a Real Deployment Problem — and Should Be in Every RL Pipeline
One of the most deployable engineering ideas in the paper is the Interval-Hold strategy: rather than querying the VLM reward model every single timestep (which would be computationally prohibitive), the LRM is queried every K environment steps and that reward is cached and held until the next query.
The paper describes this as: "The LRMs are queried every K environment steps to perform the forward mapping... This reward is cached and held for K steps, providing a continuous and dense supervisory stream" (Section III-D), with K=10 used as the standardized evaluation setting (Section IV-B). For any team trying to integrate VLM-based feedback into real-time control loops, this pattern directly addresses the latency mismatch between slow VLM inference (~seconds) and fast control (~milliseconds). It's a straightforward design pattern worth adopting.
Use LRM Completion Rewards as an Autonomous Data Labeler Before Using Them for Online RL
The real-world experiment reveals a deployment path that's arguably more immediately practical than full online RL: use the Task Completion Reward as an automated filter to identify which rollouts succeeded, then fine-tune on those. No simulator required, no human watching videos.
The paper describes this as: "utilizing the Task Completion Reward as an autonomous sparse reward classifier to verify goal satisfaction at the terminal state... effectively bootstrapping the SFT policy to internalize the LRM's physical priors without manual reward engineering" (Section IV-C). The result is a success rate improvement from 38.3% to 51.7% on real hardware (Table VII). For teams doing imitation learning with teleoperation data who want to self-improve without expensive annotation, this is a near-term operational pattern — run 60 rollouts, let the LRM label the successful ones, fine-tune, repeat.
Build Multi-Source, Multi-Domain Reward Training Sets — Single-Source Specialization Will Fail at Generalization
The zero-shot generalization capability of this reward model is entirely a function of training data diversity. The authors deliberately sampled from 24 sources spanning real robots, human hands, and multiple simulators. Any team building internal reward models from their own proprietary robot data only is likely to produce reward models that overfit to their specific setup.
As the paper states: "To ensure the resulting LRMs generalize zero-shot to unseen environments, we curate a dataset encompassing a vast range of physical interactions and semantic logic" spanning real-robot corpora, human-object interactions, and simulated benchmarks (Section III-A). The specific inclusion of human dexterity video (EgoDex, HOI4D) as a high-quality signal for "what good looks like" is a design decision with direct implications for any reward modeling pipeline — don't train only on robot data.
6. Overlooked Insights
The Task Completion Model's Near-Zero Improvement After Fine-Tuning Is Actually a Strong Signal — and Changes How You Should Allocate Training Compute
The Task Completion Reward shows almost no gain from fine-tuning: 69.38% accuracy vs. 69.23% for the zero-shot baseline (Section IV-A). Most readers will gloss over this as a null result. It's actually telling you something important: for binary success/failure judgment, the base Qwen3-VL model is already well-calibrated out of the box.
The paper acknowledges: "this outcome suggests that the foundation Qwen3-VL-8B-Instruct model already possesses a robust innate capability for semantic goal recognition in zero-shot settings" (Section IV-A). The implication for practitioners: if you're resource-constrained and need to pick which reward modality to fine-tune, skip the completion model and invest compute in the progress estimation model (which showed 20% MAE reduction and 8.6 percentage point accuracy gain at the ±0.2 tolerance level). The completion reward can likely be deployed out-of-the-box from any capable VLM — the progress estimation is where specialization earns its keep.
The Performance Gap to Ground-Truth Reward Reveals the Real Ceiling — and It's Not the Reward Model
The LRM-guided RL tops out at ~61% task success on ManiSkill3, while the privileged "Env Reward" (using simulator ground truth) reaches 66.87% (Table III). That ~6 point gap is the current cost of operating from vision alone without access to simulator state. But buried in this comparison is a more important data point: the RL training itself only runs for 30 iterations. The paper never evaluates what happens at 100, 500, or 1000 iterations.
The paper notes: "we conduct 30 RL iterations for each model to ensure a fair comparison" (Section IV-B), but provides no ablation on longer training. For investors and engineering teams, this is the key unknown: is the 6-point gap to privileged reward a fundamental ceiling for vision-only reward models, or does it close with more iterations? The paper does not answer this. Anyone evaluating the commercial readiness of this approach should treat the 30-iteration benchmark as a lower bound on potential, not a ceiling — and the gap to Env Reward as the market size for improved reward modeling.