Teahose.
SIGN IN
NEW HERE — WHAT TEAHOSE DOES
We read the entire AI & tech firehose — so you don't have to.
PODPodcastsAll-In, No Priors, Acquired…
NEWNewslettersStratechery, Newcomer…
PAPPapersPhysical AI research
PHProduct Huntdaily launches
VCInvestor ScoutSequoia, a16z, Benchmark…
CLAUDE DISTILLS →
7 reads, 30 sec each — free, 6 AM ET.
+ a live graph of the companies, people & themes underneath.
HOME/ARXIV PHYSICAL AI RESEARCH/Learning while Deploying: Fleet-…
PAPR
// RESEARCH PAPER
ARXIV PHYSICAL AI RESEARCH

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

DATE May 1, 2026SOURCE ARXIV PHYSICAL AI RESEARCHPARTICIPANTS YI WANG, JIANLAN LUO, ET AL. (ARXIV PHYSICAL AI)ARXIV 2605.00416
// KEY TAKEAWAYS5 ITEMS
  1. 01The "Deploy-to-Train" Data Flywheel Is Now Real Hardware, Not Just Theory
  2. 02Long-Horizon Manipulation Is the Real Proof of Concept
  3. 03Sparse Rewards + Heterogeneous Fleet Data Are Solved Together
  4. 04One Policy Governs Eight Tasks
  5. 05The Infrastructure Is Production-Grade, Not Lab-Grade
// SUMMARY

Summary for Investors and Operators

This paper from AGIBOT Finch and Shanghai Innovation Institute presents a framework that turns a deployed robot fleet into a self-improving training system. Rather than treating deployment as the endpoint of training, LWD makes deployment the source of continuous improvement — a critical architectural shift for anyone building or funding robots at scale.


1. Key Themes

The "Deploy-to-Train" Data Flywheel Is Now Real Hardware, Not Just Theory

LWD closes the loop between deployment and training on 16 physical dual-arm robots across 8 real manipulation tasks. The system achieves an average 95% success rate — up from 76% for the supervised fine-tuning (SFT) baseline — with the policy improving during deployment rather than requiring a separate training cycle.

"A single generalist policy trained with LWD improves as online fleet experience accumulates. It substantially improves over the pretrained model, reaches an average success rate of 0.95 across all tasks." (Abstract, Section V-B)

The gains aren't marginal: long-horizon tasks (3–5 minute cocktail making, Gongfu tea, fruit juicing) went from 0.68 (SFT) to 0.91 average step-wise score with LWD Online. That's a 34% relative improvement on the hardest tasks.

Long-Horizon Manipulation Is the Real Proof of Concept

Most real-world deployments involve tasks that can't be solved in a single motion primitive — they require chaining 5–8 sub-steps over several minutes, recovering from failures, and maintaining state across the entire episode. LWD specifically addresses this, and the results diverge most sharply here from competing approaches.

"The performance gap is especially pronounced on long-horizon tasks, where RL can propagate rewards through multi-step dynamic programming and stitch together value estimates across partial progress, while imitation-learning methods suffer more severely from compounding errors." (Section I)

HG-DAgger — the human-intervention imitation learning baseline — actually degraded performance on some long-horizon tasks compared to the SFT starting point, while LWD delivered the largest absolute gains precisely here. This is operationally significant: the tasks that are hardest to collect good demonstrations for are the ones that benefit most from RL.

Sparse Rewards + Heterogeneous Fleet Data Are Solved Together

A core challenge for any real-world RL system is that you can't densely instrument every robot interaction — you only know if the episode succeeded or failed. LWD's DIVL (Distributional Implicit Value Learning) learns a distribution over return values rather than a single scalar, which preserves rare high-return modes that a simple average would wash out.

"By compressing heterogeneous outcomes into a single expected value, a scalar value function blurs rare but reproducible high-return behaviors. A distributional value instead retains the return distribution, preserving these high-return modes and providing a more informative signal for policy improvement." (Section V-C-1)

The ablation is compelling: swapping DIVL for standard scalar expectile regression drops long-horizon performance by 16.7% in the online setting (0.91 → 0.78 average score, Table II).

One Policy Governs Eight Tasks — Generalism Is Preserved Through RL

A legitimate fear in post-training is that RL will specialize the policy and destroy its generalization. LWD trains a single shared policy across all eight tasks simultaneously and shows no regression on the simpler grocery restocking tasks (which were already near-saturated at ~95%+ for most methods) while dramatically improving long-horizon performance.

"LWD provides benefits beyond long-horizon tasks while preserving the generalist behavior of the shared policy during online learning." (Section V-B)

This matters enormously for operators who need one deployable artifact, not eight task-specific models.

The Infrastructure Is Production-Grade, Not Lab-Grade

The paper includes a full description of the distributed actor-learner system (Appendix D), including versioned snapshot data planes, at-least-once episode delivery guarantees, SPMD multi-host JAX training, and asynchronous policy synchronization back to robots.

"The Coordinator is the only orchestration singleton; both the actor fleet and the learner scale independently." (Appendix D)

This isn't a research demo — it's an architecture that can scale to larger fleets. The system pushes updated policies to robots at episode boundaries, and the learner syncs every 50 training steps.


2. Contrarian Perspectives

Human Demonstration Data Is Overrated — Failure Data May Be More Valuable

Most robotics companies treat failed trajectories as noise to be discarded and focus data collection efforts on high-quality expert demonstrations. LWD uses all of it — and the data composition reveals something striking: roughly one-third of the 652.5-hour offline buffer is failure data.

"Roughly one-third of the buffer is failure data, which the behavior-cloning baselines cannot use but which provides an informative learning signal for LWD." (Figure 7 caption, Appendix B-1)

The 187.9 hours of "play data" — human-guided exploration of failure modes — is treated as unsuccessful exploratory data and is still used by LWD to learn what not to do. Behavior cloning baselines simply throw this away. For operators, this reframes data collection strategy entirely: capturing failures systematically is as important as capturing successes.

Human-in-the-Loop Correction (DAgger) Can Actually Hurt Performance

The conventional wisdom in deployment-time learning is that human corrections are free signal — more intervention means better policy improvement. LWD's results challenge this directly. HG-DAgger underperformed SFT on long-horizon tasks in some cases.

"HG-DAgger yields only limited gains over the reference policy on long-horizon tasks and can even degrade performance on some tasks. A likely reason is that DAgger-style training relies on human correction data, whose variability can introduce inconsistencies and provide limited exploration of the broader state space." (Section V-B)

This has real operational implications: investing in human teleoperation infrastructure for correction-based learning may not be the right lever for complex, multi-step tasks. Autonomous rollout collection with sparse reward signals — even messy, failure-laden ones — may outperform expensive human-supervised correction pipelines.

Offline RL Pre-Training Alone Is Not Enough — But It Changes What Online Data You Need

A common assumption is that online RL from scratch is intractable for real robots due to sample complexity. The paper shows offline RL pre-training (LWD Offline) already beats all baselines at 0.88 average score, but more importantly, it conditions the online phase to require only ~4 hours of wall-clock time (~60 total robot-hours across the fleet) to reach 0.95.

"This LWD procedure typically requires only a few hours of real-world interaction." (Section I)

The insight is not "offline RL is sufficient" but rather "offline RL changes the efficiency equation for online RL." Without the offline initialization, online RL at this scale would be prohibitively expensive. The offline-to-online pipeline isn't sequential improvement — it's a prerequisite for the online phase being economically viable.


3. Companies Identified

AGIBOT (Zhiyuan Robot)

  • Description: Chinese humanoid and manipulation robotics company; the paper is from their "AGIBOT Finch" research division
  • Why relevant: This is their production research — the Agibot G1 dual-arm platform is the test hardware. The system described is likely heading toward their commercial deployment stack
  • Quote: "All experiments are conducted on the Agibot G1 dual-arm manipulation platform. Each G1 robot has two 7-DoF arms with parallel-jaw grippers and three RGB cameras." (Section V-A-1)

Physical Intelligence (π) / Black et al.

  • Description: Robotics foundation model company behind π0 and π0.5 VLA models
  • Why relevant: LWD is explicitly built on top of the π0.5 architecture and uses it as the base policy for all experiments. π0.5's commercial trajectory is directly relevant to this paper's deployability
  • Quote: "The actor follows the π0.5 flow-based VLA architecture. It consists of a PaliGemma vision-language backbone, instantiated with a Gemma-2B language model and a SigLIP vision encoder, together with a Gemma-300M action expert for flow-based action generation." (Section IV-D)
  • Quote: "We obtain the reference policy by supervised fine-tuning the pretrained π0.5 VLA policy on 336.6 hours of demonstration data." (Appendix C-1)

Google DeepMind

  • Description: AI research lab behind Gemma 3 and SigLIP models used in the value/critic network architecture
  • Why relevant: The critic and value networks use Gemma 3-270M-IT and SigLIP-So400M as their VLM backbone — DeepMind's open-weight models are foundational infrastructure for this system
  • Quote: "We implement Vψ and Qϕ with a shared Gemma3–SigLIP VLM backbone... The Gemma 3 language module and SigLIP vision encoder are initialized from publicly released Gemma 3-270M-IT and SigLIP-So400M checkpoints." (Section IV-D)

Google / QT-Opt Team (Kalashnikov et al.)

  • Description: Google Robotics team that pioneered fleet-scale off-policy RL for grasping (QT-Opt, MT-Opt)
  • Why relevant: LWD is explicitly positioned as extending this lineage to generalist VLA policies and long-horizon tasks
  • Quote: "Kalashnikov et al. demonstrate that off-policy RL can be scaled from vision-based grasping to multi-task manipulation through asynchronous robot data collection and centralized Q-function optimization. While these systems focus primarily on short-horizon manipulation and learn policies largely from scratch, LWD post-trains a pretrained generalist VLA policy across diverse real-world tasks." (Section II-C)

4. People Identified

Jianlan Luo

  • Lab/Institution: Shanghai Innovation Institute / AGIBOT Finch (corresponding author); previously UC Berkeley
  • Why notable: Lead on SERL (sample-efficient real-world RL) and "Precise and Dexterous Robotic Manipulation via Human-in-the-Loop RL" (published in Science Robotics). One of the most credible researchers on real-world robot RL at scale. His presence as corresponding author signals this paper is a serious engineering contribution, not a benchmark paper
  • Quote: "Luo et al. utilize a small number of human demonstrations to seed policy learning and then specialized a single robotic skill through real-world interaction." (Section II-B, self-referencing prior work)

Qiyang Li

  • Lab/Institution: UC Berkeley (acknowledgments indicate collaboration)
  • Why notable: Lead author on QAM (Q-Learning with Adjoint Matching), which is one of the two core algorithmic components of LWD. His work on flow-based policy optimization with RL is directly embedded in this system
  • Quote: "We thank Qiyang Li for helpful discussions." (Section VII); "Li and Levine introduce QAM, using critic gradients to improve flow-based policies through adjoint matching, achieving stable training from scratch in simulation." (Section II-B)

Sergey Levine

  • Lab/Institution: UC Berkeley
  • Why notable: Co-author on QAM and multiple foundational papers cited here (IQL, SERL, RLDG, AWAC). The theoretical lineage of LWD — IQL → DIVL, adjoint matching → QAM — runs directly through his lab. His group's work is the algorithmic substrate of this entire system
  • Quote: Multiple citations throughout; co-author on QAM [31], IQL [22], SERL [40], RLDG [56]

Yi Wang and Xinchen Li

  • Lab/Institution: AGIBOT Finch / Shanghai Innovation Institute (lead authors)
  • Why notable: First and second authors driving the system engineering. The depth of the infrastructure appendix (distributed replay, versioned snapshots, SPMD JAX training) indicates serious systems engineering capacity within AGIBOT's research arm
  • Quote: First authors, full paper

5. Operating Insights

Deploy Your Worst Robots First — Their Failures Are Your Most Valuable Training Data

The LWD framework inverts the typical deployment logic. Rather than deploying only when performance is good enough, the system needs imperfect rollouts to learn from. The offline buffer composition (34.8% failure data) and the online buffer (which explicitly captures both failed autonomous rollouts and human interventions) show that failure-rich data is a feature, not a bug.

"Interaction data collected from the robot fleet is aggregated into a shared learning process, enabling a generalist policy to continue improving across tasks... deployed robots generate experience on the target deployment distribution, the shared policy improves from the aggregated data, and the improved policy is redeployed to collect broader and more informative experience." (Section I)

For CTOs designing deployment pipelines: build telemetry and episode storage into your robot infrastructure from day one, even before you have a training loop to use it. Every failed task your fleet encounters in the field is training data you're either capturing or losing forever. The 652.5 hours of offline data in this paper were accumulated before online training began — that dataset took significant time to build, and the online gains were only possible because it existed.

Cycle Time Is a Signal, Not Just a Metric

LWD doesn't just improve success rates — it reduces cycle time by 23.75 seconds on long-horizon tasks. This is a direct consequence of the value function learning to prefer action sequences that make reliable progress rather than hesitating or retrying unnecessarily.

"LWD reduces mean cycle time by 23.75 seconds compared with reference policy. This efficiency gain is consistent with the critic-guided policy update. The learned value function favors action chunks that make reliable task progress. As a result, the policy reduces hesitations, retries, and unstable intermediate behaviors." (Section V-B)

For operators deploying in commercial settings (grocery stocking, food service, logistics), throughput is often more commercially important than success rate alone. A policy that succeeds 95% of the time but is 10% faster than one that succeeds 93% of the time can have meaningfully better unit economics. RL-based post-training captures this efficiency dimension that pure imitation learning misses.

Freeze the VLM Backbone During Online Updates — Only Tune the Action Head

A practical engineering decision buried in the architecture section has outsized implications for deployment stability. During online training, only the action expert (Gemma-300M) is updated; the PaliGemma vision-language backbone is frozen. This makes online policy updates computationally tractable and prevents catastrophic forgetting of visual-semantic representations.

"During online QAM updates, the policy VLM backbone is frozen and only the action expert is updated, while the value and critic networks continue to be fully fine-tuned on mixed replay. This design keeps online policy updates efficient and preserves the pretrained vision-language representations." (Section IV-D)

For engineering teams building continuous learning pipelines: this is a practical recipe for keeping online RL stable in production. Full fine-tuning of a multi-billion parameter VLM during live deployment is neither computationally feasible nor safe. Separating the "reasoning" backbone (frozen) from the "action generation" head (trainable) is the architectural pattern to adopt.


6. Overlooked Insights

The Adaptive τ Strategy Is a Free Calibration Layer That Nobody Else Has

The paper's ablation on adaptive τ (Table III) is reported as a modest improvement (0.84 → 0.88 average offline score), which undersells what's actually happening. The system uses the entropy of the learned value distribution as a real-time signal for how confident the critic is about a given state, then automatically adjusts how optimistic the TD bootstrap target should be.

"Diffuse distributions receive lower τ values to reduce overestimation, while concentrated distributions retain more optimistic targets... conditioning τ on distributional entropy helps calibrate bootstrap optimism, making targets more conservative under high uncertainty and more optimistic when the value estimate is confident." (Sections IV-A and V-C-2)

This is a self-calibrating mechanism that addresses one of the core failure modes of offline RL applied to heterogeneous data: the critic becoming overconfident in regions of the state space where it has seen little data. For operators running fleets across diverse environments (different store layouts, different object instances, different users), this automatic conservatism in novel states is practically important — it means the RL signal degrades gracefully rather than catastrophically when the robot encounters something truly out of distribution.

The Offline Data Composition Reveals a 4:1 Ratio of Demonstration Hours to Online Data Hours

The paper reports 652.5 hours of offline data and approximately 60 robot-hours of online data (4-hour wall clock × 16 robots, minus overhead). But the breakdown of that offline data shows something important: long-horizon tasks required 287.5 hours of demonstrations alone (vs. 49.2 hours for grocery tasks), simply because each episode is 3–5 minutes long.

"Long-horizon episodes dominate the buffer by volume due to their substantially longer duration." (Figure 7 caption)

The implication for anyone planning to replicate this framework: the data collection cost is asymmetric and front-loaded on demonstration quality for long-horizon tasks. The 102.3 hours of Gongfu Tea demonstrations and 100.5 hours of Fruit Juicing demonstrations represent a substantial human teleoperation investment before any RL benefit is realized. The "few hours of real-world interaction" claim in the abstract refers only to the online phase — the offline foundation required months of prior data collection. Investors evaluating companies claiming LWD-style capabilities should probe how much offline demonstration data has actually been collected and whether the data pipeline infrastructure to capture rollouts and failures systematically exists.