Manchi | 晚点聊 LateTalk Summary

Episode 165 | LateTalk Podcast Participants: Manchi (Host), Gao Shenyuan (Researcher, HKUST PhD / NVIDIA GEAR Lab)

1. Key Themes

The Self-Evolving Loop: Policy + World Model + Agent

The central thesis of the episode is that three components — a general-purpose agent (like Gemini), a world model (like DreamDojo/Genie), and a policy model (like DreamZero/SIMA) — are close to forming a closed self-reinforcing loop that could unlock autonomous improvement in physical AI. The connection mechanism is: policy outputs actions → world model predicts future states → agent evaluates and scores those states → feedback improves policy.

"Now in this loop there are three parts: a general Agent Policy, then the World Model... everyone is pushing toward generalization. So I think at some future point — I think it might happen this year — once this loop connects and the error accumulation reaches an acceptable level, the whole loop will become simpler and simpler, like achieving self-evolution." — Gao Shenyuan [00:00:04.660]

"Once the policy improves, it can automatically collect data in new environments. This data feeds the world model, which improves its physical simulation and action control. These two points directly determine whether it can predict a good future and provide precise feedback." — Gao Shenyuan [00:37:35.860]

Video as the Privileged Representation Space for World Models

Gao argues strongly that video-based world models are the most scalable and promising approach because they share a representation space with the two most data-rich modalities (language and video), allowing inheritance of generalization from foundation models. Latent/abstract spaces (like Yann LeCun's JEPA) create isolation from existing ecosystems.

"The path to AGI and physical intelligence must start from the most data-rich domains and align toward data-scarce ones. The only two most data-rich spaces currently are language and video. Robot action data, by comparison, is a relatively scarce domain." — Gao Shenyuan [00:16:24.520]

"If you construct a new latent space, it's very hard to directly leverage the strong generalization capabilities of current language and video foundation models." — Gao Shenyuan [00:16:53.520]

Data Asymmetry: World Models Need "Bad" Data, Policies Need "Good" Data

A non-obvious structural insight: policies are trained on expert trajectories (only successful actions), but world models need diverse — including failed — action data to serve as unbiased simulators. The current robotics data ecosystem is misaligned for building good world models because it was created for policies.

"World models need to fairly simulate all actions. The problem is that for the past two years, everyone was building policy models, so all accumulated data is expert data... This is a problem for world models, which are simulators — they shouldn't have a preference for actions. If given a shaky action, the world model should output a shaky result." — Gao Shenyuan [00:45:43.060]

"The world model can ingest any data. Even without any distribution processing of the data, using it to train a world model is reasonable." — Gao Shenyuan [01:16:55.060]

2. Contrarian Perspectives

The "Just Scale Policy" Camp Is Wrong — World Models Enable Compounding Advantages

The prevailing view in some robotics circles is that world models are unnecessary complexity — just keep improving the policy. Gao pushes back, arguing that world models enable evaluation, data generation, safety, and crucially, a self-reinforcing improvement loop that policies alone cannot create.

"There are two camps: one thinks world models are completely unnecessary — just keep building policy. But I believe with a world model you can do many things. The most exciting long-term prospect is building a self-evolving loop. Without a world model, robots can't iterate the way language agents or AlphaGo can." — Gao Shenyuan [01:35:07.800]

Latent Action Representations Are a Transitional Technology — Not a Permanent Solution

Gao is himself one of the pioneers of latent action for world model training, yet he openly questions whether it will remain necessary as labeled data scales up and robot embodiments converge toward humanoid form.

"Latent action — it's the most brain-dead way to apply action labels to all unlabeled video. But if all data ends up having high-quality labels, and the embodiment gap isn't a major pain point, then latent action may not be necessary." — Gao Shenyuan [01:23:38.520]

Embodiment Convergence Will Simplify the Cross-Body Problem

Most assume cross-embodiment transfer is a persistent, hard research problem. Gao suggests it may be self-resolving as robots increasingly resemble humans in structure and dynamics — at that point, one humanoid policy is sufficient.

"As robots become more and more like humans — both in appearance and dynamics — you actually just need one human-like policy. So whether you need latent actions to represent the action space, or just use the human hand's representation directly, are both worth exploring." — Gao Shenyuan [01:23:11.280]

Evaluation Metrics for World Models Are Fundamentally Broken — and This Harms the Field

Unlike LLMs or video generation models that have standardized benchmarks due to unified input/output spaces, world models for robotics cannot be zero-shot compared across labs because each team uses different robot embodiments with different action spaces. This means no fair competitive comparison exists today.

"Every paper builds its own benchmark and only compares with a handful of models... The problem is that world models cannot zero-shot across embodiments. To benchmark someone else's world model, you have to retrain it on your robot's action space — that cost is high. So you can only compare limited models." — Gao Shenyuan [00:52:01.440]

The Path to Physical AGI May Run Through Virtual Worlds First, Not Robots

Google DeepMind's approach — validating everything in game environments before robots — is more principled than it appears. Game data is not subject to physical time constraints, is infinitely generatable, and encodes rich 3D/decision knowledge from human creators.

"DeepMind's style is that they like to start from games. Game data is not bound by physical time for collection. And verification is more convenient. Genie 3, for example, uses keyboard control — theoretically the same pipeline can be applied to robots." — Gao Shenyuan [00:41:53.300]

3. Companies Identified

NVIDIA GEAR Lab (Gear Live) NVIDIA's embodied intelligence research lab. Published DreamDojo (action-conditioned world model with real-time capability, trained on 44,000+ hours of egocentric human video) and DreamZero (video-backbone-based policy model positioned as a next-generation VLA alternative). Also built on Cosmos, NVIDIA's video foundation model.

"Dream Dojo is a relatively universal world model pretrain. We open-source it so that anyone with a new robot can quickly connect to our world model, fine-tune it, and use it." — Gao Shenyuan [01:12:10.860]

Google DeepMind Cited as the most admired lab for its systematic approach: aligning everything (agents, VLAs, world models) to Gemini as the foundation, using Veo (their best video model) as the world model backbone, and validating the self-evolution loop in game environments first (Genie 3 + SIMA).

"The most impressive and the one I most want to follow is Google DeepMind. They very typically push everything to align with foundation models. Your agent aligns with Gemini, your VLA aligns with Gemini, your world model starts from Veo... always aligning action/decision data toward the most data-rich modalities." — Gao Shenyuan [00:57:50.260]

General Intuition A UK-based company focused on game-based world models. Founded or joined by a prominent researcher (referred to as "Ansony Hu") formerly of Wayve. Their thesis: pandemic-era game data accumulation + human-encoded 3D knowledge in games = powerful training substrate for decision-making agents.

"There's a company called General Intuition... their story is that during the pandemic, people played lots of games and accumulated enormous game data. Games can break through physical time limits for data collection, and humans embedded a lot of 3D knowledge when creating these games." — Gao Shenyuan [00:56:22.940]

Wayve UK autonomous driving company that produced a series of world models called "GAIA." Mentioned as an early and serious practitioner of video-based world models for autonomous systems.

"There's a UK autonomous driving company called Wayve that made a series called GAIA world models, and there's a big name there — Anthony Hu — who joined a large lab called General Intuition." — Gao Shenyuan [00:56:22.940]

World Labs (Li Fei-Fei's company) Focuses on explicit 3D representation for world models, which has advantages in gaming and autonomous driving (consistent spatial coordinates, no forgetting) but disadvantages for robotics due to multi-stage rendering pipeline and data annotation requirements.

"Professor Li Fei-Fei's World Labs — they may be more focused on games. Using explicit 3D representations for games has advantages, including possibly for autonomous driving. But for robots, video is probably better." — Gao Shenyuan [00:55:54.320]

Meta AI (Yann LeCun's group) Pursuing a latent/abstract space approach (JEPA/AMI) for world models. Praised for efficiency but criticized for being isolated from the current LLM/video model ecosystem, making it harder to leverage existing foundation model capabilities.

"LeCun's approach constructs a new latent space... the problem is: after predicting a latent state, if you show it to a language model, the language model can't read it. Show it to a video model, the video model also can't read it." — Gao Shenyuan [00:18:19.940]

4. People Identified

Gao Shenyuan (高深远) Final-year PhD at HKUST, joining NVIDIA GEAR Lab full-time. Co-first author of DreamDojo and DreamZero. Research arc: multi-agent perception → autonomous driving world models (GenAD, Vista) → game world models (AdWorld) → robotics world models. One of the early advocates of latent action representations and video-based world models.

"My main research interest is constructing various world models — starting from autonomous driving world models, then game world models, and since last year focusing more on robotic world models and their applications." — Gao Shenyuan [00:02:01.500]

Jim Fan (金范) & Yuke Zhu (预可) NVIDIA GEAR Lab researchers who initiated the DreamDojo/DreamZero direction. Their research style and taste were described as highly aligned with Gao's.

"Jim Fan and Yuke Zhu's research taste and style matched mine quite well. At the time I also really wanted to collaborate with them." — Gao Shenyuan [01:07:22.220]

Demis Hassabis Google DeepMind CEO. Cited approvingly for his vision of using world models (Genie series) combined with general agents (SIMA) to form a self-evolving loop, and specifically for his thesis that world models could dramatically accelerate scientific discovery (e.g., nuclear fusion research).

"Hassabis — I really believe in his framework. His thinking aligns closely with mine: a world model in video space, a general agent called SIMA, and together they form a self-evolving loop." — Gao Shenyuan [00:34:18.200]

Anthony Hu (Ansony Hu) Researcher formerly at Wayve (GAIA world model series), later joined General Intuition. Mentioned as an influential figure in video-based world models for decision-making.

"There's a big name there — Ansony Hu — who joined a large lab called General Intuition." — Gao Shenyuan [00:56:22.940]

5. Operating Insights

Start the Self-Evolution Loop on Simple Tasks First, Then Expand

Rather than waiting for all three components (world model, policy, agent) to be perfect, the practical path is to get the loop working on single simple tasks first. The loop itself creates compounding improvement — imperfect components can still produce meaningful signal.

"Actually, I already have confidence that on simple tasks, this loop can be connected. Once connected on a simple task, it's solved directly. Then as the policy gets stronger, it can automatically collect data across more tasks and scenarios." — Gao Shenyuan [00:39:57.820]

Data Collection Should Shift from "Deliberate Collection" to "Collection During Work"

The robotics data paradigm is shifting: instead of dedicating sessions to data collection (set up a table, collect, reset, repeat), future collection happens organically during actual work tasks. Workers wearing portable sensors generate labeled data as a byproduct, dramatically scaling data accumulation.

"Previously, the data collection process meant the person wasn't working — collecting data WAS the work. But in the future, data collection happens during work. As long as you wear some portable peripherals, it doesn't affect the original work and comes with labels automatically. Data accumulation will be very fast." — Gao Shenyuan [01:26:31.580]

Use World Models to Democratize Robotics Evaluation — Replace Physical Benchmarks

Physical robot evaluation is both inefficient (requires human presence, physical resets) and unfair (lighting changes, positioning variance, sensor drift). World models as simulators solve both problems: scenes can be perfectly reset digitally, and evaluation runs asynchronously at compute speed.

"With a world model, you can very easily reset a scene to an identical state — just store the state on the computer. The comparison becomes completely fair. It's trading compute for efficiency and fairness." — Gao Shenyuan [00:30:24.560]

6. Overlooked Insights

OpenAI's Sora Team Merging Into Robotics Is a Massive Strategic Signal

This was mentioned almost in passing, but it is highly significant: OpenAI quietly disbanded/restructured the Sora team and moved it under their Robotics Lab. This is a direct organizational bet that video generation capability is the foundation for physical AI — exactly the thesis of this episode. It also signals that OpenAI is treating robotics as a serious priority, not an experiment, and that they see world-model-scale video generation as the infrastructure for robot policy training.

"OpenAI's Sora team was restructured under the Robotics Lab. So I think this year will be quite competitive. It seems they're seriously working on producing something in world models." — Gao Shenyuan [00:58:20.160]

The Three-Stage Training Pipeline (Cosmos → Egocentric Human Video → Robot Data) Is a Replicable Template — and Exposes a Startup Opportunity

Gao revealed that DreamDojo's training followed a specific three-stage curriculum: (1) Cosmos pretraining on diverse third-person video, (2) pretraining on 44,000+ hours of egocentric human video with latent actions, (3) fine-tuning on robot data. Critically, he noted that startups can skip stage 1 (very expensive) by using NVIDIA's open Cosmos weights, dramatically lowering the cost of building competitive world models. This is a template that well-resourced startups can replicate — and the bottleneck becomes stage 2 data curation and stage 3 robot embodiment selection, not raw compute.

"Theoretically, a startup doesn't need to build the Cosmos part internally — they can start from what we do after that. The cost would be dramatically lower, since the Cosmos portion is very expensive. Though this might affect competitiveness if OpenAI and Google invest heavily in this direction." — Gao Shenyuan [01:40:15.500]