Jim Fan (NVIDIA GEAR) | Sequoia AI Ascent Summary

Sequoia AI Ascent Summary

1. Key Themes

The Great Parallel: Robotics Is Copying the LLM Playbook Step-by-Step

Jim Fan's central thesis is that robotics is structurally replicating the three-step LLM arc (pre-training → supervised fine-tuning → reinforcement learning), and that this intentional copying is the fastest path to capable robots. He calls it "the great parallel."

"Instead of simulating strings, can we simulate next physical world state? And then we can align through action fine tuning onto a thin slice of that simulation that matters for real robots. And we let reinforcement learning carry the last mile. And that's it. The great parallel, copying the LLM's success. If you can't beat them, join them." 00:02:56

VLAs Are Architecturally Wrong — WAMs Are the Successor Paradigm

Fan argues that Vision Language Action models are fundamentally mis-specified: language gets the most parameters, but robots need physics and verbs, not nouns. The replacement is World Action Models (WAMs), which use video world models as the pre-training backbone.

"The last three years were dominated by VLAs or vision language action models, and models like Pi and Groot fall in this category. So we assume that the pre-training is done by a VLM, and we simply graft an action head on top of it. But really, if you think about these models, they are LVAs because the most amount of parameters are dedicated to language. So language is first class citizen, followed by vision and action. And by design, VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs." 00:02:58

"Once again, vision and action are now first class citizens... We call this new type of model World Action Models, or WAM. So let's all take a moment of silence for our dear friend VLAs. They've served us well, rest in peace, long-lived World Action Models." 00:06:26

Video World Models Accidentally Learned Physics — and That's the Pre-Training Signal for Robots

Physics was never coded into video generation models; it emerged from next-pixel prediction at scale. Fan's insight is that this emergent physics knowledge is exactly the pre-training substrate robotics has been missing.

"Physics emerge by predicting the next blob of pixels at scale. And even visual planning emerges. Look at how Veo solves these mazes. It solves them by running simulation forward in pixel space." 00:04:36

"Dream Zero jointly decodes the next world states and next actions. And as a result, it's able to zero-shot solve tasks and verbs that it has never seen in training... If the video prediction works, the action works. If the video hallucinates, the action fails." 00:06:26

Teleoperation Is Dying — Egocentric Human Video Is the New Data Flywheel

Fan predicts teleoperation will become negligible within one to two years. The scalable replacement is sensorized egocentric human video — the equivalent of Tesla's FSD ambient data flywheel applied to manipulation.

"In the next year or two, we'll see teleoperation dropping and dropping to almost negligible amount. And then there will be an ensemble of data wearables custom designed for different hardware and use cases. And finally, the main diet for robotics will be egocentric videos." 00:13:23

"We pre-train Eagle Scale on 21k hours of in-the-wild egocentric human data with zero robot data whatsoever... Then in action fine-tuning we collect only 50 hours of high-precision mocap data and four hours of tele-op. That's four hours of tele-op, less than 0.1% of our training mix." 00:11:31

Neural Scaling Laws Now Apply to Dexterity

Fan reports a direct empirical discovery: dexterous manipulation follows the same log-linear scaling law that language models established, six years after the original finding.

"The most fascinating finding from the paper is that we discovered this neural scaling law for dexterity. It's a very clean relationship between the amount of hours we put into pre-training and the optimal validation loss. In fact, it's a clean log-linear mathematical equation six years after the original neural scaling law for language models." 00:12:29

Simulation at Scale: Compute = Environment = Data

Robotics needs millions of training environments just as LLM labs need millions of coding sandboxes for RL. NVIDIA's answer is a layered stack: real robot stations + GPU-accelerated world scans + neural world model simulators, collapsing the distinction between compute and data.

"The new post-training paradigm for robotics is a massively parallel RL system that runs on a few real robot stations, a bunch of graphics cores running world scans, and heavy inference compute running world models. Or as this equation goes, compute now equals environment, now equals data." 00:15:58

The Physical Turing Test, Physical API, and Physical Auto-Research: A Three-Achievement Roadmap to 2040

Fan lays out a specific, sequenced technology tree with timelines: human-indistinguishable dexterity in 2–3 years, software-configurable robot fleets enabling lights-out factories, and finally self-improving robots by 2040.

"The first is passing the physical Turing test. Across a wide range of activities, you cannot tell the difference between a human doing a task or a robot doing it... Physical Turing test is about unit energy in and unit labor out... Maybe it's two to three years away." 00:16:47

"I can say with 95% certainty that we'll get to the end of the technology tree by 2040. And we'll still be young." 00:18:34

2. Contrarian Perspectives

VLAs — the Dominant Industry Paradigm — Are Architecturally Backward

The mainstream bet in robotics AI (Pi, Groot, and most funded startups) is built on VLAs. Fan is essentially declaring the entire architectural foundation of the current generation wrong, not just suboptimal.

"By design, VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs. It's kind of head heavy in the wrong places." 00:03:49

Teleoperation — the Investment Theme of the Last Three Years — Is Nearly Over

Enormous capital has flowed into teleoperation infrastructure: VR headsets, low-latency streaming rigs, and specialized hardware. Fan argues the entire category is at its ceiling and will collapse within 1–2 years.

"The past three years have been dominated by teleoperation. It's the golden era... And yet for tele-op, it's upper bounded by 24 hours per robot per day, the fundamental physical limit. And actually, who am I kidding? It's more like three hours per robot per day and only when the robot god is merciful because they throw tantrums all the time." 00:07:15

AI Video "Slop" Is the Most Valuable Pre-Training Signal in Robotics

Internet video — widely dismissed as low-quality content — turns out to encode the best available physics simulation for training robot policies. What looks like noise is actually the signal.

"I can watch these cats playing banjo on security cam all day. It's peak internet. But really, look at this. No one can take this seriously until we realize that these video models are learning to simulate next world state internally." 00:04:36

Robot Data Collection Should Disappear Into Daily Life, Not Be a Dedicated Activity

The industry assumption is that robot training data requires dedicated collection sessions. Fan argues the right model is ambient, like Tesla FSD — humans generate data as a byproduct of living, not as a task.

"The data collection needs to get out of the way, fade into the background, so we can capture the full glory of human dexterity across all walks of life, across all labors of economic value." 00:10:40

3. Companies Identified

NVIDIA (GEAR / Robotics)

NVIDIA's embodied AI research group, led by Jim Fan. Mentioned as the home of Dream Zero, Eagle Scale, DexUMI, DreamDojo, and the real-to-sim-to-real pipeline. The organization is actively building the full robotics stack — model architecture, data collection hardware, and neural simulation.

"We pre-train Eagle Scale on 21k hours of in-the-wild egocentric human data with zero robot data whatsoever... That's four hours of tele-op, less than 0.1% of our training mix." 00:11:31

Journalists (Robotic Startup — UMI Derivative)

A startup that improved the UMI (Universal Manipulation Interface) gripper design into a wearable data-collection tool, described as one of two unicorn startups spawned by the UMI paper.

"On the left-hand side is Journalists improving this design so you can wear the gripper here." 00:09:01

Sunday

A robotics data startup making three-finger data gloves for human-demonstration collection, the second unicorn startup credited to the UMI paper lineage.

"On the right-hand side, Sunday made these three-finger data gloves." 00:09:01

Tesla

Referenced as the gold standard for ambient, scalable physical data collection via FSD — the model Fan explicitly says robotics manipulation must replicate.

"Anyone driving Tesla or Waymo here? When you're driving, you're actually contributing to the biggest physical data flywheel. And the beauty is you don't even feel it during FSD, because the data upload is an ambient process." 00:09:46

Waymo

Co-referenced with Tesla as a real-world example of the ambient physical data flywheel that robotics must emulate.

"Anyone driving Tesla or Waymo here? Anyone? You know, when you're driving, you're actually contributing to the biggest physical data flywheel." 00:09:46

Google DeepMind (Veo)

Google DeepMind's video generation model used by Fan as the primary demonstration that physics — gravity, buoyancy, lighting, reflection, refraction — emerges from next-pixel prediction at scale.

"These are some clips from Veo 3. You can see that the models, they pick up gravity, buoyancy, lighting, reflection, refraction, all by themselves. None of this is coded in." 00:04:36

OpenAI

Referenced historically as the recipient of Jensen Huang's first DGX-1 delivery, and as the organization where the early deep learning cohort (including Fan and Andrej Karpathy) first converged.

"There's a guy in shiny leather jacket, you know, big biceps, hurling in this large metal tray. And on this large piece of metal, he wrote, to Elon and the OpenAI team, to the future of computing and humanity, I present you the world's first DGX-1." 00:00:27

4. People Identified

Jensen Huang

CEO of NVIDIA. Personally delivered the world's first DGX-1 to OpenAI in 2016, the moment Fan credits as the origin point of the modern AI era.

"There's a guy in shiny leather jacket, you know, big biceps, hurling in this large metal tray... he wrote, to Elon and the OpenAI team, to the future of computing and humanity, I present you the world's first DGX-1. So that was the first time I met Jensen." 00:00:27

Andrej Karpathy

Pioneer AI researcher, formerly OpenAI and Tesla. Identified by Fan as having signed the original DGX-1 alongside him in 2016; invoked as a peer witness to the founding moment of modern deep learning.

"Can you spot another? That's Andre right there. So Andre, we're going to the Computer History Museum." 00:00:27

Ilya Sutskever

Co-founder of OpenAI, referenced for his famous aphorism which Fan uses as the thematic spine of his talk.

"No one can describe what happened next better than Ilya himself. If you believe in deep learning, deep learning will believe in you." 00:01:20

Bill Daly

NVIDIA's Chief Scientist. Mentioned as having personally performed teleoperation inside NVIDIA's lab, jokingly described as collecting "the most expensive teleop trajectory ever" given his salary.

"This is NVIDIA's chief scientist, Bill Daly, operating teleoperation inside our lab. And given his salary, I think this is by far the most expensive teleop trajectory ever collected in our data set." 00:07:15

Demis Hassabis

CEO of Google DeepMind. Name-checked approvingly in the context of the Veo video model results powering physics emergence.

"Sorry, I just couldn't resist. No, bananas are too good. Thanks, Demis." 00:02:58

5. Operating Insights

Pre-Training on Human Egocentric Video Before Touching Any Robot Data Dramatically Reduces Robot Data Requirements

Fan's Eagle Scale result is a direct operating blueprint: pre-train on massive human video (21,000 hours), then fine-tune with a tiny robot dataset (4 hours of teleoperation). This inverts the conventional assumption that you need large robot datasets first, and has immediate implications for any team building manipulation policies.

"We pre-train Eagle Scale on 21k hours of in-the-wild egocentric human data with zero robot data whatsoever and during pre-training we predict these hand joints and wrist poses. Then in action fine-tuning we collect only 50 hours of high-precision mocap data and four hours of tele-op. That's four hours of tele-op, less than 0.1% of our training mix." 00:11:31

Use the Real-to-Sim-to-Real Pipeline with iPhone Scans to Scale RL Environments Without Buying More Robots

Fan describes a concrete, accessible method: photograph an environment with an iPhone, extract objects via a 3D scan pipeline, synthesize them into a physics simulator, then augment infinitely with "digital cousins." This converts a single real setup into millions of RL training environments.

"Let's say you take an iPhone picture, and you can pass this through this 3D world scan pipeline to extract all the objects, and then automatically synthesize them again inside a classical physics simulator. So all these objects are actually interactive after the scan. And then you can augment this infinitely in simulation with variations that we call digital cousins." 00:14:15

Check Video Prediction Quality as a Real-Time Proxy for Policy Reliability

Dream Zero's finding that video hallucination directly predicts action failure gives operators a concrete diagnostic tool: monitor the robot's internal dream video during deployment. If the video prediction degrades, the action will fail — before the failure occurs in the physical world.

"As a robot executes, we can visualize what it's dreaming about. And the correlation is very tight. If the video prediction works, the action works. If the video hallucinates, the action fails." 00:06:26

6. Overlooked Insights

The Physical API + Lights-Out Factory Is a Specific Near-Term Business Model, Not Sci-Fi

Fan slips in a remarkably concrete commercial vision that the audience seems to receive as futurism, but it is actually a near-term product specification: a software API layer over robot fleets that enables fully autonomous factories — "printers of atoms" — and automated wet labs for scientific discovery. This is not a 2040 prediction; Fan places it one step after the Physical Turing Test, which he says is 2–3 years away. Any company building the orchestration and API layer for heterogeneous robot fleets is positioned at the exact chokepoint Fan describes.

"You have a whole fleet of robots. And they can be configured just like any other software using APIs and command lines orchestrated someday by Opus 9.0. And if we have this physical API, we'll be able to realize lights-out factories. Those are essentially printers of atoms. They take as input design in markdown files and then output fully assembled products, completely autonomous. Or these wet labs that automate scientific discoveries in chemistry, biology, and medicine." 00:17:41

UMI Spawned Two Unicorns — The "Data Wearable" Category Is Producing Outsized Returns Per Paper

Fan mentions almost in passing that a single paper — UMI (Universal Manipulation Interface) — generated two unicorn startups. This is an extraordinarily high return rate for a single academic contribution, and it signals that the data wearables category (custom exoskeletons, gloves, and sensorized tools for different robot morphologies) is likely to produce multiple more billion-dollar companies as the paradigm shifts away from teleoperation. The category is nascent, under-indexed by investors relative to the model side, and Fan is signaling it will scale to hundreds of thousands of hours of data — far beyond what teleoperation can reach.

"UMI is perhaps one of the greatest papers ever written in robotics data and it spawned two unicorn startups. On the left-hand side is Journalists improving this design so you can wear the gripper here and on the right-hand side, Sunday made these three-finger data gloves." 00:09:01