Sergey Levine

1. Key Themes

Robotic Foundation Models Are the True Unlock — Not Humanoids

The central thesis of the episode is that the focus on humanoid robots misses the deeper revolution: general-purpose foundation models that can control any robot body. The insight is that the intelligence layer, not the hardware, is the scarce and valuable asset.

"I think every time that people try to take robots out of the factory and into open world environments, they very quickly realize that in the real world, there's a huge range of things that can happen." [00:43:12]

"The old way of thinking about robotic software as something that drives one robot kind of naturally leads to that because if you want one very general robot, then you kind of have to have everything. But if you accept the premise that we're going to have robotic foundation models that can drive lots of different robots to do lots of different things, now it kind of unchains your thinking." [00:53:44]

Cross-Embodiment Training Produces Dramatically Better Results Than Specialized Models

The RTX project — pulling data from ~30 academic robotics labs — demonstrated that a single generalist model trained on diverse data outperforms the best specialized models each lab had individually developed. This mirrors the same foundational insight that powered language model breakthroughs.

"What we found is that the generalist model on average was about 50% more successful than whatever each individual lab was developed. And that's really, really exciting because this is kind of paralleling a lot of the development that we've seen in language models." [00:18:35]

"In the first year or so of the company, we worked almost entirely with static arms. And then in early 2025, we decided we wanted to start experimenting with mobile robots. We had very few mobile robots, so we could only collect a little bit of data with them. And our first publicly released research project on this, which came in April 2025, used a training set of which only 3% of the data was collected on mobile robots. So 97% was from statically mounted arms, but we could get the mobile robots to actually generalize very broadly." [00:20:12]

Real-World Data Is Superior to Simulation at Scale — With a Key Nuance

Levine pushes back against simulation as a substitute for real-world robotic data, drawing a direct parallel to computer vision where simulation was heavily attempted but ultimately lost to real data. Simulation has a legitimate role for edge cases but not as the primary data engine.

"Computer vision is an area where simulation has actually been used very little, despite a lot of attempts. Why? Well, it's not actually because rendering images is hard... it's that getting real images is so much easier... And I think that in robotics, a mental trap that people sometimes fall into is that they say, well, maybe in robotics, it's hard to get data... I think this is a little bit of a mistake because actually if you're serious about building general purpose robots, they'll go out into the world and do lots of things. The better you get at building generalist robots, the more robots there are and the more data there should be coming in." [00:08:01]

The Data Flywheel Has a Bootstrapping Problem — But a Clear Solution Path

Getting enough robots into the real world to generate sufficient data is the hard activation-energy problem. Physical Intelligence's strategy is to combine teleoperational data early on with progressively more autonomous and language-supervised learning as the model improves.

"I think it makes a lot of sense to start off building the initial foundation with teleoperation data. But as your model gets better, it should be able to leverage more accessible and more scalable data sources... One more accessible source of supervision is instructions... we found that we could actually get improvement in our policies by supervising the robot essentially through language." [00:12:23]

"For me, one of the things I really want to figure out over the next year or maybe the next few years is how to go from the foundation, which I think at this point we have a pretty good understanding of, to something that is a true data flywheel, a true continual learning system where the robot experiences more and more tasks. And the more it experiences them, the better it gets." [00:55:23]

VLAs Are the De Facto Standard — But Second-Generation Models Add a "Motor Cortex"

The technical architecture of modern robot brains is settling around vision-language-action models, but the key evolution from first-generation to second-generation VLAs is attaching a diffusion-model-based motor cortex that handles continuous, high-dimensional action trajectories rather than treating actions as discrete text outputs.

"Second generation VLAs take inspiration from how VLMs add a visual cortex to the language model. And they also add a kind of a virtual motor cortex, a specialized little piece of circuitry whose job is to take the outputs from the language model backbone and decode them into continuous actions. And this is typically done with a diffusion — basically the same kind of technology that's used to generate images and videos is now used to generate trajectories of robot joints." [00:40:10]

Generality Is Necessary Even If You Want Specialization

A counterintuitive but operationally important insight: even a narrow-use robot that only needs to do one task still needs a generalist model underneath it, because the real world constantly presents unexpected situations that fall outside any narrow specialist's training.

"We had a project on using our systems to assemble boxes. You think that's pretty structured — just build a box. But sometimes you grab boxes off the pile and you get two boxes instead of one. So you have to put one in the back. And maybe something is torn. So you have to discard it. And maybe someone left their phone on the table. So you have to put the phone away... The generalist model, the one that could handle a wide variety of tasks, actually becomes a better specialist because it can deal with all that weird stuff that arises." [00:45:11]

The Intelligence Layer Is the Operating System of the Robot World

Levine frames robotic foundation models as analogous to operating systems — the platform layer on top of which countless applications and form factors can be built. This has enormous implications for where value accrues in the robotics stack.

"I really hope that good robotic foundation models will provide that kind of middle layer. You know, in a computer, that layer is basically taken up by the operating system. The operating system is the thing on top of which you put applications. So if there's an intelligence layer that's pretty general, on top of which somebody can essentially prompt engineer their task, they can experiment with all the aspects of that application, really nail it correctly, design the right form factor. Then we'll see a lot of experimentation, a lot of creativity." [00:48:54]

Two Distinct Training Paradigms Are Splitting the Field — and Attacking Different Problems

VLAs and sim-to-real reinforcement learning are currently the two dominant paradigms, and they are specialized for fundamentally different problem types. Understanding which one applies to which domain is critical for investors and builders evaluating robotics companies.

"The vision language action models are basically the currently dominant paradigm for robotic manipulation problems where robots need to interact with diverse environments and diverse objects. The sim-to-real stuff is the method of choice for highly acrobatic and athletic movements, typically for humanoids... The sim-to-real stuff is great for really understanding the physics of the robot, but not so great for generalization. The vision language action models are great for generalization." [00:37:31]

2. Contrarian Perspectives

Humanoids Are Not the Most Important Hardware Bet — Platform Diversity Is

Against the dominant narrative of humanoids as the obvious robot form factor, Levine argues that the human shape is a historically contingent design choice, not an optimal one. Once the intelligence layer is abstracted away, form factor should be determined by the specific job to be done.

"My own take when it comes to robot morphology is that I actually really hope that robots will kind of end up being a little bit like personal computers where there's like general software and the form factor of the device can be very different for different jobs... Maybe if you live in a small apartment in New York, maybe you have a little home robot that attaches to the ceiling and pivots around... and maybe if you live on a farm, you have a big mobile robot, a tractor with a bunch of arms attached." [00:46:39]

Simulation Will Not Solve the Robot Data Problem — It Is a Trap

While most well-funded robotics efforts lean heavily on simulation to generate training data cheaply and at scale, Levine argues this is a category error. The analogy to computer vision — where simulation was extensively tried and largely failed in favor of real data — is damning for this approach.

"In robotics, a mental trap that people sometimes fall into is that they say, well, maybe in robotics it's hard to get data... I think this is a little bit of a mistake because actually if you're serious about building general purpose robots, they'll go out into the world and do lots of things. The better you get at building generalist robots, the more robots there are and the more data there should be coming in. So it actually to me makes a lot more sense to pay a little bit more of that upfront cost to get robots out there and get the data coming in." [00:08:01]

World Models vs. VLAs Is a False Dichotomy

Against prominent researchers like Yann LeCun who position world models as the necessary foundation for true robotic intelligence, Levine argues the distinction is overstated and may be mostly about which abstraction you work at.

"The view that I would disagree with is that there are model-free policies, reinforcement learning, VLAs, and world models and these are totally separate things. I actually don't think they're separate things. And I think that the right answer will be a model that can do all of those things together — that can use world model-like prediction when it's necessary and can do the model-free stuff when that's more appropriate." [00:31:33]

Data Quality Heterogeneity Is a Feature, Not a Bug

Traditional machine learning intuition says training data should be high quality and consistent. Levine argues the opposite for robot foundation models — mixing bad and good data allows the model to learn what good looks like, which enables better generalization.

"Now the variety of data quality actually becomes a blessing rather than a curse. Because if you see lots of good things and lots of bad things, then you can figure out how to distinguish good from bad and do better at test time." [00:23:06]

The Existing Closed-World Strategy for Robots Is Fundamentally Broken

Against the incremental approach of constrained environments and highly specialized robots, Levine argues that even marginal openness to the real world immediately exposes catastrophic failure modes — citing early autonomous driving efforts as the canonical historical precedent.

"The gap between a closed world and an open world is enormous. And you can't be just a little open world. Like as soon as you're out in the wild, immediately stuff can happen. Maybe it happens rarely, but that doesn't save you. Like even if it happens rarely, you have to deal with it." [00:44:12]

3. Companies Identified

Physical Intelligence (π)

A robotics AI startup co-founded by Sergey Levine and colleagues from the RTX academic project. Builds general-purpose robotic foundation models designed to control robots of any morphology across any task. Currently in R&D phase with cloud-based inference. Published research in April 2025 showing mobile robot generalization from only 3% mobile robot training data.

"A big difference between how physical intelligence is approaching this and how most other research labs approach this question is that we are not being very picky about which robots we use. We're bringing in everything and trying to build this very broad foundation." [00:05:55]

Boston Dynamics

Robotics company (Hyundai subsidiary) known for Atlas humanoid and Spot quadruped. Originally built on pure control theory, now integrating AI models. Being deployed in Hyundai factories. Mentioned as an example of the shift toward AI-driven robot control.

"Boston Dynamics — in the past they were pure control theory. But now they've partnered with AI and they're using AI models to control the robots... there's been a lot of talk of them being used in factories. I think Hyundai is deploying them." [00:51:31]

1X / Neo (implied as "Neo")

Humanoid robot company whose product was tested in a home by Joanna Stern of the Wall Street Journal. Currently requires a human teleoperator to accompany the robot during home deployment to collect training data — illustrating the scalability challenge of real-world data collection.

"Neo is available for home use, but it comes along with a teleoperator that spends time in your house, like walking around doing something. And that's to collect that data, to train the robot. They need a lot of these deployments to collect enough data. That doesn't seem like a very scalable solution." [00:11:30]

Foundation (startup)

Early-stage humanoid robotics startup run by Mike LeBlanc focused on military applications. Training robots for highly specific single-task missions (e.g., placing explosive charges on doors) rather than generalization. Mentioned as a contrasting approach to Physical Intelligence's generalist strategy.

"I had a conversation with a guy, Mike LeBlanc, he's got a startup called Foundation and he's doing humanoids for the military. They're training them to do one thing — put an explosive charge on a door, which is a very dangerous thing for a soldier to do. So it drops from a Humvee, walks, slaps the thing on the door and comes back." [00:42:51]

Google (DeepMind / Robotics)

Google ran the "arm farm" project circa 2017-2018 that pioneered fleet-based collective robotic learning using identical robot stations. This was an early precursor to the cross-embodiment learning approach. Also a participant in the RTX project.

"The Google arm farm project — every single robot platform at every single robot station was as close as possible to each other. They were virtually identical. And that worked very well with the state-of-the-art learning technology of 2017, 2018." [00:15:53]

Modulate.ai

AI audio analysis company. Their product Velma uses an ensemble listening model architecture with hundreds of sub-models for voice analysis including tone, timing, and intent. Mentioned as a podcast sponsor. Relevant for fraud defense, deepfake detection, and customer service moderation.

"Velma from Modulate, an AI built on ensemble listening model architecture, specializes in audio analysis. It orchestrates hundreds of smaller sub-models purpose-built to understand the nuances of voice, like tone, timing, and intent." [00:00:00]

4. People Identified

Co-founder of Physical Intelligence and professor at UC Berkeley. Pioneer of vision-language-action models and robotic foundation models. Led the RTX project which demonstrated that generalist robotic models outperform specialized ones by ~50%. One of the most technically credible voices in the robotic foundation model space.

"What we found is that the generalist model on average was about 50% more successful than whatever each individual lab was developed." [00:18:35]

Fei-Fei Li

Stanford professor and AI researcher. Recently appeared on the same podcast discussing world models. Co-founder of World Labs. Cited as a prominent advocate for world models as a framework for robotic intelligence.

"I had Fei-Fei Li on recently talking about world models." [00:25:45]

Yann LeCun

Chief AI Scientist at Meta. Recurring guest on the same podcast. Prominent advocate for latent-space world models as the necessary architecture for true machine intelligence, a view Levine respectfully challenges.

"Yann LeCun, for example, I know he advocates for essentially a latent space world model, which predicts a sufficient statistic of observations." [00:27:57]

Percy Liang

Stanford professor who coined the term "foundation model." Credited with establishing the intellectual framework that Physical Intelligence's work builds upon.

"The term foundation model was coined by Percy Liang and his colleagues at Stanford for precisely this reason, because this kind of broad basis of knowledge gives you a foundation on top of which you can then put other things." [00:03:17]

Mike LeBlanc

Founder of Foundation, a military humanoid robotics startup. Training robots for single, highly specific dangerous tasks for military use. Mentioned as representing a philosophically opposite approach to generalist robot development.

"I had a conversation with a guy, Mike LeBlanc, he's got a startup called Foundation and he's doing humanoids for the military." [00:42:51]

Robert Heinlein

Classic American science fiction author. Levine's stated intellectual pleasure reading. Cited for his optimistic vision of technology and American culture, which Levine finds refreshing and useful for escaping the constrained thinking of rigorous engineering.

"I think I only discovered Robert Heinlein's work in the last few years. This very optimistic aspect of the American culture — I think is very refreshing, especially in today's day and age." [00:57:15]

5. Operating Insights

Use Language Feedback as a Scalable Supervision Signal Once Your Base Model Is Strong Enough

Physical Intelligence discovered that once the low-level motor skills of a model are sufficiently capable, you can improve policy quality without expensive teleoperation — simply by providing natural language corrections. This has direct implications for any team building robot learning pipelines and trying to reduce data collection costs.

"We found that we could actually get improvement in our policies by supervising the robot essentially through language. And this only started happening once the model became powerful enough that the low level skills were already pretty good. Then you could correct the robot and say like, 'you needed to pick up the plate and make sure you put the plate in the sink.' The way the model works internally is very similar to how modern reasoning models work — there are internal thoughts that are generated and then the final action is chosen based on those thoughts. So essentially this kind of language feedback supervises the internal thoughts rather than the low-level actions." [00:13:14]

Don't Over-Invest in the Perfect Hardware Before the Intelligence Layer Exists

A strategic operating principle: hardware form factor decisions are being made prematurely across the industry because teams assume they must solve for the ideal body plan before they can start. With a general intelligence layer available, form factor can be iterated upon rapidly and cheaply.

"The trouble with robots is that the barrier to entry for serious open world robotic systems is extremely high. You have to solve open research problems just to get your prototype out the door. And I think that's kind of what's actually limiting a lot of creativity." [00:50:46]

Train Offline Reinforcement Learning on Mixed-Quality Data — Let the Model Learn the Difference

Instead of filtering training data for quality, build the model to predict outcomes from both good and bad demonstrations, then optimize toward good outcomes at inference time. This transforms the practical problem of inevitably messy data collection from a liability into an asset.

"Instead of supervising the model to produce the same actions that are in the data, what you do is you supervise the model to predict the outcomes. So you train the model so that it can predict: if I see this and I do this, will that be good or will that be bad? And if you can do a really good job predicting those outcomes, then you can tell the model, okay, now do whatever will lead to the good outcome." [00:22:11]

Partition Inference by Abstraction Level for On-Device Reliability

For anyone deploying AI-driven robots (or complex edge AI systems more broadly), the insight that inference can be architecturally partitioned — with fast, small, local models handling low-level reflexes and larger cloud models handling high-level reasoning — provides a practical design pattern that degrades gracefully under connectivity loss.

"The highest levels are maybe more appropriate to offload to a remote inference server. And the lowest levels, the ones that are really doing motor control and closing a loop very tightly on perception, run locally. The good news is that the lowest levels are probably also going to be the smallest ones in terms of number of parameters, because they're not as cognitively demanding and not as complex." [00:33:25]

6. Overlooked Insights

Agricultural and Non-Traditional Equipment Is an Under-Discussed Early Beachhead Market for Robotic Foundation Models

In a single throwaway line, Levine mentions that Physical Intelligence's models have already been adapted by partners for agricultural equipment — a domain not conventionally thought of as robotics. This is a massive, underserved market with relatively controlled environments, high labor costs, and strong ROI incentives. If a general robot foundation model already transfers to tractors and farm equipment with minimal additional training, agricultural robotics may be the stealth first-at-scale deployment of this technology — before homes or even factories — and is almost entirely absent from mainstream investor discourse.

"Some of them can use them for mobile robots, things like agricultural equipment that we wouldn't conventionally think of as robots in the usual sense." [00:16:45]

The 3% Mobile Data Finding Is a Hidden Signal About Capital Efficiency Across the Entire Robotics Hardware Stack

Levine mentions almost in passing that Physical Intelligence achieved broad mobile robot generalization using a training set that was only 3% mobile robot data — with 97% coming from cheap, stationary tabletop arms. The investment implication is non-obvious: if the expensive, complex, hard-to-teleoperate hardware (mobile robots, humanoids) requires only a tiny fraction of training data once a strong foundation model exists, the companies that have amassed massive datasets on cheap, stationary robot platforms may hold disproportionate leverage over the entire humanoid and mobile robotics market — without ever having built a humanoid themselves.

"Our first publicly released research project on this, which came in April 2025, used a training set of which only 3% of the data was collected on mobile robots. So 97% was from these statically mounted arms bolted to a table, but we could get the mobile robots to actually generalize very broadly. They could go into a home that was never seen in the training data, clean up the kitchen, put away the dishes." [00:20:12]